Skip to content

[Bug] SEGFAULT in bolt_thrustjit_test #151

@ZacBlanco

Description

@ZacBlanco

Component Selection

  • Core Engine (Expression eval, Memory, Vector)
  • Connectors / File Formats (Hive, Parquet, etc.)
  • API / Bindings (Python, etc.)
  • Build
  • Other

Describe the Bug

I have seen some runs in CI where the bolt_thrustjit_test fails with an errors.

99% tests passed, 1 tests failed out of 415

Total Test time (real) = 164.95 sec

The following tests did not run:
	 90 - */MemoryAllocatorTest.allocContiguousVsize/* (Disabled)
	128 - */MemoryCapExceededTest.singleDriver/* (Disabled)
Errors while running CTest
	130 - */MemoryCapExceededTest.allocatorCapacityExceededError/* (Disabled)
	151 - */MemoryPoolTest.memoryLeakCheck/* (Disabled)
	190 - */MemoryPoolTest.concurrentUpdateToSharedPools/* (Disabled)
	282 - */SharedArbitrationTestWithThreadingModes.raceBetweenTaskTerminateAndReclaim/* (Disabled)

The following tests FAILED:
	315 - bolt_thrustjit_test (SEGFAULT)
make: *** [Makefile:360: unittest_release] Error 8

The root cause of failures has varied. One example is reproduced below

302/421 Test #315: bolt_thrustjit_test ..............................................................................***Exception: SegFault  0.27 sec
Running main() from /github/home/.conan2/p/b/gtest9f9ec4a65659c/b/src/googletest/src/gtest_main.cc
[==========] Running 10 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 7 tests from RowContainerJitTest
[ RUN      ] RowContainerJitTest.all_types
[       OK ] RowContainerJitTest.all_types (95 ms)
[ RUN      ] RowContainerJitTest.two_float_point
[       OK ] RowContainerJitTest.two_float_point (5 ms)
[ RUN      ] RowContainerJitTest.float_with_nulls
[       OK ] RowContainerJitTest.float_with_nulls (16 ms)
[ RUN      ] RowContainerJitTest.float_point_nan_test
[       OK ] RowContainerJitTest.float_point_nan_test (0 ms)
[ RUN      ] RowContainerJitTest.timestamp
[       OK ] RowContainerJitTest.timestamp (5 ms)
[ RUN      ] RowContainerJitTest.singleKey
[       OK ] RowContainerJitTest.singleKey (4 ms)
[ RUN      ] RowContainerJitTest.stringview
[       OK ] RowContainerJitTest.stringview (8 ms)
[----------] 7 tests from RowContainerJitTest (136 ms total)

[----------] 3 tests from JitEngineTest
[ RUN      ] JitEngineTest.basic
[       OK ] JitEngineTest.basic (3 ms)
[ RUN      ] JitEngineTest.cacheLimit
JIT session error: Resource tracker 0x38f7e190 became defunct

We need to find the root cause and ensure this test does not fail. It will help the stability of CI and ensure that the failures do not occur in production workloads.

Reproduction Steps

$ make release_with_test
$ _build/Release/bolt/jit/tests/bolt_thrustjit_test

Bolt Version / Commit ID

0ea8492

System Configuration

- **OS**: Debian Bookworm
- **Compiler**: GCC 12.5.0
- **Build Type**: Release
- **CPU Arch**: x86
- **Framework**: N/A

Logs / Stack Trace

[----------] 3 tests from JitEngineTest
[ RUN      ] JitEngineTest.basic
[       OK ] JitEngineTest.basic (3 ms)
[ RUN      ] JitEngineTest.cacheLimit
JIT session error: Resource tracker 0x38f7e190 became defunct



Sample run: https://github.com/bytedance/bolt/actions/runs/21076158907/job/60618467545

Expected Behavior

No SEGFAULT

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions