-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
During a migration from x86-64 to aarch64 VMs on EC2, we discovered that fluent-bit crashes sporadically on the new nodes, often within minutes after startup. The crashes have been observed in the flb-pipeline and flb-in-prometheus threads. The crashes always leave a stacktrace like the following, crashing immediately after a call to co_switch():
#0 0x0000aaaabbd79550 in input_pre_cb_collect.lto_priv ()
#1 0x0000aaaabbd81614 [PAC] in co_switch (handle=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/monkey/deps/flb_libco/aarch64.c:133
#2 input_params_set (context=<optimized out>, config=0x13, coll=0x0, coro=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/include/fluent-bit/flb_input.h:573
#3 flb_input_coro_collect (config=0x13, coll=0x0) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/include/fluent-bit/flb_input.h:642
#4 input_collector_fd (ins=<optimized out>, fd=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:159
#5 engine_handle_event (config=<optimized out>, ins=<optimized out>, mask=<optimized out>, fd=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:181
#6 input_thread (data=0x0) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:420
#7 0x0000aaaabc3772d4 [PAC] in extent_recycle_split (expand_edata=<optimized out>, growing_retained=255, edata=<optimized out>, alignment=281473625227872, size=4096, ecache=0xaaaabc3807f8 <je_pa_alloc+248>, ehooks=0xffffb00000c0, pac=0xffffb0003a00, tsdn=0xffff9dbcf280)
at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/jemalloc-5.3.0/src/extent.c:543
#8 extent_recycle (tsdn=0xffff9dbcf280, pac=0xffffb0003a00, ehooks=0xffffb00000c0, ecache=0xaaaabc3807f8 <je_pa_alloc+248>, expand_edata=<optimized out>, size=4096, alignment=281473625227872, zero=128, commit=0xffffaf7204d0, growing_retained=92, guarded=128)
at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/jemalloc-5.3.0/src/extent.c:613
#9 0x0000000000001000 in ?? ()
Negative findings:
- It did not matter whether jemalloc was enabled or not.
- Disabling LTO made the trace clearer but did not eliminate the crash.
- ASAN did not log any problems or abort the program before the crash happens.
- Valgrind did not detect any problems but, curiously, did prevent the crash from occurring.
- Disabling aarch64 hardware stack protection at the compiler level did not help.
- Declaring 16 byte alignment for the thread-local variables the coroutine state is stored in did not help.
Positive findings:
- Disabling the GCC stack protector and glibc fortify eliminates the crash - or at least reduces its frequency to order of days (the present period of observation) rather than order of minutes.
- Disabling both RPM hardening at the package build level and building with either
FLB_DEBUG=OnorFLB_SMALL=Onat the fluent-bit build level (due toCMakeLists.txtlogic) is required to accomplish the above. - TSAN and helgrind both log an overwhelming number of unsynchronized read/write sequences in fluent-bit 4.2.3.
To Reproduce
I am not sure how to reproduce the crash generically. It may be specific to our particular configuration with lots of input and output threads:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffff8b086020 (LWP 476652) "fluent-bit" 0x0000ffff8a2f42d0 in clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6
2 Thread 0xffff69aeaaa0 (LWP 476669) "rdk:broker75478" 0x0000ffff8a3206c8 in poll () from /lib64/libc.so.6
3 Thread 0xffff6a4faaa0 (LWP 476668) "rdk:broker31876" 0x0000ffff8a3206c8 in poll () from /lib64/libc.so.6
4 Thread 0xffff6c56aaa0 (LWP 476666) "rdk:broker-1" 0x0000ffff8a2ba794 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
5 Thread 0xffff6cf7aaa0 (LWP 476665) "rdk:main" 0x0000ffff8a2ba794 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
6 Thread 0xffff6d98aaa0 (LWP 476664) "flb-out-s3.4-w0" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
7 Thread 0xffff6e39aaa0 (LWP 476663) "flb-out-prometh" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
8 Thread 0xffff6edaaaa0 (LWP 476662) "flb-out-prometh" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
9 Thread 0xffff6f7baaa0 (LWP 476661) "monkey: wrk/0" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
10 Thread 0xffff701caaa0 (LWP 476660) "monkey: clock" 0x0000ffff8a2f42d0 in clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6
11 Thread 0xffff70bdaaa0 (LWP 476659) "monkey: server" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
12 Thread 0xffff715eaaa0 (LWP 476658) "flb-out-kinesis" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
13 Thread 0xffff71ffaaa0 (LWP 476657) "flb-out-s3.0-w0" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
14 Thread 0xffff737e7aa0 (LWP 476656) "flb-in-promethe" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
15 Thread 0xffff741f7aa0 (LWP 476655) "flb-in-promethe" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
16 Thread 0xffff883faaa0 (LWP 476654) "flb-logger" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
17 Thread 0xffff8937aaa0 (LWP 476653) "flb-pipeline" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
Expected behavior
fluent-bit should never crash when switching coroutines.
Your Environment
- Version used: 4.2.3
- Configuration: extensive
- Environment name and version (e.g. Kubernetes? What version?): AL2023
- Server type and version: aarch64
- Operating System and version: kernel 6.1.161
- Filters and plugins: extensive
Additional context
The libco upstream fluent-bit is based on is an obsolete fork. This appears to be the current active upstream with several bug fixes and optimizations that haven't been incorporated:
https://github.com/higan-emu/libco
This crash is quite similar to this previous coroutine swap crash:
#7061
I noticed sometimes that the coroutine context is clobbered by unrelated (but structured) data, which makes me suspect that the TSAN errors I mentioned earlier are surfacing a real problem with cross-thread memory management:
(gdb) print coro
$3 = (struct flb_coro *) 0xfffff3ed1060
(gdb) print *coro
$4 = {valgrind_stack_id = 926430515, caller = 0x3035393336313737, callee = 0x3038363734372e35, data = 0x626c662e393637}
(gdb) x/ds coro
0xfffff3ed1060: "358753-1771639505.747680769.flb"
Build config diff from aarch64 non-working to working:
-- summary of build options:
Package version: 1.65.0
Library version: 42:4:28
Install prefix: /usr/local
@@ -7,14 +6,10 @@
Compiler:
Build type: RelWithDebInfo
C compiler: /usr/bin/gcc
- CFLAGS: -O2 -g -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -Wall -D__FLB_FILENAME__=__FILE__ -fsigned-char -Wl,-z,relro,-z,now -Wl,-z,noexecstack -fstack-protector -D_FORTIFY_SOURCE=1
+ CFLAGS: -O2 -g -Wall -D__FLB_FILENAME__=__FILE__ -fsigned-char -Os -g0 -s -fno-stack-protector -fomit-frame-pointer -DNDEBUG -U_FORTIFY_SOURCE -Wl,-z,relro,-z,now -Wl,-z,noexecstack
C++ compiler: /usr/bin/g++
- CXXFLAGS: -O2 -g -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection
+ CXXFLAGS: -O2 -g
WARNCFLAGS: -W -Wall -Wconversion -Winline -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wundef -Wwrite-strings -Waddress -Wattributes -Wcast-align -Wdeclaration-after-statement -Wdiv-by-zero -Wempty-body -Wendif-labels -Wfloat-equal -Wformat-nonliteral -Wformat-security -Wmissing-field-initializers -Wmissing-noreturn -Wno-format-nonliteral -Wredundant-decls -Wsign-conversion -Wstrict-prototypes -Wunreachable-code -Wunused-parameter -Wvla -Wclobbered -Wpragmas
CXX1XCXXFLAGS:
WARNCXXFLAGS: -Wall -Wformat-security