Skip to content

Sporadic coroutine switch crashes on AL2023 aarch64 #11488

@runderwo

Description

@runderwo

Bug Report

Describe the bug

During a migration from x86-64 to aarch64 VMs on EC2, we discovered that fluent-bit crashes sporadically on the new nodes, often within minutes after startup. The crashes have been observed in the flb-pipeline and flb-in-prometheus threads. The crashes always leave a stacktrace like the following, crashing immediately after a call to co_switch():

#0  0x0000aaaabbd79550 in input_pre_cb_collect.lto_priv ()
#1  0x0000aaaabbd81614 [PAC] in co_switch (handle=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/monkey/deps/flb_libco/aarch64.c:133
#2  input_params_set (context=<optimized out>, config=0x13, coll=0x0, coro=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/include/fluent-bit/flb_input.h:573
#3  flb_input_coro_collect (config=0x13, coll=0x0) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/include/fluent-bit/flb_input.h:642
#4  input_collector_fd (ins=<optimized out>, fd=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:159
#5  engine_handle_event (config=<optimized out>, ins=<optimized out>, mask=<optimized out>, fd=<optimized out>) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:181
#6  input_thread (data=0x0) at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/src/flb_input_thread.c:420
#7  0x0000aaaabc3772d4 [PAC] in extent_recycle_split (expand_edata=<optimized out>, growing_retained=255, edata=<optimized out>, alignment=281473625227872, size=4096, ecache=0xaaaabc3807f8 <je_pa_alloc+248>, ehooks=0xffffb00000c0, pac=0xffffb0003a00, tsdn=0xffff9dbcf280)
    at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/jemalloc-5.3.0/src/extent.c:543
#8  extent_recycle (tsdn=0xffff9dbcf280, pac=0xffffb0003a00, ehooks=0xffffb00000c0, ecache=0xaaaabc3807f8 <je_pa_alloc+248>, expand_edata=<optimized out>, size=4096, alignment=281473625227872, zero=128, commit=0xffffaf7204d0, growing_retained=92, guarded=128)
    at /usr/src/debug/fluent-bit-4.0.8-608.amzn2023.aarch64/lib/jemalloc-5.3.0/src/extent.c:613
#9  0x0000000000001000 in ?? ()

Negative findings:

  • It did not matter whether jemalloc was enabled or not.
  • Disabling LTO made the trace clearer but did not eliminate the crash.
  • ASAN did not log any problems or abort the program before the crash happens.
  • Valgrind did not detect any problems but, curiously, did prevent the crash from occurring.
  • Disabling aarch64 hardware stack protection at the compiler level did not help.
  • Declaring 16 byte alignment for the thread-local variables the coroutine state is stored in did not help.

Positive findings:

  • Disabling the GCC stack protector and glibc fortify eliminates the crash - or at least reduces its frequency to order of days (the present period of observation) rather than order of minutes.
  • Disabling both RPM hardening at the package build level and building with either FLB_DEBUG=On or FLB_SMALL=On at the fluent-bit build level (due to CMakeLists.txt logic) is required to accomplish the above.
  • TSAN and helgrind both log an overwhelming number of unsynchronized read/write sequences in fluent-bit 4.2.3.

To Reproduce
I am not sure how to reproduce the crash generically. It may be specific to our particular configuration with lots of input and output threads:

(gdb) info threads
  Id   Target Id                                            Frame 
* 1    Thread 0xffff8b086020 (LWP 476652) "fluent-bit"      0x0000ffff8a2f42d0 in clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6
  2    Thread 0xffff69aeaaa0 (LWP 476669) "rdk:broker75478" 0x0000ffff8a3206c8 in poll () from /lib64/libc.so.6
  3    Thread 0xffff6a4faaa0 (LWP 476668) "rdk:broker31876" 0x0000ffff8a3206c8 in poll () from /lib64/libc.so.6
  4    Thread 0xffff6c56aaa0 (LWP 476666) "rdk:broker-1"    0x0000ffff8a2ba794 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
  5    Thread 0xffff6cf7aaa0 (LWP 476665) "rdk:main"        0x0000ffff8a2ba794 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
  6    Thread 0xffff6d98aaa0 (LWP 476664) "flb-out-s3.4-w0" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  7    Thread 0xffff6e39aaa0 (LWP 476663) "flb-out-prometh" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  8    Thread 0xffff6edaaaa0 (LWP 476662) "flb-out-prometh" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  9    Thread 0xffff6f7baaa0 (LWP 476661) "monkey: wrk/0"   0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  10   Thread 0xffff701caaa0 (LWP 476660) "monkey: clock"   0x0000ffff8a2f42d0 in clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6
  11   Thread 0xffff70bdaaa0 (LWP 476659) "monkey: server"  0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  12   Thread 0xffff715eaaa0 (LWP 476658) "flb-out-kinesis" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  13   Thread 0xffff71ffaaa0 (LWP 476657) "flb-out-s3.0-w0" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  14   Thread 0xffff737e7aa0 (LWP 476656) "flb-in-promethe" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  15   Thread 0xffff741f7aa0 (LWP 476655) "flb-in-promethe" 0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  16   Thread 0xffff883faaa0 (LWP 476654) "flb-logger"      0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6
  17   Thread 0xffff8937aaa0 (LWP 476653) "flb-pipeline"    0x0000ffff8a32b404 in epoll_pwait () from /lib64/libc.so.6

Expected behavior
fluent-bit should never crash when switching coroutines.

Your Environment

  • Version used: 4.2.3
  • Configuration: extensive
  • Environment name and version (e.g. Kubernetes? What version?): AL2023
  • Server type and version: aarch64
  • Operating System and version: kernel 6.1.161
  • Filters and plugins: extensive

Additional context
The libco upstream fluent-bit is based on is an obsolete fork. This appears to be the current active upstream with several bug fixes and optimizations that haven't been incorporated:
https://github.com/higan-emu/libco

This crash is quite similar to this previous coroutine swap crash:
#7061

I noticed sometimes that the coroutine context is clobbered by unrelated (but structured) data, which makes me suspect that the TSAN errors I mentioned earlier are surfacing a real problem with cross-thread memory management:

(gdb) print coro
$3 = (struct flb_coro *) 0xfffff3ed1060
(gdb) print *coro
$4 = {valgrind_stack_id = 926430515, caller = 0x3035393336313737, callee = 0x3038363734372e35, data = 0x626c662e393637}
(gdb) x/ds coro
0xfffff3ed1060: "358753-1771639505.747680769.flb"

Build config diff from aarch64 non-working to working:

-- summary of build options:
     Package version: 1.65.0
     Library version: 42:4:28
     Install prefix:  /usr/local
@@ -7,14 +6,10 @@
     Compiler:
       Build type:     RelWithDebInfo
       C compiler:     /usr/bin/gcc
-      CFLAGS:         -O2 -g  -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -Wall -D__FLB_FILENAME__=__FILE__ -fsigned-char -Wl,-z,relro,-z,now -Wl,-z,noexecstack -fstack-protector -D_FORTIFY_SOURCE=1
+      CFLAGS:         -O2 -g   -Wall -D__FLB_FILENAME__=__FILE__ -fsigned-char -Os -g0  -s  -fno-stack-protector -fomit-frame-pointer -DNDEBUG -U_FORTIFY_SOURCE -Wl,-z,relro,-z,now -Wl,-z,noexecstack
       C++ compiler:   /usr/bin/g++
-      CXXFLAGS:       -O2 -g  -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection
+      CXXFLAGS:       -O2 -g  
       WARNCFLAGS:       -W -Wall -Wconversion -Winline -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wundef -Wwrite-strings -Waddress -Wattributes -Wcast-align -Wdeclaration-after-statement -Wdiv-by-zero -Wempty-body -Wendif-labels -Wfloat-equal -Wformat-nonliteral -Wformat-security -Wmissing-field-initializers -Wmissing-noreturn -Wno-format-nonliteral -Wredundant-decls -Wsign-conversion -Wstrict-prototypes -Wunreachable-code -Wunused-parameter -Wvla -Wclobbered -Wpragmas
       CXX1XCXXFLAGS:  
       WARNCXXFLAGS:     -Wall -Wformat-security

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions