Fix macOS SIGSEGV in task execution by using fork+exec#64874
Fix macOS SIGSEGV in task execution by using fork+exec#64874kaxil wants to merge 3 commits intoapache:mainfrom
SIGSEGV in task execution by using fork+exec#64874Conversation
| fds = json.loads(os.environ.pop(_CHILD_FDS_ENV_VAR)) | ||
| child_requests = _socket.socket(fileno=fds["requests"]) | ||
| child_stdout = _socket.socket(fileno=fds["stdout"]) | ||
| child_stderr = _socket.socket(fileno=fds["stderr"]) | ||
| log_fd = fds["logs"] |
There was a problem hiding this comment.
We've already got a mechanism to pass the fd and re-open the logs socket. We should use that rather than implement a new way. Out at least we only need to pass the log socket fd, as all the others are guaranteed to be 0,1,2
Also stdout, in and error are inheritable by default and kept when doing exec so we're shouldn't need to handle those differently at all
There was a problem hiding this comment.
Good call. Pushed 87883db -- now dup2s onto 0/1/2 before exec (no set_inheritable needed for those), and the log channel uses the existing ResendLoggingFD + reinit_supervisor_comms() mechanism rather than a new env var. The exec'd child starts with log_fd=0 (structured logging skipped), sets _AIRFLOW_FORK_EXEC=1, and main() in task_runner.py calls reinit_supervisor_comms() after get_startup_details() to request the log channel. Same flow as the sudo/virtualenv re-exec path.
| with suppress(BaseException): | ||
| print(f"execv failed, exiting with code 124: {e}", file=sys.stderr) | ||
| traceback.print_exception(type(e), e, e.__traceback__, file=sys.stderr) | ||
| else: |
There was a problem hiding this comment.
Style/diff nit:
No need for the else here as the if block never exits, so the else can remove and the contents un-intended
There was a problem hiding this comment.
The else is I think actually needed here. os.execv() is inside a try/except BaseException, so if execv fails, the except prints the error and falls through. Without the else we'd also run _fork_main on a half-broken macOS child. Could move os._exit(124) into the except block so the if-branch always terminates, then drop the else and un-indent. lmk if you prefer that.
| # We can't use log here, as if we except out of _fork_main something _weird_ went on. | ||
| print("Exception in _fork_main, exiting with code 124", file=sys.stderr) | ||
| traceback.print_exception(type(e), e, e.__traceback__, file=sys.stderr) | ||
| if _USE_FORK_EXEC and target is _subprocess_main: |
There was a problem hiding this comment.
I'm not sure how i feel about the target is _subprocess_main part...
There was a problem hiding this comment.
Yeah fair. Only task execution runs user code that makes network calls (HTTP/DNS), DAG processor and triggerer don't hit the ObjC crash. A use_exec: bool = False kwarg on start() would be cleaner, lets callers opt in explicitly. Want me to switch to that?
|
I wonder if we should also set the env var we have to not load settings in this ecec'd process to speed up airflow import? |
On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - apache#24463
Address review feedback: instead of passing all 4 FD numbers via JSON env var, dup2 the requests/stdout/stderr sockets onto FDs 0/1/2 before exec (inheritable by default). Only the log channel FD needs explicit passing via _AIRFLOW_SUPERVISOR_LOG_FD.
c32f2e6 to
e25461a
Compare
Instead of passing the log channel FD via env var, use the existing ResendLoggingFD protocol: the exec'd child starts with log_fd=0 (no structured logging), and after startup the task runner calls reinit_supervisor_comms() to request the log channel from the supervisor. This reuses the same mechanism as sudo/virtualenv re-exec rather than introducing a new env var.
Summary
Discovered while testing the AIP-99
LLMSQLQueryOperatorexample DAGs from #64824 on macOS. Tasks that make network calls (LLM API requests, HTTP calls) crash intermittently withSIGSEGVorSIGABRTwhen running viaairflow standaloneor any executor on macOS.Root cause:
WatchedSubprocess.start()insupervisor.pyuses bareos.fork()to create task child processes. On macOS, the forked child inherits corrupted Objective-C runtime state from the parent. When the child later triggers ObjC class initialization -- for example viasocket.getaddrinfo()-> macOS system DNS resolver ->Security.framework->+[NSNumber initialize]-- the ObjC runtime detects the half-initialized state and deliberately crashes.The fix: On macOS, call
os.execv()immediately afteros.fork()for task execution subprocesses. The exec replaces the child's address space with a fresh Python interpreter, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable viaos.set_inheritable()), and the child reads their FD numbers from the_AIRFLOW_SUPERVISOR_CHILD_FDSenvironment variable.The crash chain
The faulthandler traceback that identified the root cause:
Full trace
Why macOS only
Apple's ObjC runtime, CoreFoundation, and
libdispatchare not fork-safe. This is why CPython changedmultiprocessing's default start method fromforktospawnon macOS in Python 3.8 (BPO-33725). Linux uses glibc's resolver which has no ObjC dependency, so barefork()works fine there.Why only task execution, not DAG processor or triggerer
WatchedSubprocess.start()accepts atargetparameter. Task execution passes_subprocess_main, while DAG processor and triggerer pass different targets. Only task execution runs arbitrary user code that makes network calls (HTTP/DNS). The fix gates ontarget is _subprocess_main-- DAG processor and triggerer keep barefork().Scope: all executors on macOS, not just LocalExecutor
This affects any executor running on macOS (Local, Celery worker, etc.) because the fork happens inside
supervise()in the Task SDK, not in the executor itself. The executor spawns a worker process (which is safe --multiprocessing.Processusesspawnon macOS), but that worker then callssupervise()which does the bareos.fork().The two-fork architecture:
What we tried that didn't work
pydantic_ai,datafusion)OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YESNO_PROXY='*'socket.getaddrinfo()is a C library call that uses the macOS system resolver directly_scproxybefore forksubprocess.Popeninstead ofos.fork()Why
fork+execand notspawnPython's
multiprocessing.Process(start_method='spawn')would also work, but it requires pickling the target function and arguments. The supervisor's communication is built around socketpairs created before fork, with FDs inherited by the child.fork+execpreserves this design: FDs marked inheritable survive acrossexecv(), and the child reconstructssocket.socketobjects from the FD numbers.References
getaddrinfoSIGSEGV after fork on macOS_scproxycrash after forkGot it working on @vikramkoka 's Dag on the 46th try! It took that much debugging unfortunately
Was generative AI tooling used to co-author this PR?