Skip to content

Fix macOS SIGSEGV in task execution by using fork+exec#64874

Draft
kaxil wants to merge 3 commits intoapache:mainfrom
astronomer:fix/macos-fork-exec-supervisor
Draft

Fix macOS SIGSEGV in task execution by using fork+exec#64874
kaxil wants to merge 3 commits intoapache:mainfrom
astronomer:fix/macos-fork-exec-supervisor

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Apr 7, 2026

Summary

Discovered while testing the AIP-99 LLMSQLQueryOperator example DAGs from #64824 on macOS. Tasks that make network calls (LLM API requests, HTTP calls) crash intermittently with SIGSEGV or SIGABRT when running via airflow standalone or any executor on macOS.

Root cause: WatchedSubprocess.start() in supervisor.py uses bare os.fork() to create task child processes. On macOS, the forked child inherits corrupted Objective-C runtime state from the parent. When the child later triggers ObjC class initialization -- for example via socket.getaddrinfo() -> macOS system DNS resolver -> Security.framework -> +[NSNumber initialize] -- the ObjC runtime detects the half-initialized state and deliberately crashes.

The fix: On macOS, call os.execv() immediately after os.fork() for task execution subprocesses. The exec replaces the child's address space with a fresh Python interpreter, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable via os.set_inheritable()), and the child reads their FD numbers from the _AIRFLOW_SUPERVISOR_CHILD_FDS environment variable.

The crash chain

supervisor.py: os.fork()
  -> child runs pydantic_ai.Agent.run_sync()
    -> httpx creates ThreadPoolExecutor for DNS
      -> socket.getaddrinfo() in worker thread
        -> macOS system resolver (not glibc)
          -> _scproxy / Security.framework
            -> ObjC runtime detects fork-unsafe state
              -> SIGABRT / SIGSEGV

The faulthandler traceback that identified the root cause:

Current thread (most recent call first):
  File "socket.py", line 978 in getaddrinfo
  File "concurrent/futures/thread.py", line 59 in run
  ...
Full trace
[2026-04-07 22:52:11] INFO - Using explicit credentials for provider with model 'anthropic:claude-sonnet-4-6': ['api_key'] source=airflow.task.hooks.airflow.providers.common.ai.hooks.pydantic_ai.PydanticAIHook loc=pydantic_ai.py:151
[2026-04-07 22:52:12] ERROR - Fatal Python error: Segmentation fault source=task.stderr
[2026-04-07 22:52:12] ERROR -  source=task.stderr
[2026-04-07 22:52:12] ERROR - Current thread 0x00000001708db000 (most recent call first): source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/socket.py", line 978 in getaddrinfo source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/concurrent/futures/thread.py", line 59 in run source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/concurrent/futures/thread.py", line 93 in _worker source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/threading.py", line 1012 in run source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/threading.py", line 1075 in _bootstrap_inner source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/threading.py", line 1032 in _bootstrap source=task.stderr
[2026-04-07 22:52:12] ERROR -  source=task.stderr
[2026-04-07 22:52:12] ERROR - Thread 0x00000001f61bf100 (most recent call first): source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/selectors.py", line 566 in select source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/asyncio/base_events.py", line 1961 in _run_once source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/asyncio/base_events.py", line 645 in run_forever source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/asyncio/base_events.py", line 678 in run_until_complete source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/pydantic_ai/agent/abstract.py", line 443 in run_sync source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Documents/GitHub/astronomer/airflow/providers/common/ai/src/airflow/providers/common/ai/operators/llm_sql.py", line 150 in execute source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 417 in wrapper source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1528 in _execute_task source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1112 in run source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1697 in main source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 206 in _subprocess_main source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 388 in _fork_main source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 489 in start source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 953 in start source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 1995 in supervise source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/executors/local_executor.py", line 124 in _execute_work source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/Desktop/astro-projects/3.2.0/.venv/lib/python3.12/site-packages/airflow/executors/local_executor.py", line 96 in _run_worker source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/multiprocessing/process.py", line 108 in run source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/multiprocessing/spawn.py", line 135 in _main source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "/Users/kaxilnaik/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main source=task.stderr
[2026-04-07 22:52:12] ERROR -   File "<string>", line 1 in <module> source=task.stderr
[2026-04-07 22:52:12] ERROR -  source=task.stderr
[2026-04-07 22:52:12] CRITICAL - 
******************************************* Received SIGSEGV *******************************************
SIGSEGV (Segmentation Violation) signal indicates Segmentation Fault error which refers to
an attempt by a program/library to write or read outside its allocated memory.

In Python environment usually this signal refers to libraries which use low level C API.
Make sure that you use right libraries/Docker Images
for your architecture (Intel/ARM) and/or Operational System (Linux/macOS).

Suggested way to debug
======================
  - Set environment variable 'PYTHONFAULTHANDLER' to 'true'.
  - Start airflow services.
  - Restart failed airflow task.
  - Check 'scheduler' and 'worker' services logs for additional traceback
    which might contain information about module/library where actual error happen.

Known Issues
============

Note: Only Linux-based distros supported as "Production" execution environment for Airflow.

macOS
-----
 1. Due to limitations in Apple's libraries not every process might 'fork' safe.
    One of the general error is unable to query the macOS system configuration for network proxies.
    If your are not using a proxy you could disable it by set environment variable 'no_proxy' to '*'.
    See: https://github.com/python/cpython/issues/58037 and https://bugs.python.org/issue30385#msg293958
******************************************************************************************************** source=task
[2026-04-07 22:52:12] ERROR - Extension modules: yaml._yaml, sqlalchemy.cyextension.collections, sqlalchemy.cyextension.immutabledict, sqlalchemy.cyextension.processors, sqlalchemy.cyextension.resultproxy, sqlalchemy.cyextension.util, greenlet._greenlet, lazy_object_proxy.cext, _cffi_backend, msgspec._core, markupsafe._speedups, psutil._psutil_osx, pyarrow.lib (total: 13) source=task.stderr

Why macOS only

Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. This is why CPython changed multiprocessing's default start method from fork to spawn on macOS in Python 3.8 (BPO-33725). Linux uses glibc's resolver which has no ObjC dependency, so bare fork() works fine there.

Why only task execution, not DAG processor or triggerer

WatchedSubprocess.start() accepts a target parameter. Task execution passes _subprocess_main, while DAG processor and triggerer pass different targets. Only task execution runs arbitrary user code that makes network calls (HTTP/DNS). The fix gates on target is _subprocess_main -- DAG processor and triggerer keep bare fork().

Scope: all executors on macOS, not just LocalExecutor

This affects any executor running on macOS (Local, Celery worker, etc.) because the fork happens inside supervise() in the Task SDK, not in the executor itself. The executor spawns a worker process (which is safe -- multiprocessing.Process uses spawn on macOS), but that worker then calls supervise() which does the bare os.fork().

The two-fork architecture:

Executor -> multiprocessing.Process (spawn, safe)
  -> worker calls supervise()
    -> os.fork() (bare fork, UNSAFE on macOS)
      -> child runs task

What we tried that didn't work

Approach Why it failed
Lazy imports (pydantic_ai, datafusion) Crash happens at runtime during DNS resolution, not at import time
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES Undocumented Apple debug knob. Suppresses the abort but doesn't fix the underlying memory corruption. Cached at ObjC runtime init, unreliable on newer macOS
NO_PROXY='*' Python-level env var; socket.getaddrinfo() is a C library call that uses the macOS system resolver directly
Pre-initialize _scproxy before fork Fragile -- any new dependency that touches ObjC frameworks would break it again
subprocess.Popen instead of os.fork() Loses the socketpair FDs. The child can't communicate with the supervisor

Why fork + exec and not spawn

Python's multiprocessing.Process(start_method='spawn') would also work, but it requires pickling the target function and arguments. The supervisor's communication is built around socketpairs created before fork, with FDs inherited by the child. fork + exec preserves this design: FDs marked inheritable survive across execv(), and the child reconstructs socket.socket objects from the FD numbers.

References

Got it working on @vikramkoka 's Dag on the 46th try! It took that much debugging unfortunately

image
Was generative AI tooling used to co-author this PR?
  • [ ]

@kaxil kaxil requested a review from potiuk April 7, 2026 23:26
Comment on lines +490 to +494
fds = json.loads(os.environ.pop(_CHILD_FDS_ENV_VAR))
child_requests = _socket.socket(fileno=fds["requests"])
child_stdout = _socket.socket(fileno=fds["stdout"])
child_stderr = _socket.socket(fileno=fds["stderr"])
log_fd = fds["logs"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've already got a mechanism to pass the fd and re-open the logs socket. We should use that rather than implement a new way. Out at least we only need to pass the log socket fd, as all the others are guaranteed to be 0,1,2

Also stdout, in and error are inheritable by default and kept when doing exec so we're shouldn't need to handle those differently at all

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Pushed 87883db -- now dup2s onto 0/1/2 before exec (no set_inheritable needed for those), and the log channel uses the existing ResendLoggingFD + reinit_supervisor_comms() mechanism rather than a new env var. The exec'd child starts with log_fd=0 (structured logging skipped), sets _AIRFLOW_FORK_EXEC=1, and main() in task_runner.py calls reinit_supervisor_comms() after get_startup_details() to request the log channel. Same flow as the sudo/virtualenv re-exec path.

with suppress(BaseException):
print(f"execv failed, exiting with code 124: {e}", file=sys.stderr)
traceback.print_exception(type(e), e, e.__traceback__, file=sys.stderr)
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/diff nit:

No need for the else here as the if block never exits, so the else can remove and the contents un-intended

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else is I think actually needed here. os.execv() is inside a try/except BaseException, so if execv fails, the except prints the error and falls through. Without the else we'd also run _fork_main on a half-broken macOS child. Could move os._exit(124) into the except block so the if-branch always terminates, then drop the else and un-indent. lmk if you prefer that.

# We can't use log here, as if we except out of _fork_main something _weird_ went on.
print("Exception in _fork_main, exiting with code 124", file=sys.stderr)
traceback.print_exception(type(e), e, e.__traceback__, file=sys.stderr)
if _USE_FORK_EXEC and target is _subprocess_main:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how i feel about the target is _subprocess_main part...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair. Only task execution runs user code that makes network calls (HTTP/DNS), DAG processor and triggerer don't hit the ObjC crash. A use_exec: bool = False kwarg on start() would be cleaner, lets callers opt in explicitly. Want me to switch to that?

@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 8, 2026

I wonder if we should also set the env var we have to not load settings in this ecec'd process to speed up airflow import?

kaxil added 2 commits April 9, 2026 16:45
On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- apache#24463
Address review feedback: instead of passing all 4 FD numbers via
JSON env var, dup2 the requests/stdout/stderr sockets onto FDs
0/1/2 before exec (inheritable by default). Only the log channel
FD needs explicit passing via _AIRFLOW_SUPERVISOR_LOG_FD.
@kaxil kaxil force-pushed the fix/macos-fork-exec-supervisor branch from c32f2e6 to e25461a Compare April 9, 2026 15:45
Instead of passing the log channel FD via env var, use the existing
ResendLoggingFD protocol: the exec'd child starts with log_fd=0
(no structured logging), and after startup the task runner calls
reinit_supervisor_comms() to request the log channel from the
supervisor. This reuses the same mechanism as sudo/virtualenv
re-exec rather than introducing a new env var.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants