Skip to content

Reproducer test for #2431 (getaddrinfo_a use-after-free)#2433

Merged
yhirose merged 6 commits into
masterfrom
issue-2431-repro
Apr 28, 2026
Merged

Reproducer test for #2431 (getaddrinfo_a use-after-free)#2433
yhirose merged 6 commits into
masterfrom
issue-2431-repro

Conversation

@yhirose
Copy link
Copy Markdown
Owner

@yhirose yhirose commented Apr 27, 2026

Summary

Adds a CI-gated reproducer for #2431 — the use-after-free in detail::getaddrinfo_with_timeout() on Linux/glibc when gai_suspend() hits the connection timeout and the non-blocking gai_cancel() returns EAI_NOTCANCELED, leaving the resolver worker thread writing into the destroyed stack frame after the function has already returned.

This PR only adds the reproducer, not the fix. The new issue-2431 repro (Linux + ASAN) Linux CI job is expected to fail on this branch — that failure is the proof the test exercises the bug. The follow-up fix lives in #2434, where the same job goes green.

What's included

  • test/test.cc — three opt-in gtest cases under GetAddrInfoAsyncCancelTest.*:
    • DirectCallSingleThread — tight loop calling detail::getaddrinfo_with_timeout() with a 1s timeout against unique unresolvable hostnames.
    • DirectCallMultiThread — same, but from 8 worker threads concurrently.
    • ClientGetMultiThread — exercises the same path through the high-level Client::Get() API (matching the original repro from Bug: getaddrinfo_with_timeout use-after-free on Linux (getaddrinfo_a path) #2431).
    • All three are guarded by #if defined(__linux__) && defined(__GLIBC__) && defined(CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO) and skipped at runtime unless CPPHTTPLIB_TEST_ISSUE_2431=1 is set, so normal make test runs are unaffected.
  • test/dns_test_fixture.py — small (~50 line, stdlib-only) loopback UDP responder that answers DNS queries after a 3s delay. The deliberate delay (longer than the test's 1s timeout) is what gives the resolver worker thread something to write back after the caller's stack frame is gone.
  • .github/workflows/test.yaml — new issue-2431-repro job on ubuntu-latest. It starts the test fixture on 127.0.0.1:15353, installs an iptables -t nat OUTPUT -p udp --dport 53 -j REDIRECT --to-port 15353 rule so glibc's lookups land on the fixture (without touching /etc/resolv.conf), and runs the gtest filter under setarch -R with ASAN_OPTIONS=detect_stack_use_after_return=1. A tear down test fixture step (if: always()) flushes the iptables rule and stops the fixture.
  • test/run_issue_2431_repro.sh — Docker-based runner with the same fixture wiring so the failure can be reproduced locally on macOS / non-Linux dev machines.

Why a delayed-reply fixture, not a DNS sinkhole

Earlier iterations of this PR dropped UDP/53 outright. That made glibc hang waiting for an answer that would never arrive — the worker never reached the buggy write-back path, so the broken HEAD and any candidate fix produced the same CI behaviour (tests pass + Cleaning up orphan processes hangs ~1h). Replacing the sinkhole with a fixture that answers after the test's timeout is what actually exercises the bug.

Expected CI outcome on this branch

  • issue-2431 repro (Linux + ASAN)fails fast at the test step itself (~2 min total). Currently surfaces as DirectCallMultiThread exiting with SIGSEGV (exit 139) when the resolver worker writes back into reused stack memory; depending on the runner's stack reuse pattern ASAN may instead emit a stack-use-after-return diagnostic. Either signal is a clear test-step failure, not a job-level timeout.
  • ✓ All other jobs unaffected (the new tests are skipped without the env var).

Test plan

  • CI: issue-2431 repro (Linux + ASAN) job fails at the test step in ~2 min (proof the test exercises the bug).
  • CI: ubuntu (openssl|mbedtls|wolfssl|no-tls), macos, windows, test-no-exceptions, etc. all pass unchanged.
  • Local: bash test/run_issue_2431_repro.sh reproduces the failure inside Docker on a dev machine.

Refs: #2431. Fix: #2434.

yhirose and others added 5 commits April 27, 2026 22:25
On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via
getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend()
hits the connection timeout, gai_cancel() is called and the function
returns immediately — but gai_cancel() is non-blocking and can return
EAI_NOTCANCELED, leaving the resolver worker thread alive and still
referencing the destroyed stack frame.

Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that
exercise the cancel path repeatedly. They are gated on Linux/glibc +
CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the
CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make
test` run is unaffected.

Also adds a dedicated CI job (issue-2431-repro) and a Docker-based
local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so
the timeout branch is taken, and run the test under ASAN/LSAN. With
the bug present these runs are expected to fail; with a fix applied
they should pass.

Refs: #2431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout
directly. In split builds (make test_split) split.py moves the definition into
httplib.cc and strips `inline`, so the symbol is not declared in the public
httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions
CI jobs that the PR description says should be unaffected.

Add a forward declaration in test.cc, gated by the same #if as the tests
themselves, so it links against the split-build symbol without changing the
header-only build.
The bug manifests as orphan getaddrinfo_a resolver workers that keep the
runner from completing job teardown -- the previous run had all steps
succeed in ~1m37s but then hung in "Cleaning up orphan processes" for
~57m before GitHub force-killed the job.

A job-level timeout-minutes makes the failure signal fast and predictable:
bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout
isn't enough since the hang is in post-job cleanup, not the test step.
The bug is a textbook stack-use-after-return: a stack-local struct gaicb
is destroyed when getaddrinfo_with_timeout returns after gai_cancel()
yields EAI_NOTCANCELED, then the still-live resolver worker thread writes
back into the freed frame. ASAN's detect_stack_use_after_return is the
direct detector for exactly this pattern -- enabling it lets the failure
surface as a clear ASAN diagnostic during the test run instead of as an
orphan-process hang at job teardown.
The option did not detect the bug in CI -- the resolver worker write
likely lands on the heap (via the gaicb's pai pointer) or happens after
the test process exits, neither of which stack-use-after-return can
catch. Roll back to relying on the job-level timeout: bug present ->
post-cleanup hangs ~8min then job-level timeout cancels at 10min total;
bug fixed -> job completes in ~2min.
The previous repro setup dropped UDP/53 outright, which made glibc's
resolver hang forever on every lookup -- the worker never actually
received a response and so never reached the buggy write-back path
that #2431 is about. As a result, neither the broken HEAD nor the
fix made any visible difference in CI: both produced "tests pass +
post-cleanup hangs ~10min" because the orphan resolver thread is a
structural property of *any* getaddrinfo path on a hung resolver,
not a property of the bug.

Replace the sinkhole with a small loopback test fixture
(test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS
queries after a 3s delay -- longer than the test's 1s timeout. An
iptables NAT rule routes the test job's lookups to the fixture
without touching /etc/resolv.conf, so the rest of the runner's DNS
behaviour is unaffected.

With ASAN's detect_stack_use_after_return enabled, the worker's
late write-back into the destroyed gaicb stack frame is now caught
as a stack-use-after-return diagnostic, so the broken HEAD fails
fast at the test step (clear red) and the fix turns the same job
green in well under a minute.

Same fixture is wired into both the GitHub Actions job and the
docker-based test/run_issue_2431_repro.sh script, so local repro on
macOS and CI repro on Linux exercise the identical path.
yhirose added a commit that referenced this pull request Apr 28, 2026
The Linux/glibc branch of detail::getaddrinfo_with_timeout used
getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the
connection-timeout branch it called gai_cancel(), which is non-blocking
and may return EAI_NOTCANCELED -- in that case the resolver worker
thread is still alive and writes back to ar_result on the now-destroyed
stack frame after the function has already returned.

Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc
fall through to the existing std::thread + std::shared_ptr<State>
implementation that the file already uses for other Unix systems. That
path captures shared ownership in the resolver lambda, so the state
outlives the caller's frame whether or not the worker finishes in
time -- no stack frame is ever referenced after return.

The reproducer added in #2433 (issue-2431 repro CI job) goes from
hanging at job teardown to passing in ~25s with this change.
@yhirose yhirose merged commit d14e4fc into master Apr 28, 2026
36 of 37 checks passed
@yhirose yhirose deleted the issue-2431-repro branch April 28, 2026 09:17
yhirose added a commit that referenced this pull request Apr 28, 2026
The Linux/glibc branch of detail::getaddrinfo_with_timeout used
getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the
connection-timeout branch it called gai_cancel(), which is non-blocking
and may return EAI_NOTCANCELED -- in that case the resolver worker
thread is still alive and writes back to ar_result on the now-destroyed
stack frame after the function has already returned.

Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc
fall through to the existing std::thread + std::shared_ptr<State>
implementation that the file already uses for other Unix systems. That
path captures shared ownership in the resolver lambda, so the state
outlives the caller's frame whether or not the worker finishes in
time -- no stack frame is ever referenced after return.

The reproducer added in #2433 (issue-2431 repro CI job) goes from
hanging at job teardown to passing in ~25s with this change.
yhirose added a commit that referenced this pull request Apr 28, 2026
…2436)

The Linux/glibc branch of detail::getaddrinfo_with_timeout used
getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the
connection-timeout branch it called gai_cancel(), which is non-blocking
and may return EAI_NOTCANCELED -- in that case the resolver worker
thread is still alive and writes back to ar_result on the now-destroyed
stack frame after the function has already returned.

Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc
fall through to the existing std::thread + std::shared_ptr<State>
implementation that the file already uses for other Unix systems. That
path captures shared ownership in the resolver lambda, so the state
outlives the caller's frame whether or not the worker finishes in
time -- no stack frame is ever referenced after return.

The reproducer added in #2433 (issue-2431 repro CI job) goes from
hanging at job teardown to passing in ~25s with this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant