Reproducer test for #2431 (getaddrinfo_a use-after-free) by yhirose · Pull Request #2433 · yhirose/cpp-httplib

yhirose · 2026-04-27T13:26:30Z

Summary

Adds a CI-gated reproducer for #2431 — the use-after-free in detail::getaddrinfo_with_timeout() on Linux/glibc when gai_suspend() hits the connection timeout and the non-blocking gai_cancel() returns EAI_NOTCANCELED, leaving the resolver worker thread writing into the destroyed stack frame after the function has already returned.

This PR only adds the reproducer, not the fix. The new issue-2431 repro (Linux + ASAN) Linux CI job is expected to fail on this branch — that failure is the proof the test exercises the bug. The follow-up fix lives in #2434, where the same job goes green.

What's included

test/test.cc — three opt-in gtest cases under GetAddrInfoAsyncCancelTest.*:
- DirectCallSingleThread — tight loop calling detail::getaddrinfo_with_timeout() with a 1s timeout against unique unresolvable hostnames.
- DirectCallMultiThread — same, but from 8 worker threads concurrently.
- ClientGetMultiThread — exercises the same path through the high-level Client::Get() API (matching the original repro from Bug: getaddrinfo_with_timeout use-after-free on Linux (getaddrinfo_a path) #2431).
- All three are guarded by #if defined(__linux__) && defined(__GLIBC__) && defined(CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO) and skipped at runtime unless CPPHTTPLIB_TEST_ISSUE_2431=1 is set, so normal make test runs are unaffected.
test/dns_test_fixture.py — small (~50 line, stdlib-only) loopback UDP responder that answers DNS queries after a 3s delay. The deliberate delay (longer than the test's 1s timeout) is what gives the resolver worker thread something to write back after the caller's stack frame is gone.
.github/workflows/test.yaml — new issue-2431-repro job on ubuntu-latest. It starts the test fixture on 127.0.0.1:15353, installs an iptables -t nat OUTPUT -p udp --dport 53 -j REDIRECT --to-port 15353 rule so glibc's lookups land on the fixture (without touching /etc/resolv.conf), and runs the gtest filter under setarch -R with ASAN_OPTIONS=detect_stack_use_after_return=1. A tear down test fixture step (if: always()) flushes the iptables rule and stops the fixture.
test/run_issue_2431_repro.sh — Docker-based runner with the same fixture wiring so the failure can be reproduced locally on macOS / non-Linux dev machines.

Why a delayed-reply fixture, not a DNS sinkhole

Earlier iterations of this PR dropped UDP/53 outright. That made glibc hang waiting for an answer that would never arrive — the worker never reached the buggy write-back path, so the broken HEAD and any candidate fix produced the same CI behaviour (tests pass + Cleaning up orphan processes hangs ~1h). Replacing the sinkhole with a fixture that answers after the test's timeout is what actually exercises the bug.

Expected CI outcome on this branch

✗ issue-2431 repro (Linux + ASAN) — fails fast at the test step itself (~2 min total). Currently surfaces as DirectCallMultiThread exiting with SIGSEGV (exit 139) when the resolver worker writes back into reused stack memory; depending on the runner's stack reuse pattern ASAN may instead emit a stack-use-after-return diagnostic. Either signal is a clear test-step failure, not a job-level timeout.
✓ All other jobs unaffected (the new tests are skipped without the env var).

Test plan

CI: issue-2431 repro (Linux + ASAN) job fails at the test step in ~2 min (proof the test exercises the bug).
CI: ubuntu (openssl|mbedtls|wolfssl|no-tls), macos, windows, test-no-exceptions, etc. all pass unchanged.
Local: bash test/run_issue_2431_repro.sh reproduces the failure inside Docker on a dev machine.

Refs: #2431. Fix: #2434.

On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend() hits the connection timeout, gai_cancel() is called and the function returns immediately — but gai_cancel() is non-blocking and can return EAI_NOTCANCELED, leaving the resolver worker thread alive and still referencing the destroyed stack frame. Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that exercise the cancel path repeatedly. They are gated on Linux/glibc + CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make test` run is unaffected. Also adds a dedicated CI job (issue-2431-repro) and a Docker-based local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so the timeout branch is taken, and run the test under ASAN/LSAN. With the bug present these runs are expected to fail; with a fix applied they should pass. Refs: #2431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout directly. In split builds (make test_split) split.py moves the definition into httplib.cc and strips `inline`, so the symbol is not declared in the public httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions CI jobs that the PR description says should be unaffected. Add a forward declaration in test.cc, gated by the same #if as the tests themselves, so it links against the split-build symbol without changing the header-only build.

The bug manifests as orphan getaddrinfo_a resolver workers that keep the runner from completing job teardown -- the previous run had all steps succeed in ~1m37s but then hung in "Cleaning up orphan processes" for ~57m before GitHub force-killed the job. A job-level timeout-minutes makes the failure signal fast and predictable: bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout isn't enough since the hang is in post-job cleanup, not the test step.

The bug is a textbook stack-use-after-return: a stack-local struct gaicb is destroyed when getaddrinfo_with_timeout returns after gai_cancel() yields EAI_NOTCANCELED, then the still-live resolver worker thread writes back into the freed frame. ASAN's detect_stack_use_after_return is the direct detector for exactly this pattern -- enabling it lets the failure surface as a clear ASAN diagnostic during the test run instead of as an orphan-process hang at job teardown.

The option did not detect the bug in CI -- the resolver worker write likely lands on the heap (via the gaicb's pai pointer) or happens after the test process exits, neither of which stack-use-after-return can catch. Roll back to relying on the job-level timeout: bug present -> post-cleanup hangs ~8min then job-level timeout cancels at 10min total; bug fixed -> job completes in ~2min.

The previous repro setup dropped UDP/53 outright, which made glibc's resolver hang forever on every lookup -- the worker never actually received a response and so never reached the buggy write-back path that #2431 is about. As a result, neither the broken HEAD nor the fix made any visible difference in CI: both produced "tests pass + post-cleanup hangs ~10min" because the orphan resolver thread is a structural property of *any* getaddrinfo path on a hung resolver, not a property of the bug. Replace the sinkhole with a small loopback test fixture (test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS queries after a 3s delay -- longer than the test's 1s timeout. An iptables NAT rule routes the test job's lookups to the fixture without touching /etc/resolv.conf, so the rest of the runner's DNS behaviour is unaffected. With ASAN's detect_stack_use_after_return enabled, the worker's late write-back into the destroyed gaicb stack frame is now caught as a stack-use-after-return diagnostic, so the broken HEAD fails fast at the test step (clear red) and the fix turns the same job green in well under a minute. Same fixture is wired into both the GitHub Actions job and the docker-based test/run_issue_2431_repro.sh script, so local repro on macOS and CI repro on Linux exercise the identical path.

The Linux/glibc branch of detail::getaddrinfo_with_timeout used getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the connection-timeout branch it called gai_cancel(), which is non-blocking and may return EAI_NOTCANCELED -- in that case the resolver worker thread is still alive and writes back to ar_result on the now-destroyed stack frame after the function has already returned. Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc fall through to the existing std::thread + std::shared_ptr<State> implementation that the file already uses for other Unix systems. That path captures shared ownership in the resolver lambda, so the state outlives the caller's frame whether or not the worker finishes in time -- no stack frame is ever referenced after return. The reproducer added in #2433 (issue-2431 repro CI job) goes from hanging at job teardown to passing in ~25s with this change.

…2436) The Linux/glibc branch of detail::getaddrinfo_with_timeout used getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the connection-timeout branch it called gai_cancel(), which is non-blocking and may return EAI_NOTCANCELED -- in that case the resolver worker thread is still alive and writes back to ar_result on the now-destroyed stack frame after the function has already returned. Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc fall through to the existing std::thread + std::shared_ptr<State> implementation that the file already uses for other Unix systems. That path captures shared ownership in the resolver lambda, so the state outlives the caller's frame whether or not the worker finishes in time -- no stack frame is ever referenced after return. The reproducer added in #2433 (issue-2431 repro CI job) goes from hanging at job teardown to passing in ~25s with this change.

yhirose and others added 5 commits April 27, 2026 22:25

yhirose mentioned this pull request Apr 28, 2026

Fix #2431: drop getaddrinfo_a path (stack-use-after-free) #2434

Closed

2 tasks

yhirose merged commit d14e4fc into master Apr 28, 2026
36 of 37 checks passed

yhirose deleted the issue-2431-repro branch April 28, 2026 09:17

yhirose mentioned this pull request Apr 28, 2026

Fix #2431: drop getaddrinfo_a path (stack-use-after-free) #2436

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducer test for #2431 (getaddrinfo_a use-after-free)#2433

Reproducer test for #2431 (getaddrinfo_a use-after-free)#2433
yhirose merged 6 commits into
masterfrom
issue-2431-repro

yhirose commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yhirose commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Why a delayed-reply fixture, not a DNS sinkhole

Expected CI outcome on this branch

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yhirose commented Apr 27, 2026 •

edited

Loading