Reproducer test for #2431 (getaddrinfo_a use-after-free)#2433
Merged
Conversation
On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend() hits the connection timeout, gai_cancel() is called and the function returns immediately — but gai_cancel() is non-blocking and can return EAI_NOTCANCELED, leaving the resolver worker thread alive and still referencing the destroyed stack frame. Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that exercise the cancel path repeatedly. They are gated on Linux/glibc + CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make test` run is unaffected. Also adds a dedicated CI job (issue-2431-repro) and a Docker-based local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so the timeout branch is taken, and run the test under ASAN/LSAN. With the bug present these runs are expected to fail; with a fix applied they should pass. Refs: #2431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout directly. In split builds (make test_split) split.py moves the definition into httplib.cc and strips `inline`, so the symbol is not declared in the public httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions CI jobs that the PR description says should be unaffected. Add a forward declaration in test.cc, gated by the same #if as the tests themselves, so it links against the split-build symbol without changing the header-only build.
The bug manifests as orphan getaddrinfo_a resolver workers that keep the runner from completing job teardown -- the previous run had all steps succeed in ~1m37s but then hung in "Cleaning up orphan processes" for ~57m before GitHub force-killed the job. A job-level timeout-minutes makes the failure signal fast and predictable: bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout isn't enough since the hang is in post-job cleanup, not the test step.
The bug is a textbook stack-use-after-return: a stack-local struct gaicb is destroyed when getaddrinfo_with_timeout returns after gai_cancel() yields EAI_NOTCANCELED, then the still-live resolver worker thread writes back into the freed frame. ASAN's detect_stack_use_after_return is the direct detector for exactly this pattern -- enabling it lets the failure surface as a clear ASAN diagnostic during the test run instead of as an orphan-process hang at job teardown.
The option did not detect the bug in CI -- the resolver worker write likely lands on the heap (via the gaicb's pai pointer) or happens after the test process exits, neither of which stack-use-after-return can catch. Roll back to relying on the job-level timeout: bug present -> post-cleanup hangs ~8min then job-level timeout cancels at 10min total; bug fixed -> job completes in ~2min.
2 tasks
The previous repro setup dropped UDP/53 outright, which made glibc's resolver hang forever on every lookup -- the worker never actually received a response and so never reached the buggy write-back path that #2431 is about. As a result, neither the broken HEAD nor the fix made any visible difference in CI: both produced "tests pass + post-cleanup hangs ~10min" because the orphan resolver thread is a structural property of *any* getaddrinfo path on a hung resolver, not a property of the bug. Replace the sinkhole with a small loopback test fixture (test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS queries after a 3s delay -- longer than the test's 1s timeout. An iptables NAT rule routes the test job's lookups to the fixture without touching /etc/resolv.conf, so the rest of the runner's DNS behaviour is unaffected. With ASAN's detect_stack_use_after_return enabled, the worker's late write-back into the destroyed gaicb stack frame is now caught as a stack-use-after-return diagnostic, so the broken HEAD fails fast at the test step (clear red) and the fix turns the same job green in well under a minute. Same fixture is wired into both the GitHub Actions job and the docker-based test/run_issue_2431_repro.sh script, so local repro on macOS and CI repro on Linux exercise the identical path.
yhirose
added a commit
that referenced
this pull request
Apr 28, 2026
The Linux/glibc branch of detail::getaddrinfo_with_timeout used getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the connection-timeout branch it called gai_cancel(), which is non-blocking and may return EAI_NOTCANCELED -- in that case the resolver worker thread is still alive and writes back to ar_result on the now-destroyed stack frame after the function has already returned. Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc fall through to the existing std::thread + std::shared_ptr<State> implementation that the file already uses for other Unix systems. That path captures shared ownership in the resolver lambda, so the state outlives the caller's frame whether or not the worker finishes in time -- no stack frame is ever referenced after return. The reproducer added in #2433 (issue-2431 repro CI job) goes from hanging at job teardown to passing in ~25s with this change.
yhirose
added a commit
that referenced
this pull request
Apr 28, 2026
The Linux/glibc branch of detail::getaddrinfo_with_timeout used getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the connection-timeout branch it called gai_cancel(), which is non-blocking and may return EAI_NOTCANCELED -- in that case the resolver worker thread is still alive and writes back to ar_result on the now-destroyed stack frame after the function has already returned. Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc fall through to the existing std::thread + std::shared_ptr<State> implementation that the file already uses for other Unix systems. That path captures shared ownership in the resolver lambda, so the state outlives the caller's frame whether or not the worker finishes in time -- no stack frame is ever referenced after return. The reproducer added in #2433 (issue-2431 repro CI job) goes from hanging at job teardown to passing in ~25s with this change.
2 tasks
yhirose
added a commit
that referenced
this pull request
Apr 28, 2026
…2436) The Linux/glibc branch of detail::getaddrinfo_with_timeout used getaddrinfo_a(GAI_NOWAIT) with a stack-local struct gaicb. On the connection-timeout branch it called gai_cancel(), which is non-blocking and may return EAI_NOTCANCELED -- in that case the resolver worker thread is still alive and writes back to ar_result on the now-destroyed stack frame after the function has already returned. Drop the entire #elif _GNU_SOURCE && __GLIBC__ branch and let glibc fall through to the existing std::thread + std::shared_ptr<State> implementation that the file already uses for other Unix systems. That path captures shared ownership in the resolver lambda, so the state outlives the caller's frame whether or not the worker finishes in time -- no stack frame is ever referenced after return. The reproducer added in #2433 (issue-2431 repro CI job) goes from hanging at job teardown to passing in ~25s with this change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a CI-gated reproducer for #2431 — the use-after-free in
detail::getaddrinfo_with_timeout()on Linux/glibc whengai_suspend()hits the connection timeout and the non-blockinggai_cancel()returnsEAI_NOTCANCELED, leaving the resolver worker thread writing into the destroyed stack frame after the function has already returned.This PR only adds the reproducer, not the fix. The new
issue-2431 repro (Linux + ASAN)Linux CI job is expected to fail on this branch — that failure is the proof the test exercises the bug. The follow-up fix lives in #2434, where the same job goes green.What's included
test/test.cc— three opt-in gtest cases underGetAddrInfoAsyncCancelTest.*:DirectCallSingleThread— tight loop callingdetail::getaddrinfo_with_timeout()with a 1s timeout against unique unresolvable hostnames.DirectCallMultiThread— same, but from 8 worker threads concurrently.ClientGetMultiThread— exercises the same path through the high-levelClient::Get()API (matching the original repro from Bug:getaddrinfo_with_timeoutuse-after-free on Linux (getaddrinfo_a path) #2431).#if defined(__linux__) && defined(__GLIBC__) && defined(CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO)and skipped at runtime unlessCPPHTTPLIB_TEST_ISSUE_2431=1is set, so normalmake testruns are unaffected.test/dns_test_fixture.py— small (~50 line, stdlib-only) loopback UDP responder that answers DNS queries after a 3s delay. The deliberate delay (longer than the test's 1s timeout) is what gives the resolver worker thread something to write back after the caller's stack frame is gone..github/workflows/test.yaml— newissue-2431-reprojob onubuntu-latest. It starts the test fixture on127.0.0.1:15353, installs aniptables -t nat OUTPUT -p udp --dport 53 -j REDIRECT --to-port 15353rule so glibc's lookups land on the fixture (without touching/etc/resolv.conf), and runs the gtest filter undersetarch -RwithASAN_OPTIONS=detect_stack_use_after_return=1. Atear down test fixturestep (if: always()) flushes the iptables rule and stops the fixture.test/run_issue_2431_repro.sh— Docker-based runner with the same fixture wiring so the failure can be reproduced locally on macOS / non-Linux dev machines.Why a delayed-reply fixture, not a DNS sinkhole
Earlier iterations of this PR dropped UDP/53 outright. That made glibc hang waiting for an answer that would never arrive — the worker never reached the buggy write-back path, so the broken HEAD and any candidate fix produced the same CI behaviour (tests pass +
Cleaning up orphan processeshangs ~1h). Replacing the sinkhole with a fixture that answers after the test's timeout is what actually exercises the bug.Expected CI outcome on this branch
issue-2431 repro (Linux + ASAN)— fails fast at the test step itself (~2 min total). Currently surfaces asDirectCallMultiThreadexiting with SIGSEGV (exit 139) when the resolver worker writes back into reused stack memory; depending on the runner's stack reuse pattern ASAN may instead emit astack-use-after-returndiagnostic. Either signal is a clear test-step failure, not a job-level timeout.Test plan
issue-2431 repro (Linux + ASAN)job fails at the test step in ~2 min (proof the test exercises the bug).ubuntu (openssl|mbedtls|wolfssl|no-tls),macos,windows,test-no-exceptions, etc. all pass unchanged.bash test/run_issue_2431_repro.shreproduces the failure inside Docker on a dev machine.Refs: #2431. Fix: #2434.