Skip to content

Fail fast on AWS sandbox STS endpoint timeout/connection failures#2965

Open
SanketMeghale wants to merge 1 commit intoNetflix:masterfrom
SanketMeghale:fix/aws-sandbox-sts-timeout-2924
Open

Fail fast on AWS sandbox STS endpoint timeout/connection failures#2965
SanketMeghale wants to merge 1 commit intoNetflix:masterfrom
SanketMeghale:fix/aws-sandbox-sts-timeout-2924

Conversation

@SanketMeghale
Copy link

PR Type

  • Bug fix
  • New feature
  • Core Runtime change (higher bar -- see CONTRIBUTING.md)
  • Docs / tooling
  • Refactoring

Summary

Fail fast when AWS sandbox STS endpoint is unreachable by adding explicit request timeout and error wrapping in Boto3ClientProvider.get_client.

Issue

Fixes #2924

Reproduction

Runtime: local

Commands to run:

export METAFLOW_AWS_SANDBOX_ENABLED=1
export METAFLOW_AWS_SANDBOX_STS_ENDPOINT_URL="http://192.0.2.1"
python -c "from metaflow.plugins.aws.aws_client import Boto3ClientProvider; Boto3ClientProvider.get_client('s3')"

Where evidence shows up: parent console traceback

Before (error / log snippet)
Sandbox STS fetch may hang for a long time due to missing timeout, then fail with networking exceptions.
After (evidence that fix works)
Sandbox STS fetch fails fast with MetaflowException for timeout/connection failures,
including endpoint context and reachability guidance.

Root Cause

In sandbox mode, STS token fetch used requests.get without an explicit timeout. Unreachable/misconfigured endpoints relied on OS/network timeout behavior, causing long hangs. Also, timeout and connection failures were not wrapped into MetaflowException, yielding less actionable failures.

Why This Fix Is Correct

The patch restores predictable fail-fast behavior by setting an explicit (connect, read) timeout tuple and wrapping timeout/connection failures as MetaflowException with endpoint context. Scope is limited to sandbox credential fetch path and does not alter non-sandbox AWS client behavior.

Failure Modes Considered

  1. Endpoint accepts connection but stalls response: handled via explicit read timeout.
  2. Endpoint unreachable / routing / DNS failure: handled via ConnectionError and wrapped with actionable message.

Tests

  • Unit tests added/updated
  • Reproduction script provided (required for Core Runtime)
  • CI passes
  • If tests are impractical: explain why below and provide manual evidence above

Added test/unit/test_aws_client.py with focused coverage for:

  • timeout tuple usage in sandbox STS request
  • timeout error path
  • connection error path
  • existing HTTPError wrapping behavior
  • cached sandbox credentials path (no STS refetch)

Non-Goals

  • No retry/backoff policy changes.
  • No changes to non-sandbox AWS credential acquisition paths.

AI Tool Usage

  • No AI tools were used in this contribution
  • AI tools were used (describe below)

Used AI tooling for drafting and iterating patch/test scaffolding. All changes were reviewed, adjusted, and validated by me before submission.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR adds fail-fast behavior to the AWS sandbox STS credential fetch path in Boto3ClientProvider.get_client. Previously, requests.get was called without a timeout, causing long hangs when the sandbox STS endpoint was unreachable. The fix adds an explicit timeout=(5, 30) (5s connect, 30s read) and wraps Timeout and ConnectionError exceptions in MetaflowException with actionable error messages including the endpoint URL.

  • Adds explicit request timeout (5, 30) to sandbox STS requests.get call
  • Wraps requests.exceptions.Timeout and requests.exceptions.ConnectionError in MetaflowException with endpoint context and guidance
  • Exception ordering is correct — Timeout is caught before ConnectionError (important since Timeout is a subclass of ConnectionError in the requests library)
  • Existing HTTPError wrapping behavior is preserved unchanged
  • New test file test/unit/test_aws_client.py provides focused coverage for all error paths, timeout tuple verification, and credential caching
  • Scoped entirely to the sandbox credential fetch path; non-sandbox AWS client behavior is unaffected

Confidence Score: 5/5

  • This PR is safe to merge — it's a narrowly scoped, well-tested fix to a known hang issue in the sandbox credential path.
  • The change is minimal (adding a timeout and two except clauses), correctly ordered (Timeout before ConnectionError due to inheritance), scoped only to the sandbox STS path, and thoroughly covered by unit tests. No existing behavior is altered for non-sandbox users.
  • No files require special attention.

Important Files Changed

Filename Overview
metaflow/plugins/aws/aws_client.py Adds explicit timeout=(5, 30) to requests.get and wraps Timeout/ConnectionError in MetaflowException with actionable messages. Exception ordering is correct (Timeout before ConnectionError, important since Timeout inherits from ConnectionError in requests).
test/unit/test_aws_client.py New unit test file covering all error paths (timeout, connection error, HTTP error), happy path with timeout verification, and credential caching behavior. Tests use appropriate mocking and fixture-based cache reset.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Boto3ClientProvider
    participant STS as Sandbox STS Endpoint
    participant Cache as cached_aws_sandbox_creds

    Caller->>Boto3ClientProvider: get_client("s3")
    Boto3ClientProvider->>Cache: Check cache
    alt Cache hit
        Cache-->>Boto3ClientProvider: Return cached creds
    else Cache miss
        Boto3ClientProvider->>STS: requests.get(url, timeout=(5, 30))
        alt Success
            STS-->>Boto3ClientProvider: 200 + JSON creds
            Boto3ClientProvider->>Cache: Store creds
        else Timeout (connect >5s or read >30s)
            STS-->>Boto3ClientProvider: Timeout
            Boto3ClientProvider-->>Caller: MetaflowException (timeout message)
        else Connection Error
            STS-->>Boto3ClientProvider: ConnectionError
            Boto3ClientProvider-->>Caller: MetaflowException (connection message)
        else HTTP Error
            STS-->>Boto3ClientProvider: 4xx/5xx
            Boto3ClientProvider-->>Caller: MetaflowException (HTTP error)
        end
    end
    Boto3ClientProvider->>Caller: boto3 client
Loading

Last reviewed commit: 887d2b5

@npow npow added the gsoc Google Summer of Code label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gsoc Google Summer of Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS Sandbox STS fetch hangs indefinitely when endpoint is unreachable

2 participants