Skip to content

feat: retry mechanism for known status codes#71

Open
christianalfoni wants to merge 3 commits intomainfrom
CSB-1399
Open

feat: retry mechanism for known status codes#71
christianalfoni wants to merge 3 commits intomainfrom
CSB-1399

Conversation

@christianalfoni
Copy link
Copy Markdown
Collaborator

Automatic retry on transient failures — both SDKs

❌ Current behavior

All API calls fail immediately on any error, including transient ones:

startSandbox → HTTP 503 → throws RuntimeError / Error immediately
waitForSandbox → connection refused → throws immediately
files.read → HTTP 429 → throws immediately

No way for callers to customise retry behaviour. Network-level failures and rate-limit responses require manual retry loops.

✅ New behavior

All operations retry automatically on transient failures. Users can observe and override via RetryConfig:

// TypeScript
const sdk = new TogetherSandbox({
  apiKey: process.env.TOGETHER_API_KEY!,
  retry: {
    maxAttempts: 4,
    shouldRetry: ({ operation, status }) =>
      operation !== "snapshots.create" && status !== 401,
    onRetry: ({ operation, attempt, delay }) =>
      console.warn(`[retry] ${operation} attempt ${attempt} — retrying in ${delay}ms`),
  },
});
# Python
sdk = TogetherSandbox(
    api_key="...",
    retry=RetryConfig(
        max_attempts=4,
        should_retry=lambda ctx: ctx.operation != "snapshots.create",
        on_retry=lambda ctx: print(f"[retry] {ctx.operation} attempt {ctx.attempt}"),
    ),
)

🤔 Assumptions

  • Default of 3 total attempts is sufficient for most transient scenarios without being too aggressive
  • 408 / 429 / 500 / 502 / 503 / 504 covers the standard transient server-side codes; 4xx auth/validation errors are intentionally excluded
  • TypeScript network errors are detected as TypeError (the fetch API convention); Python uses httpx.TimeoutException, httpx.ConnectError, httpx.RemoteProtocolError
  • snapshots.create is not idempotent — retrying after a transient 500 can register a duplicate; this is documented prominently and users are shown the exclusion pattern
  • delay is expressed in milliseconds in TypeScript and seconds in Python — consistent with each language's sleep convention (setTimeout vs asyncio.sleep)

🧠 Decisions

  • Centralised in a single callApi / _call_api helper — all operations go through it; retry config flows down from the SDK constructor through every namespace
  • shouldRetry return type is boolean | number (TS) / bool | float (Python) — returning a numeric value lets the user both approve the retry AND set the delay in one callback, without needing to mutate the context object
  • onRetry receives the final resolved delay (after shouldRetry may have overridden it) so loggers and UI code see the same value the SDK will sleep for
  • RetryConfig and RetryContext are exported from the package root in both SDKs — users do not need to reach into internal modules
  • Python Sandbox instance methods hibernate() and shutdown() were prefixed _hibernate/_shutdown (private) despite being documented as public — renamed to fix the AttributeError at runtime
  • TypeScript README rewritten as a developer/contributor guide (codegen, test commands) to match the Python README; full API reference lives in docs/typescript-sdk.md

🔄 Discussions

  • Initially _unwrap_or_raise (Python) and direct throwOnError (TypeScript) were used for error handling — replaced entirely by _call_api / callApi to unify error unwrapping and retry logic in one place

🧪 Testing

  • TypeScript: comprehensive unit tests in src/utils.test.ts covering all retryable status codes, network errors (TypeError), shouldRetry return shapes (false / true / number), onRetry call count and delay values, exponential backoff math, ApiError / SandboxError typed throws, 204 no-body path, and context string formatting
  • Python: equivalent unit tests in tests/test_call_api.py including parametrized status codes, all three httpx exception types, async callbacks, backoff delay values with patched random.random, and async should_retry / on_retry coroutines

📁 References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant