feat: Production-Grade: Persistent Store & Distributed Event Bus#343
feat: Production-Grade: Persistent Store & Distributed Event Bus#343hashtekconsulting wants to merge 5 commits intoa2aproject:mainfrom
Conversation
…yments Introduce DynamoDB-backed TaskStore and SNS/SQS-based ExecutionEventBusManager to support horizontally-scaled A2A server deployments. - DynamoDBTaskStore: persistent TaskStore backed by DynamoDB with optimistic locking (version attribute + conditional writes), configurable exponential back-off on conflicts, and optional TTL for automatic task expiry. - SnsEventBusManager: drop-in replacement for DefaultExecutionEventBusManager that fans out execution events across instances via SNS/SQS, enabling SSE delivery to clients connected to any node in the cluster. - SqsEventPoller: per-instance SQS long-poll loop that deduplicates messages by instanceId to prevent double-delivery on the originating node. - TaskStoreError hierarchy: typed errors (TaskNotFoundError, TaskConflictError, StoreUnavailableError) with retryable classification. - Add AWS SDK dependencies: @aws-sdk/client-dynamodb, @aws-sdk/client-sns, @aws-sdk/client-sqs, @aws-sdk/lib-dynamodb. - Add test dependencies: aws-sdk-client-mock, aws-sdk-client-mock-vitest. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scale in and out scenarios
…nents part of a2a SDK and also made it optional.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the A2A SDK's server-side capabilities by introducing robust solutions for production-grade scalability. It replaces in-memory task storage and event management with AWS-backed persistent storage (DynamoDB) and a distributed event bus (SNS/SQS). These changes ensure that task states are durable and events are reliably fanned out across multiple server instances, which is crucial for high-availability and load-balanced deployments. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality for running the A2A server in a distributed, production environment, featuring a persistent TaskStore using DynamoDB and a distributed event bus with SNS/SQS. While the code is well-structured with robust error handling and comprehensive tests, a critical memory leak has been identified in the distributed event bus implementation, where every server instance creates and retains event bus objects for every task in the fleet, regardless of whether a local client is connected. Additionally, a resource leak in the SQS polling loop can cause async functions to hang indefinitely upon shutdown, which could lead to system instability and Denial of Service attacks. Addressing these, along with a couple of other identified issues (one critical for usability), is essential.
🧪 Code Coverage
Generated by coverage-comment.yml |
feat: Persistent Task Store and Distributed Event Bus/Queue for scaling the A2A server instances in production environment
By default the SDK ships with InMemoryTaskStore and DefaultExecutionEventBusManager, which are perfect for a single-process server. In a production deployment with multiple server instances behind a load balancer you need:
Persistent task state — so any instance can serve a tasks/get request regardless of which instance originally handled the task.
Distributed SSE fan-out — so a client that opens an SSE stream on Instance B receives events published by the executor running on Instance A.
This feature introduced new components to support A2A server scalability in production environment