-
Notifications
You must be signed in to change notification settings - Fork 163
feat(BA-4330): Enable OpenTelemetry distributed tracing in Manager #8694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Enables OpenTelemetry distributed tracing in the Manager and expands the typed action/audit/service-discovery foundations to improve observability, RBAC classification, and query capabilities across services.
Changes:
- Activate global OTel tracer provider and instrument aiohttp server/client; add GraphQL resolver spans.
- Replace string-based
entity_type/operation_typewith typed enums across actions and RBAC validators. - Add service discovery foundations (config schema, events, DB models/migration) and extend several GraphQL filter/order features.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| .claude/skills/README.md | Documents new /bep-guide skill and usage flow. |
| CLAUDE.md | Updates contributor guidance to reference /bep-guide. |
| changes/8615.fix.md | Changelog entry for role-change group deletion fix. |
| changes/8628.fix.md | Changelog entry for scoped_query project override fix. |
| changes/8643.fix.md | Changelog entry for GraphQL async resolver timing fix. |
| changes/8647.fix.md | Changelog entry for deployment duplicate name returning 409. |
| changes/8648.feature.md | Changelog entry for usage bucket CLI date filtering. |
| changes/8652.fix.md | Changelog entry for collecting error_code in history records. |
| changes/8671.feature.md | Changelog entry for nested filter/order helpers. |
| changes/8672.doc.md | Changelog entry for BEP-1046 docs. |
| changes/8674.doc.md | Changelog entry for BEP-1047 docs. |
| changes/8675.feature.md | Changelog entry for DomainV2 nested filters/orders. |
| changes/8676.doc.md | Changelog entry for BEP conventions. |
| changes/8677.feature.md | Changelog entry for resource slot normalization foundations. |
| changes/8678.feature.md | Changelog entry for ProjectV2 nested filters/orders. |
| changes/8679.enhance.md | Changelog entry for StrEnum action type system changes. |
| changes/8680.enhance.md | Changelog entry for separating child entity types. |
| changes/8681.feature.md | Changelog entry for UserV2 nested filters. |
| changes/8682.feature.md | Changelog entry for ResourceGroup filter/order + Agent status order. |
| changes/8683.feature.md | Changelog entry for fair share GQL node references. |
| changes/8684.enhance.md | Changelog entry for removing AuditLogEntityType etc. |
| changes/8685.feature.md | Changelog entry for service discovery foundations. |
| docs/manager/rest-reference/openapi.json | Adds OpenAPI schema for DateRangeFilter and usage-bucket date filtering. |
| proposals/BEP-1046/cli-migration.md | Adds BEP-1046 CLI migration details. |
| proposals/BEP-1046/data-model.md | Adds BEP-1046 service discovery DB model documentation. |
| proposals/BEP-1046/event-registration.md | Adds BEP-1046 event registration/flow documentation. |
| proposals/BEP-1047/current-design.md | Adds BEP-1047 current design detail. |
| proposals/BEP-1047/migration-compatibility.md | Adds BEP-1047 migration/compat guidance. |
| proposals/BEP-1047/phase-2-scheduler-integration.md | Adds BEP-1047 phase 2 scheduler integration detail. |
| proposals/BEP-1047/phase-3-usage-bucket-optimization.md | Adds BEP-1047 phase 3 usage-bucket optimization detail. |
| src/ai/backend/common/configs/init.py | Exposes ServiceEndpointConfig in configs package exports. |
| src/ai/backend/common/configs/service_discovery.py | Adds endpoint/instance metadata config schema for service discovery. |
| src/ai/backend/common/dto/manager/fair_share/types.py | Adds period_start date range filtering to usage bucket DTO filters. |
| src/ai/backend/common/dto/manager/query.py | Introduces DateRangeFilter DTO. |
| src/ai/backend/common/events/event_types/service_discovery/anycast.py | Adds service discovery anycast events and endpoint payload model. |
| src/ai/backend/common/events/types.py | Adds SERVICE_DISCOVERY event domain. |
| src/ai/backend/common/exception.py | Adds DeploymentNameAlreadyExists (HTTP 409) error type. |
| src/ai/backend/common/metrics/metric.py | Types action-observation entity_type as EntityType. |
| src/ai/backend/common/types.py | Adds ServiceCatalogStatus enum. |
| src/ai/backend/logging/otel.py | Sets global tracer provider to activate tracing. |
| src/ai/backend/logging/utils.py | Wires tracer initialization into BraceStyleAdapter.apply_otel(). |
| src/ai/backend/manager/actions/action/base.py | Changes action API to return typed EntityType and ActionOperationType. |
| src/ai/backend/manager/actions/action/batch.py | Removes permission_operation_type() in favor of operation_type(). |
| src/ai/backend/manager/actions/action/scope.py | Types scope entity as ScopeType and removes permission_operation_type(). |
| src/ai/backend/manager/actions/action/single_entity.py | Removes permission_operation_type() in favor of operation_type(). |
| src/ai/backend/manager/actions/types.py | Adds ActionOperationType and updates ActionSpec types. |
| src/ai/backend/manager/actions/validators/rbac/scope.py | Uses typed action/scope enums and maps operation to permission operation. |
| src/ai/backend/manager/actions/validators/rbac/single_entity.py | Uses typed action enums and maps operation to permission operation. |
| src/ai/backend/manager/api/gql/agent/types.py | Adds Agent ordering by STATUS. |
| src/ai/backend/manager/api/gql/domain_v2/types/init.py | Exposes DomainV2 nested filters. |
| src/ai/backend/manager/api/gql/fair_share/types/init.py | Exposes new fair-share nested filters/types. |
| src/ai/backend/manager/api/gql/fair_share/types/domain.py | Replaces child navigation with domain node reference; adds nested filter/order. |
| src/ai/backend/manager/api/gql/project_v2/types/init.py | Exposes ProjectV2 nested filters. |
| src/ai/backend/manager/api/gql/resource_group/types.py | Extends ResourceGroup filter/order fields. |
| src/ai/backend/manager/api/gql/user_v2/fetcher/init.py | Exposes new fetch_user API. |
| src/ai/backend/manager/api/gql/user_v2/fetcher/user.py | Adds single-user fetch by UUID. |
| src/ai/backend/manager/api/gql/user_v2/types/init.py | Exposes new UserV2 nested filters. |
| src/ai/backend/manager/api/gql_legacy/audit_log.py | Switches audit log entity type variants to common EntityType. |
| src/ai/backend/manager/api/gql_legacy/base.py | Prevents scoped_query from overriding project param with group_id. |
| src/ai/backend/manager/api/gql_legacy/schema.py | Adds GraphQL resolver OTel spans and improved async timing/error handling. |
| src/ai/backend/manager/cli/main.py | Ensures AgentRow mapper is registered for relationship resolution. |
| src/ai/backend/manager/models/alembic/versions/acd2e76c1e40_add_service_catalog_tables.py | Adds service catalog tables migration. |
| src/ai/backend/manager/models/audit_log/init.py | Removes AuditLogEntityType export. |
| src/ai/backend/manager/models/audit_log/row.py | Removes AuditLogEntityType enum definition. |
| src/ai/backend/manager/models/resource_slot/init.py | Adds resource slot model exports. |
| src/ai/backend/manager/models/service_catalog/init.py | Adds service catalog model exports. |
| src/ai/backend/manager/models/service_catalog/row.py | Adds service catalog ORM models. |
| src/ai/backend/manager/repositories/agent/options.py | Adds ordering by agent status. |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Adds endpoint-name uniqueness check with a 409 error on conflict. |
| src/ai/backend/manager/repositories/fair_share/db_source/db_source.py | Adds joins to support nested filters/orders in fair share searches. |
| src/ai/backend/manager/repositories/user/db_source/db_source.py | Avoids clearing user groups implicitly on role change. |
| src/ai/backend/manager/reporters/base.py | Types action messages with EntityType and ActionOperationType. |
| src/ai/backend/manager/server.py | Instruments aiohttp server/client when OTel enabled. |
| src/ai/backend/manager/services/agent/actions/base.py | Switches entity_type() to EntityType.AGENT. |
| src/ai/backend/manager/services/agent/actions/get_total_resources.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/get_watcher_status.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/handle_heartbeat.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/load_container_counts.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/mark_agent_exit.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/mark_agent_running.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/recalculate_usage.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/remove_agent_from_images.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/remove_agent_from_images_by_canonicals.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/search_agents.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/sync_agent_registry.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/watcher_agent_restart.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/watcher_agent_start.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/agent/actions/watcher_agent_stop.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/app_config/actions/domain.py | Switches action entity/operation typing to enums. |
| src/ai/backend/manager/services/app_config/actions/get_merged.py | Switches action entity/operation typing to enums. |
| src/ai/backend/manager/services/app_config/actions/user.py | Switches action entity/operation typing to enums. |
| src/ai/backend/manager/services/artifact/actions/base.py | Switches entity_type() to EntityType.ARTIFACT. |
| src/ai/backend/manager/services/artifact/actions/delegate_scan.py | Refines entity/operation typing; adds ARTIFACT_SCAN entity type. |
| src/ai/backend/manager/services/artifact/actions/delete_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/get.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/get_revisions.py | Refines entity/operation typing for artifact revisions. |
| src/ai/backend/manager/services/artifact/actions/list_with_revisions.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/restore_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/retrieve_model.py | Refines entity/operation typing for artifact model retrieval. |
| src/ai/backend/manager/services/artifact/actions/retrieve_model_multi.py | Refines entity/operation typing for artifact model retrieval. |
| src/ai/backend/manager/services/artifact/actions/scan.py | Refines entity/operation typing for artifact scan. |
| src/ai/backend/manager/services/artifact/actions/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/search_with_revisions.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/update.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact/actions/upsert_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/base.py | Switches entity_type() to EntityType.ARTIFACT_REGISTRY. |
| src/ai/backend/manager/services/artifact_registry/actions/common/get_meta.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/common/get_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/common/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/create.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/delete.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/get.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/get_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/list.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/huggingface/update.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/create.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/delete.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/get.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/get_multi.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/list.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_registry/actions/reservoir/update.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/approve.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/associate_with_storage.py | Refines entity/operation typing; adds storage-link entity type. |
| src/ai/backend/manager/services/artifact_revision/actions/base.py | Switches entity_type() to EntityType.ARTIFACT_REVISION. |
| src/ai/backend/manager/services/artifact_revision/actions/cancel_import.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/cleanup.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/delegate_import_revision_batch.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/disassociate_with_storage.py | Refines entity/operation typing; adds storage-link entity type. |
| src/ai/backend/manager/services/artifact_revision/actions/get.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/get_download_progress.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/get_readme.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/get_verification_result.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/import_revision.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/import_revision_batch.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/reject.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/artifact_revision/actions/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/audit_log/actions/base.py | Switches entity_type() to EntityType.AUDIT_LOG. |
| src/ai/backend/manager/services/audit_log/actions/create.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/audit_log/actions/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/authorize.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/base.py | Switches entity_type() to EntityType.AUTH. |
| src/ai/backend/manager/services/auth/actions/generate_ssh_keypair.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/get_role.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/get_ssh_keypair.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/signout.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/signup.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/update_full_name.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/update_password.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/update_password_no_auth.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/auth/actions/upload_ssh_keypair.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/base.py | Switches entity_type() to EntityType.CONTAINER_REGISTRY. |
| src/ai/backend/manager/services/container_registry/actions/clear_images.py | Refines entity/operation typing for registry images. |
| src/ai/backend/manager/services/container_registry/actions/create_container_registry.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/delete_container_registry.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/get_container_registries.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/load_all_container_registries.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/load_container_registries.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/modify_container_registry.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/container_registry/actions/rescan_images.py | Refines entity/operation typing for registry images. |
| src/ai/backend/manager/services/deployment/actions/access_token/base.py | Switches entity_type() to EntityType.DEPLOYMENT_ACCESS_TOKEN. |
| src/ai/backend/manager/services/deployment/actions/access_token/create_access_token.py | Switches base type and operation_type() to typed enums. |
| src/ai/backend/manager/services/deployment/actions/access_token/search_access_tokens.py | Switches base type and operation_type() to typed enums. |
| src/ai/backend/manager/services/deployment/actions/auto_scaling_rule/base.py | Switches entity_type() to deployment auto-scaling-rule entity. |
| src/ai/backend/manager/services/deployment/actions/auto_scaling_rule/create_auto_scaling_rule.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/auto_scaling_rule/delete_auto_scaling_rule.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/auto_scaling_rule/search_auto_scaling_rules.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/auto_scaling_rule/update_auto_scaling_rule.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/base.py | Switches entity_type() to EntityType.DEPLOYMENT. |
| src/ai/backend/manager/services/deployment/actions/create_deployment.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/create_legacy_deployment.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/deployment_policy/base.py | Switches entity_type() to EntityType.DEPLOYMENT_POLICY. |
| src/ai/backend/manager/services/deployment/actions/deployment_policy/get_deployment_policy.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/destroy_deployment.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/get_deployment_by_id.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/get_replica_by_id.py | Switches base type and operation_type() to typed enums. |
| src/ai/backend/manager/services/deployment/actions/model_revision/add_model_revision.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/model_revision/base.py | Switches entity_type() to model revision entity type. |
| src/ai/backend/manager/services/deployment/actions/model_revision/create_model_revision.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/model_revision/get_revision_by_id.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/model_revision/search_revisions.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/replica/base.py | Introduces replica base action with replica entity type. |
| src/ai/backend/manager/services/deployment/actions/revision_operations/activate_revision.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/revision_operations/base.py | Switches entity_type() to deployment revision entity type. |
| src/ai/backend/manager/services/deployment/actions/route/base.py | Switches entity_type() to route entity type. |
| src/ai/backend/manager/services/deployment/actions/route/search_routes.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/route/update_route_traffic_status.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/search_deployments.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/deployment/actions/search_replicas.py | Switches base type and operation_type() to typed enums. |
| src/ai/backend/manager/services/deployment/actions/sync_replicas.py | Switches base type and operation_type() to typed enums. |
| src/ai/backend/manager/services/deployment/actions/update_deployment.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/base.py | Switches entity_type() to EntityType.DOMAIN. |
| src/ai/backend/manager/services/domain/actions/create_domain.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/create_domain_node.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/delete_domain.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/get_domain.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/modify_domain.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/modify_domain_node.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/purge_domain.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/search_domains.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/domain/actions/search_rg_domains.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/error_log/actions/base.py | Switches entity_type() to EntityType.ERROR_LOG. |
| src/ai/backend/manager/services/error_log/actions/create.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/error_log/actions/search.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/base.py | Switches entity_type() to EntityType.EXPORT and typed operation. |
| src/ai/backend/manager/services/export/actions/export_audit_logs_csv.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/export_keypairs_csv.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/export_projects_csv.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/export_sessions_csv.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/export_users_csv.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/get_report.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/export/actions/list_reports.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/base.py | Switches entity_type() to EntityType.GROUP. |
| src/ai/backend/manager/services/group/actions/create_group.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/delete_group.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/modify_group.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/purge_group.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/search_projects.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/usage_per_month.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/group/actions/usage_per_period.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/alias_base.py | Introduces base action for image alias entity typing. |
| src/ai/backend/manager/services/image/actions/alias_image.py | Switches to alias base action and typed operation. |
| src/ai/backend/manager/services/image/actions/base.py | Switches entity_type() to EntityType.IMAGE. |
| src/ai/backend/manager/services/image/actions/clear_image_custom_resource_limit.py | Switches to resource-limit base action and typed operation. |
| src/ai/backend/manager/services/image/actions/dealias_image.py | Switches to alias base action and typed operation. |
| src/ai/backend/manager/services/image/actions/forget_image.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/get_all_images.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/get_image_installed_agents.py | Refines entity/operation typing for image-agent queries. |
| src/ai/backend/manager/services/image/actions/get_images.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/modify_image.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/preload_image.py | Refines entity/operation typing for image preload. |
| src/ai/backend/manager/services/image/actions/purge_images.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/resource_limit_base.py | Introduces base action for image resource-limit entity typing. |
| src/ai/backend/manager/services/image/actions/scan_image.py | Refines entity/operation typing for image scan. |
| src/ai/backend/manager/services/image/actions/search_aliases.py | Switches to alias base action and typed operation. |
| src/ai/backend/manager/services/image/actions/search_images.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/image/actions/set_image_resource_limit.py | Switches to resource-limit base action and typed operation. |
| src/ai/backend/manager/services/image/actions/unload_image.py | Refines entity/operation typing for image preload/unload. |
| src/ai/backend/manager/services/image/actions/untag_image_from_registry.py | Refines entity/operation typing for image tag deletion. |
| src/ai/backend/manager/services/keypair_resource_policy/actions/base.py | Switches entity_type() to EntityType.KEYPAIR_RESOURCE_POLICY. |
| src/ai/backend/manager/services/keypair_resource_policy/actions/create_keypair_resource_policy.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/keypair_resource_policy/actions/delete_keypair_resource_policy.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/keypair_resource_policy/actions/modify_keypair_resource_policy.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/metric/actions/container.py | Switches entity/operation typing for container metric actions. |
| src/ai/backend/manager/services/model_serving/actions/base.py | Switches entity_type() to EntityType.MODEL_SERVICE. |
| src/ai/backend/manager/services/model_serving/actions/clear_error.py | Refines entity/operation typing for deployment error clear. |
| src/ai/backend/manager/services/model_serving/actions/create_auto_scaling_rule.py | Refines entity/operation typing for deployment auto-scaling rule creation. |
| src/ai/backend/manager/services/model_serving/actions/create_model_service.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/delete_auto_scaling_rule.py | Refines entity/operation typing for deployment auto-scaling rule deletion. |
| src/ai/backend/manager/services/model_serving/actions/delete_model_service.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/delete_route.py | Refines entity/operation typing for route deletion. |
| src/ai/backend/manager/services/model_serving/actions/dry_run_model_service.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/force_sync.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/generate_token.py | Refines entity/operation typing for deployment token. |
| src/ai/backend/manager/services/model_serving/actions/get_model_service_info.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/list_errors.py | Refines entity/operation typing for deployment errors listing. |
| src/ai/backend/manager/services/model_serving/actions/list_model_service.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/modify_auto_scaling_rule.py | Refines entity/operation typing for auto scaling rule updates. |
| src/ai/backend/manager/services/model_serving/actions/modify_endpoint.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/model_serving/actions/scale_service_replicas.py | Refines entity/operation typing for replica scaling. |
| src/ai/backend/manager/services/model_serving/actions/search_auto_scaling_rules.py | Refines entity/operation typing for auto scaling rule searches. |
| src/ai/backend/manager/services/model_serving/actions/update_route.py | Refines entity/operation typing for route updates. |
| src/ai/backend/manager/services/notification/actions/base.py | Switches entity/operation typing for notification actions. |
| src/ai/backend/manager/services/notification/actions/create_channel.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/notification/actions/create_rule.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/notification/actions/delete_channel.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/notification/actions/delete_rule.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/notification/actions/get_channel.py | Switches operation_type() to typed ActionOperationType. |
| src/ai/backend/manager/services/notification/actions/get_rule.py | Switches operation_type() to typed ActionOperationType. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
d49b7a4 to
88b0d9f
Compare
ca1e00e to
9fbf670
Compare
9fbf670 to
8f0c264
Compare
094cfcf to
b3c9705
Compare
Activate the previously stubbed OTel tracing pipeline so Backend.AI participates in distributed traces initiated by external clients. - Set the global TracerProvider in apply_otel_tracer() so trace.get_tracer() returns a functioning tracer - Call apply_otel_tracer() from BraceStyleAdapter.apply_otel() - Instrument aiohttp server/client to propagate W3C Trace Context - Add OTel spans to GQLMetricMiddleware.resolve() with attributes for operation name, field name, and parent type - Register AgentRow ORM mapper in clear-history CLI to fix lazy relationship resolution Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…pagation Move instrument_aiohttp_server/client() from service_discovery_ctx to server_main() before the app is frozen. The instrumentor patches the Application class, but since root_app is already instantiated by that point, we must manually inject the OTel server middleware into the existing app's middleware list. This ensures incoming W3C traceparent headers are extracted and cross-service traces are correlated. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add max_queue_size and max_export_batch_size to OTELConfig and OpenTelemetrySpec, defaulting to 65536 and 4096 respectively. The SDK defaults (2048/512) are insufficient for production GraphQL workloads and cause span drops during burst traffic. Co-Authored-By: Claude Opus 4.6 <[email protected]>
b3c9705 to
ab719a3
Compare
resolves #8693 (BA-4330)
Overview
Enables OpenTelemetry distributed tracing in the Manager by activating the global TracerProvider and instrumenting aiohttp server/client for W3C Trace Context propagation. The aiohttp instrumentation ordering issue is worked around with manual middleware injection (see TODO below). Also tunes the
BatchSpanProcessorqueue and batch sizes for production GraphQL workloads.Problem Statement
apply_otel_tracer()never calledtrace.set_tracer_provider(), so no TracerProvider was registered globally.instrument_aiohttp_server()andinstrument_aiohttp_client()were called insideservice_discovery_ctxwhich runs duringrunner.setup()— after the aiohttp Application is already instantiated and frozen. SinceAioHttpServerInstrumentorworks by patching the Application class viasetattr, it only affects instances created after the call, leaving the existing root app without OTel middleware.traceparent/tracestateheaders from upstream services (e.g., the Kubernetes Bridge) were never extracted, and cross-service trace correlation was impossible.BatchSpanProcessorqueue size (2048) and export batch size (512) are insufficient for production GraphQL workloads, causing span drops during burst traffic.Architecture
sequenceDiagram participant Bridge as Bridge Service participant AIOHTTP as aiohttp Server<br/>(OTel Middleware) participant GQL as GraphQL Layer participant Tempo as OTel Backend Bridge->>AIOHTTP: HTTP + traceparent header Note over AIOHTTP: OTel middleware extracts<br/>trace context AIOHTTP->>GQL: Request with trace context GQL-->>AIOHTTP: Response AIOHTTP-->>Bridge: HTTP Response AIOHTTP->>Tempo: Export spansBatchSpanProcessor Tuning
Added
max-queue-size(default: 65536) andmax-export-batch-size(default: 4096) toOTELConfig, following the sameAnnotated[..., Field(default=...), BackendAIConfigMeta(...)]pattern used by other configs (Redis, Pyroscope, etc.). Values flow throughOTELConfig→OpenTelemetrySpec→BatchSpanProcessor.TODO: Instrumentation Ordering Workaround
The manager startup creates
web.Application()before OTel config is available (config is loaded from etcd viaconfig_provider_ctx). This PR works around the ordering constraint by:instrument_aiohttp_server()/instrument_aiohttp_client()after config is loaded but before the app is frozenroot_app.middlewaresThis is a workaround, not a proper fix. The proper fix would decouple the manager's setup procedure from the aiohttp Application lifecycle so that OTel instrumentation runs before
web.Application()is created (e.g., by moving OTel config toBootstrapConfigor by restructuring startup to separate config loading from aiohttp cleanup contexts).Checklist: (if applicable)
ai.backend.testdocsdirectory