Skip to content

Commit 8f0c264

Browse files
hhoikooclaude
andcommitted
fix(BA-4330): Fix OTel aiohttp instrumentation ordering for trace propagation
Move instrument_aiohttp_server/client() from service_discovery_ctx to server_main() before the app is frozen. The instrumentor patches the Application class, but since root_app is already instantiated by that point, we must manually inject the OTel server middleware into the existing app's middleware list. This ensures incoming W3C traceparent headers are extracted and cross-service traces are correlated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e9bdb91 commit 8f0c264

4 files changed

Lines changed: 13 additions & 11 deletions

File tree

changes/8694.feature.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Enable OpenTelemetry distributed tracing in Manager by activating the global TracerProvider, instrumenting aiohttp server/client for W3C Trace Context propagation, and adding OTel spans to GraphQL resolver middleware.
1+
Enable OpenTelemetry distributed tracing in Manager by activating the global TracerProvider and instrumenting aiohttp server/client for W3C Trace Context propagation.

src/ai/backend/logging/otel.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,12 +69,10 @@ def apply_otel_tracer(spec: OpenTelemetrySpec) -> None:
6969

7070

7171
def instrument_aiohttp_server() -> None:
72-
# TODO: Apply after the setup procedure is decoupled from aiohttp
7372
AioHttpServerInstrumentor().instrument()
7473
logging.info("OpenTelemetry tracing for aiohttp server initialized successfully.")
7574

7675

7776
def instrument_aiohttp_client() -> None:
78-
# TODO: Apply after the setup procedure is decoupled from aiohttp
7977
AioHttpClientInstrumentor().instrument()
8078
logging.info("OpenTelemetry tracing for aiohttp client initialized successfully.")

src/ai/backend/manager/cli/__main__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -265,17 +265,11 @@ def clear_history(cli_ctx: CLIContext, retention: str, vacuum_full: bool) -> Non
265265
from more_itertools import chunked
266266

267267
from ai.backend.common.validators import TimeDuration
268-
from ai.backend.manager.models.agent import AgentRow
269268
from ai.backend.manager.models.error_logs import error_logs
270269
from ai.backend.manager.models.kernel import kernels
271270
from ai.backend.manager.models.session import SessionRow
272271
from ai.backend.manager.models.utils import connect_database, vacuum_db
273272

274-
# AgentRow must be imported to register it with SQLAlchemy's ORM mapper.
275-
# KernelRow (via SessionRow) has a relationship to AgentRow using a string reference,
276-
# which requires AgentRow to be in the mapper registry.
277-
_ = AgentRow
278-
279273
from .context import redis_ctx
280274

281275
log = _get_logger()

src/ai/backend/manager/server.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@
4141
import uvloop
4242
from aiohttp import web
4343
from aiohttp.typedefs import Handler, Middleware
44+
from opentelemetry.instrumentation.aiohttp_server import (
45+
middleware as otel_server_middleware,
46+
)
4447
from setproctitle import setproctitle
4548
from zmq.auth.certs import load_certificate
4649

@@ -835,8 +838,6 @@ async def service_discovery_ctx(root_ctx: RootContext) -> AsyncIterator[None]:
835838
service_instance_name=meta.display_name,
836839
)
837840
BraceStyleAdapter.apply_otel(otel_spec)
838-
instrument_aiohttp_server()
839-
instrument_aiohttp_client()
840841
try:
841842
yield
842843
finally:
@@ -1787,6 +1788,15 @@ async def webapp_ctx(root_app: web.Application) -> AsyncGenerator[None]:
17871788
jwt_config = root_ctx.config_provider.config.jwt.to_jwt_config()
17881789
root_ctx.jwt_validator = JWTValidator(jwt_config)
17891790

1791+
# TODO: Remove manual middleware injection once the manager startup is
1792+
# decoupled from the aiohttp Application lifecycle. Currently root_app is
1793+
# instantiated before OTel config is available, so instrument_aiohttp_server()
1794+
# (which patches the class via setattr) cannot take effect automatically.
1795+
if root_ctx.config_provider.config.otel.enabled:
1796+
instrument_aiohttp_server()
1797+
instrument_aiohttp_client()
1798+
root_app.middlewares.insert(0, otel_server_middleware)
1799+
17901800
# Plugin webapps should be loaded before runner.setup() because root_app is frozen upon on_startup event.
17911801
await manager_init_stack.enter_async_context(webapp_plugin_ctx(root_app))
17921802
await manager_init_stack.enter_async_context(webapp_ctx(root_app))

0 commit comments

Comments
 (0)