Host metrics API by balanza · Pull Request #4069 · trento-project/web

balanza · 2026-03-10T09:11:45Z

Add new host metrics API endpoints under /api/v1/hosts/:id/metrics/:

GET /metrics/cpu
~~GET /metrics/memory~~
~~GET /metrics/query~~

~~The old /api/v1/charts/hosts/:id/cpu and /memory endpoints are preserved as deprecated, delegating to the new controller.~~

CPU and memory chart assembly logic (host_cpu_chart, host_memory_chart) has been moved from
Trento.Charts into Trento.Infrastructure.Prometheus, where it belongs alongside other Prometheus adapter operations. The ChartController now simply delegates to HostMetricsController.

A dedicated PromQL module handles agentID label injection into PromQL expressions.

The host metrics API are also tagged as "MCP" thus they are discoverable by the trento mcp server

EDIT: after the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it.

gagandeepb

Thanks for this. I like this overall approach quite a bit, especially the embedded PromQL query capabilities. Just one comment about PromQL parser implementation from an extensibility perspective, among others (that I would also be happy to sync on).

nelsonkopliku

Hey @balanza, thanks for this. Good job!

I mainly left questions to better understand and possibly raise some open points.

I don't feel neither strongly against nor strongly confident, OTOH it is true that if we do not experiment, we won't get information about the behavior.

In this spirit I'd like to hear your answers so that we are on the same page before we move forward.

nelsonkopliku · 2026-03-11T13:16:03Z

+      ]
+    ],
+    responses: [
+      ok:


thought: this endpoint could return 400 as well, right? Might be worth adding it.

nelsonkopliku · 2026-03-11T13:17:06Z

+      ]
+    ],
+    responses: [
+      ok:


same thing about error responses.

nelsonkopliku · 2026-03-11T13:51:58Z

+      get "/hosts/:id/metrics/query", HostMetricsController, :query
+      get "/hosts/:id/metrics/cpu", HostMetricsController, :aggregated_cpu
+      get "/hosts/:id/metrics/memory", HostMetricsController, :aggregated_memory


question: I am not against changing to these paths instead of the previous /charts/hosts/:id/whatever, It depends on the semantic we want to give, and the fact that the deprecated ones still work makes me feel comfortable.

I don't want to end up in the realm of what if this or that happens in the future discussion, nonetheless it is true that metrics based on a different entity than host (would that even make sense, tho?) could reopen the "should we have this family of endpoints scattered under the specific resource or group them into a /charts/... subpath?" question.

I am ok with the proposal: it works, it's clean, we don't know if non host specific metrics even make sense, legacy is still working and if anything we can change again.

I'd just like to hear more about the rationale.

question2: about /hosts/:id/metrics/query

What does it happen if for instance we ask the AI for metrics that span multiple hosts?
Like

what is the overall CPU load of the SAP System xyz?

what has been the memory usage trend of cluster foo in the last week?

Haven't set this locally yet to try myself, just curious.
The sooner we have more sample questions to test against, the better.

Technically speaking I'd expect several requests to be made via mcp to this API (or even also to the more specific cpu/mem/filesystem ones), one or more for each involved host.

Bottom line:

I think having this endpoint bound/restricted to a host is a reasonable guardrail to start with (have you considered unbound query endpoint? ie full proxy to prom?)

the sooner we test more questions, the better

performance is not an issue, at this time, however we do not have control on the number of requests a question might translate to

There are several things we need to validate as we proceed, as usual 😄

I am not sure I get the whole point. I'll try to answer:

In Trento architecture, metrics are always per-host, as they are collected from each host and labelled with agentID.

AI agents can use this endpoint alongside other APIs, and then they can usually answer questions like "gimme the CPU status for all hosts in cluster X" by querying the host list first.

If we find it inconvenient, we can wrap all this into a /cluster/:id/metrics API that does the listing and the aggregation itself

Unbounded query: I have no strong opinion, either against or in favour; I wonder if the AI agent would mess when reconciling the raw results with the Trento model (i.e. the agentID label that maps to the host ID)

About moving the endpoint under another subpath: it's been a bit bold from me, I just realised that chart is really a UI term, and it would just make more sense to me to have the APIs under metrics. But it's just lexycall freakiness, no big deal to move them back under chart

Thanks for the answer.

(5) I am not against that, and I actually agree on the semantic aspect
(4) The best thing we can do is start with something, see how it behaves, iterate and improve

About (1): it's true, and it's also true that a host is more often than not a specific part of a broader concept (cluster/sap systems...) which leads to:

(2) yes, to get a non host specific result, AI Agents will call host list, and for each relevant host issue any to all of the /metrics/cpu, /metrics/memory, /metrics/filesystem, /metrics/query requests which, in turn, result in 1 to ~10 requests to prom

(3) not that I strongly believe we should do it now, just raising my hand about the fact that based on the size of SAP systems trento monitors (aka the number of hosts), the unpredictability of the kind of queries users would do and the resulting requests to trento->prom chain, we could consider the benefits from more guardrailed and more predictable paths

But again, we don't have all the answers and the way to get them is to proceed somehow 👍

For "multiple hosts" kind of queries, it would be very interesting IMHO to provide the results in an aggregated form that makes sense in the SAP/Trento model. That would help users (both human and agents) better understand data, given the context.

That of course if we have relevant queries to offer to our users

At some point, we could even have the /metrics query without any hostID, where we could query anything raw from prometheus, as totally open door. Maybe the AI is smart enough to do the aggregation as long as It gets the data.

nelsonkopliku · 2026-03-11T14:02:33Z

+      bad_request: Schema.BadRequest.response()
+    ]
+
+  def query(conn, %{"id" => host_id, "query" => _query} = params) do


question: have we considered any ability to gate this endpoint? Sure, it can be added later, just asking wether we want to play it safe.

arbulu89

Hey @balanza ,
Could we maybe try to isolate the raw query implementation? Without touching the current chart endpoints
There are other parallel works on this exact codebase that will have impact and it will make everything more complex to work on

balanza · 2026-03-12T09:20:39Z

Hey @balanza , Could we maybe try to isolate the raw query implementation? Without touching the current chart endpoints There are other parallel works on this exact codebase that will have impact and it will make everything more complex to work on

If we like the idea of moving this API under metrics, I'd rather wait and rebase - it's all about renaming files. Otherwise, I can rollback, no problem.

balanza · 2026-03-12T13:29:02Z

@nelsonkopliku @arbulu89 sorry for the direct mention. The conversation got a bit broad, so let me try to narrow it down before going forward. I see 3 kinds of feedback:

About the refactor of moving the endpoints under /hosts/:id/metrics
About how AI agents would make use of metrics
Specific implementation feedback.

The rationale of (1) is to make room for other metrics or specific queries that we may want to add. It's only about naming and organising things. I arbitrarily chose /hosts/:id/metrics, but we can go for another one or roll back to /charts.

I'm gonna address (3) punctually.

About (2), I think it's a broader discussion that we can take outside of this PR, unless it's a blocker for this PR.

nelsonkopliku · 2026-03-12T13:40:36Z

@balanza for me we can go ahead with (1) and (3). (2) has never really been a blocker at this stage 👍

arbulu89 · 2026-03-12T14:47:30Z

@balanza @nelsonkopliku For me, the doubt about renaming from chart to metrics is that right now, the metrics are composed in a way that are easy to use by charts (more specifically by our current charts implementation).
Maybe there are other better ways to expose metrics in a better format for AI.

I think that having potentially both charts and metrics endpoints might not be a bad idea. Otherwise the /metrics endpoint is kind of kidnapped by the charts.
Maybe even here, the way prometheus exposes the metrics is better:
In a list, with metrics name, metrics value and some metric attributes.
Instead of a transformed format.

For example, (and this might be the current PR nitpick) the /metrics endpoints description is this Get a CPU chart of a host, and maybe I wouldn't expect any chart related comment for metrics endpoint.

balanza · 2026-03-22T22:50:05Z

After the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it.

arbulu89

Hey @balanza ,
Looks the right direction.
i left some comments/questions/nits

arbulu89 · 2026-03-23T11:07:03Z

-         sample: %{
-           value: value
-         }
+         "metric" => %{"exporter_name" => exporter_name},


question:
is this change in parse_exporter_status intentional?
it was changed just in the previous PR

arbulu89 · 2026-03-23T11:09:45Z

-    url = "#{prometheus_url}/api/v1/query_range"
-    headers = [{"Accept", "application/json"}]
-    params = %{query: query, start: start_parameter, end: end_parameter, step: "60s"}
+  defp perform_simple_query(query) do


comment:
I'm not sure about this.
Now, we would have perform_simple_query/1 and perform_simple_query/2, and both don't do the same.
Wasn't better to keep the default value usage?
Why the first one doesn't do the ChartIntegration.vector_results_to_samples() ?

arbulu89 · 2026-03-23T11:21:47Z

+        in: :path,
+        description: "Host ID (UUID).",
+        required: true,
+        schema: %OpenApiSpex.Schema{type: :string, format: :uuid}


nit:
maybe we should put examples in the schema entries. It makes the composition of the query easier to learn

arbulu89 · 2026-03-23T11:22:53Z

+        in: :query,
+        description: "Range query start timestamp (RFC3339 or Unix).",
+        required: false,
+        schema: %OpenApiSpex.Schema{type: :string}


nit:
maybe format: :"date-time" ?
For this, end and time

arbulu89 · 2026-03-23T11:25:53Z

+      ],
+      timeout: [
+        in: :query,
+        description: "Evaluation timeout.",


question:
how does the timeout work, but is the user supposed to put here?

arbulu89 · 2026-03-23T11:27:46Z

+      bad_request: Schema.BadRequest.response()
+    ]
+
+  def metrics(conn, %{"id" => _host_id} = params) when not is_map_key(params, "query") do


comment:
I don't think we need this entry. OpenApi validation should fail if the query is not coming.
We just need to configure it on the top of the file before the fallback controller:
plug OpenApiSpex.Plug.CastAndValidate, json_render_error_v2: true

arbulu89 · 2026-03-23T11:28:27Z

+    |> render(:"400", reason: "Missing required query parameter: query")
+  end
+
+  def metrics(conn, %{"id" => host_id, "query" => _query} = params) do


comment:
once the validation and casting is done, we need to change this to:
%{id: host_id, query: _query}

arbulu89 · 2026-03-23T11:29:55Z

    perform_simple_query(query, time)
  end

+  def proxy_query(host_id, params) do


i think def proxy_query(host_id, query, params) could be more appropriate.
The query param is a totally meaningful and "special" param

arbulu89 · 2026-03-23T11:33:27Z

           } = response
  end
+
+  describe "metrics proxy" do


comment:
I personally think that most of this tests should come in the prometheus_api_test.exs file and the controller test could only test just schema validation, etc
Anyway, my opinion and how I usually do the testing.

balanza · 2026-03-23T21:26:39Z

The PR has undergone several refactors and rebases, closing for opening fresh

balanza force-pushed the prometheus-mcp branch from 6bc5e9b to a8532c8 Compare March 10, 2026 09:18

gagandeepb reviewed Mar 10, 2026

View reviewed changes

Comment thread lib/trento/infrastructure/prometheus/promql.ex

balanza force-pushed the prometheus-mcp branch 3 times, most recently from 7512451 to 4fd8a99 Compare March 10, 2026 17:45

balanza marked this pull request as ready for review March 10, 2026 18:52

balanza force-pushed the prometheus-mcp branch from 4fd8a99 to ab58777 Compare March 10, 2026 18:52

balanza requested a review from nelsonkopliku March 10, 2026 18:52

nelsonkopliku reviewed Mar 11, 2026

View reviewed changes

arbulu89 reviewed Mar 11, 2026

View reviewed changes

nelsonkopliku mentioned this pull request Mar 14, 2026

Expose filesystem usage metrics #4079

Merged

balanza force-pushed the prometheus-mcp branch 5 times, most recently from a9541ee to cef153b Compare March 22, 2026 20:24

add host metric endpoint

ffafde7

balanza force-pushed the prometheus-mcp branch from cef153b to ffafde7 Compare March 22, 2026 20:39

balanza requested review from arbulu89 and nelsonkopliku March 22, 2026 22:49

arbulu89 reviewed Mar 23, 2026

View reviewed changes

balanza closed this Mar 23, 2026

balanza deleted the prometheus-mcp branch March 23, 2026 21:26

balanza mentioned this pull request Mar 23, 2026

Host metrics API #4096

Merged

Conversation

balanza commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gagandeepb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nelsonkopliku left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nelsonkopliku Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arbulu89 left a comment

Choose a reason for hiding this comment

Uh oh!

balanza commented Mar 12, 2026

Uh oh!

balanza commented Mar 12, 2026

Uh oh!

nelsonkopliku commented Mar 12, 2026

Uh oh!

arbulu89 commented Mar 12, 2026

Uh oh!

balanza commented Mar 22, 2026

Uh oh!

arbulu89 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

balanza commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

balanza commented Mar 10, 2026 •

edited

Loading

nelsonkopliku Mar 11, 2026 •

edited

Loading