Skip to content

Host metrics API#4069

Closed
balanza wants to merge 1 commit intomainfrom
prometheus-mcp
Closed

Host metrics API#4069
balanza wants to merge 1 commit intomainfrom
prometheus-mcp

Conversation

@balanza
Copy link
Copy Markdown
Member

@balanza balanza commented Mar 10, 2026

Add new host metrics API endpoints under /api/v1/hosts/:id/metrics/:

  • GET /metrics/cpu
  • GET /metrics/memory
  • GET /metrics/query

The old /api/v1/charts/hosts/:id/cpu and /memory endpoints are preserved as deprecated, delegating to the new controller.

CPU and memory chart assembly logic (host_cpu_chart, host_memory_chart) has been moved from
Trento.Charts into Trento.Infrastructure.Prometheus, where it belongs alongside other Prometheus adapter operations. The ChartController now simply delegates to HostMetricsController.

A dedicated PromQL module handles agentID label injection into PromQL expressions.

The host metrics API are also tagged as "MCP" thus they are discoverable by the trento mcp server

EDIT: after the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it.

Copy link
Copy Markdown
Contributor

@gagandeepb gagandeepb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I like this overall approach quite a bit, especially the embedded PromQL query capabilities. Just one comment about PromQL parser implementation from an extensibility perspective, among others (that I would also be happy to sync on).

Comment thread lib/trento/infrastructure/prometheus/promql.ex
@balanza balanza force-pushed the prometheus-mcp branch 3 times, most recently from 7512451 to 4fd8a99 Compare March 10, 2026 17:45
@balanza balanza marked this pull request as ready for review March 10, 2026 18:52
@balanza balanza requested a review from nelsonkopliku March 10, 2026 18:52
Copy link
Copy Markdown
Member

@nelsonkopliku nelsonkopliku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @balanza, thanks for this. Good job!

I mainly left questions to better understand and possibly raise some open points.

I don't feel neither strongly against nor strongly confident, OTOH it is true that if we do not experiment, we won't get information about the behavior.

In this spirit I'd like to hear your answers so that we are on the same page before we move forward.

Comment thread lib/trento/infrastructure/prometheus/adapter/prometheus_api.ex Outdated
]
],
responses: [
ok:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: this endpoint could return 400 as well, right? Might be worth adding it.

]
],
responses: [
ok:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing about error responses.

Comment thread lib/trento/infrastructure/prometheus/adapter/prometheus_api.ex
Comment thread lib/trento_web/router.ex Outdated
Comment on lines +147 to +149
get "/hosts/:id/metrics/query", HostMetricsController, :query
get "/hosts/:id/metrics/cpu", HostMetricsController, :aggregated_cpu
get "/hosts/:id/metrics/memory", HostMetricsController, :aggregated_memory
Copy link
Copy Markdown
Member

@nelsonkopliku nelsonkopliku Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I am not against changing to these paths instead of the previous /charts/hosts/:id/whatever, It depends on the semantic we want to give, and the fact that the deprecated ones still work makes me feel comfortable.

I don't want to end up in the realm of what if this or that happens in the future discussion, nonetheless it is true that metrics based on a different entity than host (would that even make sense, tho?) could reopen the "should we have this family of endpoints scattered under the specific resource or group them into a /charts/... subpath?" question.

I am ok with the proposal: it works, it's clean, we don't know if non host specific metrics even make sense, legacy is still working and if anything we can change again.

I'd just like to hear more about the rationale.


question2: about /hosts/:id/metrics/query

What does it happen if for instance we ask the AI for metrics that span multiple hosts?
Like

  • what is the overall CPU load of the SAP System xyz?
  • what has been the memory usage trend of cluster foo in the last week?

Haven't set this locally yet to try myself, just curious.
The sooner we have more sample questions to test against, the better.

Technically speaking I'd expect several requests to be made via mcp to this API (or even also to the more specific cpu/mem/filesystem ones), one or more for each involved host.

Bottom line:

  • I think having this endpoint bound/restricted to a host is a reasonable guardrail to start with (have you considered unbound query endpoint? ie full proxy to prom?)
  • the sooner we test more questions, the better
  • performance is not an issue, at this time, however we do not have control on the number of requests a question might translate to

There are several things we need to validate as we proceed, as usual 😄

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I get the whole point. I'll try to answer:

  1. In Trento architecture, metrics are always per-host, as they are collected from each host and labelled with agentID.
  2. AI agents can use this endpoint alongside other APIs, and then they can usually answer questions like "gimme the CPU status for all hosts in cluster X" by querying the host list first.
  3. If we find it inconvenient, we can wrap all this into a /cluster/:id/metrics API that does the listing and the aggregation itself
  4. Unbounded query: I have no strong opinion, either against or in favour; I wonder if the AI agent would mess when reconciling the raw results with the Trento model (i.e. the agentID label that maps to the host ID)
  5. About moving the endpoint under another subpath: it's been a bit bold from me, I just realised that chart is really a UI term, and it would just make more sense to me to have the APIs under metrics. But it's just lexycall freakiness, no big deal to move them back under chart

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answer.

(5) I am not against that, and I actually agree on the semantic aspect
(4) The best thing we can do is start with something, see how it behaves, iterate and improve

About (1): it's true, and it's also true that a host is more often than not a specific part of a broader concept (cluster/sap systems...) which leads to:

  • (2) yes, to get a non host specific result, AI Agents will call host list, and for each relevant host issue any to all of the /metrics/cpu, /metrics/memory, /metrics/filesystem, /metrics/query requests which, in turn, result in 1 to ~10 requests to prom
  • (3) not that I strongly believe we should do it now, just raising my hand about the fact that based on the size of SAP systems trento monitors (aka the number of hosts), the unpredictability of the kind of queries users would do and the resulting requests to trento->prom chain, we could consider the benefits from more guardrailed and more predictable paths

But again, we don't have all the answers and the way to get them is to proceed somehow 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "multiple hosts" kind of queries, it would be very interesting IMHO to provide the results in an aggregated form that makes sense in the SAP/Trento model. That would help users (both human and agents) better understand data, given the context.

That of course if we have relevant queries to offer to our users

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we could even have the /metrics query without any hostID, where we could query anything raw from prometheus, as totally open door. Maybe the AI is smart enough to do the aggregation as long as It gets the data.

bad_request: Schema.BadRequest.response()
]

def query(conn, %{"id" => host_id, "query" => _query} = params) do
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: have we considered any ability to gate this endpoint? Sure, it can be added later, just asking wether we want to play it safe.

Comment thread lib/trento/infrastructure/prometheus/promql.ex
Copy link
Copy Markdown
Contributor

@arbulu89 arbulu89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @balanza ,
Could we maybe try to isolate the raw query implementation? Without touching the current chart endpoints
There are other parallel works on this exact codebase that will have impact and it will make everything more complex to work on

@balanza
Copy link
Copy Markdown
Member Author

balanza commented Mar 12, 2026

Hey @balanza , Could we maybe try to isolate the raw query implementation? Without touching the current chart endpoints There are other parallel works on this exact codebase that will have impact and it will make everything more complex to work on

If we like the idea of moving this API under metrics, I'd rather wait and rebase - it's all about renaming files. Otherwise, I can rollback, no problem.

@balanza
Copy link
Copy Markdown
Member Author

balanza commented Mar 12, 2026

@nelsonkopliku @arbulu89 sorry for the direct mention. The conversation got a bit broad, so let me try to narrow it down before going forward. I see 3 kinds of feedback:

  1. About the refactor of moving the endpoints under /hosts/:id/metrics
  2. About how AI agents would make use of metrics
  3. Specific implementation feedback.

The rationale of (1) is to make room for other metrics or specific queries that we may want to add. It's only about naming and organising things. I arbitrarily chose /hosts/:id/metrics, but we can go for another one or roll back to /charts.

I'm gonna address (3) punctually.

About (2), I think it's a broader discussion that we can take outside of this PR, unless it's a blocker for this PR.

@nelsonkopliku
Copy link
Copy Markdown
Member

@balanza for me we can go ahead with (1) and (3). (2) has never really been a blocker at this stage 👍

@arbulu89
Copy link
Copy Markdown
Contributor

@balanza @nelsonkopliku For me, the doubt about renaming from chart to metrics is that right now, the metrics are composed in a way that are easy to use by charts (more specifically by our current charts implementation).
Maybe there are other better ways to expose metrics in a better format for AI.

I think that having potentially both charts and metrics endpoints might not be a bad idea. Otherwise the /metrics endpoint is kind of kidnapped by the charts.
Maybe even here, the way prometheus exposes the metrics is better:
In a list, with metrics name, metrics value and some metric attributes.
Instead of a transformed format.

For example, (and this might be the current PR nitpick) the /metrics endpoints description is this Get a CPU chart of a host, and maybe I wouldn't expect any chart related comment for metrics endpoint.

@balanza balanza force-pushed the prometheus-mcp branch 5 times, most recently from a9541ee to cef153b Compare March 22, 2026 20:24
@balanza
Copy link
Copy Markdown
Member Author

balanza commented Mar 22, 2026

After the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it.

Copy link
Copy Markdown
Contributor

@arbulu89 arbulu89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @balanza ,
Looks the right direction.
i left some comments/questions/nits

sample: %{
value: value
}
"metric" => %{"exporter_name" => exporter_name},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:
is this change in parse_exporter_status intentional?
it was changed just in the previous PR

url = "#{prometheus_url}/api/v1/query_range"
headers = [{"Accept", "application/json"}]
params = %{query: query, start: start_parameter, end: end_parameter, step: "60s"}
defp perform_simple_query(query) do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
I'm not sure about this.
Now, we would have perform_simple_query/1 and perform_simple_query/2, and both don't do the same.
Wasn't better to keep the default value usage?
Why the first one doesn't do the ChartIntegration.vector_results_to_samples() ?

in: :path,
description: "Host ID (UUID).",
required: true,
schema: %OpenApiSpex.Schema{type: :string, format: :uuid}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
maybe we should put examples in the schema entries. It makes the composition of the query easier to learn

in: :query,
description: "Range query start timestamp (RFC3339 or Unix).",
required: false,
schema: %OpenApiSpex.Schema{type: :string}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
maybe format: :"date-time" ?
For this, end and time

],
timeout: [
in: :query,
description: "Evaluation timeout.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:
how does the timeout work, but is the user supposed to put here?

bad_request: Schema.BadRequest.response()
]

def metrics(conn, %{"id" => _host_id} = params) when not is_map_key(params, "query") do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
I don't think we need this entry. OpenApi validation should fail if the query is not coming.
We just need to configure it on the top of the file before the fallback controller:
plug OpenApiSpex.Plug.CastAndValidate, json_render_error_v2: true

|> render(:"400", reason: "Missing required query parameter: query")
end

def metrics(conn, %{"id" => host_id, "query" => _query} = params) do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
once the validation and casting is done, we need to change this to:
%{id: host_id, query: _query}

perform_simple_query(query, time)
end

def proxy_query(host_id, params) do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think def proxy_query(host_id, query, params) could be more appropriate.
The query param is a totally meaningful and "special" param

} = response
end

describe "metrics proxy" do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
I personally think that most of this tests should come in the prometheus_api_test.exs file and the controller test could only test just schema validation, etc
Anyway, my opinion and how I usually do the testing.

@balanza
Copy link
Copy Markdown
Member Author

balanza commented Mar 23, 2026

The PR has undergone several refactors and rebases, closing for opening fresh

@balanza balanza closed this Mar 23, 2026
@balanza balanza deleted the prometheus-mcp branch March 23, 2026 21:26
@balanza balanza mentioned this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants