Conversation
gagandeepb
left a comment
There was a problem hiding this comment.
Thanks for this. I like this overall approach quite a bit, especially the embedded PromQL query capabilities. Just one comment about PromQL parser implementation from an extensibility perspective, among others (that I would also be happy to sync on).
7512451 to
4fd8a99
Compare
nelsonkopliku
left a comment
There was a problem hiding this comment.
Hey @balanza, thanks for this. Good job!
I mainly left questions to better understand and possibly raise some open points.
I don't feel neither strongly against nor strongly confident, OTOH it is true that if we do not experiment, we won't get information about the behavior.
In this spirit I'd like to hear your answers so that we are on the same page before we move forward.
| ] | ||
| ], | ||
| responses: [ | ||
| ok: |
There was a problem hiding this comment.
thought: this endpoint could return 400 as well, right? Might be worth adding it.
| ] | ||
| ], | ||
| responses: [ | ||
| ok: |
There was a problem hiding this comment.
same thing about error responses.
| get "/hosts/:id/metrics/query", HostMetricsController, :query | ||
| get "/hosts/:id/metrics/cpu", HostMetricsController, :aggregated_cpu | ||
| get "/hosts/:id/metrics/memory", HostMetricsController, :aggregated_memory |
There was a problem hiding this comment.
question: I am not against changing to these paths instead of the previous /charts/hosts/:id/whatever, It depends on the semantic we want to give, and the fact that the deprecated ones still work makes me feel comfortable.
I don't want to end up in the realm of what if this or that happens in the future discussion, nonetheless it is true that metrics based on a different entity than host (would that even make sense, tho?) could reopen the "should we have this family of endpoints scattered under the specific resource or group them into a /charts/... subpath?" question.
I am ok with the proposal: it works, it's clean, we don't know if non host specific metrics even make sense, legacy is still working and if anything we can change again.
I'd just like to hear more about the rationale.
question2: about /hosts/:id/metrics/query
What does it happen if for instance we ask the AI for metrics that span multiple hosts?
Like
- what is the overall CPU load of the SAP System xyz?
- what has been the memory usage trend of cluster foo in the last week?
Haven't set this locally yet to try myself, just curious.
The sooner we have more sample questions to test against, the better.
Technically speaking I'd expect several requests to be made via mcp to this API (or even also to the more specific cpu/mem/filesystem ones), one or more for each involved host.
Bottom line:
- I think having this endpoint bound/restricted to a host is a reasonable guardrail to start with (have you considered unbound query endpoint? ie full proxy to prom?)
- the sooner we test more questions, the better
- performance is not an issue, at this time, however we do not have control on the number of requests a question might translate to
There are several things we need to validate as we proceed, as usual 😄
There was a problem hiding this comment.
I am not sure I get the whole point. I'll try to answer:
- In Trento architecture, metrics are always per-host, as they are collected from each host and labelled with
agentID. - AI agents can use this endpoint alongside other APIs, and then they can usually answer questions like "gimme the CPU status for all hosts in cluster X" by querying the host list first.
- If we find it inconvenient, we can wrap all this into a
/cluster/:id/metricsAPI that does the listing and the aggregation itself - Unbounded query: I have no strong opinion, either against or in favour; I wonder if the AI agent would mess when reconciling the raw results with the Trento model (i.e. the
agentIDlabel that maps to the host ID) - About moving the endpoint under another subpath: it's been a bit bold from me, I just realised that
chartis really a UI term, and it would just make more sense to me to have the APIs undermetrics. But it's just lexycall freakiness, no big deal to move them back underchart
There was a problem hiding this comment.
Thanks for the answer.
(5) I am not against that, and I actually agree on the semantic aspect
(4) The best thing we can do is start with something, see how it behaves, iterate and improve
About (1): it's true, and it's also true that a host is more often than not a specific part of a broader concept (cluster/sap systems...) which leads to:
- (2) yes, to get a non host specific result, AI Agents will call host list, and for each relevant host issue any to all of the
/metrics/cpu,/metrics/memory,/metrics/filesystem,/metrics/queryrequests which, in turn, result in 1 to ~10 requests to prom - (3) not that I strongly believe we should do it now, just raising my hand about the fact that based on the size of SAP systems trento monitors (aka the number of hosts), the unpredictability of the kind of queries users would do and the resulting requests to trento->prom chain, we could consider the benefits from more guardrailed and more predictable paths
But again, we don't have all the answers and the way to get them is to proceed somehow 👍
There was a problem hiding this comment.
For "multiple hosts" kind of queries, it would be very interesting IMHO to provide the results in an aggregated form that makes sense in the SAP/Trento model. That would help users (both human and agents) better understand data, given the context.
That of course if we have relevant queries to offer to our users
There was a problem hiding this comment.
At some point, we could even have the /metrics query without any hostID, where we could query anything raw from prometheus, as totally open door. Maybe the AI is smart enough to do the aggregation as long as It gets the data.
| bad_request: Schema.BadRequest.response() | ||
| ] | ||
|
|
||
| def query(conn, %{"id" => host_id, "query" => _query} = params) do |
There was a problem hiding this comment.
question: have we considered any ability to gate this endpoint? Sure, it can be added later, just asking wether we want to play it safe.
arbulu89
left a comment
There was a problem hiding this comment.
Hey @balanza ,
Could we maybe try to isolate the raw query implementation? Without touching the current chart endpoints
There are other parallel works on this exact codebase that will have impact and it will make everything more complex to work on
If we like the idea of moving this API under |
|
@nelsonkopliku @arbulu89 sorry for the direct mention. The conversation got a bit broad, so let me try to narrow it down before going forward. I see 3 kinds of feedback:
The rationale of (1) is to make room for other metrics or specific queries that we may want to add. It's only about naming and organising things. I arbitrarily chose I'm gonna address (3) punctually. About (2), I think it's a broader discussion that we can take outside of this PR, unless it's a blocker for this PR. |
|
@balanza for me we can go ahead with (1) and (3). (2) has never really been a blocker at this stage 👍 |
|
@balanza @nelsonkopliku For me, the doubt about renaming from I think that having potentially both For example, (and this might be the current PR nitpick) the |
a9541ee to
cef153b
Compare
|
After the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it. |
| sample: %{ | ||
| value: value | ||
| } | ||
| "metric" => %{"exporter_name" => exporter_name}, |
There was a problem hiding this comment.
question:
is this change in parse_exporter_status intentional?
it was changed just in the previous PR
| url = "#{prometheus_url}/api/v1/query_range" | ||
| headers = [{"Accept", "application/json"}] | ||
| params = %{query: query, start: start_parameter, end: end_parameter, step: "60s"} | ||
| defp perform_simple_query(query) do |
There was a problem hiding this comment.
comment:
I'm not sure about this.
Now, we would have perform_simple_query/1 and perform_simple_query/2, and both don't do the same.
Wasn't better to keep the default value usage?
Why the first one doesn't do the ChartIntegration.vector_results_to_samples() ?
| in: :path, | ||
| description: "Host ID (UUID).", | ||
| required: true, | ||
| schema: %OpenApiSpex.Schema{type: :string, format: :uuid} |
There was a problem hiding this comment.
nit:
maybe we should put examples in the schema entries. It makes the composition of the query easier to learn
| in: :query, | ||
| description: "Range query start timestamp (RFC3339 or Unix).", | ||
| required: false, | ||
| schema: %OpenApiSpex.Schema{type: :string} |
There was a problem hiding this comment.
nit:
maybe format: :"date-time" ?
For this, end and time
| ], | ||
| timeout: [ | ||
| in: :query, | ||
| description: "Evaluation timeout.", |
There was a problem hiding this comment.
question:
how does the timeout work, but is the user supposed to put here?
| bad_request: Schema.BadRequest.response() | ||
| ] | ||
|
|
||
| def metrics(conn, %{"id" => _host_id} = params) when not is_map_key(params, "query") do |
There was a problem hiding this comment.
comment:
I don't think we need this entry. OpenApi validation should fail if the query is not coming.
We just need to configure it on the top of the file before the fallback controller:
plug OpenApiSpex.Plug.CastAndValidate, json_render_error_v2: true
| |> render(:"400", reason: "Missing required query parameter: query") | ||
| end | ||
|
|
||
| def metrics(conn, %{"id" => host_id, "query" => _query} = params) do |
There was a problem hiding this comment.
comment:
once the validation and casting is done, we need to change this to:
%{id: host_id, query: _query}
| perform_simple_query(query, time) | ||
| end | ||
|
|
||
| def proxy_query(host_id, params) do |
There was a problem hiding this comment.
i think def proxy_query(host_id, query, params) could be more appropriate.
The query param is a totally meaningful and "special" param
| } = response | ||
| end | ||
|
|
||
| describe "metrics proxy" do |
There was a problem hiding this comment.
comment:
I personally think that most of this tests should come in the prometheus_api_test.exs file and the controller test could only test just schema validation, etc
Anyway, my opinion and how I usually do the testing.
|
The PR has undergone several refactors and rebases, closing for opening fresh |
Add new host metrics API endpoints under /api/v1/hosts/:id/metrics/:
GET /metrics/memoryGET /metrics/queryThe old /api/v1/charts/hosts/:id/cpu and /memory endpoints are preserved as deprecated, delegating to the new controller.CPU and memory chart assembly logic (host_cpu_chart, host_memory_chart) has been moved from
Trento.ChartsintoTrento.Infrastructure.Prometheus, where it belongs alongside other Prometheus adapter operations. The ChartController now simply delegates toHostMetricsController.A dedicated PromQL module handles agentID label injection into PromQL expressions.
The host metrics API are also tagged as
"MCP"thus they are discoverable by the trento mcp serverEDIT: after the conversation, it emerged that moving the current chart endpoint under the host metrics was a bad idea, so I reverted it.