Skip to content

Helm installed application metric Grafana dashboards#800

Merged
poussa merged 4 commits intoopea-project:mainfrom
eero-t:dashboards
Feb 18, 2025
Merged

Helm installed application metric Grafana dashboards#800
poussa merged 4 commits intoopea-project:mainfrom
eero-t:dashboards

Conversation

@eero-t
Copy link
Collaborator

@eero-t eero-t commented Feb 11, 2025

Description

  • Add Helm installed Grafana dashboards
    • App component metrics (when monitoring is enabled)
    • App component scaling (when autoscaling is enabled)
  • Update docs accordingly & move files referred by docs to better locations

Note: to make reviewing the changes easier, I'll file separate PR for removing redundant files from under kubernetes/Observability/ after this is merged. Scaling dashboard in there included LLM metrics only for TGI, new one supports also vLLM.

Issues

n/a.

Type of change

  • New feature (non-breaking change which adds new functionality)

Dependencies

Requires HG workflow fix #814.

Tests

Manually tested that dashboard installation, and the dashboards themselves work.

@lianhao
Copy link
Collaborator

lianhao commented Feb 12, 2025

@eero-t Is opea-project/GenAIComps#1280 related to this failure?

@eero-t
Copy link
Collaborator Author

eero-t commented Feb 12, 2025

@eero-t Is opea-project/GenAIComps#1280 related to this failure?

@lianhao Yes. For some reason CI did not catch the the original change failing the CI...

(That issue does not happen in production. Only CI creates multiple Orchestrator instances in the same program.)

@eero-t
Copy link
Collaborator Author

eero-t commented Feb 13, 2025

Rebased to main to rerun CI, no other changes.

@lianhao CI still failing, but this time all failures are due to:

for img in `helm template -n $NAMESPACE -f helm-charts//chatqna/${value_file} $RELEASE_NAME helm-charts//chatqna | grep 'image:' | grep 'opea/' | awk '{print $2}' | xargs`
+ .github/workflows/scripts/e2e/chart_test.sh check_local_opea_image 100.83.111.229:5000/opea/chatqna-ui:latest
Failed to get image manifest 100.83.111.229:5000/opea/chatqna-ui:latest

@lianhao
Copy link
Collaborator

lianhao commented Feb 14, 2025

Bugs in GenAIExamples and GenAIComps resulted the latest changes are not populated into the CI's image registry repo. These bugs just got fixed yesterday. I'll manually populate the related container images

@yongfengdu
Copy link
Collaborator

I'm afraid the job of release charts will think "assets" is a helm chart and report fail.
https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/push-release-charts.yaml#L52
Either move the assets to elsewhere, or have the push-release-charts skip "assets" directory.

UIDs identifying the dashboards for Grafana, are based on the chart
"fullname" values.

That way, if dashboard configMaps get installed multiple times as
dependencies for different OPEA application charts, Grafana can
differentiate them based on those UIDs.

(At least as long as those application instances are given different
Helm release names, not just installed to separate namespaces.)

Another alternative would have been omitting the dashboard UIDs, to
let Grafana generate them.  However, that would have meant dashboard
URLs changing on every Helm re-deployment, as UID is part of the
Grafana dashboard URL!

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
* s/{{err}}/{{ printf "{{err}}" }}/
* s/{{le}}/{{ printf "{{le}}" }}/

Another possibility would be putting the dashboard JSON spec to a
separate file and reading it to Helm configMap template with:
  {{ .Files.Get "dashboards/metrics.json" | toJson | indent 4 }}

But that would require extra scripting to replace dashboard spec
"title" and "uid" fields with suitable templated values.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
This assume "prometheusNamespace" variable value to be correct.

Metrics dashboard is installed when monitoring is enabled, and scaling
dashboard when also HPA autoscaling is enabled.

NOTE: Regardless of which application installs the dashboard(s), they
are identical except for the title in the Grafana dashboards list (and
dashboard internal "uid").  Both dashboards show metrics for _any_ of
the OPEA applications that process streamed tokens with an LLM, not
just for ChatQnA or DocSum.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
* Describe new Helm functionality in monitoring.md
* Update kubernetes-addons/Observability/
  - Add index to README + move dashboard import info to its own section
  - Remove duplicate Protheus / Grafana install info
  - Move related files to better locations
* New dashboad screenshots

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@yongfengdu
Copy link
Collaborator

@eero-t Would you please add filter for assets here?
grep -vE 'valuefiles.yaml|common|assets|.md|.sh' |
https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/push-release-charts.yaml#L54C3-L54C5

@eero-t
Copy link
Collaborator Author

eero-t commented Feb 17, 2025

@eero-t Would you please add filter for assets here? grep -vE 'valuefiles.yaml|common|assets|.md|.sh' | https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/push-release-charts.yaml#L54C3-L54C5

@yongfengdu I filed separate PR for fixing the workflow, as it had bugs that were not related to this PR, see #814.

@eero-t
Copy link
Collaborator Author

eero-t commented Feb 17, 2025

Btw. In this PR "ChatQnA" & "DocSum" both have dashboard chart as dep when monitoring is enabled. However the resulting dashboard(s) are not really app specific and can also be installed separately (for apps that don't have that dependency). This could confuse users, if user installs multiple apps.

@poussa poussa merged commit f46e8c1 into opea-project:main Feb 18, 2025
18 checks passed
@eero-t eero-t deleted the dashboards branch February 21, 2025 15:14
@eero-t eero-t mentioned this pull request Apr 22, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants