Skip to content

feat: Switch Metric Collectors to Opentelemetry-kube-stack#273

Merged
denis-ryzhkov merged 10 commits into
k0rdent:mainfrom
aglarendil:feature/otel-kube-stack
Jul 21, 2025
Merged

feat: Switch Metric Collectors to Opentelemetry-kube-stack#273
denis-ryzhkov merged 10 commits into
k0rdent:mainfrom
aglarendil:feature/otel-kube-stack

Conversation

@aglarendil
Copy link
Copy Markdown
Contributor

@aglarendil aglarendil commented May 13, 2025

closes #204
refer to opentelemetry-kube-stack and use it collect all the metrics from within the kof-collectors chart

we have 4 collectors:

  1. kube cluster collectors (cluster-stats)

enable as much as possible for kube collection

  1. node daemon collectors

host metrics
additional scrape config for "kubernetes-pods" jobs and kubelet metrics (/cadvisor, /metrics, /metrics/resource, /metrics/probes)

k0s components collector (hostnetwork, polling etcd, kube-controller-manager)

prometheus receiver with a scrape config to poll pods such as kube-controller-manager, scheduler, etcd (launched by k0s)

syslog collector that also extracts contents using Grok patterns as for instance default Ubuntu 24.04 log format forwarded by systemd to rsyslog is not in any way syslog-rfc-compliant

  1. target-allocator collector - dedicated

works only against prometheus objects. ta is enabled indepedently as it modifies how node collectors receive targets to scrape and affects the part of the scrape config for daemon that uses hacks around env variables such as OTEK_KUBE_NODE_NAME and others. so we separate those 2 daemons for them not to step on each other's toes

collectors are sprinkled over with attribute transformers and populate node/job/instance and/or their opentelemetry counterparts (i.e. service.instance.id), so when we use kube-prom-stack dashboards and alerts, we do not have label discrepancy.

possible known issues with this version:

  1. some attributes need to be additionally renamed ('/hostfs' for /var/log/syslog)
  2. some collectors are commented out (journald is still alpha, but can be parametrised to be enabled, requires json parsing and ugly hacks with LD_LIBRARY_PATH - honestly, to be removed in this version)
  3. requires additional filtering of redundant and very noise metrics such as some of kubeapi latency buckets, etc.
  4. some servicemonitors might collect the same metrics as other collectors (i.e. node-exporter for daemon collector and node-exporter via service monitor
  • to be cleaned up as well)
  1. otel-operator requires explicit setting for fallbackstrategy to collect non-Node service monitors such as apiserver

P.S. otel-kube-stack directory is a sandbox playground to be removed in a consequent commit bit later

@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch 7 times, most recently from faaad58 to a386e86 Compare May 19, 2025 14:02
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch 2 times, most recently from d6ca143 to dce4d33 Compare June 3, 2025 15:39
@aglarendil aglarendil changed the title Switch Metric Collectors to Opentelemetry-kube-stack feat: Switch Metric Collectors to Opentelemetry-kube-stack Jun 3, 2025
@aglarendil aglarendil marked this pull request as ready for review June 3, 2025 15:47
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch from 089bb98 to 344ceea Compare June 9, 2025 15:24
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch 3 times, most recently from ae9e70d to 580928e Compare June 17, 2025 17:12
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch from 580928e to 8bef154 Compare June 23, 2025 12:07
@aglarendil aglarendil requested a review from AndrejsPon00 as a code owner June 23, 2025 12:07
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch 7 times, most recently from 0af01a1 to 817e0df Compare June 30, 2025 15:40
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch 2 times, most recently from 8cf9bc2 to 8d857a3 Compare June 30, 2025 17:19
Comment thread charts/kof-child/templates/child-multi-cluster-service.yaml Outdated
Comment thread charts/kof-regional/templates/regional-multi-cluster-service.yaml Outdated
Comment thread charts/kof-child/templates/child-multi-cluster-service.yaml
Comment thread charts/kof-child/templates/child-multi-cluster-service.yaml Outdated
Comment thread charts/kof-collectors/Chart.lock
Comment thread charts/kof-regional/templates/regional-multi-cluster-service.yaml
Comment thread charts/kof-regional/templates/regional-multi-cluster-service.yaml Outdated
Comment thread charts/kof-collectors/values.yaml Outdated
Comment thread charts/kof-collectors/values.yaml
Comment thread charts/kof-collectors/values.yaml
Comment thread charts/kof-istio/templates/kof-regional-cluster-profile.yaml Outdated
Comment thread charts/kof-istio/templates/_helpers.tpl Outdated
Comment thread charts/kof-mothership/values.yaml
refer to opentelemetry-kube-stack and use it collect all the metrics
from within the kof-collectors chart

also modify the MCS for kof-child and kof-regional accordingly to refer to the secrets related to basic-auth and also
"parametrise" exporters configuration

we have 4 collectors:

1. kube cluster collectors (cluster-stats)

enable as much as possible for kube collection

2. node daemon collectors

host metrics
additional scrape config for "kubernetes-pods" jobs and kubelet metrics (/cadvisor, /metrics, /metrics/resource, /metrics/probes)

k0s components collector (hostnetwork, polling etcd, kube-controller-manager)

prometheus receiver with a scrape config to poll pods such as kube-controller-manager, scheduler, etcd (launched by k0s)

syslog collector that also extracts contents using Grok patterns as for instance default Ubuntu 24.04 log format forwarded by systemd to rsyslog
is not in any way syslog-rfc-compliant

4. target-allocator collector - dedicated

works only against prometheus objects. ta is enabled indepedently as it modifies how node collectors receive targets to scrape and affects the part
of the scrape config for daemon that uses hacks around env variables such as OTEK_KUBE_NODE_NAME and others. so we separate those 2 daemons for them
not to step on each other's toes

collectors are sprinkled over with attribute transformers and populate node/job/instance and/or their
opentelemetry counterparts (i.e. service.instance.id), so when we use kube-prom-stack dashboards and alerts, we do not have
label discrepancy.

possible known issues with this version:

1. some attributes need to be additionally renamed ('/hostfs' for /var/log/syslog)
2. some collectors are commented out (journald is still alpha, but can be parametrised to be enabled, requires json parsing
   and ugly hacks with LD_LIBRARY_PATH - honestly, to be removed in this version)
3. requires additional filtering of redundant and very noise metrics such as some of kubeapi latency buckets, etc.
4. some servicemonitors might collect the same metrics as other collectors (i.e. node-exporter for daemon collector and node-exporter via service monitor
 - to be cleaned up as well)
5. otel-operator requires explicit setting for fallbackstrategy to collect non-Node service monitors such as apiserver
6. istiod and other pod-based collection from upstream kof was also deleted as it should be picked up by "kubernetes-pods" job

cleanup: remove journald binary dirty hack and comments

fix: defaultCRConfig values for default helm install

fix: remove kube-state-metrics and prom-node-exporter from subcharts

they are installed as subcharts of opentelemetry-kube-stack

fix: remove unneeded auth exts from regional mcs

fix: add auth exts removed from default config

fix: adopt istio to passing values explicitly to otel-kube-stack

misc: use cluster label instead of clusterName

address reviews

resolve global.cluster vs global.clusterName conundrum

global.clusterName may still be used by a pair of subcharts while
clusterLabel should surely be used for dashboards and metrics/alerts
to mirror the default behaviour in upstream
however, for vmoperator and vlogscluster global.cluster is already
reserved to contain some info in map format, so we just resort to:

clusterLabel - used for dashboards and metrics
clusterName - to have clustername in chart values

fix: fix variable interpolation for istio child template

fix: follow cluster and config changes in corresponding ns

fixup insecure flag, as it is not needed
@aglarendil aglarendil force-pushed the feature/otel-kube-stack branch from 8a6e09b to 1f0ce6c Compare July 14, 2025 12:04
@aglarendil aglarendil requested a review from denis-ryzhkov July 14, 2025 13:33
@denis-ryzhkov denis-ryzhkov merged commit eb2f283 into k0rdent:main Jul 21, 2025
6 checks passed
@github-project-automation github-project-automation Bot moved this to Done in k0rdent Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Switch to Upstream OpenTelemetry Kube Stack

3 participants