You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 28, 2022. It is now read-only.
To help diagnose issues the operator exposes a set of [Prometheus metrics](https://prometheus.io/). Also included with this repo is a ServiceMonitor definition `yaml` that can be deployed to enable an existing (or new) Prometheus deployment to scrape these metrics.
4
+
5
+
## Operator metrics
6
+
7
+
- Enabling the Operator to output prometheus metrics is done via the customization of `config/default/kustomization.yaml`:
8
+
- If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
9
+
> *NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you. Please see the section below for further information about installation.
10
+
- All custom operator metrics exposed on the metrics endpoint are prefixed `databricks_`
11
+
12
+
In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
13
+
14
+
The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
15
+
16
+
|Name|Description|
17
+
|-|-|
18
+
|`object_type`|The type of CRD that the call relates to, e.g. `dcluster`|
19
+
|`action`| The action being performed, e.g. `get`, `create`|
20
+
|`outcome`|`success` or `failure`|
21
+
22
+
## Accessing Prometheus
23
+
-[Prometheus-Operator](https://github.com/coreos/prometheus-operator) can be installed in your cluster easily via Helm
24
+
> This repo provides an easy `make install-prometheus` to perform the Helm installtion
25
+
- Determine the name of Prometheus service running in your cluster (If you used our `make` command then this will default to `prom-azure-databricks-oper-prometheus`)
26
+
- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
27
+
>If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
28
+
- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
29
+
- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
30
+
31
+
## Grafana Dashboard
32
+
This repo also includes a Grafana dashboard named `Databricks Operator` that can be installed:
33
+
- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
34
+
- Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
35
+
- Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
36
+
- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
37
+
- The dashboard provides you general metrics regarding the health of your operator (see below for information about interpretting the chart data)
38
+
39
+
## Dashboard Charts
40
+
41
+
| Panel Name | Description | Usage |
42
+
|---|---|---|
43
+
|**Reconciliations Per Controller**| Increase/decrease in the total count of reconcile loops that are being performed | Graph is useful to determine the number of Reconcile loops that result in Error vs Success. <br /><br />A spike in errors can indicate something wrong inside the operator logic such as missing config Secret containing Databricks uri etc.|
44
+
|**Controller Reconcile Time**| Median, 95% and mean time taken to perform a reconciliation loop | Graph is useful to see how long the reconciliations take to complete as this is the complete lifecycle time and includes execution time in addition to upstream Databricks calls|
45
+
|**Workqueue Adds**| Increase/decrease of new work for the Operator to perform. | Graph is useful as it will show incoming rate of Operator work requests to create CRD's. <br /><br />Operator also re-queues items to re-process (polling runs for completion status for example) and so therefore graph will show rate increase even when not strictly "new work to be performed"<br /><br />Note: The Operator logic will re-queue certain tasks when polling to see if work is complete etc. |
46
+
|**Workqueue Depth**| Increase/decrease of the Operator work queue depth | The work queue shows the number of reconcile loops currently awaiting and opportunity to run. <br /><br />Useful for seeing if the Operator is struggling to cope with incoming demands for work |
47
+
|**Average Databricks Request Duration**| Average and 95% request duration when the Operator calls Databricks via its REST api | Useful for seeing how long Databricks is taking to respond to requests from Operator and can help diagnose network issues from the K8s cluster/potential timeout issues. |
48
+
|**Databricks REST endpoint calls - Success**| Increase/decrease of successful calls to databricks REST endpoints | Useful for identifying the throughput rate of the Operator calls to Databricks |
49
+
|**Databricks REST endpoint calls - Failure**| Increase/decrease of failed calls to databricks REST endpoints | Useful for identifying the error rate of external Databricks calls, a sudden spike could indicated a databricks outage or a potentially breaking change to the Databricks REST services causing all requests for a specific endpoint that is having issues |
50
+
| **Workqueue - Work Duration** | Median and 95% of how long in seconds processing an item from workqueue takes | Useful for measuring if one type of CRD request takes longer than others to complete<br /><br />*Note:* This metric is different to that of Controller Reconcile Time because it includes overhead execution time, not just the time spent executing with the Controller.
51
+
| **Workqueue - Queue Duration** | Median and 95% of how long in seconds an item stays in workqueue before being requested. | Useful for measuring if the work queue is backing up. Can indicate that something is starving the Operator of CPU
Copy file name to clipboardExpand all lines: docs/resources.md
+2-50Lines changed: 2 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,54 +34,6 @@ More info:
34
34
-[Create a pipeline and add a status badge to Github](https://docs.microsoft.com/en-us/azure/devops/pipelines/create-first-pipeline?view=azure-devops&tabs=tfs-2018-2)
35
35
-[Customize status badge with shields.io](https://shields.io/)
36
36
37
-
## Operator metrics
37
+
## Controller metrics and dashboards
38
38
39
-
- Operator telemetry metrics are exposed via standard [Prometheus](https://prometheus.io/) format endpoints.
40
-
-[Prometheus-Operator](https://github.com/coreos/prometheus-operator) is included as part of the operator deployment via Helm chart.
41
-
- Prometheus configuration is generated via the `config/default/kustomization.yaml`
42
-
- Installation of Prometheus-Operator can be manually triggered via command `make install-prometheus`
43
-
- If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
44
-
-*NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you
45
-
- Custom metrics exposed by the Operator can be found by searching for `databricks_` inside the Prometheus web ui
46
-
- Metrics follow the naming guidlines recommended by Prometheus
47
-
48
-
### How to access the Prometheus instance
49
-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
50
-
- Determine the name of Prometheus service running in your cluster (by default this will be prom-azure-databricks-oper-prometheus)
51
-
- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
52
-
- If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
53
-
- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
54
-
- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
55
-
56
-
### How To scrape the metrics from a single intance of the Operator running on a Pod:
57
-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
58
-
- Determine the name of the pod running your operator: `kubectl get pods -n azure-databricks-operator-system`
59
-
- Port forward localhost:8080 to your pod: `kubectl port-forward -n azure-databricks-operator-system pod/azure-databricks-operator-controller-manager-<id> 8080:8080`
60
-
- Open another terminal and curl request the metric endpoint: `curl localhost:8080/metrics`
61
-
62
-
### How to access metrics via Grafana
63
-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
64
-
- Determine the name of Grafana service running in your cluster (by default this will be prom-azure-databricks-operator-grafana)
65
-
- Port forward localhost:8080 to your service: `kubectl port-forward service/prom-azure-databricks-operator-grafana 8080:80`
66
-
- If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
67
-
- Using a browser navigate to `http://localhost:8080` to view the Prometheus dashboard
68
-
- If you are using the default helm installation of the Prometheus-Operator (as provided) then you can find the [default login details here](https://github.com/helm/charts/tree/master/stable/grafana#configuration)
69
-
70
-
This repo also includes a Grafana dashboard that can be installed:
71
-
- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
72
-
- Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
73
-
- Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
74
-
- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
75
-
- The dashboard provides you general metrics regarding the health of your operator (upstream databricks call success/failure rates and general health of the operator)
76
-
77
-
### Counter metrics
78
-
79
-
In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
80
-
81
-
The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
82
-
83
-
|Name|Description|
84
-
|-|-|
85
-
|`object_type`|The type of object that the call relatest to, e.g. `dcluster`|
86
-
|`action`| The action being performed, e.g. `get`, `create`|
87
-
|`outcome`|`success` or `failure`|
39
+
For information on how to monitor metrics from published from the operator, please review [the metrics page](metrics.md).
0 commit comments