Skip to content
This repository was archived by the owner on Jan 28, 2022. It is now read-only.

Commit 2988515

Browse files
author
Dave Storey
authored
Updated metrics readme and added dashboard info (#171)
* updated metrics readme and added dashboard info
1 parent 0978e28 commit 2988515

File tree

3 files changed

+55
-50
lines changed

3 files changed

+55
-50
lines changed

docs/deploy.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,3 +101,5 @@ kubectl --namespace azure-databricks-operator-system get pods
101101
# pull the logs
102102
kubectl --namespace azure-databricks-operator-system logs -f [name_of_the_operator_pod]
103103
```
104+
105+
To further aid debugging diagnostic metrics are produced by the operator. Please review [the metrics page](metrics.md) for further information

docs/metrics.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Azure-Databricks-Operator Metrics
2+
3+
To help diagnose issues the operator exposes a set of [Prometheus metrics](https://prometheus.io/). Also included with this repo is a ServiceMonitor definition `yaml` that can be deployed to enable an existing (or new) Prometheus deployment to scrape these metrics.
4+
5+
## Operator metrics
6+
7+
- Enabling the Operator to output prometheus metrics is done via the customization of `config/default/kustomization.yaml`:
8+
- If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
9+
> *NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you. Please see the section below for further information about installation.
10+
- All custom operator metrics exposed on the metrics endpoint are prefixed `databricks_`
11+
12+
In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
13+
14+
The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
15+
16+
|Name|Description|
17+
|-|-|
18+
|`object_type`|The type of CRD that the call relates to, e.g. `dcluster`|
19+
|`action`| The action being performed, e.g. `get`, `create`|
20+
|`outcome`| `success` or `failure`|
21+
22+
## Accessing Prometheus
23+
- [Prometheus-Operator](https://github.com/coreos/prometheus-operator) can be installed in your cluster easily via Helm
24+
> This repo provides an easy `make install-prometheus` to perform the Helm installtion
25+
- Determine the name of Prometheus service running in your cluster (If you used our `make` command then this will default to `prom-azure-databricks-oper-prometheus`)
26+
- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
27+
>If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
28+
- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
29+
- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
30+
31+
## Grafana Dashboard
32+
This repo also includes a Grafana dashboard named `Databricks Operator` that can be installed:
33+
- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
34+
- Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
35+
- Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
36+
- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
37+
- The dashboard provides you general metrics regarding the health of your operator (see below for information about interpretting the chart data)
38+
39+
## Dashboard Charts
40+
41+
| Panel Name | Description | Usage |
42+
|---|---|---|
43+
| **Reconciliations Per Controller** | Increase/decrease in the total count of reconcile loops that are being performed | Graph is useful to determine the number of Reconcile loops that result in Error vs Success. <br /><br />A spike in errors can indicate something wrong inside the operator logic such as missing config Secret containing Databricks uri etc.|
44+
| **Controller Reconcile Time** | Median, 95% and mean time taken to perform a reconciliation loop | Graph is useful to see how long the reconciliations take to complete as this is the complete lifecycle time and includes execution time in addition to upstream Databricks calls|
45+
| **Workqueue Adds** | Increase/decrease of new work for the Operator to perform. | Graph is useful as it will show incoming rate of Operator work requests to create CRD's. <br /><br />Operator also re-queues items to re-process (polling runs for completion status for example) and so therefore graph will show rate increase even when not strictly "new work to be performed"<br /><br />Note: The Operator logic will re-queue certain tasks when polling to see if work is complete etc. |
46+
| **Workqueue Depth** | Increase/decrease of the Operator work queue depth | The work queue shows the number of reconcile loops currently awaiting and opportunity to run. <br /><br />Useful for seeing if the Operator is struggling to cope with incoming demands for work |
47+
| **Average Databricks Request Duration** | Average and 95% request duration when the Operator calls Databricks via its REST api | Useful for seeing how long Databricks is taking to respond to requests from Operator and can help diagnose network issues from the K8s cluster/potential timeout issues. |
48+
| **Databricks REST endpoint calls - Success** | Increase/decrease of successful calls to databricks REST endpoints | Useful for identifying the throughput rate of the Operator calls to Databricks |
49+
| **Databricks REST endpoint calls - Failure** | Increase/decrease of failed calls to databricks REST endpoints | Useful for identifying the error rate of external Databricks calls, a sudden spike could indicated a databricks outage or a potentially breaking change to the Databricks REST services causing all requests for a specific endpoint that is having issues |
50+
| **Workqueue - Work Duration** | Median and 95% of how long in seconds processing an item from workqueue takes | Useful for measuring if one type of CRD request takes longer than others to complete<br /><br />*Note:* This metric is different to that of Controller Reconcile Time because it includes overhead execution time, not just the time spent executing with the Controller.
51+
| **Workqueue - Queue Duration** | Median and 95% of how long in seconds an item stays in workqueue before being requested. | Useful for measuring if the work queue is backing up. Can indicate that something is starving the Operator of CPU

docs/resources.md

Lines changed: 2 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -34,54 +34,6 @@ More info:
3434
- [Create a pipeline and add a status badge to Github](https://docs.microsoft.com/en-us/azure/devops/pipelines/create-first-pipeline?view=azure-devops&tabs=tfs-2018-2)
3535
- [Customize status badge with shields.io](https://shields.io/)
3636

37-
## Operator metrics
37+
## Controller metrics and dashboards
3838

39-
- Operator telemetry metrics are exposed via standard [Prometheus](https://prometheus.io/) format endpoints.
40-
- [Prometheus-Operator](https://github.com/coreos/prometheus-operator) is included as part of the operator deployment via Helm chart.
41-
- Prometheus configuration is generated via the `config/default/kustomization.yaml`
42-
- Installation of Prometheus-Operator can be manually triggered via command `make install-prometheus`
43-
- If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
44-
- *NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you
45-
- Custom metrics exposed by the Operator can be found by searching for `databricks_` inside the Prometheus web ui
46-
- Metrics follow the naming guidlines recommended by Prometheus
47-
48-
### How to access the Prometheus instance
49-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
50-
- Determine the name of Prometheus service running in your cluster (by default this will be prom-azure-databricks-oper-prometheus)
51-
- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
52-
- If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
53-
- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
54-
- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
55-
56-
### How To scrape the metrics from a single intance of the Operator running on a Pod:
57-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
58-
- Determine the name of the pod running your operator: `kubectl get pods -n azure-databricks-operator-system`
59-
- Port forward localhost:8080 to your pod: `kubectl port-forward -n azure-databricks-operator-system pod/azure-databricks-operator-controller-manager-<id> 8080:8080`
60-
- Open another terminal and curl request the metric endpoint: `curl localhost:8080/metrics`
61-
62-
### How to access metrics via Grafana
63-
- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
64-
- Determine the name of Grafana service running in your cluster (by default this will be prom-azure-databricks-operator-grafana)
65-
- Port forward localhost:8080 to your service: `kubectl port-forward service/prom-azure-databricks-operator-grafana 8080:80`
66-
- If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container)
67-
- Using a browser navigate to `http://localhost:8080` to view the Prometheus dashboard
68-
- If you are using the default helm installation of the Prometheus-Operator (as provided) then you can find the [default login details here](https://github.com/helm/charts/tree/master/stable/grafana#configuration)
69-
70-
This repo also includes a Grafana dashboard that can be installed:
71-
- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
72-
- Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
73-
- Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
74-
- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
75-
- The dashboard provides you general metrics regarding the health of your operator (upstream databricks call success/failure rates and general health of the operator)
76-
77-
### Counter metrics
78-
79-
In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
80-
81-
The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
82-
83-
|Name|Description|
84-
|-|-|
85-
|`object_type`|The type of object that the call relatest to, e.g. `dcluster`|
86-
|`action`| The action being performed, e.g. `get`, `create`|
87-
|`outcome`| `success` or `failure`|
39+
For information on how to monitor metrics from published from the operator, please review [the metrics page](metrics.md).

0 commit comments

Comments
 (0)