Metrics and Monitoring V0.0.2: Model Container, Latencies, and Example#357
Metrics and Monitoring V0.0.2: Model Container, Latencies, and Example#357dcrankshaw merged 31 commits intoucbrise:developfrom
Conversation
This commit adds an example in the example folder. It helps the user to visualize clipper metric. The init_grafana.py script launches a grafana/grafana docker container. It adds prometheus as a data source. I attempted to add the dashboard via Grafana API but failed after trying it out for hours. This seems like a persistent issue for at least two years: grafana/grafana#2816 <Simon Mo>
This commit adds an example in the example folder. It helps the user to visualize clipper metric. The init_grafana.py script launches a grafana/grafana docker container. It adds prometheus as a data source. I attempted to add the dashboard via Grafana API but failed after trying it out for hours. This seems like a persistent issue for at least two years: grafana/grafana#2816 <Simon Mo>
|
Test PASSed. |
|
Test FAILed. |
|
Issue #352 failing the build again: |
|
Test PASSed. |
dcrankshaw
left a comment
There was a problem hiding this comment.
Very cool. I like the grafana example.
| """ | ||
| REGISTRY.register(MetricCollector(child_conn)) | ||
| collector = MetricCollector(child_conn) | ||
| start_http_server(1390) |
There was a problem hiding this comment.
Does this port need to be open in the Docker container?
There was a problem hiding this comment.
Nope. This port does not need to be exposed to the user. Prometheus can detect and access this port since the Prometheus container and model container are technically in the same "pod". This might need to change to k8s though.
| print("Stopping Clipper...") | ||
| clipper_conn = ClipperConnection(DockerContainerManager()) | ||
| clipper_conn.stop_all() | ||
| sys.exit(0) |
| return res | ||
|
|
||
|
|
||
| def check_three_node_healthy(res, node_num): |
There was a problem hiding this comment.
What is this testing for?
There was a problem hiding this comment.
It was a helper method to test how many container metric is prometheus collecting. This is just bad naming. I should change it to parse_res_and_assert_node and write a docstring for it.
| @@ -0,0 +1,49 @@ | |||
| # This Configuration files details the all metrics Clipper is collecting. | |||
There was a problem hiding this comment.
nit: "This configuration file details all the metrics Clipper collects."
|
|
||
| end_to_end_latency_us: | ||
| type: Histogram | ||
| description: The time in ms takes from receive RPC call to send off prediction. |
There was a problem hiding this comment.
Is this in milliseconds (ms) or microseconds (us)? If it's in microseconds, change the description. If it's milliseconds, change the name to end_to_end_latency_ms.
There was a problem hiding this comment.
Changed it to microseconds (us)
| parse_time_us: | ||
| type: Histogram | ||
| description: The time in ms takes to parse RPC call. | ||
| bucket: [0, 100, 5] |
There was a problem hiding this comment.
Parsing is also very fast. Change to [0.2, 10, 0.2]
|
|
||
| handle_time_us: | ||
| type: Histogram | ||
| description: The time in ms takes to make prediction. |
| handle_time_us: | ||
| type: Histogram | ||
| description: The time in ms takes to make prediction. | ||
| bucket: [0, 100, 5] |
There was a problem hiding this comment.
Handling could take longer than 100 ms. Update to [0, 500, 5]
| if __name__ == '__main__': | ||
| signal.signal(signal.SIGINT, signal_handler) | ||
|
|
||
| print("(1/3) Initiating Grafana") |
There was a problem hiding this comment.
s/Initiating/Initializing
| client = docker.from_env() | ||
| container = client.containers.run( | ||
| "grafana/grafana", ports={'3000/tcp': 3000}, detach=True) | ||
| print("(2/3) Grafana Initiated ") |
There was a problem hiding this comment.
s/Initiated/Initialized
|
@simon-mo This basically ready. Just had a couple small comments. |
- use 1000.0 instead 1000 rpc arithmatic. - change docs
|
Test PASSed. |
|
Test FAILed. |
|
jenkins test this please |
|
Test FAILed. |
|
jenkins test this please |
|
Test FAILed. |
|
jenkins test this please |
|
Test FAILed. |
Add a block of code make sure config file is found.
|
Test FAILed. |
|
jenkins test this please |
|
Test FAILed. |
|
Last try of today... The new update does not seem relevant to the "conda install tensorflow" issue. |
|
Test FAILed. |
|
Similar issue here for reference: tensorflow/tensorflow#8096 (comment) |
|
Test PASSed. |
As mentioned in #339.
This PR adds the following features to monitoring MVP:
monitoring/metrics_config.yaml. With this configuration file, we load up model container with corresponding metrics just for one time. Thus the histogram and summary observation will much more accurate.prediction count,end-to-end latency,{receive | parse | handle} latency. The first inCounterformat and the rest inHistogramformat.example/monitoring, user can launch two scriptsquery.pyandinit_grafanato view Clipper Dashboard like this: