Skip to content

Metrics and Monitoring V0.0.2: Model Container, Latencies, and Example#357

Merged
dcrankshaw merged 31 commits intoucbrise:developfrom
simon-mo:metrics
Jan 24, 2018
Merged

Metrics and Monitoring V0.0.2: Model Container, Latencies, and Example#357
dcrankshaw merged 31 commits intoucbrise:developfrom
simon-mo:metrics

Conversation

@simon-mo
Copy link
Copy Markdown
Contributor

As mentioned in #339.

This PR adds the following features to monitoring MVP:

  1. Create a configuration for metrics we are actively monitoring in monitoring/metrics_config.yaml. With this configuration file, we load up model container with corresponding metrics just for one time. Thus the histogram and summary observation will much more accurate.
  2. Monitoring 5 metrics in model container now. In particular, prediction count, end-to-end latency, {receive | parse | handle} latency. The first in Counter format and the rest in Histogram format.
  3. Create an example for monitoring in example/monitoring, user can launch two scripts query.py and init_grafana to view Clipper Dashboard like this:

screen shot 2018-01-18 at 1 10 47 am

This commit adds an example in the example folder. It helps
the user to visualize clipper metric.

The init_grafana.py script launches a grafana/grafana docker
container. It adds prometheus as a data source. I attempted
to add the dashboard via Grafana API but failed after trying
it out for hours. This seems like a persistent issue for at
least two years: grafana/grafana#2816

<Simon Mo>
This commit adds an example in the example folder. It helps
the user to visualize clipper metric.

The init_grafana.py script launches a grafana/grafana docker
container. It adds prometheus as a data source. I attempted
to add the dashboard via Grafana API but failed after trying
it out for hours. This seems like a persistent issue for at
least two years: grafana/grafana#2816

<Simon Mo>
@simon-mo simon-mo requested a review from dcrankshaw January 18, 2018 10:21
@simon-mo simon-mo self-assigned this Jan 18, 2018
@simon-mo simon-mo added this to the 0.3.0 Release milestone Jan 18, 2018
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/859/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/861/
Test FAILed.

@simon-mo
Copy link
Copy Markdown
Contributor Author

Issue #352 failing the build again:

APIError: 500 Server Error: Internal Server Error 
("driver failed programming external connectivity on endpoint query_frontend-89208 
(5a4d3ec0ecef36ece07bce7b5ff390e794c9ea483861a6da618bffa1252607cb): 
Error starting userland proxy: listen tcp 0.0.0.0:38782: bind: address already in use")

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/862/
Test PASSed.

Copy link
Copy Markdown
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. I like the grafana example.

Comment thread containers/python/rpc.py
"""
REGISTRY.register(MetricCollector(child_conn))
collector = MetricCollector(child_conn)
start_http_server(1390)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this port need to be open in the Docker container?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. This port does not need to be exposed to the user. Prometheus can detect and access this port since the Prometheus container and model container are technically in the same "pod". This might need to change to k8s though.

print("Stopping Clipper...")
clipper_conn = ClipperConnection(DockerContainerManager())
clipper_conn.stop_all()
sys.exit(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

Comment thread integration-tests/clipper_metric.py Outdated
return res


def check_three_node_healthy(res, node_num):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing for?

Copy link
Copy Markdown
Contributor Author

@simon-mo simon-mo Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a helper method to test how many container metric is prometheus collecting. This is just bad naming. I should change it to parse_res_and_assert_node and write a docstring for it.

Comment thread monitoring/metrics_config.yaml Outdated
@@ -0,0 +1,49 @@
# This Configuration files details the all metrics Clipper is collecting.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "This configuration file details all the metrics Clipper collects."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


end_to_end_latency_us:
type: Histogram
description: The time in ms takes from receive RPC call to send off prediction.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in milliseconds (ms) or microseconds (us)? If it's in microseconds, change the description. If it's milliseconds, change the name to end_to_end_latency_ms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to microseconds (us)

Comment thread monitoring/metrics_config.yaml Outdated
parse_time_us:
type: Histogram
description: The time in ms takes to parse RPC call.
bucket: [0, 100, 5]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing is also very fast. Change to [0.2, 10, 0.2]


handle_time_us:
type: Histogram
description: The time in ms takes to make prediction.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ms vs us

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to us

Comment thread monitoring/metrics_config.yaml Outdated
handle_time_us:
type: Histogram
description: The time in ms takes to make prediction.
bucket: [0, 100, 5]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling could take longer than 100 ms. Update to [0, 500, 5]

Comment thread examples/monitoring/init_grafana.py Outdated
if __name__ == '__main__':
signal.signal(signal.SIGINT, signal_handler)

print("(1/3) Initiating Grafana")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Initiating/Initializing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread examples/monitoring/init_grafana.py Outdated
client = docker.from_env()
container = client.containers.run(
"grafana/grafana", ports={'3000/tcp': 3000}, detach=True)
print("(2/3) Grafana Initiated ")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Initiated/Initialized

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@dcrankshaw
Copy link
Copy Markdown
Contributor

@simon-mo This basically ready. Just had a couple small comments.

- use 1000.0 instead 1000 rpc arithmatic.
- change docs
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/889/
Test PASSed.

Copy link
Copy Markdown
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/892/
Test FAILed.

@dcrankshaw
Copy link
Copy Markdown
Contributor

jenkins test this please

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/894/
Test FAILed.

@dcrankshaw
Copy link
Copy Markdown
Contributor

jenkins test this please

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/898/
Test FAILed.

@dcrankshaw
Copy link
Copy Markdown
Contributor

jenkins test this please

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/899/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/900/
Test FAILed.

@simon-mo
Copy link
Copy Markdown
Contributor Author

jenkins test this please

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/901/
Test FAILed.

@simon-mo
Copy link
Copy Markdown
Contributor Author

Last try of today... The new update does not seem relevant to the "conda install tensorflow" issue.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/902/
Test FAILed.

@simon-mo
Copy link
Copy Markdown
Contributor Author

Similar issue here for reference: tensorflow/tensorflow#8096 (comment)

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/904/
Test PASSed.

@dcrankshaw dcrankshaw merged commit ffef540 into ucbrise:develop Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants