Metrics and Monitoring V0.0.2: Model Container, Latencies, and Example by simon-mo · Pull Request #357 · ucbrise/clipper

simon-mo · 2018-01-18T10:21:10Z

As mentioned in #339.

This PR adds the following features to monitoring MVP:

Create a configuration for metrics we are actively monitoring in monitoring/metrics_config.yaml. With this configuration file, we load up model container with corresponding metrics just for one time. Thus the histogram and summary observation will much more accurate.
Monitoring 5 metrics in model container now. In particular, prediction count, end-to-end latency, {receive | parse | handle} latency. The first in Counter format and the rest in Histogram format.
Create an example for monitoring in example/monitoring, user can launch two scripts query.py and init_grafana to view Clipper Dashboard like this:

This commit adds an example in the example folder. It helps the user to visualize clipper metric. The init_grafana.py script launches a grafana/grafana docker container. It adds prometheus as a data source. I attempted to add the dashboard via Grafana API but failed after trying it out for hours. This seems like a persistent issue for at least two years: grafana/grafana#2816 <Simon Mo>

…trics

AmplabJenkins · 2018-01-18T11:38:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/859/
Test PASSed.

AmplabJenkins · 2018-01-19T07:17:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/861/
Test FAILed.

simon-mo · 2018-01-19T07:23:30Z

Issue #352 failing the build again:

APIError: 500 Server Error: Internal Server Error 
("driver failed programming external connectivity on endpoint query_frontend-89208 
(5a4d3ec0ecef36ece07bce7b5ff390e794c9ea483861a6da618bffa1252607cb): 
Error starting userland proxy: listen tcp 0.0.0.0:38782: bind: address already in use")

…trics

AmplabJenkins · 2018-01-20T09:07:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/862/
Test PASSed.

dcrankshaw

Very cool. I like the grafana example.

dcrankshaw · 2018-01-19T19:01:09Z

    """
-    REGISTRY.register(MetricCollector(child_conn))
+    collector = MetricCollector(child_conn)
    start_http_server(1390)


Does this port need to be open in the Docker container?

Nope. This port does not need to be exposed to the user. Prometheus can detect and access this port since the Prometheus container and model container are technically in the same "pod". This might need to change to k8s though.

dcrankshaw · 2018-01-19T19:09:31Z

+    print("Stopping Clipper...")
+    clipper_conn = ClipperConnection(DockerContainerManager())
+    clipper_conn.stop_all()
+    sys.exit(0)


This is great!

dcrankshaw · 2018-01-19T19:25:17Z

+    return res
+
+
+def check_three_node_healthy(res, node_num):


What is this testing for?

It was a helper method to test how many container metric is prometheus collecting. This is just bad naming. I should change it to parse_res_and_assert_node and write a docstring for it.

dcrankshaw · 2018-01-21T17:43:55Z

@@ -0,0 +1,49 @@
+# This Configuration files details the all metrics Clipper is collecting.


nit: "This configuration file details all the metrics Clipper collects."

dcrankshaw · 2018-01-21T17:45:05Z

+
+  end_to_end_latency_us:
+    type: Histogram
+    description: The time in ms takes from receive RPC call to send off prediction.


Is this in milliseconds (ms) or microseconds (us)? If it's in microseconds, change the description. If it's milliseconds, change the name to end_to_end_latency_ms.

Changed it to microseconds (us)

dcrankshaw · 2018-01-21T17:48:08Z

+  parse_time_us:
+    type: Histogram
+    description: The time in ms takes to parse RPC call.
+    bucket: [0, 100, 5]


Parsing is also very fast. Change to [0.2, 10, 0.2]

dcrankshaw · 2018-01-21T17:48:37Z

+
+  handle_time_us:
+    type: Histogram
+    description: The time in ms takes to make prediction.


changed to us

dcrankshaw · 2018-01-21T17:49:27Z

+  handle_time_us:
+    type: Histogram
+    description: The time in ms takes to make prediction.
+    bucket: [0, 100, 5]


Handling could take longer than 100 ms. Update to [0, 500, 5]

dcrankshaw · 2018-01-21T20:04:42Z

+if __name__ == '__main__':
+    signal.signal(signal.SIGINT, signal_handler)
+
+    print("(1/3) Initiating Grafana")


s/Initiating/Initializing

dcrankshaw · 2018-01-21T20:05:00Z

+    client = docker.from_env()
+    container = client.containers.run(
+        "grafana/grafana", ports={'3000/tcp': 3000}, detach=True)
+    print("(2/3) Grafana Initiated ")


s/Initiated/Initialized

dcrankshaw · 2018-01-23T01:23:52Z

@simon-mo This basically ready. Just had a couple small comments.

- use 1000.0 instead 1000 rpc arithmatic. - change docs

AmplabJenkins · 2018-01-23T07:30:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/889/
Test PASSed.

dcrankshaw

LGTM

AmplabJenkins · 2018-01-23T18:06:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/892/
Test FAILed.

dcrankshaw · 2018-01-23T18:08:02Z

jenkins test this please

AmplabJenkins · 2018-01-23T18:21:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/894/
Test FAILed.

dcrankshaw · 2018-01-23T20:37:28Z

jenkins test this please

AmplabJenkins · 2018-01-23T22:39:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/898/
Test FAILed.

dcrankshaw · 2018-01-23T23:17:35Z

jenkins test this please

AmplabJenkins · 2018-01-24T00:23:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/899/
Test FAILed.

Add a block of code make sure config file is found.

…trics

AmplabJenkins · 2018-01-24T01:12:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/900/
Test FAILed.

simon-mo · 2018-01-24T02:03:28Z

jenkins test this please

AmplabJenkins · 2018-01-24T02:11:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/901/
Test FAILed.

simon-mo · 2018-01-24T02:21:16Z

Last try of today... The new update does not seem relevant to the "conda install tensorflow" issue.

AmplabJenkins · 2018-01-24T02:25:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/902/
Test FAILed.

simon-mo · 2018-01-24T03:16:49Z

Similar issue here for reference: tensorflow/tensorflow#8096 (comment)

…trics

AmplabJenkins · 2018-01-24T07:48:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/904/
Test PASSed.

simon-mo added 18 commits January 14, 2018 00:50

Allow port specification for Prometheus

d6cb327

Fix dict unpacking lint error

e6c8263

Format Code

797f319

Update yapf

299491b

Update yapf and reformat code

7b54e6e

Check yapf version in shell script

7a96dcf

Change echo to cat

65c7f80

Add python version check; almost ready

f3d5eac

Format code with python version 2.7.12

ab9cc98

Allow port specification for Prometheus

db908aa

Update Metric Config; Finish Implement Model Container

d272152

Add Integration Test

491467a

Fix Integration Tests

aa50423

Format Code and Rebase from Develop

7338f6e

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

66cc56a

…trics

Format Code Again with python2.7.12

c514378

simon-mo requested a review from dcrankshaw January 18, 2018 10:21

simon-mo self-assigned this Jan 18, 2018

simon-mo added the status: needs review label Jan 18, 2018

simon-mo added this to the 0.3.0 Release milestone Jan 18, 2018

Merge branch 'develop' into metrics

d05b63f

simon-mo added 2 commits January 20, 2018 00:08

Update monitoring readme; trigger Jenkins

3806bfe

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

7f29ab2

…trics

dcrankshaw requested changes Jan 21, 2018

View reviewed changes

dcrankshaw removed the status: needs review label Jan 23, 2018

Address comments

e395b1f

- use 1000.0 instead 1000 rpc arithmatic. - change docs

simon-mo added status: fixed and removed status: needs revision labels Jan 23, 2018

Merge branch 'develop' into metrics

9b1cd28

dcrankshaw approved these changes Jan 23, 2018

View reviewed changes

dcrankshaw added status: accepted and removed status: fixed labels Jan 23, 2018

simon-mo added 2 commits January 23, 2018 16:54

Address RPC Test issue

fe9e8b8

Add a block of code make sure config file is found.

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

458cdc3

…trics

Merge branch 'develop' into metrics

1cecfd6

simon-mo added 2 commits January 23, 2018 22:37

Update tensowflowcifar

34bf606

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

6d56d97

…trics

dcrankshaw merged commit ffef540 into ucbrise:develop Jan 24, 2018

		@@ -0,0 +1,49 @@
		# This Configuration files details the all metrics Clipper is collecting.

Conversation

simon-mo commented Jan 18, 2018

Uh oh!

AmplabJenkins commented Jan 18, 2018

Uh oh!

AmplabJenkins commented Jan 19, 2018

Uh oh!

simon-mo commented Jan 19, 2018

Uh oh!

AmplabJenkins commented Jan 20, 2018

Uh oh!

dcrankshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simon-mo Jan 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcrankshaw commented Jan 23, 2018

Uh oh!

AmplabJenkins commented Jan 23, 2018

Uh oh!

dcrankshaw left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 23, 2018

Uh oh!

dcrankshaw commented Jan 23, 2018

Uh oh!

AmplabJenkins commented Jan 23, 2018

Uh oh!

dcrankshaw commented Jan 23, 2018

Uh oh!

AmplabJenkins commented Jan 23, 2018

Uh oh!

dcrankshaw commented Jan 23, 2018

Uh oh!

AmplabJenkins commented Jan 24, 2018

Uh oh!

AmplabJenkins commented Jan 24, 2018

Uh oh!

simon-mo commented Jan 24, 2018

Uh oh!

AmplabJenkins commented Jan 24, 2018

Uh oh!

simon-mo commented Jan 24, 2018

Uh oh!

AmplabJenkins commented Jan 24, 2018

simon-mo Jan 21, 2018 •

edited

Loading