PyTorch + ONNX + Caffe2 Model deployer by haofanwang · Pull Request #362 · ucbrise/clipper

haofanwang · 2018-01-19T16:32:12Z

@Corey-Zumar #340 Please take a look, and test it. It seems fine on local environment.

AmplabJenkins · 2018-01-19T16:35:06Z

Can one of the admins verify this patch?

dcrankshaw · 2018-01-19T17:59:25Z

jenkins ok to test

On Fri, Jan 19, 2018 at 8:35 AM, UCB AMPLab ***@***.***> wrote: Can one of the admins verify this patch? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#362 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAaV5PajEhTfliJKvv_xSa1ac8kg3ZFQks5tMMQ7gaJpZM4Rkuky> .

dcrankshaw · 2018-01-21T19:03:05Z

jenkins ok to test

AmplabJenkins · 2018-01-21T19:10:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/865/
Test FAILed.

AmplabJenkins · 2018-01-21T23:45:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/870/
Test FAILed.

Corey-Zumar

I think this is close. The Caffe2 container is currently crashing, and the unit test does not pass. Here are the logs from the container:

Attempting to run Caffe2 container without installing dependencies
Contents of /model
environment.yml
func.pkl
modules
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: libcuda.so.1: cannot open shared object file: No such file or directory
CRITICAL:root:Cannot load caffe2.python. Error: /opt/conda/lib/python2.7/site-packages/caffe2/python/../../../../libcaffe2.so: undefined symbol: _ZN7leveldb2DB4OpenERKNS_7OptionsERKSsPPS0_

Corey-Zumar · 2018-01-22T19:35:24Z

dockerfiles/Caffe2Dockerfile

+      && conda install -c anaconda cloudpickle=0.5.2
+
+RUN conda install -c ezyang onnx \
+	&& RUN conda install -c conda-forge protobuf==3.4.0 \


Remove the RUN directives after &&. Also, align &&'s with the ones on lines 7-11

Corey-Zumar · 2018-01-22T19:38:11Z

clipper_admin/clipper_admin/deployers/caffe2.py

+        registry=None,
+        base_image="clipper/caffe2-container:{}".format(__version__),
+        num_replicas=1):
+    """Registers an app and deploys the provided predict function with Caffe2 model as


This function deploys the prediction function with a PyTorch model. It serializes the PyTorch model in Onnx format and creates a container that loads it as a Caffe2 model. Let's update the documentation here and for deploy_caffe2_model with this information.

Corey-Zumar · 2018-01-22T19:42:55Z

containers/python/caffe2_container.py

+
+import numpy as np
+
+from clipper_admin.deployers import cloudpickle


this should just be import cloudpickle

AmplabJenkins · 2018-01-22T20:55:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/883/
Test FAILed.

haofanwang · 2018-01-23T15:05:35Z

I guess it should be a mismatch between library such as protobuf or opencv, but there is no such error in my local environment. Will PR later.

dcrankshaw · 2018-01-31T00:07:04Z

What's the status on this?

haofanwang · 2018-01-31T10:06:47Z

Ok to test. @dcrankshaw

AmplabJenkins · 2018-01-31T11:12:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/929/
Test PASSed.

dcrankshaw

This looks great! Just some small naming and API changes requested.

dcrankshaw · 2018-02-04T21:27:58Z

bin/build_docker_images.sh

    create_image tf_cifar_container TensorFlowCifarDockerfile $public
    create_image tf-container TensorFlowDockerfile $public
    create_image pytorch-container PyTorchContainerDockerfile $public
+    create_image caffe2-container Caffe2Dockerfile $public


Let's call this caffe2-onnx-container and rename the Dockerfile to Caffe2OnnxDockerfile

dcrankshaw · 2018-02-04T21:29:45Z

clipper_admin/clipper_admin/deployers/caffe2.py

@@ -0,0 +1,159 @@
+from __future__ import print_function, with_statement, absolute_import


This file shouldn't be called the Caffe2 deployer, because it doesn't let people deploy Caffe2 models. Let's rename this to onnx.py and we'll start to centralize all our ONNX-related model deployer functionality in here.

dcrankshaw · 2018-02-04T21:33:48Z

clipper_admin/clipper_admin/deployers/caffe2.py

+logger = logging.getLogger(__name__)
+
+
+def create_endpoint(


Rename this to create_pytorch_endpoint, add an argument called onnx_backend="caffe2", then change the base_image default value to None (base_image=None).

Some context: We might want to support multiple ONNX backends soon. The cool thing about ONNX is that it decouples the choice of backend from the source of the model. So within the onnx model deployer we can support multiple training frameworks and inference frameworks. E.g. along with our create_pytorch_endpoint we might have a create_caffe2_endpoint and a create_mxnet_endpoint, any of which could deploy their model to a Caffe2 backend or an mxnet backend or a pytorch backend.

dcrankshaw · 2018-02-04T21:40:55Z

clipper_admin/clipper_admin/deployers/caffe2.py

+    clipper_conn.link_model_to_app(name, name)
+
+
+def deploy_caffe2_model(


Rename this to deploy_pytorch_model, add an argument onnx_backend=caffe2, and change the base_image default value to be None (base_image=None).

Modify the method documentation to match the new functionality.

Then inside the function, add the following code:

if base_image is None: if onnx_backend is "caffe2": base_image = "clipper/caffe2-container:{}".format(__version__) else: logger.error("{backend} ONNX backend is not currently supported.".format(backend=onnx_backend))

dcrankshaw · 2018-02-04T21:41:55Z

clipper_admin/clipper_admin/deployers/caffe2.py

+
+    try:
+        torch_out = torch.onnx._export(
+            pytorch_model, inputs, "pytorch_model.onnx", export_params=True)


Save this as just "model.onnx", the Caffe2 model can load any ONNX model, not just ones saved from PyTorch.

dcrankshaw · 2018-02-04T21:49:43Z

dockerfiles/Caffe2Dockerfile

+      && apt-get install -yqq -t jessie-backports openjdk-8-jdk \
+      && conda install -y --file /lib/python_container_conda_deps.txt \
+      && conda install -c anaconda cloudpickle=0.5.2 \
+      && conda install -c ezyang onnx\


This should be && conda install -c conda-forge onnx \

dcrankshaw · 2018-02-04T21:52:04Z

dockerfiles/Caffe2Dockerfile

+      && conda install -c anaconda cloudpickle=0.5.2 \
+      && conda install -c ezyang onnx\
+      && conda install -c conda-forge protobuf==3.4.0 \
+      && conda install -c ezyang/label/devgpu caffe2 \


This should be && conda install -c caffe2 caffe2 \

I tried this command, but it seems to conflict with other packages in my local environment, so I install another one from anaconda cloud. I will change them as you say to see whether test can pass. @dcrankshaw

dcrankshaw · 2018-02-04T21:52:38Z

dockerfiles/Caffe2Dockerfile

+      && conda install -c ezyang onnx\
+      && conda install -c conda-forge protobuf==3.4.0 \
+      && conda install -c ezyang/label/devgpu caffe2 \
+      && conda install -c ezyang onnx-caffe2 \


This should be && pip install onnx-caffe2 \

dcrankshaw · 2018-02-04T21:53:22Z

dockerfiles/Caffe2Dockerfile

+      && conda install -c ezyang/label/devgpu caffe2 \
+      && conda install -c ezyang onnx-caffe2 \
+      && conda install -c jjh_pytorch pytorch \
+      && conda install -c jjh_pytorch torchvision


Why are you installing pytorch and torchvision in this container?

As torch and torchvision are imported in the test file. Should I delete there lines ? @dcrankshaw

Umm actually the python function serialization code will find PyTorch as dependencies and install them anyway, so you can leave them in. But install them as

&& conda install -c pytorch pytorch torchvision \

dcrankshaw · 2018-02-04T22:01:00Z

integration-tests/deploy_caffe2_models.py

@@ -0,0 +1,222 @@
+from __future__ import absolute_import, print_function


Rename this to deploy_pytorch_to_caffe2_with_onnx.py

AmplabJenkins · 2018-02-04T23:03:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/937/
Test PASSed.

haofanwang · 2018-02-08T09:44:40Z

The logs says that there exists some python format violations, could you check it for me ? @Corey-Zumar

dcrankshaw

This is looking good! Almost there.

dcrankshaw · 2018-02-12T20:44:23Z

clipper_admin/clipper_admin/deployers/onnx.py

        for a model can be changed at any time with
        :py:meth:`clipper.ClipperConnection.set_num_replicas`.
+    onnx_backend : str, optional
+        The provided onnx backend.


Add a comment that caffe2 is the only currently supported ONNX backend.

dcrankshaw · 2018-02-12T20:44:35Z

clipper_admin/clipper_admin/deployers/onnx.py

+    deploy_pytorch_model(clipper_conn, name, version, input_type, inputs, func,
                        pytorch_model, base_image, labels, registry,
-                        num_replicas)
+                        num_replicas,onnx_backend)


You need a space here after the comma

num_replicas, onnx_backend)

dcrankshaw · 2018-02-12T20:45:57Z

clipper_admin/clipper_admin/deployers/onnx.py

        for a model can be changed at any time with
        :py:meth:`clipper.ClipperConnection.set_num_replicas`.
+    onnx_backend : str, optional
+        The provided onnx backend.


Add the same comment as above about caffe2 being the only currently supported ONNX backend

dcrankshaw · 2018-02-12T20:46:20Z

clipper_admin/clipper_admin/deployers/onnx.py

+        if onnx_backend is "caffe2":
+                base_image = "clipper/caffe2-onnx-container:{}".format(__version__)
+        else:
+                logger.error("{backend} ONNX backend is not currently supported.".format(backend=onnx_backend))


dcrankshaw · 2018-02-12T20:47:22Z

clipper_admin/clipper_admin/deployers/caffe2.py

+    except Exception as e:
+        logger.warn("Error serializing torch model: %s" % e)
+
+    logger.info("Torch model has be serialized to ONNX foamat")


Oh, just that you have a typo in the word "format". You spelled it "foamat", and I want you to spell it "format".

That comment was using vim syntax for find/replace ("substitute format for foamat")

dcrankshaw · 2018-02-12T20:49:52Z

dockerfiles/Caffe2Dockerfile

+      && conda install -c ezyang/label/devgpu caffe2 \
+      && conda install -c ezyang onnx-caffe2 \
+      && conda install -c jjh_pytorch pytorch \
+      && conda install -c jjh_pytorch torchvision


Umm actually the python function serialization code will find PyTorch as dependencies and install them anyway, so you can leave them in. But install them as

&& conda install -c pytorch pytorch torchvision \

dcrankshaw · 2018-02-12T20:50:42Z

integration-tests/deploy_pytorch_to_caffe2_with_onnx.py

                          link_model=False,
                          predict_fn=predict):
-    deploy_caffe2_model(clipper_conn, model_name, version, "integers", inputs,
+    deploy_pytorch_model(clipper_conn, model_name, version, "integers", inputs,


Specify the onnx_backend argument here

dcrankshaw · 2018-02-12T20:50:57Z

integration-tests/deploy_pytorch_to_caffe2_with_onnx.py


            app_and_model_name = "easy-register-app-model"
-            create_endpoint(clipper_conn, app_and_model_name, "integers",
+            create_pytorch_endpoint(clipper_conn, app_and_model_name, "integers",


Specify the onnx_backend argument here

AmplabJenkins · 2018-02-13T04:21:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1001/
Test FAILed.

AmplabJenkins · 2018-02-13T04:27:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1002/
Test FAILed.

dcrankshaw · 2018-02-14T03:06:44Z

The test failed the format checker. Run ./bin/format_code.sh.

AmplabJenkins · 2018-02-14T10:20:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1009/
Test FAILed.

AmplabJenkins · 2018-02-16T15:26:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1017/
Test PASSed.

AmplabJenkins · 2018-02-21T01:29:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1027/
Test FAILed.

haofanwang · 2018-02-21T14:39:31Z

The logs are below.

18-02-21:01:29:23 ERROR [kubernetes_integration_test.py:145] Exception: HTTPConnectionPool(host='ec2-54-187-245-87.us-west-2.compute.amazonaws.com', port=31170): Max retries exceeded with url: /testapp0-app/predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f96f006c110>: Failed to establish a new connection: [Errno 111] Connection refused',))

Why the connection error happens after merging ? Could you take a look? @dcrankshaw @Corey-Zumar

dcrankshaw

Some more cleanup changes

dcrankshaw · 2018-02-23T05:09:36Z

clipper_admin/clipper_admin/deployers/onnx.py

+        :py:meth:`clipper.ClipperConnection.set_num_replicas`.
+    onnx_backend : str, optional
+        The provided onnx backend.Caffe2 is the only currently supported ONNX backend.
+    """


You're missing documentation for the batch_size argument

dcrankshaw · 2018-02-23T05:10:01Z

clipper_admin/clipper_admin/deployers/onnx.py

+        :py:meth:`clipper.ClipperConnection.set_num_replicas`.
+    onnx_backend : str, optional
+        The provided onnx backend.Caffe2 is the only currently supported ONNX backend.
+    """


You're missing documentation for the batch_size argument

dcrankshaw · 2018-02-23T05:11:12Z

clipper_admin/clipper_admin/deployers/onnx.py

+    # Deploy model
+    clipper_conn.build_and_deploy_model(name, version, input_type,
+                                        serialization_dir, base_image, labels,
+                                        registry, num_replicas, batch_size)


Move the call to clipper_conn.build_and_deploy_model() to inside the try statement

dcrankshaw · 2018-02-23T05:14:30Z

dockerfiles/Caffe2OnnxDockerfile

+      && apt-get install -yqq -t jessie-backports openjdk-8-jdk \
+      && conda install -y --file /lib/python_container_conda_deps.txt \
+      && conda install -c anaconda cloudpickle=0.5.2 \
+      && conda install -c ezyang onnx\


install onnx from the official channel: conda install -c conda-forge onnx

dcrankshaw · 2018-02-23T05:14:50Z

dockerfiles/Caffe2OnnxDockerfile

+      && conda install -c anaconda cloudpickle=0.5.2 \
+      && conda install -c ezyang onnx\
+      && conda install -c conda-forge protobuf==3.4.0 \
+      && conda install -c ezyang/label/devgpu caffe2 \


install caffe2 from the official source

dcrankshaw · 2018-02-23T05:15:07Z

dockerfiles/Caffe2OnnxDockerfile

+      && conda install -c ezyang onnx\
+      && conda install -c conda-forge protobuf==3.4.0 \
+      && conda install -c ezyang/label/devgpu caffe2 \
+      && conda install -c ezyang onnx-caffe2 \


install caffe2 onnx extensions from official source

dcrankshaw · 2018-02-23T05:16:14Z

dockerfiles/Caffe2OnnxDockerfile

+      && conda install -c ezyang/label/devgpu caffe2 \
+      && conda install -c ezyang onnx-caffe2 \
+      && conda install -c jjh_pytorch pytorch \
+      && conda install -c jjh_pytorch torchvision


install pytorch and torchvision from the official channel: conda install pytorch torchvision -c pytorch

AmplabJenkins · 2018-02-23T06:09:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1044/
Test PASSed.

AmplabJenkins · 2018-02-23T06:37:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1045/
Test FAILed.

AmplabJenkins · 2018-02-23T06:52:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1046/
Test FAILed.

AmplabJenkins · 2018-02-23T17:15:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1047/
Test PASSed.

dcrankshaw

LGTM. Nice job!

Comments have been addressed

* update caffe2 deployer * update caffe2 container * update caffe2 container * update Caffe2Dockerfile * update deploy_caffe2_models.py * Update build_docker_images.sh * Format code * Update caffe2 container entrypoint permissions * Update Caffe2Dockerfile * Update caffe2_container.py * Update caffe2.py * Update build_docker_images.sh * Rename caffe2.py to onnx.py * Update onnx.py * Update and rename caffe2_container.py to caffe2_onnx_container.py * Update and rename caffe2_container_entry.sh to caffe2_onnx_container_entry.sh * Rename Caffe2Dockerfile to Caffe2OnnxDockerfile * Rename deploy_caffe2_models.py to deploy_pytorch_to_caffe2_with_onnx.py * Update caffe2_onnx_container_entry.sh * Update Caffe2OnnxDockerfile * Update deploy_pytorch_to_caffe2_with_onnx.py * Update onnx.py * Update onnx.py * Update caffe2_onnx_container_entry.sh * Update onnx.py * Update onnx.py * Update deploy_pytorch_to_caffe2_with_onnx.py * Update onnx.py * Update deploy_pytorch_to_caffe2_with_onnx.py * Update onnx.py * Update deploy_pytorch_to_caffe2_with_onnx.py * Support PyTorch + ONNX + Caffe2 Model deployer * Support PyTorch + ONNX + Caffe2 Model deployer * Update onnx.py

haofanwang added 6 commits January 19, 2018 08:13

update caffe2 deployer

6f67970

update caffe2 container

eddd0c3

update caffe2 container

fd94e34

update Caffe2Dockerfile

6f2a446

update deploy_caffe2_models.py

7824859

Update build_docker_images.sh

1e024e5

Merge branch 'develop' into develop

fb2bc62

Format code

e8f7ba6

Update caffe2 container entrypoint permissions

345eb2e

Corey-Zumar previously requested changes Jan 22, 2018

View reviewed changes

Corey-Zumar added type: enhancement status: needs revision labels Jan 22, 2018

haofanwang added 3 commits January 31, 2018 02:01

Update Caffe2Dockerfile

a6464d1

Update caffe2_container.py

4833cf6

Update caffe2.py

a9a02bb

Merge branch 'develop' into develop

4d804a6

dcrankshaw requested changes Feb 4, 2018

View reviewed changes

haofanwang added 2 commits February 7, 2018 16:00

Update build_docker_images.sh

709402e

Rename caffe2.py to onnx.py

1b1f3ac

dcrankshaw requested changes Feb 12, 2018

View reviewed changes

haofanwang added 2 commits February 13, 2018 11:48

Update onnx.py

fb77163

Update deploy_pytorch_to_caffe2_with_onnx.py

88d53a6

haofanwang added 2 commits February 14, 2018 02:01

Update onnx.py

15fdbed

Update deploy_pytorch_to_caffe2_with_onnx.py

35f29ba

haofanwang added 2 commits February 16, 2018 06:18

Update onnx.py

43a7ada

Update deploy_pytorch_to_caffe2_with_onnx.py

6dda88f

Merge branch 'develop' into develop

d05f62c

Merge branch 'develop' into develop

0216ad3

dcrankshaw requested changes Feb 23, 2018

View reviewed changes

haofanwang added 2 commits February 23, 2018 14:25

Support PyTorch + ONNX + Caffe2 Model deployer

4d4f198

Support PyTorch + ONNX + Caffe2 Model deployer

4720b4c

Update onnx.py

a0cb14a

dcrankshaw approved these changes Feb 23, 2018

View reviewed changes

dcrankshaw merged commit f8ce8ec into ucbrise:develop Feb 23, 2018

dcrankshaw mentioned this pull request Feb 23, 2018

PyTorch + ONNX + Caffe2 Model deployer #340

Closed


		import numpy as np

		from clipper_admin.deployers import cloudpickle

		@@ -0,0 +1,159 @@
		from __future__ import print_function, with_statement, absolute_import

		clipper_conn.link_model_to_app(name, name)


		def deploy_caffe2_model(

		@@ -0,0 +1,222 @@
		from __future__ import absolute_import, print_function

Conversation

haofanwang commented Jan 19, 2018

Uh oh!

AmplabJenkins commented Jan 19, 2018

Uh oh!

dcrankshaw commented Jan 19, 2018 via email

Uh oh!

dcrankshaw commented Jan 21, 2018

Uh oh!

AmplabJenkins commented Jan 21, 2018

Uh oh!

AmplabJenkins commented Jan 21, 2018

Uh oh!

Corey-Zumar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 22, 2018

Uh oh!

haofanwang commented Jan 23, 2018

Uh oh!

dcrankshaw commented Jan 31, 2018

Uh oh!

haofanwang commented Jan 31, 2018

Uh oh!

AmplabJenkins commented Jan 31, 2018

Uh oh!

dcrankshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haofanwang Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Feb 4, 2018

Uh oh!

haofanwang commented Feb 8, 2018

Uh oh!

dcrankshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haofanwang Feb 7, 2018 •

edited

Loading

haofanwang commented Feb 21, 2018 •

edited

Loading