Skip to content

PyTorch + ONNX + Caffe2 Model deployer#362

Merged
dcrankshaw merged 38 commits intoucbrise:developfrom
haofanwang:develop
Feb 23, 2018
Merged

PyTorch + ONNX + Caffe2 Model deployer#362
dcrankshaw merged 38 commits intoucbrise:developfrom
haofanwang:develop

Conversation

@haofanwang
Copy link
Member

@Corey-Zumar #340 Please take a look, and test it. It seems fine on local environment.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dcrankshaw
Copy link
Contributor

dcrankshaw commented Jan 19, 2018 via email

@dcrankshaw
Copy link
Contributor

jenkins ok to test

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/865/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/870/
Test FAILed.

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close. The Caffe2 container is currently crashing, and the unit test does not pass. Here are the logs from the container:

Attempting to run Caffe2 container without installing dependencies
Contents of /model
environment.yml
func.pkl
modules
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: libcuda.so.1: cannot open shared object file: No such file or directory
CRITICAL:root:Cannot load caffe2.python. Error: /opt/conda/lib/python2.7/site-packages/caffe2/python/../../../../libcaffe2.so: undefined symbol: _ZN7leveldb2DB4OpenERKNS_7OptionsERKSsPPS0_

&& conda install -c anaconda cloudpickle=0.5.2

RUN conda install -c ezyang onnx \
&& RUN conda install -c conda-forge protobuf==3.4.0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the RUN directives after &&. Also, align &&'s with the ones on lines 7-11

registry=None,
base_image="clipper/caffe2-container:{}".format(__version__),
num_replicas=1):
"""Registers an app and deploys the provided predict function with Caffe2 model as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function deploys the prediction function with a PyTorch model. It serializes the PyTorch model in Onnx format and creates a container that loads it as a Caffe2 model. Let's update the documentation here and for deploy_caffe2_model with this information.


import numpy as np

from clipper_admin.deployers import cloudpickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be import cloudpickle

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/883/
Test FAILed.

@haofanwang
Copy link
Member Author

I guess it should be a mismatch between library such as protobuf or opencv, but there is no such error in my local environment. Will PR later.

@dcrankshaw
Copy link
Contributor

What's the status on this?

@haofanwang
Copy link
Member Author

Ok to test. @dcrankshaw

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/929/
Test PASSed.

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just some small naming and API changes requested.

create_image tf_cifar_container TensorFlowCifarDockerfile $public
create_image tf-container TensorFlowDockerfile $public
create_image pytorch-container PyTorchContainerDockerfile $public
create_image caffe2-container Caffe2Dockerfile $public
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this caffe2-onnx-container and rename the Dockerfile to Caffe2OnnxDockerfile

@@ -0,0 +1,159 @@
from __future__ import print_function, with_statement, absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be called the Caffe2 deployer, because it doesn't let people deploy Caffe2 models. Let's rename this to onnx.py and we'll start to centralize all our ONNX-related model deployer functionality in here.

logger = logging.getLogger(__name__)


def create_endpoint(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to create_pytorch_endpoint, add an argument called onnx_backend="caffe2", then change the base_image default value to None (base_image=None).

Some context: We might want to support multiple ONNX backends soon. The cool thing about ONNX is that it decouples the choice of backend from the source of the model. So within the onnx model deployer we can support multiple training frameworks and inference frameworks. E.g. along with our create_pytorch_endpoint we might have a create_caffe2_endpoint and a create_mxnet_endpoint, any of which could deploy their model to a Caffe2 backend or an mxnet backend or a pytorch backend.

clipper_conn.link_model_to_app(name, name)


def deploy_caffe2_model(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to deploy_pytorch_model, add an argument onnx_backend=caffe2, and change the base_image default value to be None (base_image=None).

Modify the method documentation to match the new functionality.

Then inside the function, add the following code:

if base_image is None:
    if onnx_backend is "caffe2":
        base_image = "clipper/caffe2-container:{}".format(__version__)
    else:
        logger.error("{backend} ONNX backend is not currently supported.".format(backend=onnx_backend))


try:
torch_out = torch.onnx._export(
pytorch_model, inputs, "pytorch_model.onnx", export_params=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save this as just "model.onnx", the Caffe2 model can load any ONNX model, not just ones saved from PyTorch.

&& apt-get install -yqq -t jessie-backports openjdk-8-jdk \
&& conda install -y --file /lib/python_container_conda_deps.txt \
&& conda install -c anaconda cloudpickle=0.5.2 \
&& conda install -c ezyang onnx\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be && conda install -c conda-forge onnx \

&& conda install -c anaconda cloudpickle=0.5.2 \
&& conda install -c ezyang onnx\
&& conda install -c conda-forge protobuf==3.4.0 \
&& conda install -c ezyang/label/devgpu caffe2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be && conda install -c caffe2 caffe2 \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this command, but it seems to conflict with other packages in my local environment, so I install another one from anaconda cloud. I will change them as you say to see whether test can pass. @dcrankshaw

&& conda install -c ezyang onnx\
&& conda install -c conda-forge protobuf==3.4.0 \
&& conda install -c ezyang/label/devgpu caffe2 \
&& conda install -c ezyang onnx-caffe2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be && pip install onnx-caffe2 \

&& conda install -c ezyang/label/devgpu caffe2 \
&& conda install -c ezyang onnx-caffe2 \
&& conda install -c jjh_pytorch pytorch \
&& conda install -c jjh_pytorch torchvision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you installing pytorch and torchvision in this container?

Copy link
Member Author

@haofanwang haofanwang Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As torch and torchvision are imported in the test file. Should I delete there lines ? @dcrankshaw

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm actually the python function serialization code will find PyTorch as dependencies and install them anyway, so you can leave them in. But install them as

&& conda install -c pytorch pytorch torchvision \

@@ -0,0 +1,222 @@
from __future__ import absolute_import, print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to deploy_pytorch_to_caffe2_with_onnx.py

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/937/
Test PASSed.

@haofanwang
Copy link
Member Author

The logs says that there exists some python format violations, could you check it for me ? @Corey-Zumar

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Almost there.

for a model can be changed at any time with
:py:meth:`clipper.ClipperConnection.set_num_replicas`.
onnx_backend : str, optional
The provided onnx backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that caffe2 is the only currently supported ONNX backend.

deploy_pytorch_model(clipper_conn, name, version, input_type, inputs, func,
pytorch_model, base_image, labels, registry,
num_replicas)
num_replicas,onnx_backend)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a space here after the comma

num_replicas, onnx_backend)

for a model can be changed at any time with
:py:meth:`clipper.ClipperConnection.set_num_replicas`.
onnx_backend : str, optional
The provided onnx backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the same comment as above about caffe2 being the only currently supported ONNX backend

if onnx_backend is "caffe2":
base_image = "clipper/caffe2-onnx-container:{}".format(__version__)
else:
logger.error("{backend} ONNX backend is not currently supported.".format(backend=onnx_backend))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

except Exception as e:
logger.warn("Error serializing torch model: %s" % e)

logger.info("Torch model has be serialized to ONNX foamat")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just that you have a typo in the word "format". You spelled it "foamat", and I want you to spell it "format".

That comment was using vim syntax for find/replace ("substitute format for foamat")

&& conda install -c ezyang/label/devgpu caffe2 \
&& conda install -c ezyang onnx-caffe2 \
&& conda install -c jjh_pytorch pytorch \
&& conda install -c jjh_pytorch torchvision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm actually the python function serialization code will find PyTorch as dependencies and install them anyway, so you can leave them in. But install them as

&& conda install -c pytorch pytorch torchvision \

link_model=False,
predict_fn=predict):
deploy_caffe2_model(clipper_conn, model_name, version, "integers", inputs,
deploy_pytorch_model(clipper_conn, model_name, version, "integers", inputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the onnx_backend argument here


app_and_model_name = "easy-register-app-model"
create_endpoint(clipper_conn, app_and_model_name, "integers",
create_pytorch_endpoint(clipper_conn, app_and_model_name, "integers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the onnx_backend argument here

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1001/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1002/
Test FAILed.

@dcrankshaw
Copy link
Contributor

The test failed the format checker. Run ./bin/format_code.sh.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1009/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1017/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1027/
Test FAILed.

@haofanwang
Copy link
Member Author

haofanwang commented Feb 21, 2018

The logs are below.

18-02-21:01:29:23 ERROR [kubernetes_integration_test.py:145] Exception: HTTPConnectionPool(host='ec2-54-187-245-87.us-west-2.compute.amazonaws.com', port=31170): Max retries exceeded with url: /testapp0-app/predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f96f006c110>: Failed to establish a new connection: [Errno 111] Connection refused',))

Why the connection error happens after merging ? Could you take a look? @dcrankshaw @Corey-Zumar

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more cleanup changes

:py:meth:`clipper.ClipperConnection.set_num_replicas`.
onnx_backend : str, optional
The provided onnx backend.Caffe2 is the only currently supported ONNX backend.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing documentation for the batch_size argument

:py:meth:`clipper.ClipperConnection.set_num_replicas`.
onnx_backend : str, optional
The provided onnx backend.Caffe2 is the only currently supported ONNX backend.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing documentation for the batch_size argument

# Deploy model
clipper_conn.build_and_deploy_model(name, version, input_type,
serialization_dir, base_image, labels,
registry, num_replicas, batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the call to clipper_conn.build_and_deploy_model() to inside the try statement

&& apt-get install -yqq -t jessie-backports openjdk-8-jdk \
&& conda install -y --file /lib/python_container_conda_deps.txt \
&& conda install -c anaconda cloudpickle=0.5.2 \
&& conda install -c ezyang onnx\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install onnx from the official channel: conda install -c conda-forge onnx

&& conda install -c anaconda cloudpickle=0.5.2 \
&& conda install -c ezyang onnx\
&& conda install -c conda-forge protobuf==3.4.0 \
&& conda install -c ezyang/label/devgpu caffe2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install caffe2 from the official source

&& conda install -c ezyang onnx\
&& conda install -c conda-forge protobuf==3.4.0 \
&& conda install -c ezyang/label/devgpu caffe2 \
&& conda install -c ezyang onnx-caffe2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install caffe2 onnx extensions from official source

&& conda install -c ezyang/label/devgpu caffe2 \
&& conda install -c ezyang onnx-caffe2 \
&& conda install -c jjh_pytorch pytorch \
&& conda install -c jjh_pytorch torchvision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install pytorch and torchvision from the official channel: conda install pytorch torchvision -c pytorch

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1044/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1045/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1046/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1047/
Test PASSed.

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice job!

@dcrankshaw dcrankshaw dismissed Corey-Zumar’s stale review February 23, 2018 20:46

Comments have been addressed

@dcrankshaw dcrankshaw merged commit f8ce8ec into ucbrise:develop Feb 23, 2018
gtfierro pushed a commit to gtfierro/clipper that referenced this pull request Feb 27, 2018
* update caffe2 deployer

* update caffe2 container

* update caffe2 container

* update Caffe2Dockerfile

* update deploy_caffe2_models.py

* Update build_docker_images.sh

* Format code

* Update caffe2 container entrypoint permissions

* Update Caffe2Dockerfile

* Update caffe2_container.py

* Update caffe2.py

* Update build_docker_images.sh

* Rename caffe2.py to onnx.py

* Update onnx.py

* Update and rename caffe2_container.py to caffe2_onnx_container.py

* Update and rename caffe2_container_entry.sh to caffe2_onnx_container_entry.sh

* Rename Caffe2Dockerfile to Caffe2OnnxDockerfile

* Rename deploy_caffe2_models.py to deploy_pytorch_to_caffe2_with_onnx.py

* Update caffe2_onnx_container_entry.sh

* Update Caffe2OnnxDockerfile

* Update deploy_pytorch_to_caffe2_with_onnx.py

* Update onnx.py

* Update onnx.py

* Update caffe2_onnx_container_entry.sh

* Update onnx.py

* Update onnx.py

* Update deploy_pytorch_to_caffe2_with_onnx.py

* Update onnx.py

* Update deploy_pytorch_to_caffe2_with_onnx.py

* Update onnx.py

* Update deploy_pytorch_to_caffe2_with_onnx.py

* Support PyTorch + ONNX + Caffe2 Model deployer

* Support PyTorch + ONNX + Caffe2 Model deployer

* Update onnx.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants