Skip to content

Support for PyTorch model#322

Closed
haofanwang wants to merge 0 commit intoucbrise:developfrom
haofanwang:develop
Closed

Support for PyTorch model#322
haofanwang wants to merge 0 commit intoucbrise:developfrom
haofanwang:develop

Conversation

@haofanwang
Copy link
Member

Implement a model deployer for PyTorch,support deploying PyTorch models directly from the clipper_admin tool, similarly to the way we support deploying PySpark models directly by calling Clipper.deploy_pyspark_model.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well - just a few minor issues to address. Nice job!

registry=None,
base_image="clipper/pytorch-container:{}".format(__version__),
num_replicas=1):
"""Registers an app and deploys the provided predict function with PySpark model as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "PySpark" to "PyTorch"


try:
torch.save(pytorch_model,torch_model_save_loc)
#torch.save(pytorch_model.state_dict(), torch_model_save_loc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code


import numpy as np

# sys.path.append(os.path.abspath("/lib/"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

with open(file_path, 'r') as serialized_func_file:
return cloudpickle.load(serialized_func_file)

##### Need write code to load pytorch model here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line


##### Need write code to load pytorch model here
def load_pytorch_model(model_path):
print(model_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging print statement

model = torch.load(model_path)
return model

##### Need write code for pytorch container here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

return [str(p) for p in preds]

def predict_doubles(self, inputs):
preds = self.predict_func(inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call to self.predict_func is missing the self.model parameter.

@@ -0,0 +1,15 @@
#!/usr/bin/env sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the permissions of this script such that it can be executed (like we did during the last meeting) and commit the permissions change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘m not sure what should I modify here,I just do the same thing as all other .sh file,do you mean I need change "#!/usr/bin/env sh" to " #!/bin/bash"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, by permissions change I mean chmod the script so that the container can execute it. 777 should work just fine

@haofanwang
Copy link
Member Author

haofanwang commented Nov 25, 2017

Done,please check it. @Corey-Zumar

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good. But you need to add some tests. Check out the PySpark model deployer tests for an example.

@haofanwang
Copy link
Member Author

Done.

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haofanwang The deployment functionality looks really good. Most of the comments I left concern the test. The training method doesn't currently run due to an exception being thrown with the forward propagation step. If training the model before deploying it is too difficult, it's fine to write a test that deploys one of the pretrained PyTorch models, such as VGG16 from TorchVision, which can be used as

> from torchvision.models import vgg16
> model = vgg16()
...

labels=None,
registry=None,
num_replicas=1):
"""Deploy a Python function with a PySpark model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change PySpark to PyTorch

registry=None,
num_replicas=1):
"""Deploy a Python function with a PySpark model.
The function must take 3 arguments (in order): a SparkSession, the PySpark model, and a list of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this description. The function no longer takes a SparkSession object. Replace all instances of PySpark of PyTorch

running_loss = 0.0
for i, data in enumerate(train_loader, 1):
img, label = data
if use_gpu:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not defined

img = Variable(img)
label = Variable(label)
else:
img = Variable(img)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both the "if" and "else" cases are the same

def __len__(self):
return len(self.imgs)

def getName(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used

return 0


def parseData(train_path, pos_label):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Replace camelcase method name with parse_data.



def parseData(train_path, pos_label):
trainData = np.genfromtxt(train_path, delimiter=',', dtype=int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace camelcase: train_data

cleanup=True, start_clipper=True)

train_path = os.path.join(cur_dir, "data/train.data")
(x, y) = parseData(train_path, pos_label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with train_x, train_y = parse_data(train_path, pos_label)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(1000):
print('*' * 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statement

optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(1000):
print('*' * 10)
print('epoch {}'.format(epoch + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statement

@haofanwang
Copy link
Member Author

@Corey-Zumar I do some changes,please check it.

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haofanwang There's an error in the deployment test. The container crashed with the following message:

Traceback (most recent call last):
  File "/container/pytorch_container.py", line 110, in <module>
    input_type)
  File "/container/rpc.py", line 491, in start
    self.server.run()
  File "/container/rpc.py", line 303, in run
    prediction_request)
  File "/container/rpc.py", line 131, in handle_prediction_request
    outputs = predict_fn(prediction_request.inputs)
  File "/container/pytorch_container.py", line 43, in predict_ints
    preds = self.predict_func(self.model, inputs)
  File "deploy_pytorch_models.py", line 62, in predict
TypeError: 'NoneType' object is not callable

@haofanwang
Copy link
Member Author

haofanwang commented Dec 7, 2017

Oh,I forgot to return the trained model. Have fixed it. @Corey-Zumar

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haofanwang The container launched by the integration test crashes with the follow stack trace

Traceback (most recent call last):
  File "/container/pytorch_container.py", line 103, in <module>
    model = PyTorchContainer(model_path, input_type)
  File "/container/pytorch_container.py", line 36, in __init__
    self.model = load_pytorch_model(torch_model_path)
  File "/container/pytorch_container.py", line 22, in load_pytorch_model
    model = torch.load(model_path)
  File "/opt/conda/lib/python2.7/site-packages/torch/serialization.py", line 231, in load
    return _load(f, map_location, pickle_module)
  File "/opt/conda/lib/python2.7/site-packages/torch/serialization.py", line 379, in _load
    result = unpickler.load()
AttributeError: 'module' object has no attribute 'BasicNN'

Before submitting this for another round of reviews, please make sure that the integration test passes. You can test it by running python integration-tests/deploy_pytorch_models.py from the clipper root directory.

def train(model):
model.train()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(2000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only train the model for 10 epochs

@haofanwang
Copy link
Member Author

@Corey-Zumar Integration test has passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants