[CI] Fix maintenance scripts by rkooo567 · Pull Request #707 · ucbrise/clipper

rkooo567 · 2019-05-24T05:27:35Z

This will print why mxnet fails even before it starts. We should probably not merge it. Let's keep running it until mxnet fails without running.

Reference: #703

AmplabJenkins · 2019-05-24T06:45:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2015/
Test FAILed.

rkooo567 · 2019-05-25T05:11:03Z

@withsmilo Looks like even with this change, the error log is not printed. Let me think of other way to do this.

AmplabJenkins · 2019-05-25T06:41:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2016/
Test PASSed.

rkooo567 · 2019-05-25T06:56:47Z

jenkins test this please

AmplabJenkins · 2019-05-25T08:06:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2017/
Test FAILed.

withsmilo · 2019-05-25T13:30:36Z

@rkooo567

 [integration_py2_mxnet] Sleep 482 secs before starting a test 
 [integration_py2_mxnet] Starting Trial 0 with timeout 2400.0 seconds 
 [integration_py2_mxnet] output: None 
 [integration_py2_mxnet] err: None 
 [integration_py2_mxnet] Sleep 230 
 [integration_py2_mxnet] Starting Trial 1 with timeout 2400.0 seconds 
 [integration_py2_mxnet] output: None 
 [integration_py2_mxnet] err: None 
 [integration_py2_mxnet] Sleep 295 
 [integration_py2_mxnet] All retry failed. 
CI_test.Makefile:152: recipe for target 'integration_py2_mxnet' failed
make: *** [integration_py2_mxnet] Error 1
make: *** Waiting for unfinished jobs....

AmplabJenkins · 2019-05-25T15:35:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2018/
Test PASSed.

withsmilo · 2019-05-25T15:48:12Z

@rkooo567
I found an interesting thing. If Jenkins executes a job on amp-jenkins-staging-worker-02, it will always have success about 'integration_py2_mxnet', but if on amp-jenkins-staging-worker-07 or amp-jenkins-staging-worker-08, it will always fail.

AmplabJenkins · 2019-05-25T16:49:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2019/
Test FAILed.

rkooo567 · 2019-05-25T19:19:41Z

@withsmilo That's interesting... @simon-mo Do you have any input about this?

Also, it looks like the error is still not printed. I will push a new commit to resolve this.

AmplabJenkins · 2019-05-25T20:12:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2021/
Test FAILed.

rkooo567 · 2019-05-25T21:00:22Z

@withsmilo If it is a docker command failure, it should still print out the logs. I guess it is that the request to docker daemon is hanging and timeout (It is still weird because then other tests should fail as well, but mxnet is the only test it fails. I will also modify the order of tests and see if mxnet is still broken). I will try to call some docker commands to check connectivity.

AmplabJenkins · 2019-05-25T21:32:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2023/
Test FAILed.

AmplabJenkins · 2019-05-25T22:45:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2025/
Test FAILed.

rkooo567 · 2019-05-26T06:22:22Z

@simon-mo @withsmilo One possibility might be that the docker volume is not cleaned properly at worker 7 & 8, and because of that the daemon is hanging. I printed volumes in the new commit here, and seems like there are lots of volumes that are not deleted (I am not sure if it is not deleted, or there are just lots of volumes because we are running lots of tests. I need some verification.)

withsmilo · 2019-05-26T08:52:37Z

Good point. According to the https://docs.docker.com/storage/volumes/#remove-volumes, the anonymous volume will be deleted automatically with —rm option. However, in case of the named volume, we have to execute docker volume prune to remove it. Therefore we have to add a cleanup command to ‘cleanup_jenkins.sh’ for docker volume. I will do it tonight.

withsmilo · 2019-05-26T23:09:00Z

@simon-mo
Thanks. https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2030/ was executed at worker 2. Let's run it again.

AmplabJenkins · 2019-05-27T00:10:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2031/
Test FAILed.

withsmilo · 2019-05-27T00:11:46Z

@rkooo567
https://amplab.cs.berkeley.edu/jenkins/job/Clipper-PRB/2031/ shows that many old Docker images were not deleted.

withsmilo · 2019-05-27T00:25:34Z

I added a command docker image ls --filter "label=maintainer=Dan Crankshaw <dscrankshaw@gmail.com>" | awk '{ print $3 }' | xargs docker image rm -f to cleanup old Docker images. This is a temporary patch for maintenance. After cleaning, I will add a new label to all the Clipper images, and use it to cleanup.

rkooo567 · 2019-05-27T01:20:38Z

@withsmilo Sounds good.

Also, why do you think it only occurs at mxnet test? If it is because of out of disk (which we are thinking), isn't it supposed to happen in other tests as well? I guess there might be some other causes, but cannot think of any.

@simon-mo Let us know if you have any clue.

withsmilo · 2019-05-27T01:28:39Z

@rkooo567 I don't know why.
@simon-mo Can you connect to the worker 7 or 8 through terminal? If you can, it might be a best solution to execute a mxnet test command directly at worker 7 or 8.

AmplabJenkins · 2019-05-27T01:37:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2032/
Test FAILed.

withsmilo · 2019-05-27T04:46:50Z

@rkooo567 @simon-mo
https://amplab.cs.berkeley.edu/jenkins/job/Clipper-PRB/2033/ shows,

 [integration_py3_mxnet] /usr/bin/python: line 2:     6 Illegal instruction     (core dumped) python3 "$@"

I guess that CPU in worker 7 & 8 is incompatible with the latest mxnet. https://github.com/apache/incubator-mxnet/issues?utf8=%E2%9C%93&q=is%3Aissue+Illegal+instruction
Can we exclude worker 7 & 8 from Clipper workers?

AmplabJenkins · 2019-05-27T04:53:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2033/
Test FAILed.

withsmilo · 2019-05-27T08:26:35Z

Some of the modifications included in this PR are worth merge with develop branch. After resolving integration_py3_mxnet issue, let's create a new PR or modify this one.

rkooo567 · 2019-05-27T16:35:07Z

@withsmilo Agree. Let's just remove debug logs and merge this PR. Once it is merged, I will create an issue to handle how to delete dangling images.

simon-mo · 2019-05-27T17:56:10Z

@withsmilo, @rkooo567 and I found out that only worker 1 and 2 have avx2 and mxnet requires avx2. I constrained the test to run on worker 2 for now. (worker 1 is down).

This is a temporary fix. I will start investigating alternative ways to do our CIs.

simon-mo · 2019-05-27T17:58:02Z

Also, the volumes and images are cleaned each night via a cronjob.

1) docker-gc (https://github.com/spotify/docker-gc)
2) stackoverflow knowledge:


$ cat cleanup-docker.sh
#!/bin/bash

# remove exited containers:
docker ps --filter status=dead --filter status=exited -aq | xargs -r docker rm -v

# remove unused images:
docker images --no-trunc | grep '<none>' | awk '{ print $3 }' | xargs -r docker rmi

# remove unused volumes:
find '/var/lib/docker/volumes/' -mindepth 1 -maxdepth 1 -type d | grep -vFf <(
  docker ps -aq | xargs docker inspect | jq -r '.[] | .Mounts | .[] | .Name | select(.)'
) | xargs -r rm -fr

docker volume ls -qf dangling=true | xargs -r docker volume rm```

withsmilo · 2019-05-27T22:15:12Z

@simon-mo Great! I understood that we have to run just one Jenkins job only before revolving CI worker's problem.

AmplabJenkins · 2019-05-27T23:31:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2035/
Test PASSed.

withsmilo

LGTM

Change retry function to print stderr to debug

e1edde2

rkooo567 mentioned this pull request May 24, 2019

Looks like our main build still fails. #703

Closed

withsmilo added status: in progress type: maintenance labels May 24, 2019

Second try

988beff

Sungjun, Kim added 3 commits May 25, 2019 23:23

Print logging

ed1e471

Increase the number of integration jobs to run simultaneously

bf799c6

Cleanup all containers and images when starting CI

2461049

rkooo567 added 3 commits May 25, 2019 12:52

check docker network status

5ed947e

resolved merge conflict

21b9727

check docker daemon connectivity + check network status

c941db8

cannot use docker client. use subprocess instead

6bf1c5d

Moved mxnet to the end and see if it is still broken at mxnet

0843afb

Cleanup all dangling Docker volumes when starting CI

c9f0ff2

Clean up old Docker images

fb058a7

Run integration_py3_mxnet before integration_py2_mxnet

356db36

simon-mo assigned rkooo567 May 27, 2019

Remove temporary codes

b5fd6cf

withsmilo changed the title ~~Change retry function to print stderr to debug~~ [CI] Fix maintenance scripts May 27, 2019

withsmilo requested review from simon-mo and withsmilo May 27, 2019 23:31

withsmilo approved these changes May 27, 2019

View reviewed changes

withsmilo added status: needs review and removed status: in progress labels May 27, 2019

simon-mo approved these changes May 27, 2019

View reviewed changes

simon-mo merged commit 16ec176 into ucbrise:develop May 27, 2019

withsmilo added status: accepted and removed status: needs review labels May 27, 2019

rkooo567 mentioned this pull request May 28, 2019

Some instructions that are required by Mxnet are not supported by some workers. #709

Open

Uh oh!

Conversation

rkooo567 commented May 24, 2019

Uh oh!

AmplabJenkins commented May 24, 2019

Uh oh!

rkooo567 commented May 25, 2019

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

rkooo567 commented May 25, 2019

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

withsmilo commented May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

withsmilo commented May 25, 2019

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

rkooo567 commented May 25, 2019

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

rkooo567 commented May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

AmplabJenkins commented May 25, 2019

Uh oh!

rkooo567 commented May 26, 2019

Uh oh!

withsmilo commented May 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

withsmilo commented May 26, 2019

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019

Uh oh!

rkooo567 commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019

Uh oh!

rkooo567 commented May 27, 2019

Uh oh!

simon-mo commented May 27, 2019

Uh oh!

simon-mo commented May 27, 2019

Uh oh!

withsmilo commented May 27, 2019

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

withsmilo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

withsmilo commented May 25, 2019 •

edited

Loading

rkooo567 commented May 25, 2019 •

edited

Loading

withsmilo commented May 26, 2019 •

edited

Loading

withsmilo commented May 27, 2019 •

edited

Loading