Skip to content

[CI] Fix maintenance scripts#707

Merged
simon-mo merged 19 commits intoucbrise:developfrom
rkooo567:mxnet_ci_debug
May 27, 2019
Merged

[CI] Fix maintenance scripts#707
simon-mo merged 19 commits intoucbrise:developfrom
rkooo567:mxnet_ci_debug

Conversation

@rkooo567
Copy link
Collaborator

This will print why mxnet fails even before it starts. We should probably not merge it. Let's keep running it until mxnet fails without running.

Reference: #703

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2015/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

@withsmilo Looks like even with this change, the error log is not printed. Let me think of other way to do this.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2016/
Test PASSed.

@rkooo567
Copy link
Collaborator Author

jenkins test this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2017/
Test FAILed.

@withsmilo
Copy link
Collaborator

withsmilo commented May 25, 2019

@rkooo567

 [integration_py2_mxnet] Sleep 482 secs before starting a test 
 [integration_py2_mxnet] Starting Trial 0 with timeout 2400.0 seconds 
 [integration_py2_mxnet] output: None 
 [integration_py2_mxnet] err: None 
 [integration_py2_mxnet] Sleep 230 
 [integration_py2_mxnet] Starting Trial 1 with timeout 2400.0 seconds 
 [integration_py2_mxnet] output: None 
 [integration_py2_mxnet] err: None 
 [integration_py2_mxnet] Sleep 295 
 [integration_py2_mxnet] All retry failed. 
CI_test.Makefile:152: recipe for target 'integration_py2_mxnet' failed
make: *** [integration_py2_mxnet] Error 1
make: *** Waiting for unfinished jobs....

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2018/
Test PASSed.

@withsmilo
Copy link
Collaborator

@rkooo567
I found an interesting thing. If Jenkins executes a job on amp-jenkins-staging-worker-02, it will always have success about 'integration_py2_mxnet', but if on amp-jenkins-staging-worker-07 or amp-jenkins-staging-worker-08, it will always fail.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2019/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

@withsmilo That's interesting... @simon-mo Do you have any input about this?

Also, it looks like the error is still not printed. I will push a new commit to resolve this.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2021/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

rkooo567 commented May 25, 2019

@withsmilo If it is a docker command failure, it should still print out the logs. I guess it is that the request to docker daemon is hanging and timeout (It is still weird because then other tests should fail as well, but mxnet is the only test it fails. I will also modify the order of tests and see if mxnet is still broken). I will try to call some docker commands to check connectivity.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2023/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2025/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

@simon-mo @withsmilo One possibility might be that the docker volume is not cleaned properly at worker 7 & 8, and because of that the daemon is hanging. I printed volumes in the new commit here, and seems like there are lots of volumes that are not deleted (I am not sure if it is not deleted, or there are just lots of volumes because we are running lots of tests. I need some verification.)

@withsmilo
Copy link
Collaborator

withsmilo commented May 26, 2019

Good point. According to the https://docs.docker.com/storage/volumes/#remove-volumes, the anonymous volume will be deleted automatically with —rm option. However, in case of the named volume, we have to execute docker volume prune to remove it. Therefore we have to add a cleanup command to ‘cleanup_jenkins.sh’ for docker volume. I will do it tonight.

@withsmilo
Copy link
Collaborator

@simon-mo
Thanks. https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2030/ was executed at worker 2. Let's run it again.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2031/
Test FAILed.

@withsmilo
Copy link
Collaborator

@rkooo567
https://amplab.cs.berkeley.edu/jenkins/job/Clipper-PRB/2031/ shows that many old Docker images were not deleted.

@withsmilo
Copy link
Collaborator

I added a command docker image ls --filter "label=maintainer=Dan Crankshaw <dscrankshaw@gmail.com>" | awk '{ print $3 }' | xargs docker image rm -f to cleanup old Docker images. This is a temporary patch for maintenance. After cleaning, I will add a new label to all the Clipper images, and use it to cleanup.

@rkooo567
Copy link
Collaborator Author

@withsmilo Sounds good.

Also, why do you think it only occurs at mxnet test? If it is because of out of disk (which we are thinking), isn't it supposed to happen in other tests as well? I guess there might be some other causes, but cannot think of any.

@simon-mo Let us know if you have any clue.

@withsmilo
Copy link
Collaborator

@rkooo567 I don't know why.
@simon-mo Can you connect to the worker 7 or 8 through terminal? If you can, it might be a best solution to execute a mxnet test command directly at worker 7 or 8.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2032/
Test FAILed.

@withsmilo
Copy link
Collaborator

withsmilo commented May 27, 2019

@rkooo567 @simon-mo
https://amplab.cs.berkeley.edu/jenkins/job/Clipper-PRB/2033/ shows,

 [integration_py3_mxnet] /usr/bin/python: line 2:     6 Illegal instruction     (core dumped) python3 "$@" 

I guess that CPU in worker 7 & 8 is incompatible with the latest mxnet. https://github.com/apache/incubator-mxnet/issues?utf8=%E2%9C%93&q=is%3Aissue+Illegal+instruction
Can we exclude worker 7 & 8 from Clipper workers?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2033/
Test FAILed.

@withsmilo
Copy link
Collaborator

Some of the modifications included in this PR are worth merge with develop branch. After resolving integration_py3_mxnet issue, let's create a new PR or modify this one.

@rkooo567
Copy link
Collaborator Author

@withsmilo Agree. Let's just remove debug logs and merge this PR. Once it is merged, I will create an issue to handle how to delete dangling images.

@simon-mo
Copy link
Contributor

@withsmilo, @rkooo567 and I found out that only worker 1 and 2 have avx2 and mxnet requires avx2. I constrained the test to run on worker 2 for now. (worker 1 is down).

This is a temporary fix. I will start investigating alternative ways to do our CIs.

@simon-mo
Copy link
Contributor

Also, the volumes and images are cleaned each night via a cronjob.

1) docker-gc (https://github.com/spotify/docker-gc)
2) stackoverflow knowledge:


$ cat cleanup-docker.sh
#!/bin/bash

# remove exited containers:
docker ps --filter status=dead --filter status=exited -aq | xargs -r docker rm -v

# remove unused images:
docker images --no-trunc | grep '<none>' | awk '{ print $3 }' | xargs -r docker rmi

# remove unused volumes:
find '/var/lib/docker/volumes/' -mindepth 1 -maxdepth 1 -type d | grep -vFf <(
  docker ps -aq | xargs docker inspect | jq -r '.[] | .Mounts | .[] | .Name | select(.)'
) | xargs -r rm -fr

docker volume ls -qf dangling=true | xargs -r docker volume rm```

@withsmilo
Copy link
Collaborator

@simon-mo Great! I understood that we have to run just one Jenkins job only before revolving CI worker's problem.

@withsmilo withsmilo changed the title Change retry function to print stderr to debug [CI] Fix maintenance scripts May 27, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2035/
Test PASSed.

@withsmilo withsmilo requested review from simon-mo and withsmilo May 27, 2019 23:31
Copy link
Collaborator

@withsmilo withsmilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants