[CI] Fix maintenance scripts#707
Conversation
|
Test FAILed. |
|
@withsmilo Looks like even with this change, the error log is not printed. Let me think of other way to do this. |
|
Test PASSed. |
|
jenkins test this please |
|
Test FAILed. |
|
|
Test PASSed. |
|
@rkooo567 |
|
Test FAILed. |
|
@withsmilo That's interesting... @simon-mo Do you have any input about this? Also, it looks like the error is still not printed. I will push a new commit to resolve this. |
|
Test FAILed. |
|
@withsmilo If it is a docker command failure, it should still print out the logs. I guess it is that the request to docker daemon is hanging and timeout (It is still weird because then other tests should fail as well, but mxnet is the only test it fails. I will also modify the order of tests and see if mxnet is still broken). I will try to call some docker commands to check connectivity. |
|
Test FAILed. |
|
Test FAILed. |
|
@simon-mo @withsmilo One possibility might be that the docker volume is not cleaned properly at worker 7 & 8, and because of that the daemon is hanging. I printed volumes in the new commit here, and seems like there are lots of volumes that are not deleted (I am not sure if it is not deleted, or there are just lots of volumes because we are running lots of tests. I need some verification.) |
|
Good point. According to the https://docs.docker.com/storage/volumes/#remove-volumes, the anonymous volume will be deleted automatically with —rm option. However, in case of the named volume, we have to execute |
|
@simon-mo |
|
Test FAILed. |
|
@rkooo567 |
|
I added a command |
|
@withsmilo Sounds good. Also, why do you think it only occurs at mxnet test? If it is because of out of disk (which we are thinking), isn't it supposed to happen in other tests as well? I guess there might be some other causes, but cannot think of any. @simon-mo Let us know if you have any clue. |
|
Test FAILed. |
|
@rkooo567 @simon-mo I guess that CPU in worker 7 & 8 is incompatible with the latest mxnet. https://github.com/apache/incubator-mxnet/issues?utf8=%E2%9C%93&q=is%3Aissue+Illegal+instruction |
|
Test FAILed. |
|
Some of the modifications included in this PR are worth merge with |
|
@withsmilo Agree. Let's just remove debug logs and merge this PR. Once it is merged, I will create an issue to handle how to delete dangling images. |
|
@withsmilo, @rkooo567 and I found out that only worker 1 and 2 have avx2 and mxnet requires avx2. I constrained the test to run on worker 2 for now. (worker 1 is down). This is a temporary fix. I will start investigating alternative ways to do our CIs. |
|
Also, the volumes and images are cleaned each night via a cronjob. |
|
@simon-mo Great! I understood that we have to run just one Jenkins job only before revolving CI worker's problem. |
|
Test PASSed. |
This will print why mxnet fails even before it starts. We should probably not merge it. Let's keep running it until
mxnetfails without running.Reference: #703