Handle tee_stdout.TimeoutExpired with warn() and terminate()#740
Handle tee_stdout.TimeoutExpired with warn() and terminate()#740thequilo merged 2 commits intoIDSIA:masterfrom
Conversation
|
The flake8 error is unrelated to these changes. Maybe it's a new error due to a flake8 upgrade, as if we ran flake8 on master right now we would get this error: Anyways, I ended renaming |
|
This seems related to #289 I'd rather get to the bottom of this but given the bug has been open for over a year a workaround like this is not crazy. |
|
Interestingly in my case this is a deterministic fail |
Runs often failed with a tee timeout error in multiprocessing situations, this fixes this. Cf.: IDSIA/sacred#740 Change-Id: I81071deee538f864802206a0860fee4bcfba9f23
|
Is there a more elegant way to fix this than just taking a shotgun to the def stop_forkserver():
from multiprocessing.forkserver import _forkserver as _fs
if hasattr(_fs, '_stop'):
_fs._stop() # Python 3.8+
else:
# this next bit is copied from Python 3.8's ForkServer._stop()
# (should work in 3.7, at least on Linux)
if _fs._forkserver_pid is not None:
os.close(_fs._forkserver_alive_fd) # child now dies of its own accord
os.waitpid(_fs._forkserver_pid, 0) # reap the child
os.unlink(_fs._forkserver_address) # remove IPC channel
_fs._forkserver_alive_fd = _fs._forkserver_pid = _fs._forkserver_address = NoneOf course, the most elegant solution would be one that allows Sacred to stop before all subprocesses exit. Unfortunately I'm not enough of a Unix wizard to say whether that exists. At the very least, I don't see a way to do it while still retaining the current method of redirecting output with |
|
Flake8 is fixed now. This should be passing soon. |
|
I would prefer a solution to the problem and not killing the processes, but as long as we don't have a good solution, this workaround is fine I think. |
When I run Sacred jobs in parallel using Ray I run into an error where the experiment first prints that it finished complete successfully, and then crashes with a unhandled
subprocess.TimeoutExpiredwhile waiting fortee_stdoutto finish.I don't know why this happens, but it seems like a pretty harmless solution to handle the exception, warn the user, and then
terminatethe tee subproc forcibly.Closing the
teeforcibly here could lead to some of the captured stdout not making it into Observers, but this is still better than the current behavior forTimeoutExpired, which is to simply exit on error and notterminate.If it is helpful, then I can try to come up a minimal reproducible example of this error.