Skip to content

Handle tee_stdout.TimeoutExpired with warn() and terminate()#740

Merged
thequilo merged 2 commits intoIDSIA:masterfrom
HumanCompatibleAI:handle_tee_timeout
Oct 16, 2020
Merged

Handle tee_stdout.TimeoutExpired with warn() and terminate()#740
thequilo merged 2 commits intoIDSIA:masterfrom
HumanCompatibleAI:handle_tee_timeout

Conversation

@shwang
Copy link
Contributor

@shwang shwang commented May 31, 2020

When I run Sacred jobs in parallel using Ray I run into an error where the experiment first prints that it finished complete successfully, and then crashes with a unhandled subprocess.TimeoutExpired while waiting for tee_stdout to finish.

I don't know why this happens, but it seems like a pretty harmless solution to handle the exception, warn the user, and then terminate the tee subproc forcibly.

Closing the tee forcibly here could lead to some of the captured stdout not making it into Observers, but this is still better than the current behavior for TimeoutExpired, which is to simply exit on error and not terminate.

If it is helpful, then I can try to come up a minimal reproducible example of this error.

@shwang
Copy link
Contributor Author

shwang commented May 31, 2020

The flake8 error is unrelated to these changes. Maybe it's a new error due to a flake8 upgrade, as if we ran flake8 on master right now we would get this error:

flake8 run-test: commands[0] | flake8 sacred
sacred/config/config_scope.py:189:48: E741 ambiguous variable name 'l'
sacred/config/config_scope.py:190:35: E741 ambiguous variable name 'l'

Anyways, I ended renaming l something longer in #737 to avoid this flake8 error.

@AdamGleave
Copy link
Contributor

This seems related to #289

I'd rather get to the bottom of this but given the bug has been open for over a year a workaround like this is not crazy.

@shwang
Copy link
Contributor Author

shwang commented Jun 3, 2020

Interestingly in my case this is a deterministic fail

@EspenHa
Copy link

EspenHa commented Jun 26, 2020

I also submitted a PR for this issue here: #718. Also related: #705

@AdamGleave
Copy link
Contributor

@shwang I think you forgot to push the flake8 fix since that test is still failing?

@EspenHa thanks for the link! seems like a similar approach.

muffgaga pushed a commit to electronicvisions/spack that referenced this pull request Jul 10, 2020
Runs often failed with a tee timeout error in multiprocessing situations, this
fixes this. Cf.: IDSIA/sacred#740

Change-Id: I81071deee538f864802206a0860fee4bcfba9f23
@qxcv
Copy link

qxcv commented Aug 25, 2020

Is there a more elegant way to fix this than just taking a shotgun to the tee children? I've run into this problem running multiprocessing with forkserver. It seems like the problem is that forkserver subprocess doesn't shut down until the parent process exits. Hence, tee keeps waiting on output from the forkserver (not the main process) indefinitely, and we get a timeout. There is a hack that will let you shut down the forkserver at the end of your script, thereby allowing Sacred to close the file descriptor used to capture stdout (in Python 3.7+):

def stop_forkserver():
    from multiprocessing.forkserver import _forkserver as _fs
    if hasattr(_fs, '_stop'):
        _fs._stop()  # Python 3.8+
    else:
        # this next bit is copied from Python 3.8's ForkServer._stop()
        # (should work in 3.7, at least on Linux)
        if _fs._forkserver_pid is not None:
            os.close(_fs._forkserver_alive_fd)  # child now dies of its own accord
            os.waitpid(_fs._forkserver_pid, 0)  # reap the child
            os.unlink(_fs._forkserver_address)  # remove IPC channel
            _fs._forkserver_alive_fd = _fs._forkserver_pid = _fs._forkserver_address = None

Of course, the most elegant solution would be one that allows Sacred to stop before all subprocesses exit. Unfortunately I'm not enough of a Unix wizard to say whether that exists. At the very least, I don't see a way to do it while still retaining the current method of redirecting output with tee.

@shwang
Copy link
Contributor Author

shwang commented Oct 15, 2020

Flake8 is fixed now. This should be passing soon.

@AdamGleave
Copy link
Contributor

Nice! @thequilo could you review?

Not the most elegant solution as @qxcv points out but seems a significant improvement on the current behavior (Sacred failing spuriously on certain systems).

@thequilo
Copy link
Collaborator

I would prefer a solution to the problem and not killing the processes, but as long as we don't have a good solution, this workaround is fine I think.

@thequilo thequilo merged commit e9e8b6f into IDSIA:master Oct 16, 2020
@shwang shwang deleted the handle_tee_timeout branch November 2, 2020 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants