Fix leaking orphan processes on macOS/Linux#62782
Fix leaking orphan processes on macOS/Linux#62782zafnz wants to merge 3 commits intodart-lang:mainfrom
Conversation
Process.start can leak orphaned child processes that sleep forever. When ProcessStarter::Start() encounters an error **after** fork() but **before** signaling the child to proceed, the cleanup path (CleanupAndReturnError) closes pipes but never kills the forked child. The child blocks forever in read(), waiting for a "go" signal that will never arrive. This is exacerbated by a secondary issue: the child inherits the write end of its own signal pipe, so even when the parent closes its copy, the pipe doesn't break, the child's read() never returns EOF. Closes: dart-lang#62781 dart-lang#62781
|
Thank you for your contribution! This project uses Gerrit for code reviews. Your pull request has automatically been converted into a code review at: https://dart-review.googlesource.com/c/sdk/+/484060 Please wait for a developer to review your code review at the above link; you can speed up the review if you sign into Gerrit and manually add a reviewer that has recently worked on the relevant code. See CONTRIBUTING.md to learn how to upload changes to Gerrit directly. Additional commits pushed to this PR will update both the PR and the corresponding Gerrit CL. After the review is complete on the CL, your reviewer will merge the CL (automatically closing this PR). |
runtime/bin/process_linux.cc
Outdated
| if (err != 0) { | ||
| // The child is blocked in read() waiting for the go signal that | ||
| // will never arrive. Kill it and reap immediately. | ||
| // ProcessStarted() has not been called yet so the exit-code |
There was a problem hiding this comment.
This comment and reordering of ProcessStarted and RegisterProcess is questionable ExitCodeHandler might already be running (e.g. if there was another process started before this one) - so the exit handler can be involved. Is it necessary to reorder, instead of relying on ExitCodeHandler?
Another questionable thing is: here it says that the child is blocked on read(), but the comment below in NewProcess says that child will be unblocked with read() returning EOF because pipe's write-fd is closed.
There was a problem hiding this comment.
If we switch to using pidfd (#60854), the child won't need to wait at all. There's no race with adding the pidfd to the epoll set after it dies like there is with another thread calling wait.
There was a problem hiding this comment.
You're right that it's not great, and yeah this fix has a race condition.
There is a simplier approach i could do, which is keep the original order, kill the child, and just undo the ProcessStarted() call. Only problem there is there's no ProcessTerminated() method.
Could add one though
static void ProcessTerminated() {
MonitorLocker locker(monitor_);
process_count_--;
}
And then change the fix to:
ExitCodeHandler::ProcessStarted();
err = RegisterProcess(pid);
if (err != 0) {
kill(pid, SIGKILL);
TEMP_FAILURE_RETRY(waitpid(pid, nullptr, 0));
ExitCodeHandler::ProcessTerminated();
return err;
}
How does that sound?
There was a problem hiding this comment.
The variant you are proposing has another race: we might end up reaping pid before ExitCodeHandler reaches its wait (but after it checks process_count_ == 0) - in which case wait will fail with "no children" error (which is treated as fatal).
I suggest we do something like this instead. Lets leave reaping to the exit process thread, so that we don't need to think about races.
Note that we don't actually need kill(...) because ReadFromBlocking will fail naturally.
There was a problem hiding this comment.
Ok, yeah, i've put that in, that works and so far I'm unable to crash or lockup any children in my testing.
There was a problem hiding this comment.
Welp. Trying to run tests on the change actually reveals why read_in_[1] was not closed. Because down in ExecProcess we do:
if (TEMP_FAILURE_RETRY(dup2(read_in_[1], STDOUT_FILENO)) == -1) {
ReportChildError();
}
close(read_in_[0]);
close(read_in_[1]);This obviously does not work if we close read_in_[1].
I think it's a bit mind-boggling that this code uses stdout pipe in both directions. After fork child is reading from that pipe and parent is writing to it, but after exec child will start writing into it and parent is reading. mind blown gif
I suggest we make another change and replace read_in_ with write_out_ in the child-parent hand-shake process, because child process does not need write_out_[1] so (unlike read_in_[1]) it can be closed without issues.
There was a problem hiding this comment.
Oh, lol 🤦 . I just spent 20 minutes trying to figure out how I managed to test this without it failing, only to find that I was testing against an earlier patch, not the latest -- very broken -- "fix". I also didn't run the full test suite for the codebase. Mea culpa. Sigh. Sometimes getting out of bed is the only the first mistake you make of the day.
New patch incoming shortly 😂
ps Oh I don't know the repo policies re PRs and commits. So far I've kept each change as a separate commit, but I can squash them if you'd prefer.
There was a problem hiding this comment.
I would suggest a simplification. It will require you to change more places, but result will be simpler logic overall. I think you can update CreatePipes method and flip creation of read_in_ and write_out_ so that write_out_ is always created (instead of always creating read_in_ like the code does now). After that you would need to update comments to match (there are few places) and update code in ExecDetachedProcess and at the end of Start where it closes descriptors. It might be good to quickly review all existing places where read_in_ occurs - because maybe I missed some other logic change. This will make us uniformly use write_out_ for child->parent communication.
Also note that presubmit checks are failing for the change because code is not formatted. Try running git cl presubmit and/or git cl format (this assumes you have installed depot_tools in your path - which I assume you did because otherwise you would not be able to build the SDK).
ps Oh I don't know the repo policies re PRs and commits. So far I've kept each change as a separate commit, but I can squash them if you'd prefer.
tldr is that it does not matter. It will all get squashed in the end.
We don't really land PRs - GitHub is just a mirror of our main repository which resides here and we use Gerrit as our main code review1 tool.
For convenience of external contributors who might not want to learn Gerrit workflow we provide PR-to-Gerrit sync via copybara service.
When you open a PR copybara creates a Gerrit Change for it. Sometimes these are referred to as Change Lists or CLs for historical reasons. Your PR corresponds to this CL. Each time you update the PR copybara pushes a new Patch Set to the CL. So Gerrit always keeps the immutable history of the evolution of the change - unlike Github where you just look at commits and you can just rewrite the whole thing by force-pushing.
When we arrive to the state in which the code is ready to land we submit the CL - rather than merge the PR - and CL becomes a single atomic commit in the repo.
Footnotes
-
In my opinion Gerrit - despite being somewhat fringe compared to ubiquity of Github - provides superior code review experience compared to GitHub PRs, especially for big changes. ↩
There was a problem hiding this comment.
Ok, I'll have to get to this in a few days, I've run out of time this week.
Apply patch from @mraleph: instead of killing and reaping the child ourselves (which races with ExitCodeHandler), always register the child in ProcessInfoList (even with fd=-1 when pipe() fails) and let the exit handler reap it naturally via EOF on the read pipe. I had to also make RemoveProcess use take_fd() to transfer fd ownership to the caller before deleting the ProcessInfo, preventing the destructor from closing an fd that ExitCodeHandler still needs to write to (caused EBADF crash without this).
|
https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request. |
1 similar comment
|
https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request. |
Fixing a rather braindead mistake in previous commit. The child needs read_in[1] after exec for dup2 to STDOUT_FILENO, so it cannot be closed early. Use write_out (stdin pipe) for the handshake instead — the child doesn't need write_out[1] so it can safely close it, enabling EOF detection if the parent errors out. In detached mode write_out_ doesn't exist but read_in[1] is not dup2'd, so closing it there remains safe. This time I have run the full test suite, as well as tested it with my own repro case, and it seems to work correctly. Fingers crossed
|
https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request. |
1 similar comment
|
https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request. |
|
Gerrit CL has build or test failures as well as new comments, please review them in Gerrit. You will need to address both the comments and the failures before requesting another review. If a reviewer requested changes, push new commits to this PR and it will be automatically copied to Gerrit. After that you can mark reviewer comments as resolved in Gerrit and request another round of reviews. Note: when you add comments in Gerrit they only become visible after you send them by clicking Reply and Send. |
Process.start can leak orphaned child processes that sleep forever. When
ProcessStarter::Start()encounters an error afterfork()but before signalling the child to proceed, the cleanup path (CleanupAndReturnError) closes pipes but never kills the forked child. The child blocks forever inread(), waiting for a "go" signal that will never arrive.This is exacerbated by a secondary issue: the child inherits the write end of its own signal pipe, so even when the parent closes its copy, the pipe doesn't break, the child's
read()never returns EOF.See #62781 for a full breakdown of everything.
Closes: #62781