Skip to content

Fix leaking orphan processes on macOS/Linux#62782

Open
zafnz wants to merge 3 commits intodart-lang:mainfrom
zafnz:fix-leaking-orphans
Open

Fix leaking orphan processes on macOS/Linux#62782
zafnz wants to merge 3 commits intodart-lang:mainfrom
zafnz:fix-leaking-orphans

Conversation

@zafnz
Copy link

@zafnz zafnz commented Feb 27, 2026

Process.start can leak orphaned child processes that sleep forever. When ProcessStarter::Start() encounters an error after fork() but before signalling the child to proceed, the cleanup path (CleanupAndReturnError) closes pipes but never kills the forked child. The child blocks forever in read(), waiting for a "go" signal that will never arrive.

This is exacerbated by a secondary issue: the child inherits the write end of its own signal pipe, so even when the parent closes its copy, the pipe doesn't break, the child's read() never returns EOF.

See #62781 for a full breakdown of everything.

Closes: #62781

  • I’ve reviewed the contributor guide and applied the relevant portions to this PR.

Process.start can leak orphaned child processes that sleep forever.
When ProcessStarter::Start() encounters an error **after** fork() but
**before** signaling the child to proceed, the cleanup path
(CleanupAndReturnError) closes pipes but never kills the forked child.
The child blocks forever in read(), waiting for a "go" signal that will
never arrive.

This is exacerbated by a secondary issue: the child inherits the write
end of its own signal pipe, so even when the parent closes its copy,
the pipe doesn't break, the child's read() never returns EOF.

Closes: dart-lang#62781 dart-lang#62781
@copybara-service
Copy link

Thank you for your contribution! This project uses Gerrit for code reviews. Your pull request has automatically been converted into a code review at:

https://dart-review.googlesource.com/c/sdk/+/484060

Please wait for a developer to review your code review at the above link; you can speed up the review if you sign into Gerrit and manually add a reviewer that has recently worked on the relevant code. See CONTRIBUTING.md to learn how to upload changes to Gerrit directly.

Additional commits pushed to this PR will update both the PR and the corresponding Gerrit CL. After the review is complete on the CL, your reviewer will merge the CL (automatically closing this PR).

if (err != 0) {
// The child is blocked in read() waiting for the go signal that
// will never arrive. Kill it and reap immediately.
// ProcessStarted() has not been called yet so the exit-code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment and reordering of ProcessStarted and RegisterProcess is questionable ExitCodeHandler might already be running (e.g. if there was another process started before this one) - so the exit handler can be involved. Is it necessary to reorder, instead of relying on ExitCodeHandler?

Another questionable thing is: here it says that the child is blocked on read(), but the comment below in NewProcess says that child will be unblocked with read() returning EOF because pipe's write-fd is closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switch to using pidfd (#60854), the child won't need to wait at all. There's no race with adding the pidfd to the epoll set after it dies like there is with another thread calling wait.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it's not great, and yeah this fix has a race condition.

There is a simplier approach i could do, which is keep the original order, kill the child, and just undo the ProcessStarted() call. Only problem there is there's no ProcessTerminated() method.

Could add one though

static void ProcessTerminated() {
      MonitorLocker locker(monitor_);
      process_count_--;
  }

And then change the fix to:

  ExitCodeHandler::ProcessStarted();
  err = RegisterProcess(pid);
  if (err != 0) {
      kill(pid, SIGKILL);
      TEMP_FAILURE_RETRY(waitpid(pid, nullptr, 0));
      ExitCodeHandler::ProcessTerminated();
      return err;
  }

How does that sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variant you are proposing has another race: we might end up reaping pid before ExitCodeHandler reaches its wait (but after it checks process_count_ == 0) - in which case wait will fail with "no children" error (which is treated as fatal).

I suggest we do something like this instead. Lets leave reaping to the exit process thread, so that we don't need to think about races.

Note that we don't actually need kill(...) because ReadFromBlocking will fail naturally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah, i've put that in, that works and so far I'm unable to crash or lockup any children in my testing.

Copy link
Member

@mraleph mraleph Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp. Trying to run tests on the change actually reveals why read_in_[1] was not closed. Because down in ExecProcess we do:

      if (TEMP_FAILURE_RETRY(dup2(read_in_[1], STDOUT_FILENO)) == -1) {
        ReportChildError();
      }
      close(read_in_[0]);
      close(read_in_[1]);

This obviously does not work if we close read_in_[1].

I think it's a bit mind-boggling that this code uses stdout pipe in both directions. After fork child is reading from that pipe and parent is writing to it, but after exec child will start writing into it and parent is reading. mind blown gif

I suggest we make another change and replace read_in_ with write_out_ in the child-parent hand-shake process, because child process does not need write_out_[1] so (unlike read_in_[1]) it can be closed without issues.

Copy link
Author

@zafnz zafnz Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, lol 🤦 . I just spent 20 minutes trying to figure out how I managed to test this without it failing, only to find that I was testing against an earlier patch, not the latest -- very broken -- "fix". I also didn't run the full test suite for the codebase. Mea culpa. Sigh. Sometimes getting out of bed is the only the first mistake you make of the day.

New patch incoming shortly 😂

ps Oh I don't know the repo policies re PRs and commits. So far I've kept each change as a separate commit, but I can squash them if you'd prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a simplification. It will require you to change more places, but result will be simpler logic overall. I think you can update CreatePipes method and flip creation of read_in_ and write_out_ so that write_out_ is always created (instead of always creating read_in_ like the code does now). After that you would need to update comments to match (there are few places) and update code in ExecDetachedProcess and at the end of Start where it closes descriptors. It might be good to quickly review all existing places where read_in_ occurs - because maybe I missed some other logic change. This will make us uniformly use write_out_ for child->parent communication.

Also note that presubmit checks are failing for the change because code is not formatted. Try running git cl presubmit and/or git cl format (this assumes you have installed depot_tools in your path - which I assume you did because otherwise you would not be able to build the SDK).

ps Oh I don't know the repo policies re PRs and commits. So far I've kept each change as a separate commit, but I can squash them if you'd prefer.

tldr is that it does not matter. It will all get squashed in the end.

We don't really land PRs - GitHub is just a mirror of our main repository which resides here and we use Gerrit as our main code review1 tool.

For convenience of external contributors who might not want to learn Gerrit workflow we provide PR-to-Gerrit sync via copybara service.

When you open a PR copybara creates a Gerrit Change for it. Sometimes these are referred to as Change Lists or CLs for historical reasons. Your PR corresponds to this CL. Each time you update the PR copybara pushes a new Patch Set to the CL. So Gerrit always keeps the immutable history of the evolution of the change - unlike Github where you just look at commits and you can just rewrite the whole thing by force-pushing.

When we arrive to the state in which the code is ready to land we submit the CL - rather than merge the PR - and CL becomes a single atomic commit in the repo.

Footnotes

  1. In my opinion Gerrit - despite being somewhat fringe compared to ubiquity of Github - provides superior code review experience compared to GitHub PRs, especially for big changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkustermann had a simpler suggestion here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll have to get to this in a few days, I've run out of time this week.

Apply patch from @mraleph: instead of killing and reaping the child
ourselves (which races with ExitCodeHandler), always register the child
in ProcessInfoList (even with fd=-1 when pipe() fails) and let the
exit handler reap it naturally via EOF on the read pipe.

I had to also make RemoveProcess use take_fd() to transfer fd
ownership to the caller before deleting the ProcessInfo,
preventing the destructor from closing an fd that ExitCodeHandler
still needs to write to (caused EBADF crash without this).
@copybara-service
Copy link

https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request.

1 similar comment
@copybara-service
Copy link

https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request.

Fixing a rather braindead mistake in previous commit.

The child needs read_in[1] after exec for dup2 to STDOUT_FILENO, so
it cannot be closed early. Use write_out (stdin pipe) for the
handshake instead — the child doesn't need write_out[1] so it can
safely close it, enabling EOF detection if the parent errors out.

In detached mode write_out_ doesn't exist but read_in[1] is not
dup2'd, so closing it there remains safe.

This time I have run the full test suite, as well as tested it with
my own repro case, and it seems to work correctly. Fingers crossed
@copybara-service
Copy link

https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request.

1 similar comment
@copybara-service
Copy link

https://dart-review.googlesource.com/c/sdk/+/484060 has been updated with the latest commits from this pull request.

@copybara-service
Copy link

Gerrit CL has build or test failures as well as new comments, please review them in Gerrit.

You will need to address both the comments and the failures before requesting another review.

If a reviewer requested changes, push new commits to this PR and it will be automatically copied to Gerrit. After that you can mark reviewer comments as resolved in Gerrit and request another round of reviews.

Note: when you add comments in Gerrit they only become visible after you send them by clicking Reply and Send.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Process.start leaks orphaned child processes when parent fails after fork() on macOS+Linux

3 participants