Skip to content

Don't crash the server when there's a listen error#587

Merged
jbr merged 2 commits intohttp-rs:masterfrom
jbr:dont-crash-on-listen-errors
Jun 16, 2020
Merged

Don't crash the server when there's a listen error#587
jbr merged 2 commits intohttp-rs:masterfrom
jbr:dont-crash-on-listen-errors

Conversation

@jbr
Copy link
Member

@jbr jbr commented Jun 11, 2020

This change is an attempt to address #577, and also seems like correct behavior for a server. Regardless of what goes wrong with any individual request, we attempt to continue to respond to subsequent requests. Previously, the use of ? was returning from the listen function scope, which shut down the server.

Extracting the inner functions was mostly for clarity and will be further refactored/extracted in future PRs

closes #577

@jbr jbr requested a review from yoshuawuyts June 11, 2020 23:43
Copy link
Member

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yoshuawuyts
Copy link
Member

I think we should merge this; though the more structural fix seems to be #386 by @tailhook which allows setting more fine-grained limits. Happy to merge this in the interim though!

@tailhook
Copy link

As far as I can see this will hang single CPU with 100% usage in case ENFILE or EMFILE is encountered (depending on one's use case it may be worse than just crashing, because in case of crashing supervisor would restart the process). I've tried to explain that in the book: https://book.async.rs/patterns/accept-loop.html#handling-errors Let me know if book is not clear enough on this issue.

Sorry, for not having enough time for finishing #386

let stream = match stream {
Err(ref e) if is_transient_error(e) => continue,
Err(error) => {
let delay = std::time::Duration::from_millis(500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of large? I guess it should be pretty unusual so maybe it's fine. Ideally we'd have a dynamically increasing backoff here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a good sense of an appropriate number here so I used the same number in the book section cited above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's 500ms in the book because it's reasonably fine to write a log message every 500ms. Skipping log message always will make the issue a nightmare to debug. Sleeping less but keeping message rarely will substantially complicate the code (which is fine for prod, but bad for a book).

But at the end of the day, you should not expect code to hit this timeout for a long time. I always consider this as an infrastructure issue: either bump file descriptor limit if you have enough memory, or lower listen queue, or put a load-balancer/queue in front of the process to get rid of error. So large timeout here is fine. Also, TCP delays can be larger than 500ms, so any client should be fine to wait that much.

@jbr
Copy link
Member Author

jbr commented Jun 16, 2020

Merging this based on @yoshuawuyts' comment above, especially since this is a short-term fix for multiple reasons:

  • When/if async-listen is Clone (Derive Clone for ByteStream and Token tailhook/async-listen#3), tide will be able to use that instead of this code
  • When tide supports multiple simultaneous listeners (a near-term goal), we'll need a totally different approach that pools open fds across multiple listeners of different types, and that will require rewriting this yet again

@jbr jbr merged commit d2f56d3 into http-rs:master Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Too many open files" error when benchmarking

4 participants