Skip to content

fix(extension-redis): avoid hard crashes of expected errors when acquiring a lock#983

Merged
janthurau merged 2 commits intoueberdosis:mainfrom
0xb4lamx:fix/unhandled-err-on-store-failing-to-aquire-lock
Oct 6, 2025
Merged

fix(extension-redis): avoid hard crashes of expected errors when acquiring a lock#983
janthurau merged 2 commits intoueberdosis:mainfrom
0xb4lamx:fix/unhandled-err-on-store-failing-to-aquire-lock

Conversation

@0xb4lamx
Copy link
Contributor

@0xb4lamx 0xb4lamx commented Sep 1, 2025

Description

Currently when running Hocuspocus server at scale with Redis-extension enabled (many HP servers connecting to 1 Redis server); it could lead to a concurrency issues trying to acquire the lock during onStoreDocument. on failure, it hard crashes the server, with:

[onStoreDocument] The operation was unable to achieve a quorum during its retry window.
Error: Unhandled Rejection at promise Promise {
  <rejected> ExecutionError: The operation was unable to achieve a quorum during its retry window.
      at Redlock._execute (/usr/src/app/node_modules/.pnpm/@sesamecare-oss+redlock@1.4.0_ioredis@5.6.1/node_modules/@sesamecare-oss/redlock/build/index.js:231:15)
      at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
    attempts: [
     ....
    ]
  }
}

This seems to be a regression as part of #958 when switching to @sesamecare-oss+redlock and related refactor.

Reproduction Scenario

to imitate the lock issue; You can config your server with

const server = new Server({
    ...
    debounce: 1,
    maxDebounce: 2,
    extensions: [
      new Redis({
        ...,
        lockTimeout: 10000000000000000000,
      }),
      ...
  ]
});

then try to generate couple of saves()...

This PR includes

  • silent handler for expected errors when trying to acquire a lock.

@0xb4lamx
Copy link
Contributor Author

0xb4lamx commented Sep 2, 2025

cc @mattkrick @janthurau

@mejackreed
Copy link

FWIW, we are seeing something similar here too and are looking at solutions

@mattkrick
Copy link
Contributor

We've seen similar. As your PR points out, the hard crashes come from throwing something that has a message property, so just like you, we stopped throwing an error with a message: https://github.com/ParabolInc/parabol/pull/11937/files.

I also increased the retries to 30 and maintain the lock for 2 seconds: https://github.com/ParabolInc/parabol/pull/11935/files.
I also started tracing the work done in the onStoreDocument hook to make sure we're under 1 second for server-established direct connections, which it looks like we are, so we could rule out hooks taking longer than expected.
Screenshot 2025-09-04 at 1 57 02 PM

those changes reduced the number of errors by about 70%, so we know it's doing something, but we've still hit this error about 5 times in the last 2 days. what's even stranger, is it was hit during off-peak hours (midnight in the US) so we can't attribute it to high load.

I think the next area to explore is replacing redlock with something else:

@mejackreed
Copy link

redis-semaphore looks more interesting to me

@0xb4lamx
Copy link
Contributor Author

0xb4lamx commented Sep 5, 2025

In our case, we don’t use direct connections. As I understand it, the issue mainly occurs when our load balancer’s sharding key strategy fails, causing the same doc to be loaded on multiple servers (all of which connect to the same Redis server), leading to lock contention during onStoreDocument.
The metrics below show how stable things are now after releasing a patched version that includes the changes from this PR 👇
image

@ChasLui
Copy link
Contributor

ChasLui commented Sep 9, 2025

Friend, I have submitted a new extension-smartroute to address the Redlock issue, which has significantly improved performance and availability.

@mattkrick
Copy link
Contributor

@0xb4lamx that's really helpful to know you don't use direct connections! I could absolutely see this PR reducing contention. Do you have any data on how many of your custom errors got thrown? My concern is if you set the retryCount to 0, then yes that reduces contention, but it also means the hook didn't run if a lock failed.

@0xb4lamx
Copy link
Contributor Author

@0xb4lamx that's really helpful to know you don't use direct connections! I could absolutely see this PR reducing contention. Do you have any data on how many of your custom errors got thrown?

Sure, this attached chart 👇 shows the err rate during last 7days:
image
it shows spikes, which occur when new servers join the consistent-hashing pool. The rebalancing causes some yDocs to be loaded on multiple servers at the same time, leading to lock contention.
image

comparing it with this chart ☝️ confirms these events — each sharp drop represents connections being redistributed to a newly added server.
//such LB issues are something we’re working on improving...

My concern is if you set the retryCount to 0, then yes that reduces contention, but it also means the hook didn't run if a lock failed.

The first server that acquires the lock is guaranteed to perform the save, so in our case, additional retries from other servers would just create unnecessary contention. Thus, setting retryCount = 0 ensures we avoid redundant attempts while still guaranteeing the document is saved once. wdyt? @mattkrick 🤔

@mattkrick
Copy link
Contributor

ah, that's interesting! we should probably expose the redlock config (or whatever locking tool we end up using).

How do you guarantee a save of the most recent state? Here's the scenario I imagine:

  • User1 (U1) connects to Server1 (S1) to modify Document A (A).
  • User2 (U2) connects to Server2 (S2) to modify Document A (A).
  • At time 0, U1 writes "hi", S1 locks A while persisting that write to the DB.
  • At time 1, U2 writes "mom", S2 attempts to lock A, fails.
  • At time 2, the persisted state is "hi". The in-memory state is "hi mom" and propagates to both users
  • At time 3, both uses disconnect without further edits to the document

How can you guarantee that "hi mom" is the eventual persisted state? True, in the real all you need is one more client mutation to trigger the onStoreDocument, but if that mutation doesn't come, won't the in-memory state be ahead of of the persisted state?

@0xb4lamx
Copy link
Contributor Author

  • At time 3, both uses disconnect without further edits to the document
    How can you guarantee that "hi mom" is the eventual persisted state?

I think there will be one final auto triggered save on last WSconn close event; which should guarantee that 🤔

That said, edge cases could still exist — for example, if U1 and U2 somehow end up with unsynced Y.Doc states and their WS connections both drop at the exact same moment, one will fail to lock... 🤷

@mattkrick
Copy link
Contributor

mattkrick commented Sep 15, 2025

yeah, definitely edge cases, for example with the auto triggered save, if S2 unloads the doc & triggers onStoreDocument, if it fails to achieve a lock, and then U2 reconnects to it, it loads the document from persisted storage & U2 will get a stale read.

I suppose it can be summed up that with retryCount: 0 you're targeting an "at most once" strategy, whereas I'm going for "at least once". Neither is right or wrong, and in a perfect world we both want "exactly once", but that seems pretty difficult to achieve with redlock. The past week we saw 10 failed locks out of 34.3k, I'll try swapping out redlock for something else & report back in a week or so.

UPDATE: After more thought, I think the right architecture is to only have the document open on a single server, like figma. I'll still use redis to ensure this & publish messages across servers, but solving the affinity problem seems cleaner than consistency + conflicts.

@0xb4lamx
Copy link
Contributor Author

0xb4lamx commented Oct 1, 2025

any thoughts on this @janthurau ?

@janthurau
Copy link
Contributor

Yes, in general the way to go is make sure that users working on the same document are connected to the same server (as @mattkrick mentioned). We haven't seen this issue in our environments for this reason.

The advantage of a crash is that this usually triggers external monitoring and may fix temporary connection issues, though obviously it may lead to data loss. On the other hand, we can't just ignore an error in onStoreDocument and proceed with unloading, as this could still cause data loss.

I was actually already thinking about removing the locking logic altogether. It basically only ensures that no database update is sent concurrently, but the DB anyway handles that. What would be needed is some way to figure out if the DB-stored update is newer than the update that is about to be stored, so that two concurrent writes don't overwrite each other but the 2nd one just merges onto the first.

@0xb4lamx
Copy link
Contributor Author

0xb4lamx commented Oct 3, 2025

I see the point about crashes triggering monitoring, but IMHO, a server should never hard-crash for this kind of issue. A crash would take down all active WSconnecions, not just the document in question. For monitoring, proper error handling/logs and alerts should provide the necessary visibility instead.

Regarding the locking logic: it may make sense to rethink or even remove it in the long run. But this PR should at least gracefully handle this issue as it was handled before merging #958. Once that baseline is preserved, we can then discuss and decide on broader changes.

wdyt? @janthurau

@0xb4lamx 0xb4lamx requested a review from janthurau October 6, 2025 10:09
@janthurau janthurau force-pushed the fix/unhandled-err-on-store-failing-to-aquire-lock branch from a22be19 to 7b68fdc Compare October 6, 2025 10:48
@janthurau janthurau merged commit 38e58a4 into ueberdosis:main Oct 6, 2025
12 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments