fix(extension-redis): avoid hard crashes of expected errors when acquiring a lock by 0xb4lamx · Pull Request #983 · ueberdosis/hocuspocus

0xb4lamx · 2025-09-01T14:54:22Z

Description

Currently when running Hocuspocus server at scale with Redis-extension enabled (many HP servers connecting to 1 Redis server); it could lead to a concurrency issues trying to acquire the lock during onStoreDocument. on failure, it hard crashes the server, with:

[onStoreDocument] The operation was unable to achieve a quorum during its retry window.
Error: Unhandled Rejection at promise Promise {
  <rejected> ExecutionError: The operation was unable to achieve a quorum during its retry window.
      at Redlock._execute (/usr/src/app/node_modules/.pnpm/@sesamecare-oss+redlock@1.4.0_ioredis@5.6.1/node_modules/@sesamecare-oss/redlock/build/index.js:231:15)
      at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
    attempts: [
     ....
    ]
  }
}

This seems to be a regression as part of #958 when switching to @sesamecare-oss+redlock and related refactor.

Reproduction Scenario

to imitate the lock issue; You can config your server with

const server = new Server({
    ...
    debounce: 1,
    maxDebounce: 2,
    extensions: [
      new Redis({
        ...,
        lockTimeout: 10000000000000000000,
      }),
      ...
  ]
});

then try to generate couple of saves()...

This PR includes

silent handler for expected errors when trying to acquire a lock.

packages/extension-redis/src/Redis.ts

0xb4lamx · 2025-09-02T08:26:18Z

cc @mattkrick @janthurau

mejackreed · 2025-09-04T19:57:46Z

FWIW, we are seeing something similar here too and are looking at solutions

mattkrick · 2025-09-04T21:13:22Z

We've seen similar. As your PR points out, the hard crashes come from throwing something that has a message property, so just like you, we stopped throwing an error with a message: https://github.com/ParabolInc/parabol/pull/11937/files.

I also increased the retries to 30 and maintain the lock for 2 seconds: https://github.com/ParabolInc/parabol/pull/11935/files.
I also started tracing the work done in the onStoreDocument hook to make sure we're under 1 second for server-established direct connections, which it looks like we are, so we could rule out hooks taking longer than expected.

those changes reduced the number of errors by about 70%, so we know it's doing something, but we've still hit this error about 5 times in the last 2 days. what's even stranger, is it was hit during off-peak hours (midnight in the US) so we can't attribute it to high load.

I think the next area to explore is replacing redlock with something else:

mejackreed · 2025-09-05T17:18:31Z

redis-semaphore looks more interesting to me

0xb4lamx · 2025-09-05T18:50:53Z

In our case, we don’t use direct connections. As I understand it, the issue mainly occurs when our load balancer’s sharding key strategy fails, causing the same doc to be loaded on multiple servers (all of which connect to the same Redis server), leading to lock contention during onStoreDocument.
The metrics below show how stable things are now after releasing a patched version that includes the changes from this PR 👇

ChasLui · 2025-09-09T02:49:27Z

Friend, I have submitted a new extension-smartroute to address the Redlock issue, which has significantly improved performance and availability.

mattkrick · 2025-09-12T18:33:47Z

@0xb4lamx that's really helpful to know you don't use direct connections! I could absolutely see this PR reducing contention. Do you have any data on how many of your custom errors got thrown? My concern is if you set the retryCount to 0, then yes that reduces contention, but it also means the hook didn't run if a lock failed.

0xb4lamx · 2025-09-14T09:26:00Z

@0xb4lamx that's really helpful to know you don't use direct connections! I could absolutely see this PR reducing contention. Do you have any data on how many of your custom errors got thrown?

Sure, this attached chart 👇 shows the err rate during last 7days:

it shows spikes, which occur when new servers join the consistent-hashing pool. The rebalancing causes some yDocs to be loaded on multiple servers at the same time, leading to lock contention.

comparing it with this chart ☝️ confirms these events — each sharp drop represents connections being redistributed to a newly added server.
//such LB issues are something we’re working on improving...

My concern is if you set the retryCount to 0, then yes that reduces contention, but it also means the hook didn't run if a lock failed.

The first server that acquires the lock is guaranteed to perform the save, so in our case, additional retries from other servers would just create unnecessary contention. Thus, setting retryCount = 0 ensures we avoid redundant attempts while still guaranteeing the document is saved once. wdyt? @mattkrick 🤔

mattkrick · 2025-09-15T19:08:17Z

ah, that's interesting! we should probably expose the redlock config (or whatever locking tool we end up using).

How do you guarantee a save of the most recent state? Here's the scenario I imagine:

User1 (U1) connects to Server1 (S1) to modify Document A (A).
User2 (U2) connects to Server2 (S2) to modify Document A (A).
At time 0, U1 writes "hi", S1 locks A while persisting that write to the DB.
At time 1, U2 writes "mom", S2 attempts to lock A, fails.
At time 2, the persisted state is "hi". The in-memory state is "hi mom" and propagates to both users
At time 3, both uses disconnect without further edits to the document

How can you guarantee that "hi mom" is the eventual persisted state? True, in the real all you need is one more client mutation to trigger the onStoreDocument, but if that mutation doesn't come, won't the in-memory state be ahead of of the persisted state?

0xb4lamx · 2025-09-15T21:24:56Z

At time 3, both uses disconnect without further edits to the document
How can you guarantee that "hi mom" is the eventual persisted state?

I think there will be one final auto triggered save on last WSconn close event; which should guarantee that 🤔

That said, edge cases could still exist — for example, if U1 and U2 somehow end up with unsynced Y.Doc states and their WS connections both drop at the exact same moment, one will fail to lock... 🤷

mattkrick · 2025-09-15T22:06:17Z

yeah, definitely edge cases, for example with the auto triggered save, if S2 unloads the doc & triggers onStoreDocument, if it fails to achieve a lock, and then U2 reconnects to it, it loads the document from persisted storage & U2 will get a stale read.

I suppose it can be summed up that with retryCount: 0 you're targeting an "at most once" strategy, whereas I'm going for "at least once". Neither is right or wrong, and in a perfect world we both want "exactly once", but that seems pretty difficult to achieve with redlock. The past week we saw 10 failed locks out of 34.3k, I'll try swapping out redlock for something else & report back in a week or so.

UPDATE: After more thought, I think the right architecture is to only have the document open on a single server, like figma. I'll still use redis to ensure this & publish messages across servers, but solving the affinity problem seems cleaner than consistency + conflicts.

0xb4lamx · 2025-10-01T08:39:40Z

any thoughts on this @janthurau ?

janthurau · 2025-10-02T08:22:35Z

Yes, in general the way to go is make sure that users working on the same document are connected to the same server (as @mattkrick mentioned). We haven't seen this issue in our environments for this reason.

The advantage of a crash is that this usually triggers external monitoring and may fix temporary connection issues, though obviously it may lead to data loss. On the other hand, we can't just ignore an error in onStoreDocument and proceed with unloading, as this could still cause data loss.

I was actually already thinking about removing the locking logic altogether. It basically only ensures that no database update is sent concurrently, but the DB anyway handles that. What would be needed is some way to figure out if the DB-stored update is newer than the update that is about to be stored, so that two concurrent writes don't overwrite each other but the 2nd one just merges onto the first.

0xb4lamx · 2025-10-03T13:10:36Z

I see the point about crashes triggering monitoring, but IMHO, a server should never hard-crash for this kind of issue. A crash would take down all active WSconnecions, not just the document in question. For monitoring, proper error handling/logs and alerts should provide the necessary visibility instead.

Regarding the locking logic: it may make sense to rethink or even remove it in the long run. But this PR should at least gracefully handle this issue as it was handled before merging #958. Once that baseline is preserved, we can then discuss and decide on broader changes.

wdyt? @janthurau

…quire lock

revert retryCount to 0

0xb4lamx commented Sep 1, 2025

View reviewed changes

packages/extension-redis/src/Redis.ts Show resolved Hide resolved

0xb4lamx commented Sep 1, 2025

View reviewed changes

packages/extension-redis/src/Redis.ts Show resolved Hide resolved

0xb4lamx requested a review from janthurau October 6, 2025 10:09

0xb4lamx added 2 commits October 6, 2025 12:48

fix(redis-extension): avoid throwing unhadled error when failing to a…

ab27f1c

…quire lock

fix(driftFactor): reset to default of 1% instead of 10%

7b68fdc

revert retryCount to 0

janthurau force-pushed the fix/unhandled-err-on-store-failing-to-aquire-lock branch from a22be19 to 7b68fdc Compare October 6, 2025 10:48

janthurau merged commit 38e58a4 into ueberdosis:main Oct 6, 2025
12 of 18 checks passed

Palanikannan1437 mentioned this pull request Oct 9, 2025

[WIKI-722] fix: redis memory leak hard crash issue makeplane/plane#7928

Merged

1 task

mattkrick mentioned this pull request Oct 9, 2025

feat: RedisServerAffinity RFC: Lock one document to one server #1008

Open

Uh oh!

Conversation

0xb4lamx commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

0xb4lamx commented Sep 2, 2025

Uh oh!

mejackreed commented Sep 4, 2025

Uh oh!

mattkrick commented Sep 4, 2025

Uh oh!

mejackreed commented Sep 5, 2025

Uh oh!

0xb4lamx commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChasLui commented Sep 9, 2025

Uh oh!

mattkrick commented Sep 12, 2025

Uh oh!

0xb4lamx commented Sep 14, 2025

Uh oh!

mattkrick commented Sep 15, 2025

Uh oh!

0xb4lamx commented Sep 15, 2025

Uh oh!

mattkrick commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xb4lamx commented Oct 1, 2025

Uh oh!

janthurau commented Oct 2, 2025

Uh oh!

0xb4lamx commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

0xb4lamx commented Sep 5, 2025 •

edited

Loading

mattkrick commented Sep 15, 2025 •

edited

Loading