fix(extension-redis): avoid hard crashes of expected errors when acquiring a lock#983
Conversation
|
FWIW, we are seeing something similar here too and are looking at solutions |
|
We've seen similar. As your PR points out, the hard crashes come from throwing something that has a I also increased the retries to 30 and maintain the lock for 2 seconds: https://github.com/ParabolInc/parabol/pull/11935/files. those changes reduced the number of errors by about 70%, so we know it's doing something, but we've still hit this error about 5 times in the last 2 days. what's even stranger, is it was hit during off-peak hours (midnight in the US) so we can't attribute it to high load. I think the next area to explore is replacing redlock with something else: |
|
|
|
Friend, I have submitted a new extension-smartroute to address the Redlock issue, which has significantly improved performance and availability. |
|
@0xb4lamx that's really helpful to know you don't use direct connections! I could absolutely see this PR reducing contention. Do you have any data on how many of your custom errors got thrown? My concern is if you set the retryCount to 0, then yes that reduces contention, but it also means the hook didn't run if a lock failed. |
Sure, this attached chart 👇 shows the err rate during last 7days: comparing it with this chart ☝️ confirms these events — each sharp drop represents connections being redistributed to a newly added server.
The first server that acquires the lock is guaranteed to perform the save, so in our case, additional retries from other servers would just create unnecessary contention. Thus, setting |
|
ah, that's interesting! we should probably expose the redlock config (or whatever locking tool we end up using). How do you guarantee a save of the most recent state? Here's the scenario I imagine:
How can you guarantee that "hi mom" is the eventual persisted state? True, in the real all you need is one more client mutation to trigger the |
I think there will be one final auto triggered save on last WSconn close event; which should guarantee that 🤔 That said, edge cases could still exist — for example, if U1 and U2 somehow end up with unsynced Y.Doc states and their WS connections both drop at the exact same moment, one will fail to lock... 🤷 |
|
yeah, definitely edge cases, for example with the auto triggered save, if S2 unloads the doc & triggers I suppose it can be summed up that with UPDATE: After more thought, I think the right architecture is to only have the document open on a single server, like figma. I'll still use redis to ensure this & publish messages across servers, but solving the affinity problem seems cleaner than consistency + conflicts. |
|
any thoughts on this @janthurau ? |
|
Yes, in general the way to go is make sure that users working on the same document are connected to the same server (as @mattkrick mentioned). We haven't seen this issue in our environments for this reason. The advantage of a crash is that this usually triggers external monitoring and may fix temporary connection issues, though obviously it may lead to data loss. On the other hand, we can't just ignore an error in onStoreDocument and proceed with unloading, as this could still cause data loss. I was actually already thinking about removing the locking logic altogether. It basically only ensures that no database update is sent concurrently, but the DB anyway handles that. What would be needed is some way to figure out if the DB-stored update is newer than the update that is about to be stored, so that two concurrent writes don't overwrite each other but the 2nd one just merges onto the first. |
|
I see the point about crashes triggering monitoring, but IMHO, a server should never hard-crash for this kind of issue. A crash would take down all active WSconnecions, not just the document in question. For monitoring, proper error handling/logs and alerts should provide the necessary visibility instead. Regarding the locking logic: it may make sense to rethink or even remove it in the long run. But this PR should at least gracefully handle this issue as it was handled before merging #958. Once that baseline is preserved, we can then discuss and decide on broader changes. wdyt? @janthurau |
revert retryCount to 0
a22be19 to
7b68fdc
Compare




Description
Currently when running Hocuspocus server at scale with Redis-extension enabled (many HP servers connecting to 1 Redis server); it could lead to a concurrency issues trying to acquire the lock during
onStoreDocument. on failure, it hard crashes the server, with:This seems to be a regression as part of #958 when switching to
@sesamecare-oss+redlockand related refactor.Reproduction Scenario
to imitate the lock issue; You can config your server with
then try to generate couple of saves()...
This PR includes