Skip to content

Lockup in library when client shuts down #900

@andysCaplin

Description

@andysCaplin

C++ compiler version:
Hazelcast Cpp client version: 4.0.1
Hazelcast server version: N/A
Number of the clients:
Cluster size, i.e. the number of Hazelcast cluster members:
OS version (Windows/Linux/OSX): Linux

Please attach relevant logs and files for client and server side.

Expected behaviour

When a SIGTERM is handled the binary should shut down

Actual behaviour

When the SIGTERM is received the Hazelcast library gets stuck sometimes which prevents the binary shutdown
We intercept SIGTERM and write to a pipe to call our specific signal handling code - i.e. none of our signal handling code, including the call to Hazelcast shutdown is called in the signal handler itself.

You can see that in thread 41 of the attached file. I don't know why frame 3 isn't shown but it's a call to a routine in our code that calls Hazelcast shutdown. When that returns we exit.
I can see from the logs that we've called the Hazelcast shutdown

2021/07/09-22:49:19.542328 +0000: (rttpd): NOTIFY: Discovery: Shutting down client
2021/07/09-22:49:19.542349 +0000: (rttpd): NOTIFY: Discovery: Peer - Removing entry <ccbbfb16-e5be-4c1a-af5b-840492060004>
2021/07/09-22:49:19.542465 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> LifecycleService::LifecycleEvent SHUTTING_DOWN
2021/07/09-22:49:19.542631 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.224:5701], connection: ClientConnection{alive=0, connectionId=3, remoteEndpoint=Address[10.42.3.224:5701], lastReadTime=2021-07-09 22:49:19.0-5, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542676 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.226:5701], connection: ClientConnection{alive=0, connectionId=1, remoteEndpoint=Address[10.42.3.226:5701], lastReadTime=2021-07-09 22:49:19.0-5, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542755 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.230:5701], connection: ClientConnection{alive=0, connectionId=2, remoteEndpoint=Address[10.42.3.230:5701], lastReadTime=2021-07-09 22:49:19.-35, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542763 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> LifecycleService::LifecycleEvent CLIENT_DISCONNECTED

In the attached trace file

  • There is a Hazelcast thread - 40 - that seems to doing shut down stuff
  • Thread 22 is handling a queue item
  • All the other threads don't appear to be locked up

The reason we shut down is because we timed out on receiving an expected response on a queue - which may be related to thread 22, i.e.

  • Thread 22 got stuck ( don't know why )
  • We time out and try to shutdown
  • We call hazelcast->shutdown() and it can't shut down because thread 22 still has some resource ?

The queue that has timed out has previously received several responses.
We have had issues with queues randomly not receiving messages sent by the server but in those cases when I looked at the client threads none of them were stuck as shown in the trace shown below.

Steps to reproduce the behaviour

This doesn't happen every time

gdb.txt

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions