C++ compiler version:
Hazelcast Cpp client version: 4.0.1
Hazelcast server version: N/A
Number of the clients:
Cluster size, i.e. the number of Hazelcast cluster members:
OS version (Windows/Linux/OSX): Linux
Please attach relevant logs and files for client and server side.
Expected behaviour
When a SIGTERM is handled the binary should shut down
Actual behaviour
When the SIGTERM is received the Hazelcast library gets stuck sometimes which prevents the binary shutdown
We intercept SIGTERM and write to a pipe to call our specific signal handling code - i.e. none of our signal handling code, including the call to Hazelcast shutdown is called in the signal handler itself.
You can see that in thread 41 of the attached file. I don't know why frame 3 isn't shown but it's a call to a routine in our code that calls Hazelcast shutdown. When that returns we exit.
I can see from the logs that we've called the Hazelcast shutdown
2021/07/09-22:49:19.542328 +0000: (rttpd): NOTIFY: Discovery: Shutting down client
2021/07/09-22:49:19.542349 +0000: (rttpd): NOTIFY: Discovery: Peer - Removing entry <ccbbfb16-e5be-4c1a-af5b-840492060004>
2021/07/09-22:49:19.542465 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> LifecycleService::LifecycleEvent SHUTTING_DOWN
2021/07/09-22:49:19.542631 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.224:5701], connection: ClientConnection{alive=0, connectionId=3, remoteEndpoint=Address[10.42.3.224:5701], lastReadTime=2021-07-09 22:49:19.0-5, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542676 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.226:5701], connection: ClientConnection{alive=0, connectionId=1, remoteEndpoint=Address[10.42.3.226:5701], lastReadTime=2021-07-09 22:49:19.0-5, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542755 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> Removed connection to endpoint: Address[10.42.3.230:5701], connection: ClientConnection{alive=0, connectionId=2, remoteEndpoint=Address[10.42.3.230:5701], lastReadTime=2021-07-09 22:49:19.-35, closedTime=2021-07-09 22:49:19.000, connected server version=4.0.3}
2021/07/09-22:49:19.542763 +0000: (4276): INFO: Discovery: Library - <hz.client_1><soaktest> LifecycleService::LifecycleEvent CLIENT_DISCONNECTED
In the attached trace file
- There is a Hazelcast thread - 40 - that seems to doing shut down stuff
- Thread 22 is handling a queue item
- All the other threads don't appear to be locked up
The reason we shut down is because we timed out on receiving an expected response on a queue - which may be related to thread 22, i.e.
- Thread 22 got stuck ( don't know why )
- We time out and try to shutdown
- We call hazelcast->shutdown() and it can't shut down because thread 22 still has some resource ?
The queue that has timed out has previously received several responses.
We have had issues with queues randomly not receiving messages sent by the server but in those cases when I looked at the client threads none of them were stuck as shown in the trace shown below.
Steps to reproduce the behaviour
This doesn't happen every time
gdb.txt
C++ compiler version:
Hazelcast Cpp client version: 4.0.1
Hazelcast server version: N/A
Number of the clients:
Cluster size, i.e. the number of Hazelcast cluster members:
OS version (Windows/Linux/OSX): Linux
Please attach relevant logs and files for client and server side.
Expected behaviour
When a SIGTERM is handled the binary should shut down
Actual behaviour
When the SIGTERM is received the Hazelcast library gets stuck sometimes which prevents the binary shutdown
We intercept SIGTERM and write to a pipe to call our specific signal handling code - i.e. none of our signal handling code, including the call to Hazelcast shutdown is called in the signal handler itself.
You can see that in thread 41 of the attached file. I don't know why frame 3 isn't shown but it's a call to a routine in our code that calls Hazelcast shutdown. When that returns we exit.
I can see from the logs that we've called the Hazelcast shutdown
In the attached trace file
The reason we shut down is because we timed out on receiving an expected response on a queue - which may be related to thread 22, i.e.
The queue that has timed out has previously received several responses.
We have had issues with queues randomly not receiving messages sent by the server but in those cases when I looked at the client threads none of them were stuck as shown in the trace shown below.
Steps to reproduce the behaviour
This doesn't happen every time
gdb.txt