The dragonfly pods seem to be healthy, then suddenly, unpredictably get into an unhealthy state. When they become unhealthy, the pod will raise the warning
W20251202 12:04:38.763005 13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
W20251202 12:04:38.763013 14 listener_interface.cc:124] Error calling accept system:22/Invalid argument
The below is the logs for a dragonfly pod that is stuck in an infinite restart loop:
kubectl logs -f -n online-store t4459-dragonfly-0 1 ✘ prodDefaultCluster 07:04:35
I20251202 12:04:38.325198 7 init.cc:127] dragonfly running in opt mode.
I20251202 12:04:38.325289 7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
.--::--.
:+*=: =@@@@@@@@= :+*+:
%@@@@@@%*=. =@@@@@@@@- .=*%@@@@@@#
@@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%
-@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-
:+*********####-%@%%@%-####********++.
.%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%
.@@@@@@@@%*+-: =@@@@= .:-+*%@@@@@@@%.
=*+-: ###* .:-+*=
%@@%
*@@*
+@@=
:##:
:@@:
@@
..
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20251202 12:04:38.325392 7 dfly_main.cc:963] Max memory limit is: 15.00GiB
I20251202 12:04:38.325886 12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20251202 12:04:38.326862 7 proactor_pool.cc:149] Running 8 io threads
I20251202 12:04:38.329512 7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
I20251202 12:04:38.329754 7 dfly_main.cc:298] Listening on admin socket any:9999
I20251202 12:04:38.331471 7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
I20251202 12:04:38.366417 17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
I20251202 12:04:38.384289 17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
I20251202 12:04:38.408107 12 accept_server.cc:60] Exiting on signal Terminated
I20251202 12:04:38.457690 17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
I20251202 12:04:38.473973 7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/
I20251202 12:04:38.630169 7 server_family.cc:1340] Loading s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs
I20251202 12:04:38.762454 13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
I20251202 12:04:38.762465 14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
W20251202 12:04:38.763005 13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
W20251202 12:04:38.763013 14 listener_interface.cc:124] Error calling accept system:22/Invalid argument
I20251202 12:04:38.763972 14 listener_interface.cc:230] Listener stopped for port 6379
I20251202 12:04:38.763993 13 listener_interface.cc:230] Listener stopped for port 9999
I20251202 12:04:38.839461 13 server_family.cc:1394] Load finished, num keys read: 0
I20251202 12:04:39.024096 16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
I20251202 12:04:39.600379 17 save_stages_controller.cc:355] Saving "s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs" finished after 0 us
When I force a helm upgrade with no change on the values, the dragonfly pod gets back into a healthy state, with the below logs:
kubectl logs -f -n online-store t4408-dragonfly-0 ✔ prodDefaultCluster 07:17:37
I20251202 12:17:18.588176 7 init.cc:127] dragonfly running in opt mode.
.--::--.
:+*=: =@@@@@@@@= :+*+:
%@@@@@@%*=. =@@@@@@@@- .=*%@@@@@@#
I20251202 12:17:18.588313 7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
@@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%
-@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-
:+*********####-%@%%@%-####********++.
.%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%
.@@@@@@@@%*+-: =@@@@= .:-+*%@@@@@@@%.
=*+-: ###* .:-+*=
%@@%
*@@*
+@@=
:##:
:@@:
@@
..
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20251202 12:17:18.588423 7 dfly_main.cc:963] Max memory limit is: 15.00GiB
I20251202 12:17:18.588954 12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20251202 12:17:18.589929 7 proactor_pool.cc:149] Running 8 io threads
I20251202 12:17:18.592590 7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
I20251202 12:17:18.592851 7 dfly_main.cc:298] Listening on admin socket any:9999
I20251202 12:17:18.595541 7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
I20251202 12:17:18.630002 17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
I20251202 12:17:18.646389 17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
I20251202 12:17:18.679507 17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
I20251202 12:17:18.697150 7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/
I20251202 12:17:18.797996 7 server_family.cc:1340] Loading s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/t4408-dragonfly-summary.dfs
I20251202 12:17:18.879534 13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
I20251202 12:17:18.879840 14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
I20251202 12:17:18.910612 13 server_family.cc:1394] Load finished, num keys read: 0
I20251202 12:17:19.138098 16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
I20251202 12:17:29.334545 15 server_family.cc:3647] Initiate replication with: NO ONE
I20251202 12:17:29.335666 16 server_family.cc:3647] Initiate replication with: NO ONE
I20251202 12:17:40.734303 19 dflycmd.cc:749] Registered replica 100.64.33.131:6379
I20251202 12:17:40.735044 12 dflycmd.cc:749] Registered replica 100.64.33.131:6379
At the time that the dragonfly got into an unhealthy state, the operator logs were:
2025-12-02T03:16:56Z INFO received {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}}
2025-12-02T03:16:56Z INFO non-deletion event for a pod with an existing role. checking if something is wrong {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\", \"role\": \"replica\"}
2025-12-02T03:16:56Z INFO getting all pods relevant to the dragonfly instance {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\"}
2025-12-02T03:16:56Z INFO checking if replica is configured correctly {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-1\"}
2025-12-02T03:16:56Z INFO checking if replica is configured correctly {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\"}
2025-12-02T03:16:56Z INFO received {\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-1\", \"reconcileID\": \"27cffc41-a69f-45ce-b21f-7a125d555df3\", \"pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}}
...
A workaround as I don't have access to reopen issue #402. Context is, our Dragonfly pods seem to get into a bad state unpredictably, where they will enter a state of perpetually killing and restarting the pods. Below are some logs for unhealthy vs healthy dragonfly pods, and the dragonfly operator logs at the time that the pod gets into an unhealthy restart loop state.
The only way to resolve the restart loop is to run a force helm-upgrade with no changes to the values.
dragonfly-operator-logs.csv
Originally posted by @xuekat in #402