Skip to content

Dragonfly stuck in restart loop #426

@xuekat

Description

@xuekat

A workaround as I don't have access to reopen issue #402. Context is, our Dragonfly pods seem to get into a bad state unpredictably, where they will enter a state of perpetually killing and restarting the pods. Below are some logs for unhealthy vs healthy dragonfly pods, and the dragonfly operator logs at the time that the pod gets into an unhealthy restart loop state.

The only way to resolve the restart loop is to run a force helm-upgrade with no changes to the values.

dragonfly-operator-logs.csv

The dragonfly pods seem to be healthy, then suddenly, unpredictably get into an unhealthy state. When they become unhealthy, the pod will raise the warning

W20251202 12:04:38.763005    13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
W20251202 12:04:38.763013    14 listener_interface.cc:124] Error calling accept system:22/Invalid argument

The below is the logs for a dragonfly pod that is stuck in an infinite restart loop:

kubectl logs -f -n online-store t4459-dragonfly-0                                                                                  1 ✘  prodDefaultCluster 󱃾  07:04:35 
I20251202 12:04:38.325198     7 init.cc:127] dragonfly running in opt mode.
I20251202 12:04:38.325289     7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
                   .--::--.                   
   :+*=:          =@@@@@@@@=          :+*+:   
  %@@@@@@%*=.     =@@@@@@@@-     .=*%@@@@@@#  
  @@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%  
  -@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-  
    :+*********####-%@%%@%-####********++.    
   .%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%    
   .@@@@@@@@%*+-:   =@@@@=  .:-+*%@@@@@@@%.   
     =*+-:           ###*          .:-+*=     
                     %@@%                     
                     *@@*                     
                     +@@=                     
                     :##:                     
                     :@@:                     
                      @@                      
                      ..                      
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20251202 12:04:38.325392     7 dfly_main.cc:963] Max memory limit is: 15.00GiB
I20251202 12:04:38.325886    12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20251202 12:04:38.326862     7 proactor_pool.cc:149] Running 8 io threads
I20251202 12:04:38.329512     7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
I20251202 12:04:38.329754     7 dfly_main.cc:298] Listening on admin socket any:9999
I20251202 12:04:38.331471     7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
I20251202 12:04:38.366417    17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
I20251202 12:04:38.384289    17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
I20251202 12:04:38.408107    12 accept_server.cc:60] Exiting on signal Terminated
I20251202 12:04:38.457690    17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
I20251202 12:04:38.473973     7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/
I20251202 12:04:38.630169     7 server_family.cc:1340] Loading s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs
I20251202 12:04:38.762454    13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
I20251202 12:04:38.762465    14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
W20251202 12:04:38.763005    13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
W20251202 12:04:38.763013    14 listener_interface.cc:124] Error calling accept system:22/Invalid argument
I20251202 12:04:38.763972    14 listener_interface.cc:230] Listener stopped for port 6379
I20251202 12:04:38.763993    13 listener_interface.cc:230] Listener stopped for port 9999
I20251202 12:04:38.839461    13 server_family.cc:1394] Load finished, num keys read: 0
I20251202 12:04:39.024096    16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
I20251202 12:04:39.600379    17 save_stages_controller.cc:355] Saving "s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs" finished after 0 us

When I force a helm upgrade with no change on the values, the dragonfly pod gets back into a healthy state, with the below logs:

kubectl logs -f -n online-store t4408-dragonfly-0                                                                              ✔  prodDefaultCluster 󱃾  07:17:37 
I20251202 12:17:18.588176     7 init.cc:127] dragonfly running in opt mode.
                   .--::--.                   
   :+*=:          =@@@@@@@@=          :+*+:   
  %@@@@@@%*=.     =@@@@@@@@-     .=*%@@@@@@#  
I20251202 12:17:18.588313     7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
  @@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%  
  -@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-  
    :+*********####-%@%%@%-####********++.    
   .%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%    
   .@@@@@@@@%*+-:   =@@@@=  .:-+*%@@@@@@@%.   
     =*+-:           ###*          .:-+*=     
                     %@@%                     
                     *@@*                     
                     +@@=                     
                     :##:                     
                     :@@:                     
                      @@                      
                      ..                      
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20251202 12:17:18.588423     7 dfly_main.cc:963] Max memory limit is: 15.00GiB
I20251202 12:17:18.588954    12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20251202 12:17:18.589929     7 proactor_pool.cc:149] Running 8 io threads
I20251202 12:17:18.592590     7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
I20251202 12:17:18.592851     7 dfly_main.cc:298] Listening on admin socket any:9999
I20251202 12:17:18.595541     7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
I20251202 12:17:18.630002    17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
I20251202 12:17:18.646389    17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
I20251202 12:17:18.679507    17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
I20251202 12:17:18.697150     7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/
I20251202 12:17:18.797996     7 server_family.cc:1340] Loading s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/t4408-dragonfly-summary.dfs
I20251202 12:17:18.879534    13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
I20251202 12:17:18.879840    14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
I20251202 12:17:18.910612    13 server_family.cc:1394] Load finished, num keys read: 0
I20251202 12:17:19.138098    16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
I20251202 12:17:29.334545    15 server_family.cc:3647] Initiate replication with: NO ONE
I20251202 12:17:29.335666    16 server_family.cc:3647] Initiate replication with: NO ONE
I20251202 12:17:40.734303    19 dflycmd.cc:749] Registered replica 100.64.33.131:6379
I20251202 12:17:40.735044    12 dflycmd.cc:749] Registered replica 100.64.33.131:6379

At the time that the dragonfly got into an unhealthy state, the operator logs were:

2025-12-02T03:16:56Z	INFO	received	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}}
2025-12-02T03:16:56Z	INFO	non-deletion event for a pod with an existing role. checking if something is wrong	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\", \"role\": \"replica\"}
2025-12-02T03:16:56Z	INFO	getting all pods relevant to the dragonfly instance	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\"}
2025-12-02T03:16:56Z	INFO	checking if replica is configured correctly	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-1\"}
2025-12-02T03:16:56Z	INFO	checking if replica is configured correctly	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\"}
2025-12-02T03:16:56Z	INFO	received	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-1\", \"reconcileID\": \"27cffc41-a69f-45ce-b21f-7a125d555df3\", \"pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}}
...

I have attached the full csv logs for the operator at this timestamp

data_exported_2025-12-02_123531.csv

Originally posted by @xuekat in #402

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions