Dragonfly stuck in restart loop

A workaround as I don't have access to reopen issue #402. Context is, our Dragonfly pods seem to get into a bad state unpredictably, where they will enter a state of perpetually killing and restarting the pods. Below are some logs for unhealthy vs healthy dragonfly pods, and the dragonfly operator logs at the time that the pod gets into an unhealthy restart loop state. 

The only way to resolve the restart loop is to run a force helm-upgrade with no changes to the values.

[dragonfly-operator-logs.csv](https://github.com/user-attachments/files/23881101/data_exported_2025-12-02_123531.csv)

> The dragonfly pods seem to be healthy, then suddenly, unpredictably get into an unhealthy state. When they become unhealthy, the pod will raise the warning 
> ```
> W20251202 12:04:38.763005    13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
> W20251202 12:04:38.763013    14 listener_interface.cc:124] Error calling accept system:22/Invalid argument
> ```
> The below is the logs for a dragonfly pod that is stuck in an infinite restart loop:
> ```
> kubectl logs -f -n online-store t4459-dragonfly-0                                                                                  1 ✘  prodDefaultCluster 󱃾  07:04:35 
> I20251202 12:04:38.325198     7 init.cc:127] dragonfly running in opt mode.
> I20251202 12:04:38.325289     7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
>                    .--::--.                   
>    :+*=:          =@@@@@@@@=          :+*+:   
>   %@@@@@@%*=.     =@@@@@@@@-     .=*%@@@@@@#  
>   @@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%  
>   -@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-  
>     :+*********####-%@%%@%-####********++.    
>    .%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%    
>    .@@@@@@@@%*+-:   =@@@@=  .:-+*%@@@@@@@%.   
>      =*+-:           ###*          .:-+*=     
>                      %@@%                     
>                      *@@*                     
>                      +@@=                     
>                      :##:                     
>                      :@@:                     
>                       @@                      
>                       ..                      
> * Logs will be written to the first available of the following paths:
> /tmp/dragonfly.*
> ./dragonfly.*
> * For the available flags type dragonfly [--help | --helpfull]
> * Documentation can be found at: https://www.dragonflydb.io/docs
> I20251202 12:04:38.325392     7 dfly_main.cc:963] Max memory limit is: 15.00GiB
> I20251202 12:04:38.325886    12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
> I20251202 12:04:38.326862     7 proactor_pool.cc:149] Running 8 io threads
> I20251202 12:04:38.329512     7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
> I20251202 12:04:38.329754     7 dfly_main.cc:298] Listening on admin socket any:9999
> I20251202 12:04:38.331471     7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
> I20251202 12:04:38.366417    17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
> I20251202 12:04:38.384289    17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
> I20251202 12:04:38.408107    12 accept_server.cc:60] Exiting on signal Terminated
> I20251202 12:04:38.457690    17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
> I20251202 12:04:38.473973     7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/
> I20251202 12:04:38.630169     7 server_family.cc:1340] Loading s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs
> I20251202 12:04:38.762454    13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
> I20251202 12:04:38.762465    14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
> W20251202 12:04:38.763005    13 listener_interface.cc:124] Error calling accept system:22/Invalid argument
> W20251202 12:04:38.763013    14 listener_interface.cc:124] Error calling accept system:22/Invalid argument
> I20251202 12:04:38.763972    14 listener_interface.cc:230] Listener stopped for port 6379
> I20251202 12:04:38.763993    13 listener_interface.cc:230] Listener stopped for port 9999
> I20251202 12:04:38.839461    13 server_family.cc:1394] Load finished, num keys read: 0
> I20251202 12:04:39.024096    16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
> I20251202 12:04:39.600379    17 save_stages_controller.cc:355] Saving "s3://shaped-t4459-feature-e9ed9cad56c92652263953755852bedb/4459/dragonfly/t4459-dragonfly-summary.dfs" finished after 0 us
> ```
> 
> When I force a helm upgrade with no change on the values, the dragonfly pod gets back into a healthy state, with the below logs:
> ```
> kubectl logs -f -n online-store t4408-dragonfly-0                                                                              ✔  prodDefaultCluster 󱃾  07:17:37 
> I20251202 12:17:18.588176     7 init.cc:127] dragonfly running in opt mode.
>                    .--::--.                   
>    :+*=:          =@@@@@@@@=          :+*+:   
>   %@@@@@@%*=.     =@@@@@@@@-     .=*%@@@@@@#  
> I20251202 12:17:18.588313     7 dfly_main.cc:902] Starting dragonfly df-v1.35.0-67c51eb70c5aa16e38ddaa906689e7aa31037590
>   @@@@@@@@@@@@#+-. .%@@@@#. .-+#@@@@@@@@@@@%  
>   -@@@@@@@@@@@@@@@@*:#@@#:*@@@@@@@@@@@@@@@@-  
>     :+*********####-%@%%@%-####********++.    
>    .%@@@@@@@@@@@@@%:@@@@@@:@@@@@@@@@@@@@@%    
>    .@@@@@@@@%*+-:   =@@@@=  .:-+*%@@@@@@@%.   
>      =*+-:           ###*          .:-+*=     
>                      %@@%                     
>                      *@@*                     
>                      +@@=                     
>                      :##:                     
>                      :@@:                     
>                       @@                      
>                       ..                      
> * Logs will be written to the first available of the following paths:
> /tmp/dragonfly.*
> ./dragonfly.*
> * For the available flags type dragonfly [--help | --helpfull]
> * Documentation can be found at: https://www.dragonflydb.io/docs
> I20251202 12:17:18.588423     7 dfly_main.cc:963] Max memory limit is: 15.00GiB
> I20251202 12:17:18.588954    12 uring_proactor.cc:224] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
> I20251202 12:17:18.589929     7 proactor_pool.cc:149] Running 8 io threads
> I20251202 12:17:18.592590     7 dragonfly_listener.cc:150] SSL version: OpenSSL 3.0.17 1 Jul 2025
> I20251202 12:17:18.592851     7 dfly_main.cc:298] Listening on admin socket any:9999
> I20251202 12:17:18.595541     7 server_family.cc:1092] Host OS: Linux 6.1.134 x86_64 with 8 threads
> I20251202 12:17:18.630002    17 snapshot_storage.cc:510] Creating AWS S3 client; region=us-east-2; https=true; endpoint=
> I20251202 12:17:18.646389    17 credentials_provider_chain.cc:28] aws: disabled EC2 metadata
> I20251202 12:17:18.679507    17 credentials_provider_chain.cc:36] aws: loaded credentials; provider=web-identity
> I20251202 12:17:18.697150     7 snapshot_storage.cc:567] Load snapshot: Searching for snapshot in S3 path: s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/
> I20251202 12:17:18.797996     7 server_family.cc:1340] Loading s3://prod-magnusstackrankingserv-featurerepos3f51007ca-1fv44vqc7s7as/4408/dragonfly/t4408-dragonfly-summary.dfs
> I20251202 12:17:18.879534    13 listener_interface.cc:102] sock[19] AcceptServer - listening on 0.0.0.0:9999
> I20251202 12:17:18.879840    14 listener_interface.cc:102] sock[20] AcceptServer - listening on 0.0.0.0:6379
> I20251202 12:17:18.910612    13 server_family.cc:1394] Load finished, num keys read: 0
> I20251202 12:17:19.138098    16 version_monitor.cc:174] Your current version '1.35.0' is not the latest version. A newer version '1.35.1' is now available. Please consider an update.
> I20251202 12:17:29.334545    15 server_family.cc:3647] Initiate replication with: NO ONE
> I20251202 12:17:29.335666    16 server_family.cc:3647] Initiate replication with: NO ONE
> I20251202 12:17:40.734303    19 dflycmd.cc:749] Registered replica 100.64.33.131:6379
> I20251202 12:17:40.735044    12 dflycmd.cc:749] Registered replica 100.64.33.131:6379
> ```
> 
> At the time that the dragonfly got into an unhealthy state, the operator logs were:
> ```
> 2025-12-02T03:16:56Z	INFO	received	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}}
> 2025-12-02T03:16:56Z	INFO	non-deletion event for a pod with an existing role. checking if something is wrong	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\", \"role\": \"replica\"}
> 2025-12-02T03:16:56Z	INFO	getting all pods relevant to the dragonfly instance	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\"}
> 2025-12-02T03:16:56Z	INFO	checking if replica is configured correctly	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-1\"}
> 2025-12-02T03:16:56Z	INFO	checking if replica is configured correctly	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-2\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-2\", \"reconcileID\": \"85a08725-87bb-4299-b1f2-cf86d7614b6b\", \"pod\": \"t4408-dragonfly-2\"}
> 2025-12-02T03:16:56Z	INFO	received	{\"controller\": \"DragonflyPodLifecycle\", \"controllerGroup\": \"\", \"controllerKind\": \"Pod\", \"Pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}, \"namespace\": \"online-store\", \"name\": \"t4408-dragonfly-1\", \"reconcileID\": \"27cffc41-a69f-45ce-b21f-7a125d555df3\", \"pod\": {\"name\":\"t4408-dragonfly-1\",\"namespace\":\"online-store\"}}
> ...
> ```
> I have attached the full csv logs for the operator at this timestamp
> 
> [data_exported_2025-12-02_123531.csv](https://github.com/user-attachments/files/23881010/data_exported_2025-12-02_123531.csv)
> 
>  

 _Originally posted by @xuekat in [#402](https://github.com/dragonflydb/dragonfly-operator/issues/402#issuecomment-3601832543)_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dragonfly stuck in restart loop #426

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dragonfly stuck in restart loop #426

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions