Skip to content

bug: single node invalid will cause service unavailable #12937

@jizhuozhi

Description

@jizhuozhi

Bug Description

There is a bug in the Consul service discovery implementation where a single malformed node (missing Service field) can cause the entire service to become unavailable.

Root Cause

The :: CONTINUE :: label is placed outside the inner for loop (line 648), but the goto CONTINUE statement is used inside the inner loop (line 578). This causes the goto statement to skip not only the current node but also:

  • All remaining nodes in the result
  • The sorting logic
  • The critical up_services[service_name] = nodes assignment

Affected Code

File: apisix/discovery/consul/init.lua

for service_name, _ in pairs(catalog_res.body) do
    -- ... get nodes from consul ...
    
    for _, node in ipairs(result.body) do
        local service = node.Service
        if not service then
            goto CONTINUE  -- Line 578: BUG! This jumps outside the inner loop
        end
        
        -- ... process node ...
    end  -- Inner loop ends here (line 633)
    
    -- Sorting logic
    if nodes then
        -- ... sort nodes ...
    end
    up_services[service_name] = nodes
end  -- Outer loop ends here

:: CONTINUE ::  -- Line 648: Label is HERE, outside both loops!

Impact

When a single node in the Consul response has a missing Service field:

  1. The goto CONTINUE is triggered
  2. It jumps to line 648, skipping the entire remaining processing for that service
  3. The up_services[service_name] = nodes assignment is never executed
  4. APISIX does not update the node list for that service
  5. If this happens during startup or after a previous service removal, the service becomes completely unavailable

Proposed Fix

Move the :: CONTINUE :: label inside the inner loop, or rename the labels to avoid confusion:

for service_name, _ in pairs(catalog_res.body) do
    -- ... get nodes from consul ...
    
    for _, node in ipairs(result.body) do
        local service = node.Service
        if not service then
            goto CONTINUE_NODE  -- Skip only this node
        end
        
        -- ... process node ...
        
        :: CONTINUE_NODE ::  -- Label inside the inner loop
    end  -- Inner loop ends
    
    -- Sorting and assignment continue normally
    if nodes then
        -- ... sort nodes ...
    end
    up_services[service_name] = nodes
end

:: CONTINUE ::  -- Keep outer label for service-level skipping

Additional Notes

This bug has likely existed for a long time but may have gone unnoticed due to the rarity of malformed Consul responses. However, it represents a critical reliability issue as a single bad node entry can cause complete service unavailability.

Expected Behavior

When a single node is malformed, only that specific node should be skipped. Other healthy nodes should still be processed and added to up_services.

Error Logs

None

Steps to Reproduce

  1. Register a service in Consul with multiple instances
  2. Somehow create a node entry with a missing Service field (e.g., through API manipulation or data corruption)
  3. APISIX will fail to discover ANY nodes for that service, even the healthy ones
  4. Requests to that service will return "no valid upstream node" errors

Environment

  • APISIX version (run apisix version):
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    🏗 In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions