-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Bug Description
There is a bug in the Consul service discovery implementation where a single malformed node (missing Service field) can cause the entire service to become unavailable.
Root Cause
The :: CONTINUE :: label is placed outside the inner for loop (line 648), but the goto CONTINUE statement is used inside the inner loop (line 578). This causes the goto statement to skip not only the current node but also:
- All remaining nodes in the result
- The sorting logic
- The critical
up_services[service_name] = nodesassignment
Affected Code
File: apisix/discovery/consul/init.lua
for service_name, _ in pairs(catalog_res.body) do
-- ... get nodes from consul ...
for _, node in ipairs(result.body) do
local service = node.Service
if not service then
goto CONTINUE -- Line 578: BUG! This jumps outside the inner loop
end
-- ... process node ...
end -- Inner loop ends here (line 633)
-- Sorting logic
if nodes then
-- ... sort nodes ...
end
up_services[service_name] = nodes
end -- Outer loop ends here
:: CONTINUE :: -- Line 648: Label is HERE, outside both loops!Impact
When a single node in the Consul response has a missing Service field:
- The
goto CONTINUEis triggered - It jumps to line 648, skipping the entire remaining processing for that service
- The
up_services[service_name] = nodesassignment is never executed - APISIX does not update the node list for that service
- If this happens during startup or after a previous service removal, the service becomes completely unavailable
Proposed Fix
Move the :: CONTINUE :: label inside the inner loop, or rename the labels to avoid confusion:
for service_name, _ in pairs(catalog_res.body) do
-- ... get nodes from consul ...
for _, node in ipairs(result.body) do
local service = node.Service
if not service then
goto CONTINUE_NODE -- Skip only this node
end
-- ... process node ...
:: CONTINUE_NODE :: -- Label inside the inner loop
end -- Inner loop ends
-- Sorting and assignment continue normally
if nodes then
-- ... sort nodes ...
end
up_services[service_name] = nodes
end
:: CONTINUE :: -- Keep outer label for service-level skippingAdditional Notes
This bug has likely existed for a long time but may have gone unnoticed due to the rarity of malformed Consul responses. However, it represents a critical reliability issue as a single bad node entry can cause complete service unavailability.
Expected Behavior
When a single node is malformed, only that specific node should be skipped. Other healthy nodes should still be processed and added to up_services.
Error Logs
None
Steps to Reproduce
- Register a service in Consul with multiple instances
- Somehow create a node entry with a missing
Servicefield (e.g., through API manipulation or data corruption) - APISIX will fail to discover ANY nodes for that service, even the healthy ones
- Requests to that service will return "no valid upstream node" errors
Environment
- APISIX version (run
apisix version): - Operating system (run
uname -a): - OpenResty / Nginx version (run
openresty -Vornginx -V): - etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info): - APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run
luarocks --version):
Metadata
Metadata
Assignees
Labels
Type
Projects
Status