fix: nil pointer panic in AWS ALB provider crashes process#743
fix: nil pointer panic in AWS ALB provider crashes process#743
Conversation
lb.DNSName and target.Target.Id can be nil for internal/failed ALBs. Dereferencing without check at alb.go:66 panics in a nested goroutine, crashing the entire Aurora pod and leaving enumerations stuck in queued. Confirmed via prod crash log: goroutine 4511917 panicked in listELBV2Resources at alb.go:66 for Group1001 customer (user 349180).
Same pattern as ALB — lb.DNSName and instance.InstanceId can be nil.
A panic in any nested goroutine (e.g., nil pointer from unexpected AWS API response) crashes the entire process. Add defer/recover to worker() and all GetResource goroutines so panics are logged but don't kill the process. Root cause of Group1001 incident: alb.go goroutine panicked, no recovery existed, Aurora pod crashed, enumeration stuck in queued.
Remove [cloudlist-debug] prints from aws.go and instances.go, clean up debug variables (workerCount, totalGoroutines) added during Group1001 incident investigation.
Remove [cloudlist-debug] prints from aws.go and instances.go, clean up debug variables (workerCount, totalGoroutines) added during Group1001 incident investigation.
Neo - PR Security ReviewCaution Review could not be completed Task aborted before completion Suggestion: Try again with Comment |
WalkthroughAdds panic-recovery around many AWS provider goroutines and the core worker, introduces nil-safety guards in ELB/ALB listing logic to avoid nil dereferences, adds unit tests for recovery and nil-handling, and updates Changes
Sequence Diagram(s)sequenceDiagram
participant Coordinator as Coordinator
participant WorkerFunc as worker()
participant RegionGoroutine as Region Goroutine
participant ResultsCh as Results Channel
participant Aggregator as Aggregator / Merge
Coordinator->>WorkerFunc: spawn worker per region/client
WorkerFunc->>RegionGoroutine: invoke getResourcesFunc()
Note right of RegionGoroutine: defer recover() installed
alt getResourcesFunc panics
RegionGoroutine-->>ResultsCh: send result{resources: nil, err: "panic..."}
else success or error
RegionGoroutine-->>ResultsCh: send result{resources:*, err:nil/err}
end
Aggregator->>ResultsCh: read results, merge successful resources, collect errs
Aggregator->>Coordinator: return merged resources or aggregated "all workers failed" error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Neo - PR Security ReviewNo security issues found Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/providers/aws/elb.go (1)
93-95:⚠️ Potential issue | 🟠 MajorPotential bug: checking wrong client variable.
Line 93 checks
ep.elbClient == nil, but the goroutine usesec2Client(passed as parameter) at line 102. This check should verifyec2Clientinstead, otherwise a nilec2Clientwould cause a panic whenep.elbClientis non-nil.Proposed fix
- if ep.elbClient == nil { + if ec2Client == nil { continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/elb.go` around lines 93 - 95, The nil-check is using ep.elbClient but the goroutine uses the ec2Client parameter, so replace or augment that check to verify ec2Client (and ideally both) before launching the goroutine; update the conditional that currently references ep.elbClient to check ec2Client != nil (or both ep.elbClient != nil && ec2Client != nil) to prevent a nil ec2Client panic when the goroutine runs.
🧹 Nitpick comments (1)
pkg/providers/aws/recovery_test.go (1)
11-57: Good test coverage for worker panic recovery.The tests appropriately cover the three key scenarios: panic recovery, normal operation, and error propagation. Consider adding
t.Parallel()to each test for consistency with the ALB and ELB test files.Optional: Add t.Parallel() for consistency
func TestWorkerPanicRecovery(t *testing.T) { + t.Parallel() results := make(chan result, 1)func TestWorkerNormalOperation(t *testing.T) { + t.Parallel() results := make(chan result, 1)func TestWorkerErrorPropagation(t *testing.T) { + t.Parallel() results := make(chan result, 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/recovery_test.go` around lines 11 - 57, Add t.Parallel() as the first statement in each test function (TestWorkerPanicRecovery, TestWorkerNormalOperation, TestWorkerErrorPropagation) so they run concurrently like the ALB/ELB tests; insert t.Parallel() at the top of each test body right after the function signature to enable parallel test execution and ensure consistent test behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/providers/aws/aws.go`:
- Around line 438-440: The panic recovered in worker() is being sent back as
result.err but Provider.Resources() currently swallows non-nil errors (it does
`continue`) and returns nil, so callers never see worker panics; update
Provider.Resources() to handle result.err by returning the error (or aggregating
errors) instead of continuing—e.g., when reading from the channel in
Provider.Resources(), if res.err != nil then return nil, res.err (or append to
an error slice and return a combined error) so panic-derived errors from
worker() are propagated to callers.
In `@pkg/providers/aws/instances.go`:
- Around line 42-46: The local defer with recover() inside GetResource()
swallows child-goroutine panics and still returns nil, causing partial
successes; change it so panics are propagated or translated into returned
errors: remove the local defer that calls recover() in GetResource() (or replace
it with a panic-to-error handler that sends the recovered value into an error
channel), ensure goroutine workers report errors into a shared errs channel (or
set a named error return), wait for all goroutines to finish, and if any
panic/error was received return a non-nil error from GetResource() instead of
nil; apply the same pattern for the other providers (s3.go, route53.go,
lightsail.go, lambda-api-gateway.go, elb.go, eks.go, ecs.go, cloudfront.go,
alb.go) so the worker-level recovery in pkg/providers/aws/aws.go can handle
panics consistently.
---
Outside diff comments:
In `@pkg/providers/aws/elb.go`:
- Around line 93-95: The nil-check is using ep.elbClient but the goroutine uses
the ec2Client parameter, so replace or augment that check to verify ec2Client
(and ideally both) before launching the goroutine; update the conditional that
currently references ep.elbClient to check ec2Client != nil (or both
ep.elbClient != nil && ec2Client != nil) to prevent a nil ec2Client panic when
the goroutine runs.
---
Nitpick comments:
In `@pkg/providers/aws/recovery_test.go`:
- Around line 11-57: Add t.Parallel() as the first statement in each test
function (TestWorkerPanicRecovery, TestWorkerNormalOperation,
TestWorkerErrorPropagation) so they run concurrently like the ALB/ELB tests;
insert t.Parallel() at the top of each test body right after the function
signature to enable parallel test execution and ensure consistent test behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fbfef235-cadd-4a38-a659-68a00e7044cb
📒 Files selected for processing (15)
.gitignorepkg/providers/aws/alb.gopkg/providers/aws/alb_test.gopkg/providers/aws/aws.gopkg/providers/aws/cloudfront.gopkg/providers/aws/ecs.gopkg/providers/aws/eks.gopkg/providers/aws/elb.gopkg/providers/aws/elb_test.gopkg/providers/aws/instances.gopkg/providers/aws/lambda-api-gateway.gopkg/providers/aws/lightsail.gopkg/providers/aws/recovery_test.gopkg/providers/aws/route53.gopkg/providers/aws/s3.go
…heck - Resources() now logs worker errors and returns error when all workers fail instead of silently swallowing panic-derived errors - Inner GetResource goroutines collect panics into error slice instead of just logging — returns error if all goroutines failed - elb.go: fix wrong nil check (ep.elbClient → ec2Client parameter) - recovery_test.go: add t.Parallel() to all tests
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/providers/aws/ecs.go (1)
45-63:⚠️ Potential issue | 🟠 MajorRecord worker errors and successes explicitly.
errsonly grows on recovered panics. OrdinarylistECSResourcesfailures are still dropped by theif ... err == nilguard, so a full AccessDenied/throttling outage can return an empty list withnil. Also,len(list.Items)is not a reliable success signal because a worker can legitimately scan a region/account and find zero ECS resources. The same wrapper was copied into the sibling AWS providers in this PR.🛠️ Minimal fix
list := schema.NewResources() var wg sync.WaitGroup var mu sync.Mutex var errs []error +var successfulWorkers int @@ - if resources, err := ep.listECSResources(ecsClient, ec2Client); err == nil { - mu.Lock() - list.Merge(resources) - mu.Unlock() - } + resources, err := ep.listECSResources(ecsClient, ec2Client) + mu.Lock() + defer mu.Unlock() + if err != nil { + errs = append(errs, errors.Wrap(err, "ecs worker failed")) + return + } + successfulWorkers++ + list.Merge(resources) @@ -if len(errs) > 0 && len(list.Items) == 0 { +if len(errs) > 0 && successfulWorkers == 0 { return nil, fmt.Errorf("ecs: all workers failed: %v", errs) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/ecs.go` around lines 45 - 63, The code currently only records panics into errs and ignores non-panic errors from listECSResources; change the goroutine body so that when listECSResources returns an error you append that error to errs (with mu locked), and when it succeeds you merge into list (as currently done); after wg.Wait, determine failure by comparing len(errs) to the total number of workers (e.g., len(ecsClients) or the iteration count) rather than checking list.Items, and return an aggregated error only if all workers errored. Keep the existing panic recover block, but add explicit error appends for the listECSResources call and use a worker-count check to decide "all workers failed."pkg/providers/aws/alb.go (1)
45-56:⚠️ Potential issue | 🔴 CriticalFix the mutex deadlock in panic recovery.
A panic during
list.Merge()whilemuis locked will cause the deferred panic handler to deadlock when it triesmu.Lock()on the same goroutine that already holds the mutex. Go'ssync.Mutexis not reentrant.Use a nested anonymous function with
defer mu.Unlock()to ensure the mutex is released before the outer panic handler runs:Safer pattern
- if resources, err := ep.listELBV2Resources(albClient, ec2Client); err == nil { - mu.Lock() - list.Merge(resources) - mu.Unlock() - } + if resources, err := ep.listELBV2Resources(albClient, ec2Client); err == nil { + func() { + mu.Lock() + defer mu.Unlock() + list.Merge(resources) + }() + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/alb.go` around lines 45 - 56, The deferred panic handler deadlocks if a panic occurs while mu is already locked around list.Merge; fix by removing mu.Lock() from the panic handler and enclosing the mutex-protected work in a nested anonymous function that does mu.Lock() with defer mu.Unlock() before calling list.Merge (i.e., call ep.listELBV2Resources(...), then if err==nil invoke an inline func that does mu.Lock(); defer mu.Unlock(); list.Merge(resources)) so the mutex is always released before any outer panic recovery runs and errs is appended safely.
🧹 Nitpick comments (1)
pkg/providers/aws/elb.go (1)
31-31: Consider using context for cancellation support in a follow-up.The
ctxparameter is received but not passed to AWS SDK calls (DescribeLoadBalancers,DescribeInstances). While this is pre-existing behavior and outside the scope of this panic-fix PR, the coding guidelines specify that context should be honored in long-running API calls. The AWS SDK v1 provides*WithContextvariants for this purpose.As per coding guidelines: "Always pass and honor context.Context in Resources() and long-running API calls to support cancellation"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/elb.go` at line 31, GetResource currently accepts ctx but does not pass it to AWS SDK calls; update elbProvider.GetResource to use the context-aware SDK variants (e.g., call DescribeLoadBalancersWithContext and DescribeInstancesWithContext) and propagate ctx into any subsequent AWS API invocations so those long-running calls honor cancellation; locate calls to DescribeLoadBalancers and DescribeInstances in GetResource and replace them with their *WithContext counterparts, forwarding the ctx and preserving existing request/input parameters and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/providers/aws/aws.go`:
- Around line 511-521: The code currently infers any successful worker by
checking finalResources.Items length, which is wrong because a successful worker
may return an empty *schema.Resources; change the loop over results to track
successful workers explicitly (e.g., introduce a successCount and increment it
when result.err == nil, alongside calling
finalResources.Merge(result.resources)), then replace the condition `if
len(errs) > 0 && len(finalResources.Items) == 0` with `if len(errs) > 0 &&
successCount == 0` (and update the error message if needed) so failures are
reported only when no worker actually succeeded; reference the results loop,
result.err, finalResources.Merge, finalResources.Items, and errs to locate where
to add successCount.
---
Outside diff comments:
In `@pkg/providers/aws/alb.go`:
- Around line 45-56: The deferred panic handler deadlocks if a panic occurs
while mu is already locked around list.Merge; fix by removing mu.Lock() from the
panic handler and enclosing the mutex-protected work in a nested anonymous
function that does mu.Lock() with defer mu.Unlock() before calling list.Merge
(i.e., call ep.listELBV2Resources(...), then if err==nil invoke an inline func
that does mu.Lock(); defer mu.Unlock(); list.Merge(resources)) so the mutex is
always released before any outer panic recovery runs and errs is appended
safely.
In `@pkg/providers/aws/ecs.go`:
- Around line 45-63: The code currently only records panics into errs and
ignores non-panic errors from listECSResources; change the goroutine body so
that when listECSResources returns an error you append that error to errs (with
mu locked), and when it succeeds you merge into list (as currently done); after
wg.Wait, determine failure by comparing len(errs) to the total number of workers
(e.g., len(ecsClients) or the iteration count) rather than checking list.Items,
and return an aggregated error only if all workers errored. Keep the existing
panic recover block, but add explicit error appends for the listECSResources
call and use a worker-count check to decide "all workers failed."
---
Nitpick comments:
In `@pkg/providers/aws/elb.go`:
- Line 31: GetResource currently accepts ctx but does not pass it to AWS SDK
calls; update elbProvider.GetResource to use the context-aware SDK variants
(e.g., call DescribeLoadBalancersWithContext and DescribeInstancesWithContext)
and propagate ctx into any subsequent AWS API invocations so those long-running
calls honor cancellation; locate calls to DescribeLoadBalancers and
DescribeInstances in GetResource and replace them with their *WithContext
counterparts, forwarding the ctx and preserving existing request/input
parameters and error handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85278d54-811c-404f-bb80-1a9341584bcd
📒 Files selected for processing (12)
pkg/providers/aws/alb.gopkg/providers/aws/aws.gopkg/providers/aws/cloudfront.gopkg/providers/aws/ecs.gopkg/providers/aws/eks.gopkg/providers/aws/elb.gopkg/providers/aws/instances.gopkg/providers/aws/lambda-api-gateway.gopkg/providers/aws/lightsail.gopkg/providers/aws/recovery_test.gopkg/providers/aws/route53.gopkg/providers/aws/s3.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/providers/aws/recovery_test.go
- pkg/providers/aws/lambda-api-gateway.go
- pkg/providers/aws/cloudfront.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/providers/aws/aws.go (1)
438-442: Consider capturing stack trace for better panic diagnostics.The panic recovery correctly prevents process crashes, but
%von the recovered value loses the stack trace. For production incidents like the one mentioned in the PR (Group1001), having the full stack trace would significantly aid debugging.Capture stack trace in panic error
+import "runtime/debug" + func worker(ctx context.Context, fn getResourcesFunc, ch chan<- result) { defer func() { if r := recover(); r != nil { - ch <- result{resources: nil, err: fmt.Errorf("panic in provider worker: %v", r)} + ch <- result{resources: nil, err: fmt.Errorf("panic in provider worker: %v\n%s", r, debug.Stack())} } }() resources, err := fn(ctx) ch <- result{resources, err} }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/aws.go` around lines 438 - 442, The defer recover block in the provider worker currently sends only the recovered value (r) over ch via fmt.Errorf, losing the stack trace; update the deferred func (the recover block that sends result on ch) to capture the goroutine stack (e.g., runtime/debug.Stack() or runtime.Stack) and include it in the error passed to ch so the sent error contains both the panic value and full stack trace (ensure you add the required import for runtime/debug and format the error like "panic in provider worker: %v\n%s" combining r and the stack).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/providers/aws/aws.go`:
- Around line 438-442: The defer recover block in the provider worker currently
sends only the recovered value (r) over ch via fmt.Errorf, losing the stack
trace; update the deferred func (the recover block that sends result on ch) to
capture the goroutine stack (e.g., runtime/debug.Stack() or runtime.Stack) and
include it in the error passed to ch so the sent error contains both the panic
value and full stack trace (ensure you add the required import for runtime/debug
and format the error like "panic in provider worker: %v\n%s" combining r and the
stack).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bc52823c-d75a-447d-92e7-5849ef114e44
📒 Files selected for processing (1)
pkg/providers/aws/aws.go
Summary
alb.go:66) that crashes the entire process whenlb.DNSNameis nilelb.go:66)worker()function and all 10 AWS provider goroutines — a panic in any nested goroutine no longer kills the processRoot Cause
Group1001 customer (user 349180) had ~55 cloud discoveries stuck in
queuedstate for months. Investigation traced to:alb.go:66—*lb.DNSNamedereferences nil for internal/failed ALBsGetResource(line 42)queuedpermanentlyProd crash log:
Changes
lb.DNSName,lb.LoadBalancerName,target.Target.Idlb.DNSName,lb.LoadBalancerName,instance.InstanceIdworker()function (returns error instead of crashing)defer recover()in everyGetResourcegoroutinealb_test.go,elb_test.go,recovery_test.go— 9 tests covering nil safety and panic recoveryTest plan
go test ./pkg/providers/aws/ -v— 9/9 passgo build ./pkg/providers/aws/— cleanSummary by CodeRabbit
Bug Fixes
Tests
Chores