Skip to content

feat: force-delete Subnet/SG blocked by orphan Lambda VPC ENIs#643

Merged
go-to-k merged 8 commits into
mainfrom
feat/issue-637-vpc-lambda-eni
May 4, 2026
Merged

feat: force-delete Subnet/SG blocked by orphan Lambda VPC ENIs#643
go-to-k merged 8 commits into
mainfrom
feat/issue-637-vpc-lambda-eni

Conversation

@go-to-k
Copy link
Copy Markdown
Owner

@go-to-k go-to-k commented May 4, 2026

Summary

  • Adds EC2SubnetOperator and EC2SecurityGroupOperator so delstack can force-delete AWS::EC2::Subnet / AWS::EC2::SecurityGroup stuck in DELETE_FAILED because AWS Lambda has not yet released its VPC ENIs after a VPC-attached function was deleted.
  • New cleanupOrphanLambdaENIsByFilter package-level helper in internal/operation/lambda_function.go finds available ENIs whose description starts with AWS Lambda VPC ENI and that match the given Subnet ID / SecurityGroup ID, then deletes them in parallel via errgroup + semaphore.NewWeighted(runtime.NumCPU()). The operators then delete the Subnet / SecurityGroup itself via the EC2 API; CFN's existing RetainResources path skips the resource on the next DeleteStack retry.
  • pkg/client/ec2.go is kept as thin SDK wrappers (per review feedback): only adds DeleteSubnet / DeleteSecurityGroup next to the existing DescribeNetworkInterfaces / DeleteNetworkInterface.
  • Filter is intentionally Lambda-scoped only (status=available + description prefix) to avoid touching ENIs from VPC Endpoints / ELB / RDS / EFS, which are out of scope for this PR.
  • E2E scenario e2e/vpc_lambda/ reproduces the orphan ENI condition deterministically: CDK deploys VPC + private Subnet + SecurityGroup, then deploy.go injects synthetic ENIs (CreateNetworkInterface with description AWS Lambda VPC ENI-...) into that Subnet+SG. delstack itself drives DeleteStack and waits via its internal CloudFormation delete waiter. Wired into the Makefile as testgen_vpc_lambda / e2e_vpc_lambda. Real Hyperplane ENI release timing is non-deterministic, so deploying a real VPC Lambda made the test flaky; synthetic ENIs exercise the same operator code path every run.
  • Updates .claude/skills/add-operator/SKILL.md with what was learned: pkg/client stays SDK-1:1, the operator deletion model (operator deletes via SDK, CFN skips via RetainResources), the cross-resource cleanup pattern (operator removes out-of-stack dependencies with tight filters), and the dedicated e2e/<scenario>/ layout (cdk/.gitignore, module-root /<basename> ignore, Makefile testgen_ / e2e_ targets).

Closes #637

Test plan

  • make test passes (unit tests for new EC2 client methods via SDK middleware mocks; new operators via gomock against IEC2)
  • make lint is clean (0 issues.)
  • make e2e_vpc_lambda end-to-end: CDK deploy + synthetic ENIs + delstack → stack reached DELETE_FAILED, the new operators removed the ENIs and the Subnet / SecurityGroup, and the stack was successfully deleted. Verified via AWS CLI that the stack, both ENIs, the Subnet, and the SecurityGroup all return NotFound.
  • Regression check (maintainer): make e2e_full and make e2e_preprocessor still pass.

…Lambda VPC ENIs (#637)

Adds EC2SubnetOperator and EC2SecurityGroupOperator that detect and
delete orphan AWS Lambda VPC ENIs (status=available, description prefix
"AWS Lambda VPC ENI") which block Subnet/SecurityGroup deletion after a
VPC-attached Lambda function has been deleted but Lambda has not yet
released its ENIs.

The new IEC2.DeleteOrphanLambdaENIsByFilter helper finds orphan ENIs
filtered by subnet-id or group-id and deletes them in parallel; the
operators then delete the Subnet/SecurityGroup itself via the EC2 API.
CFN reuses the existing RetainResources path to skip the resource on
the next DeleteStack retry.

Also adds an E2E scenario under e2e/vpc_lambda/ that reproduces the
orphan ENI condition by invoking the function and then deleting it
out-of-band via the Lambda SDK.
Comment thread .gitignore Outdated
go-to-k added 3 commits May 4, 2026 16:15
PR #643 review: keep build-artifact ignores scoped to each e2e module
(matches the existing cdk/.gitignore pattern) instead of leaking into
the repo-root .gitignore.
…del, and E2E layouts

Lessons learned from PR #643 (issue #637):
- Make explicit that pkg/client methods may be high-level helpers (not
  only thin SDK 1:1 wrappers).
- Document the operator deletion model (operator deletes via SDK, CFN
  skips the resource via RetainResources on retry) so contributors do
  not assume CFN re-deletes.
- Document the cross-resource cleanup pattern (an operator may remove
  out-of-stack dependencies that block deletion, with tight filters).
- Document the dedicated e2e/<scenario>/ layout (cdk/.gitignore for
  cdk.context.json + module-root /<basename> ignore for the deploy.go
  build artifact, plus Makefile testgen_/e2e_ targets).
Previously deploy.go deleted the Lambda via the Lambda SDK and then let
delstack drive the first DeleteStack itself. AWS Lambda often released
the ENIs by the time delstack reached the Subnet/SG, so the stack
deleted cleanly on the first try and the new EC2SubnetOperator /
EC2SecurityGroupOperator code paths were never exercised.

Switch to the actual user-facing scenario from issue #637:
- Invoke the Lambda to provision the VPC ENIs.
- Trigger an ordinary CloudFormation DeleteStack (NOT delstack).
- Poll until the stack reaches DELETE_FAILED. If it reaches
  DELETE_COMPLETE instead (AWS released ENIs in time) we report failure
  because the orphan ENI scenario was not reproduced.

Update README to explain both scenarios (delstack-first works via the
LambdaVPCDetacher preprocessor; CFN-first fails and needs the new
operators).
Comment thread pkg/client/ec2.go Outdated
// "AWS Lambda VPC ENI" and that match the given filter (e.g. subnet-id or group-id),
// then deletes them in parallel. Used to unblock Subnet / SecurityGroup deletion when
// AWS Lambda has not yet released its ENIs after the function was already deleted.
func (c *EC2Client) DeleteOrphanLambdaENIsByFilter(ctx context.Context, filterName, filterValue string) error {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoun't this be in the operator? The client is a wrapper of SDK client.

go-to-k added 4 commits May 4, 2026 18:55
PR #643 review: pkg/client should stay a thin SDK wrapper. The
discover-and-delete-in-parallel helper for orphan Lambda VPC ENIs is
operation-layer logic, not client-layer.

- Drop IEC2.DeleteOrphanLambdaENIsByFilter and the related test +
  mockgen entries.
- Add cleanupOrphanLambdaENIsByFilter as a package-level helper in
  internal/operation/lambda_function.go (Lambda VPC ENI is Lambda
  domain knowledge, even though the helper calls EC2 APIs).
- Rewire EC2SubnetOperator and EC2SecurityGroupOperator to call the new
  helper. Update their gomock tests to expect DescribeNetworkInterfaces
  and DeleteNetworkInterface directly.
- Update add-operator skill: pkg/client methods stay 1:1 SDK wrappers;
  shared cross-operator helpers live in the operator file that owns the
  domain knowledge.
Real Hyperplane ENIs are released non-deterministically by AWS Lambda,
so deploying a real VPC Lambda often let CFN delete the stack cleanly
on the first pass and the new operator code path was never exercised.

Switch to synthetic ENIs:
- CDK now provisions just VPC + private Subnet + Lambda SecurityGroup,
  exporting their IDs as CFN outputs.
- deploy.go reads those outputs and creates two unattached ENIs whose
  description matches the Lambda VPC ENI prefix so the new operators
  recognise them. CFN DeleteStack then fails with DependencyViolation
  on the SecurityGroup / Subnet, leaving the stack in DELETE_FAILED
  every time.
- README updated to explain why the synthetic approach is used.
Removing triggerCfnDeleteAndWaitForFailure: CFN's DependencyViolation
retry loop on SG/Subnet ran past my 15-min wait without ever flipping
to DELETE_FAILED. delstack already wraps DeleteStack with a CloudFormation
delete-complete waiter (~75 min cap) and tolerates the DELETE_FAILED
end-state, so the script just needs to leave the synthetic ENIs in place
and let delstack take over. README updated accordingly.
A stack with a CfnOutput Export, when it lands in DELETE_FAILED, makes
delstack's dependency graph trip ListImports with ValidationError
("Export ... does not exist"). The synthetic ENI E2E does not need
Exports — the deploy script reads OutputValue via DescribeStacks. Keep
the outputs but drop ExportName so delstack can analyse the stack.
@go-to-k go-to-k merged commit 875ed5f into main May 4, 2026
5 checks passed
@go-to-k go-to-k deleted the feat/issue-637-vpc-lambda-eni branch May 4, 2026 10:59
@github-actions github-actions Bot mentioned this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Force-delete fails on stacks blocked by orphan Lambda VPC ENIs (Subnet/SG dependencies)

1 participant