feat: force-delete Subnet/SG blocked by orphan Lambda VPC ENIs#643
Merged
Conversation
…Lambda VPC ENIs (#637) Adds EC2SubnetOperator and EC2SecurityGroupOperator that detect and delete orphan AWS Lambda VPC ENIs (status=available, description prefix "AWS Lambda VPC ENI") which block Subnet/SecurityGroup deletion after a VPC-attached Lambda function has been deleted but Lambda has not yet released its ENIs. The new IEC2.DeleteOrphanLambdaENIsByFilter helper finds orphan ENIs filtered by subnet-id or group-id and deletes them in parallel; the operators then delete the Subnet/SecurityGroup itself via the EC2 API. CFN reuses the existing RetainResources path to skip the resource on the next DeleteStack retry. Also adds an E2E scenario under e2e/vpc_lambda/ that reproduces the orphan ENI condition by invoking the function and then deleting it out-of-band via the Lambda SDK.
go-to-k
commented
May 4, 2026
PR #643 review: keep build-artifact ignores scoped to each e2e module (matches the existing cdk/.gitignore pattern) instead of leaking into the repo-root .gitignore.
…del, and E2E layouts Lessons learned from PR #643 (issue #637): - Make explicit that pkg/client methods may be high-level helpers (not only thin SDK 1:1 wrappers). - Document the operator deletion model (operator deletes via SDK, CFN skips the resource via RetainResources on retry) so contributors do not assume CFN re-deletes. - Document the cross-resource cleanup pattern (an operator may remove out-of-stack dependencies that block deletion, with tight filters). - Document the dedicated e2e/<scenario>/ layout (cdk/.gitignore for cdk.context.json + module-root /<basename> ignore for the deploy.go build artifact, plus Makefile testgen_/e2e_ targets).
Previously deploy.go deleted the Lambda via the Lambda SDK and then let delstack drive the first DeleteStack itself. AWS Lambda often released the ENIs by the time delstack reached the Subnet/SG, so the stack deleted cleanly on the first try and the new EC2SubnetOperator / EC2SecurityGroupOperator code paths were never exercised. Switch to the actual user-facing scenario from issue #637: - Invoke the Lambda to provision the VPC ENIs. - Trigger an ordinary CloudFormation DeleteStack (NOT delstack). - Poll until the stack reaches DELETE_FAILED. If it reaches DELETE_COMPLETE instead (AWS released ENIs in time) we report failure because the orphan ENI scenario was not reproduced. Update README to explain both scenarios (delstack-first works via the LambdaVPCDetacher preprocessor; CFN-first fails and needs the new operators).
go-to-k
commented
May 4, 2026
| // "AWS Lambda VPC ENI" and that match the given filter (e.g. subnet-id or group-id), | ||
| // then deletes them in parallel. Used to unblock Subnet / SecurityGroup deletion when | ||
| // AWS Lambda has not yet released its ENIs after the function was already deleted. | ||
| func (c *EC2Client) DeleteOrphanLambdaENIsByFilter(ctx context.Context, filterName, filterValue string) error { |
Owner
Author
There was a problem hiding this comment.
Shoun't this be in the operator? The client is a wrapper of SDK client.
PR #643 review: pkg/client should stay a thin SDK wrapper. The discover-and-delete-in-parallel helper for orphan Lambda VPC ENIs is operation-layer logic, not client-layer. - Drop IEC2.DeleteOrphanLambdaENIsByFilter and the related test + mockgen entries. - Add cleanupOrphanLambdaENIsByFilter as a package-level helper in internal/operation/lambda_function.go (Lambda VPC ENI is Lambda domain knowledge, even though the helper calls EC2 APIs). - Rewire EC2SubnetOperator and EC2SecurityGroupOperator to call the new helper. Update their gomock tests to expect DescribeNetworkInterfaces and DeleteNetworkInterface directly. - Update add-operator skill: pkg/client methods stay 1:1 SDK wrappers; shared cross-operator helpers live in the operator file that owns the domain knowledge.
Real Hyperplane ENIs are released non-deterministically by AWS Lambda, so deploying a real VPC Lambda often let CFN delete the stack cleanly on the first pass and the new operator code path was never exercised. Switch to synthetic ENIs: - CDK now provisions just VPC + private Subnet + Lambda SecurityGroup, exporting their IDs as CFN outputs. - deploy.go reads those outputs and creates two unattached ENIs whose description matches the Lambda VPC ENI prefix so the new operators recognise them. CFN DeleteStack then fails with DependencyViolation on the SecurityGroup / Subnet, leaving the stack in DELETE_FAILED every time. - README updated to explain why the synthetic approach is used.
Removing triggerCfnDeleteAndWaitForFailure: CFN's DependencyViolation retry loop on SG/Subnet ran past my 15-min wait without ever flipping to DELETE_FAILED. delstack already wraps DeleteStack with a CloudFormation delete-complete waiter (~75 min cap) and tolerates the DELETE_FAILED end-state, so the script just needs to leave the synthetic ENIs in place and let delstack take over. README updated accordingly.
A stack with a CfnOutput Export, when it lands in DELETE_FAILED, makes
delstack's dependency graph trip ListImports with ValidationError
("Export ... does not exist"). The synthetic ENI E2E does not need
Exports — the deploy script reads OutputValue via DescribeStacks. Keep
the outputs but drop ExportName so delstack can analyse the stack.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EC2SubnetOperatorandEC2SecurityGroupOperatorsodelstackcan force-deleteAWS::EC2::Subnet/AWS::EC2::SecurityGroupstuck inDELETE_FAILEDbecause AWS Lambda has not yet released its VPC ENIs after a VPC-attached function was deleted.cleanupOrphanLambdaENIsByFilterpackage-level helper ininternal/operation/lambda_function.gofinds available ENIs whose description starts withAWS Lambda VPC ENIand that match the given Subnet ID / SecurityGroup ID, then deletes them in parallel viaerrgroup+semaphore.NewWeighted(runtime.NumCPU()). The operators then delete the Subnet / SecurityGroup itself via the EC2 API; CFN's existingRetainResourcespath skips the resource on the nextDeleteStackretry.pkg/client/ec2.gois kept as thin SDK wrappers (per review feedback): only addsDeleteSubnet/DeleteSecurityGroupnext to the existingDescribeNetworkInterfaces/DeleteNetworkInterface.status=available+ description prefix) to avoid touching ENIs from VPC Endpoints / ELB / RDS / EFS, which are out of scope for this PR.e2e/vpc_lambda/reproduces the orphan ENI condition deterministically: CDK deploys VPC + private Subnet + SecurityGroup, thendeploy.goinjects synthetic ENIs (CreateNetworkInterfacewith descriptionAWS Lambda VPC ENI-...) into that Subnet+SG.delstackitself drivesDeleteStackand waits via its internal CloudFormation delete waiter. Wired into the Makefile astestgen_vpc_lambda/e2e_vpc_lambda. Real Hyperplane ENI release timing is non-deterministic, so deploying a real VPC Lambda made the test flaky; synthetic ENIs exercise the same operator code path every run..claude/skills/add-operator/SKILL.mdwith what was learned:pkg/clientstays SDK-1:1, the operator deletion model (operator deletes via SDK, CFN skips viaRetainResources), the cross-resource cleanup pattern (operator removes out-of-stack dependencies with tight filters), and the dedicatede2e/<scenario>/layout (cdk/.gitignore, module-root/<basename>ignore, Makefiletestgen_/e2e_targets).Closes #637
Test plan
make testpasses (unit tests for new EC2 client methods via SDK middleware mocks; new operators via gomock againstIEC2)make lintis clean (0 issues.)make e2e_vpc_lambdaend-to-end: CDK deploy + synthetic ENIs +delstack→ stack reachedDELETE_FAILED, the new operators removed the ENIs and the Subnet / SecurityGroup, and the stack was successfully deleted. Verified via AWS CLI that the stack, both ENIs, the Subnet, and the SecurityGroup all returnNotFound.make e2e_fullandmake e2e_preprocessorstill pass.