fix: Instead of panicking, the code now handles duplicates gracefully:#38085
fix: Instead of panicking, the code now handles duplicates gracefully:#38085pgaijin66 wants to merge 3 commits intohashicorp:mainfrom
Conversation
|
Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement Learn more about why HashiCorp requires a CLA and what the CLA includes Have you signed the CLA already but the status is still pending? Recheck it. |
b318390 to
0e12a87
Compare
Incrementing the `aws-sdk-go-base` dependecy will also pull in the latest version of the AWS SDK for Go V2, which includes support for the newly launched `aws login` authentication method.
- Logging a warning and continuing if the same status is reported twice - Logging an error and preserving the first status if conflicting statuses are reported (first-write-wins semantics)
0e12a87 to
29a4403
Compare
|
Hello @pgaijin66 ! Thank you for submitting this. I have a concern about your report: if this issue is sometimes triggered by a race condition, we must look into that. A race condition in terraform is never benign 😅 ! I would rather leave the panic logic (replacing it with a DuplicateCheck error, but still causing failures) and fix the underlying issue. Race conditions that seem benign in one configuration can be disastrous in a larger one, so we don't just paper over them. We can add handling for duplicate entires, but it's important that we do so deliberately - terraform should know that it's processing streaming events and in that case use a different method (ReportPossiblyDuplicateCheck 🤷🏻). If we're still getting a check twice for other reasons, we can return the previously-mentioned DuplicateCheck error. I'd also question your assertion that the first check result is the one saved, and this is a genuine question - if it's due to a retry, isn't it possible that what we want is the "final" check value? If the resource is slow to provision I would think it's possible the first check value wouldn't have all the information. |
|
@mildwonkey Thank you for the thorough feedback, you're absolutely right that we shouldn't paper over race conditions. I appreciate you taking the time to explain the deeper concerns here. After reading your feedback, I agree that my current approach is treating the symptom rather than the disease. A race condition that seems benign in my specific scenario could absolutely cause serious issues in larger configurations. Would you prefer I close this PR and open a new one after proper investigation? Or should I update this PR with findings and a revised approach? I want to get this right rather than rush a fix that might mask real problems. |
|
Excellent, and thank you for understanding! (fixing a panic is a good instinct no matter how you look at it 😁) 🤔 I think the easiest would be if you updated this pr to return an error instead of panic when duplicate checks are registered, and then someone (you, the core team, some other kindly contributor) can look into the other items in follow up PRs. That or open a new PR, either works! If you have any more information on the race condition you mentioned, that would be greatly appreciated, however you wish to share it. A reproduction or even a test that fails occasionally would be great, but whatever you have or know is great 👍🏻 (right now I'm fighting with race conditions that only show up in ~1% of test runs; it's not always easy to reproduce!) |
|
Sounds good. I'll update the PR. Thank you for the quick response. Appreciate it alot. |
mildwonkey
left a comment
There was a problem hiding this comment.
(nothing new, sorry for the noise - I just realized I should've left this with my first comment)
Summary
Replace panic with graceful error handling when duplicate check status reports occur in the internal checks system.
Background
This panic was encountered during a
terraform applyoperation that was recreating AWS ACM certificates and API Gateway resources. During the apply phase, Terraform's internal checks system attempted to report the same resource precondition status multiple times for the same resource, triggering a panic that crashed the entire apply operation.What Was Happening
Running the following command:
The Terraform plan was recreating production ACM certificates that were in
PENDING_VALIDATIONstatus along with their associated API Gateway resources. During concurrent resource evaluation, Terraform's checks system received duplicate status reports for the same precondition check on the ACM certificate resource.Root Cause
The issue occurs in
internal/checks/state_report.go:114where the code panics when it detects that a check status has already been reported (i.e., the status is notStatusUnknown). In concurrent execution scenarios, particularly during resource recreation or when evaluating resource preconditions, it's possible for the same check to be reported multiple times due to race conditions or retry logic in the graph walker.The previous implementation assumed this should never happen and treated it as a fatal error, causing Terraform to crash instead of handling the duplicate gracefully.
Changes
Error Before Fix
Test Plan
Added
TestChecksDuplicateReportthat verifies:Impact
This change makes Terraform more resilient to duplicate or conflicting check reports while maintaining data integrity. It prevents unnecessary crashes during legitimate concurrent execution scenarios, particularly when applying plans that involve resource recreation or complex dependency graphs.
Target Release
1.15.x
Rollback Plan
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.
CHANGELOG entry