Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

improving and fixing failure detection#196

Merged
shlomi-noach merged 7 commits into
masterfrom
detection-fixes
Jun 8, 2017
Merged

improving and fixing failure detection#196
shlomi-noach merged 7 commits into
masterfrom
detection-fixes

Conversation

@shlomi-noach
Copy link
Copy Markdown
Collaborator

This PR will improve upon the failure detection process, by:

  • prioritizing detection by severity
  • more useful output

@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 08:01 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 14:04 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 16:05 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 16:08 Active
Comment thread go/logic/topology_recovery.go Outdated
}
} else {
go executeCheckAndRecoverFunction(analysisEntry, candidateInstanceKey, false, skipProcesses)
analysisEntry := analysisEntry
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the proper behavior here would be to pass analysisEntry into the goroutine:

go func(analysisEntry inst.ReplicationAnalysis) {
    // call things in here
}(analysisEntry)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickvanw thank you; I became accustomed to this paradigm following https://golang.org/doc/effective_go.html#channels (req := req).
I thought at first it looked funny, but then it is also very explicit and clear.

@shlomi-noach shlomi-noach temporarily deployed to production June 7, 2017 05:12 Inactive
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 7, 2017 05:27 Active
@shlomi-noach
Copy link
Copy Markdown
Collaborator Author

Time to elaborate. This PR fixes a subtle bug which does not manifest consistently. The subtle fix is on https://github.com/github/orchestrator/pull/196/files#diff-86933c5afedc1f1a02e011c53af7b039R1366

The executeCheckAndRecoverFunction() function was called within a goroutine even while analysisEntry was iterating replicationAnalysis, which means by the time the goroutine actually executed, analysisEntry would have switched to a/the next value.

This doesn't manifest in a consistent manner because of the nature of unpredictable concurrency. It did manifest more on prior array entries than on latter array entries.

When this PR is merged (being hammered right now) I'll release a version as this is a critical recovery fix. I'm surprised it hasn't surfaced sooner, to be honest.

@shlomi-noach
Copy link
Copy Markdown
Collaborator Author

This PR adds a distinction between an actionably-recoverable problem (e.g. DeadMaster) and one that hasn't got a recovering action (e.g. UnreachableMaster).

It allows orchestrator to register up to one actionable and one non-actionable entry in the backend topology_failure_detection table.

Detection is known to be of an escalating nature. A dying master may for example first be diagnosed as UnreachableMaster and then as DeadMaster. While recoveries work for such escalations (pending the subtle fix in this PR), detection only shows the first met analysis, and this causes for lacking visibility.

With the changes in this PR we will get more information via topology_failure_detection and, as result, via /api/audit-failure-detection and /web/audit-failure-detection.

@shlomi-noach
Copy link
Copy Markdown
Collaborator Author

an also more logging to be able to figure out what's up with detection at any step. And logging is Info rather than Debug because it's important™.

@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 7, 2017 11:53 Active
@shlomi-noach
Copy link
Copy Markdown
Collaborator Author

Recent commit also intentionally changes recovery iteration to random order.

@shlomi-noach shlomi-noach merged commit a74c07a into master Jun 8, 2017
@shlomi-noach shlomi-noach deleted the detection-fixes branch June 8, 2017 06:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants