improving and fixing failure detection#196
Conversation
| } | ||
| } else { | ||
| go executeCheckAndRecoverFunction(analysisEntry, candidateInstanceKey, false, skipProcesses) | ||
| analysisEntry := analysisEntry |
There was a problem hiding this comment.
I believe the proper behavior here would be to pass analysisEntry into the goroutine:
go func(analysisEntry inst.ReplicationAnalysis) {
// call things in here
}(analysisEntry)
There was a problem hiding this comment.
@nickvanw thank you; I became accustomed to this paradigm following https://golang.org/doc/effective_go.html#channels (req := req).
I thought at first it looked funny, but then it is also very explicit and clear.
|
Time to elaborate. This PR fixes a subtle bug which does not manifest consistently. The subtle fix is on https://github.com/github/orchestrator/pull/196/files#diff-86933c5afedc1f1a02e011c53af7b039R1366 The This doesn't manifest in a consistent manner because of the nature of unpredictable concurrency. It did manifest more on prior array entries than on latter array entries. When this PR is merged (being hammered right now) I'll release a version as this is a critical recovery fix. I'm surprised it hasn't surfaced sooner, to be honest. |
|
This PR adds a distinction between an actionably-recoverable problem (e.g. It allows Detection is known to be of an escalating nature. A dying master may for example first be diagnosed as With the changes in this PR we will get more information via |
|
an also more logging to be able to figure out what's up with detection at any step. And logging is |
|
Recent commit also intentionally changes recovery iteration to random order. |
This PR will improve upon the failure detection process, by: