Skip to content

join new etcd members as learners and auto-promote#7629

Open
makhov wants to merge 1 commit into
k0sproject:mainfrom
makhov:join-etcd-as-learner
Open

join new etcd members as learners and auto-promote#7629
makhov wants to merge 1 commit into
k0sproject:mainfrom
makhov:join-etcd-as-learner

Conversation

@makhov
Copy link
Copy Markdown
Contributor

@makhov makhov commented May 14, 2026

Description

Adds new etcd members through the k0s join API as raft learners instead of voting members. The leader-elected EtcdMemberReconciler promotes them once etcd reports them caught up. This prevents an unreachable joiner (e.g. one advertising a wrong-interface peer URL) from breaking quorum on the existing cluster — notably the 1-node case where the surviving controller went from quorum=1 to quorum=2 and stalled waiting for an unreachable peer.

Fixes #7628

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@makhov makhov force-pushed the join-etcd-as-learner branch 6 times, most recently from 02cbc2e to 6814586 Compare May 14, 2026 12:14
@makhov makhov changed the title feat(etcd): join new members as learners and auto-promote join new etcd members as learners and auto-promote May 14, 2026
@makhov makhov marked this pull request as ready for review May 14, 2026 14:15
@makhov makhov requested review from a team as code owners May 14, 2026 14:15
@makhov makhov requested review from kke and ncopa May 14, 2026 14:15
twz123
twz123 previously approved these changes May 15, 2026
Comment thread pkg/apis/etcd/v1beta1/types.go Outdated

type JoinCondition struct {
// +kubebuilder:validation:Enum=Joined
// +kubebuilder:validation:Enum=Joined;Promoted
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps we should remove the validation altogether so that adding a new condition doesn't require a CRD change. This is rather limiting for conditions in general: They are free-form and are supposed to offer flexibility, but then we're restricting them to well-known values, which defeats the whole point of not having bespoke status fields for each condition.

Comment thread pkg/apis/etcd/v1beta1/types_test.go Outdated
Comment on lines +54 to +56
if got == nil {
t.Fatalf("expected Promoted condition to be set")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're usually using testify assertions across the unit tests.

// Force a periodic resync even when no EtcdMember CR changes. Learners
// that join before their own controller is ready never have a matching
// CR, so the watch alone would never fire for them.
periodic := time.NewTicker(30 * time.Second)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge this with the retry channel, and simply resync every 10 seconds (maybe adding wait.Jitter(...)), not only when a retry is necessary.

Comment thread pkg/etcd/client.go Outdated
}

// LearnerInfo describes a learner member as reported by etcd.
type LearnerInfo struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to add something similar for another etcd member issue I'm working on. What about merging the old member list with this? I don't think we need to have two separate methods. /xref 64b77ea.

@makhov makhov force-pushed the join-etcd-as-learner branch 2 times, most recently from ce62cb9 to 642cae9 Compare May 15, 2026 08:47
@makhov makhov requested a review from twz123 May 15, 2026 09:23
@makhov makhov closed this May 15, 2026
@makhov makhov reopened this May 15, 2026
@makhov makhov force-pushed the join-etcd-as-learner branch from 642cae9 to 0df160f Compare May 15, 2026 13:29
@makhov
Copy link
Copy Markdown
Contributor Author

makhov commented May 15, 2026

Self-promoting is impossible in etcd, so we have to do in etcd_member_reconcile

Adds new etcd members through the k0s join API as raft learners
instead of voting members. The leader-elected EtcdMemberReconciler
promotes them once etcd reports them caught up. This prevents an
unreachable joiner (e.g. one advertising a wrong-interface peer URL)
from breaking quorum on the existing cluster — notably the 1-node
case where the surviving controller went from quorum=1 to quorum=2
and stalled waiting for an unreachable peer.

Signed-off-by: amakhov <amakhov@mirantis.com>
@makhov makhov force-pushed the join-etcd-as-learner branch from 0df160f to c4a0ad9 Compare May 15, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected desintegration of k0s in attempt to add one more controller

2 participants