Skip to content

feat: enable restarting precreated qemu cluster#12733

Open
Jaakkonen wants to merge 1 commit intosiderolabs:mainfrom
Jaakkonen:enable-restarting-cluster
Open

feat: enable restarting precreated qemu cluster#12733
Jaakkonen wants to merge 1 commit intosiderolabs:mainfrom
Jaakkonen:enable-restarting-cluster

Conversation

@Jaakkonen
Copy link
Copy Markdown

@Jaakkonen Jaakkonen commented Feb 5, 2026

This is useful when using talosctl created clusters as development clusters where rapid iteration is required.

Pull Request

What? (description)

Adds a calosctl cluster start command to start existing clusters that were shut down (for example by system restart).

Why? (reasoning)

Talos takes ~5 minutes to do an install and fresh start whereas just restaring it takes ~2 minutes to get ready. My use case involves using talos on a host I need to restart often making slow starts waste high amounts of developer time. I'd still want to use Talos instead of K3s but the boot speed is the main problem. This addresses 50% of that.

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (if applicable)
  • you ran conformance (make conformance)
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

@github-project-automation github-project-automation bot moved this to To Do in Planning Feb 5, 2026
@talos-bot talos-bot moved this from To Do to In Review in Planning Feb 5, 2026
Allows same talos clusters to created by `talosctl cluster create dev` to be started again after system restart or such.
This is useful when using talosctl created clusters as development clusters where rapid iteration is required.

Signed-off-by: Jaakko Sirén <jaakko@craci.com>
@Jaakkonen Jaakkonen force-pushed the enable-restarting-cluster branch from 3812354 to 44f5433 Compare February 5, 2026 22:02
@smira smira moved this from In Review to On Hold in Planning Feb 9, 2026
@frezbo frezbo requested a review from Orzelius February 19, 2026 08:24
@shanduur shanduur requested review from Copilot February 19, 2026 14:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for restarting precreated QEMU clusters without full recreation, significantly reducing development iteration time from ~5 minutes to ~2 minutes. The feature is particularly useful when development hosts are frequently restarted.

Changes:

  • Added talosctl cluster start command to restart existing stopped clusters
  • Refactored helper service methods (LoadBalancer, DHCPd, DNSd) to separate configuration persistence from service startup
  • Added network bridge recreation logic with state-based configuration
  • Implemented process status checking and node restart functionality

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
website/content/v1.13/reference/cli.md Added documentation for the new talosctl cluster start command
pkg/provision/state.go Added configuration structs for helper services (LoadBalancerConfig, DHCPdConfig, DNSdConfig, CNIConfig) to enable restart
pkg/provision/result.go Added IPMasquerade field to NetworkInfo for state persistence
pkg/provision/provision.go Added Start method to Provisioner interface
pkg/provision/providers/vm/start_test.go Added unit tests for IsProcessRunning function
pkg/provision/providers/vm/start.go Added IsProcessRunning helper to check if a process is alive
pkg/provision/providers/vm/network_linux.go Refactored CreateNetwork into reusable components; added RecreateNetwork for restart
pkg/provision/providers/vm/network_darwin.go Added RecreateNetwork stub that returns error (restart not supported on darwin)
pkg/provision/providers/vm/loadbalancer*.go Capitalized GetLbBindIP and refactored CreateLoadBalancer to use StartLoadBalancer
pkg/provision/providers/vm/dhcpd*.go Refactored to separate config persistence (Create) from service start (Start)
pkg/provision/providers/vm/dnsd*.go Refactored to separate config persistence (Create) from service start (Start)
pkg/provision/providers/qemu/start.go Implemented Start method to restart clusters by recreating network and restarting all services/nodes
pkg/provision/providers/qemu/create.go Added config persistence for helper services to support restart
pkg/provision/providers/docker/docker.go Added Start stub that returns error (not supported for docker)
cmd/talosctl/cmd/mgmt/cluster/start.go Added CLI command implementation for cluster start

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +179 to +204
// Save helper service configurations for restart support.
state.SelfExecutable = request.SelfExecutable

controlPlaneIPs := xslices.Map(request.Nodes.ControlPlaneNodes(), func(req provision.NodeRequest) string {
return req.IPs[0].String()
})

state.LoadBalancerConfig = &provision.LoadBalancerConfig{
BindAddress: vm.GetLbBindIP(request.Network.GatewayAddrs[0]),
Upstreams: controlPlaneIPs,
Ports: request.Network.LoadBalancerPorts,
}

state.DHCPdConfig = &provision.DHCPdConfig{
GatewayAddrs: request.Network.GatewayAddrs,
IPXEBootScript: request.IPXEBootScript,
}

state.DNSdConfig = &provision.DNSdConfig{
GatewayAddrs: request.Network.GatewayAddrs,
}

state.CNIConfig = &provision.CNIConfig{
BinPath: request.Network.CNI.BinPath,
CacheDir: request.Network.CNI.CacheDir,
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code duplicates configuration that is already set by the Create methods for helper services (CreateLoadBalancer, CreateDHCPd, CreateDNSd). Each of these methods already sets the respective config fields (LoadBalancerConfig, DHCPdConfig, DNSdConfig) and SelfExecutable on the state object, making these assignments redundant. Only CNIConfig needs to be saved here since CreateNetwork doesn't set it on state.

Copilot uses AI. Check for mistakes.
Comment thread pkg/provision/providers/vm/dhcpd.go
Comment thread pkg/provision/providers/vm/dnsd_linux.go
Comment thread pkg/provision/providers/vm/dnsd_darwin.go
// Bridge doesn't exist, need to recreate it
// Use CNI config from state (saved during cluster create)
if state.CNIConfig == nil {
return fmt.Errorf("no CNI config in state")
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RecreateNetwork function checks for CNIConfig but provides a non-specific error message. Consider making the error message consistent with other helper service error messages by including guidance about older talosctl versions. The error should be: "no CNI config in state; cluster was created with older talosctl, please destroy and recreate"

Suggested change
return fmt.Errorf("no CNI config in state")
return fmt.Errorf("no CNI config in state; cluster was created with older talosctl, please destroy and recreate")

Copilot uses AI. Check for mistakes.
### Synopsis

Starts a local Talos Kubernetes cluster that was previously created but is now stopped.
This is useful when the development container is restarted and the VM processes are no longer running.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "This is useful when the development container is restarted" but the PR description mentions a different use case: "My use case involves using talos on a host I need to restart often". The documentation should reflect the broader use case of system restarts, not just development container restarts. Consider updating to: "This is useful when the host system or development container is restarted and the VM processes are no longer running."

Suggested change
This is useful when the development container is restarted and the VM processes are no longer running.
This is useful when the host system or development container is restarted and the VM processes are no longer running.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +54
func TestIsProcessRunning(t *testing.T) {
t.Run("non-existent pid file", func(t *testing.T) {
assert.False(t, vm.IsProcessRunning("/nonexistent/path/to/pid"))
})

t.Run("invalid pid in file", func(t *testing.T) {
tmpDir := t.TempDir()
pidPath := filepath.Join(tmpDir, "test.pid")

err := os.WriteFile(pidPath, []byte("not-a-number"), 0o644)
assert.NoError(t, err)

assert.False(t, vm.IsProcessRunning(pidPath))
})

t.Run("non-existent process", func(t *testing.T) {
tmpDir := t.TempDir()
pidPath := filepath.Join(tmpDir, "test.pid")

// Use a very high PID that's unlikely to exist
err := os.WriteFile(pidPath, []byte("999999999"), 0o644)
assert.NoError(t, err)

assert.False(t, vm.IsProcessRunning(pidPath))
})

t.Run("running process", func(t *testing.T) {
tmpDir := t.TempDir()
pidPath := filepath.Join(tmpDir, "test.pid")

// Use current process PID
err := os.WriteFile(pidPath, []byte(strconv.Itoa(os.Getpid())), 0o644)
assert.NoError(t, err)

assert.True(t, vm.IsProcessRunning(pidPath))
})
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite should include a test case for process cleanup after signal handling. The IsProcessRunning function uses syscall.Signal(0) to check if a process exists, but doesn't verify that the process actually belongs to the expected program (e.g., a rogue process could have reused the PID). Consider adding a test that verifies behavior when a PID file contains a valid PID for a different process, though this edge case may be acceptable given the development-focused nature of this feature.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +70
func (p *provisioner) Start(ctx context.Context, cluster provision.Cluster, opts ...provision.Option) error {
options := provision.DefaultOptions()

for _, opt := range opts {
if err := opt(&options); err != nil {
return err
}
}

state, ok := cluster.(*provision.State)
if !ok {
return fmt.Errorf("cluster is not a *provision.State")
}

fmt.Fprintln(options.LogWriter, "recreating network bridge", state.BridgeName)

if err := p.RecreateNetwork(ctx, state, options); err != nil {
return fmt.Errorf("error recreating network: %w", err)
}

fmt.Fprintln(options.LogWriter, "starting load balancer")

if err := p.StartLoadBalancer(state); err != nil {
return fmt.Errorf("error starting loadbalancer: %w", err)
}

fmt.Fprintln(options.LogWriter, "starting dnsd")

if err := p.StartDNSd(state); err != nil {
return fmt.Errorf("error starting dnsd: %w", err)
}

fmt.Fprintln(options.LogWriter, "starting nodes")

if err := p.startNodes(ctx, state, &options); err != nil {
return err
}

fmt.Fprintln(options.LogWriter, "starting dhcpd")

if err := p.StartDHCPd(state); err != nil {
return fmt.Errorf("error starting dhcpd: %w", err)
}

return nil
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Start method doesn't implement any rollback or cleanup if an error occurs partway through starting services. If starting the load balancer succeeds but dnsd fails, the load balancer will remain running. While this may be acceptable for a development-focused feature, consider whether partial startup states could cause confusion or resource leaks. Either document this behavior or add cleanup logic similar to how the Destroy method handles errors.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

This PR is stale because it has been open 45 days with no activity.

@github-actions github-actions bot added the Stale label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: On Hold

Development

Successfully merging this pull request may close these issues.

5 participants