feat: enable restarting precreated qemu cluster by Jaakkonen · Pull Request #12733 · siderolabs/talos

Jaakkonen · 2026-02-05T21:50:38Z

This is useful when using talosctl created clusters as development clusters where rapid iteration is required.

Pull Request

What? (description)

Adds a calosctl cluster start command to start existing clusters that were shut down (for example by system restart).

Why? (reasoning)

Talos takes ~5 minutes to do an install and fresh start whereas just restaring it takes ~2 minutes to get ready. My use case involves using talos on a host I need to restart often making slow starts waste high amounts of developer time. I'd still want to use Talos instead of K3s but the boot speed is the main problem. This addresses 50% of that.

Acceptance

Please use the following checklist:

you linked an issue (if applicable)
you included tests (if applicable)
you ran conformance (make conformance)
you formatted your code (make fmt)
you linted your code (make lint)
you generated documentation (make docs)
you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

Allows same talos clusters to created by `talosctl cluster create dev` to be started again after system restart or such. This is useful when using talosctl created clusters as development clusters where rapid iteration is required. Signed-off-by: Jaakko Sirén <jaakko@craci.com>

Copilot

Pull request overview

This pull request adds support for restarting precreated QEMU clusters without full recreation, significantly reducing development iteration time from ~5 minutes to ~2 minutes. The feature is particularly useful when development hosts are frequently restarted.

Changes:

Added talosctl cluster start command to restart existing stopped clusters
Refactored helper service methods (LoadBalancer, DHCPd, DNSd) to separate configuration persistence from service startup
Added network bridge recreation logic with state-based configuration
Implemented process status checking and node restart functionality

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
website/content/v1.13/reference/cli.md	Added documentation for the new `talosctl cluster start` command
pkg/provision/state.go	Added configuration structs for helper services (LoadBalancerConfig, DHCPdConfig, DNSdConfig, CNIConfig) to enable restart
pkg/provision/result.go	Added IPMasquerade field to NetworkInfo for state persistence
pkg/provision/provision.go	Added Start method to Provisioner interface
pkg/provision/providers/vm/start_test.go	Added unit tests for IsProcessRunning function
pkg/provision/providers/vm/start.go	Added IsProcessRunning helper to check if a process is alive
pkg/provision/providers/vm/network_linux.go	Refactored CreateNetwork into reusable components; added RecreateNetwork for restart
pkg/provision/providers/vm/network_darwin.go	Added RecreateNetwork stub that returns error (restart not supported on darwin)
pkg/provision/providers/vm/loadbalancer*.go	Capitalized GetLbBindIP and refactored CreateLoadBalancer to use StartLoadBalancer
pkg/provision/providers/vm/dhcpd*.go	Refactored to separate config persistence (Create) from service start (Start)
pkg/provision/providers/vm/dnsd*.go	Refactored to separate config persistence (Create) from service start (Start)
pkg/provision/providers/qemu/start.go	Implemented Start method to restart clusters by recreating network and restarting all services/nodes
pkg/provision/providers/qemu/create.go	Added config persistence for helper services to support restart
pkg/provision/providers/docker/docker.go	Added Start stub that returns error (not supported for docker)
cmd/talosctl/cmd/mgmt/cluster/start.go	Added CLI command implementation for cluster start

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T14:09:27Z

+	// Save helper service configurations for restart support.
+	state.SelfExecutable = request.SelfExecutable
+
+	controlPlaneIPs := xslices.Map(request.Nodes.ControlPlaneNodes(), func(req provision.NodeRequest) string {
+		return req.IPs[0].String()
+	})
+
+	state.LoadBalancerConfig = &provision.LoadBalancerConfig{
+		BindAddress: vm.GetLbBindIP(request.Network.GatewayAddrs[0]),
+		Upstreams:   controlPlaneIPs,
+		Ports:       request.Network.LoadBalancerPorts,
+	}
+
+	state.DHCPdConfig = &provision.DHCPdConfig{
+		GatewayAddrs:   request.Network.GatewayAddrs,
+		IPXEBootScript: request.IPXEBootScript,
+	}
+
+	state.DNSdConfig = &provision.DNSdConfig{
+		GatewayAddrs: request.Network.GatewayAddrs,
+	}
+
+	state.CNIConfig = &provision.CNIConfig{
+		BinPath:  request.Network.CNI.BinPath,
+		CacheDir: request.Network.CNI.CacheDir,
+	}


This code duplicates configuration that is already set by the Create methods for helper services (CreateLoadBalancer, CreateDHCPd, CreateDNSd). Each of these methods already sets the respective config fields (LoadBalancerConfig, DHCPdConfig, DNSdConfig) and SelfExecutable on the state object, making these assignments redundant. Only CNIConfig needs to be saved here since CreateNetwork doesn't set it on state.

Copilot · 2026-02-19T14:09:28Z

+	// Bridge doesn't exist, need to recreate it
+	// Use CNI config from state (saved during cluster create)
+	if state.CNIConfig == nil {
+		return fmt.Errorf("no CNI config in state")


The RecreateNetwork function checks for CNIConfig but provides a non-specific error message. Consider making the error message consistent with other helper service error messages by including guidance about older talosctl versions. The error should be: "no CNI config in state; cluster was created with older talosctl, please destroy and recreate"

Suggested change

return fmt.Errorf("no CNI config in state")

return fmt.Errorf("no CNI config in state; cluster was created with older talosctl, please destroy and recreate")

Copilot · 2026-02-19T14:09:28Z

+### Synopsis
+
+Starts a local Talos Kubernetes cluster that was previously created but is now stopped.
+This is useful when the development container is restarted and the VM processes are no longer running.


The documentation states "This is useful when the development container is restarted" but the PR description mentions a different use case: "My use case involves using talos on a host I need to restart often". The documentation should reflect the broader use case of system restarts, not just development container restarts. Consider updating to: "This is useful when the host system or development container is restarted and the VM processes are no longer running."

Suggested change

This is useful when the development container is restarted and the VM processes are no longer running.

This is useful when the host system or development container is restarted and the VM processes are no longer running.

Copilot · 2026-02-19T14:09:28Z

+func TestIsProcessRunning(t *testing.T) {
+	t.Run("non-existent pid file", func(t *testing.T) {
+		assert.False(t, vm.IsProcessRunning("/nonexistent/path/to/pid"))
+	})
+
+	t.Run("invalid pid in file", func(t *testing.T) {
+		tmpDir := t.TempDir()
+		pidPath := filepath.Join(tmpDir, "test.pid")
+
+		err := os.WriteFile(pidPath, []byte("not-a-number"), 0o644)
+		assert.NoError(t, err)
+
+		assert.False(t, vm.IsProcessRunning(pidPath))
+	})
+
+	t.Run("non-existent process", func(t *testing.T) {
+		tmpDir := t.TempDir()
+		pidPath := filepath.Join(tmpDir, "test.pid")
+
+		// Use a very high PID that's unlikely to exist
+		err := os.WriteFile(pidPath, []byte("999999999"), 0o644)
+		assert.NoError(t, err)
+
+		assert.False(t, vm.IsProcessRunning(pidPath))
+	})
+
+	t.Run("running process", func(t *testing.T) {
+		tmpDir := t.TempDir()
+		pidPath := filepath.Join(tmpDir, "test.pid")
+
+		// Use current process PID
+		err := os.WriteFile(pidPath, []byte(strconv.Itoa(os.Getpid())), 0o644)
+		assert.NoError(t, err)
+
+		assert.True(t, vm.IsProcessRunning(pidPath))
+	})
+}


The test suite should include a test case for process cleanup after signal handling. The IsProcessRunning function uses syscall.Signal(0) to check if a process exists, but doesn't verify that the process actually belongs to the expected program (e.g., a rogue process could have reused the PID). Consider adding a test that verifies behavior when a PID file contains a valid PID for a different process, though this edge case may be acceptable given the development-focused nature of this feature.

Copilot · 2026-02-19T14:09:29Z

+func (p *provisioner) Start(ctx context.Context, cluster provision.Cluster, opts ...provision.Option) error {
+	options := provision.DefaultOptions()
+
+	for _, opt := range opts {
+		if err := opt(&options); err != nil {
+			return err
+		}
+	}
+
+	state, ok := cluster.(*provision.State)
+	if !ok {
+		return fmt.Errorf("cluster is not a *provision.State")
+	}
+
+	fmt.Fprintln(options.LogWriter, "recreating network bridge", state.BridgeName)
+
+	if err := p.RecreateNetwork(ctx, state, options); err != nil {
+		return fmt.Errorf("error recreating network: %w", err)
+	}
+
+	fmt.Fprintln(options.LogWriter, "starting load balancer")
+
+	if err := p.StartLoadBalancer(state); err != nil {
+		return fmt.Errorf("error starting loadbalancer: %w", err)
+	}
+
+	fmt.Fprintln(options.LogWriter, "starting dnsd")
+
+	if err := p.StartDNSd(state); err != nil {
+		return fmt.Errorf("error starting dnsd: %w", err)
+	}
+
+	fmt.Fprintln(options.LogWriter, "starting nodes")
+
+	if err := p.startNodes(ctx, state, &options); err != nil {
+		return err
+	}
+
+	fmt.Fprintln(options.LogWriter, "starting dhcpd")
+
+	if err := p.StartDHCPd(state); err != nil {
+		return fmt.Errorf("error starting dhcpd: %w", err)
+	}
+
+	return nil
+}


The Start method doesn't implement any rollback or cleanup if an error occurs partway through starting services. If starting the load balancer succeeds but dnsd fails, the load balancer will remain running. While this may be acceptable for a development-focused feature, consider whether partial startup states could cause confusion or resource leaks. Either document this behavior or add cleanup logic similar to how the Destroy method handles errors.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-04-06T02:19:26Z

This PR is stale because it has been open 45 days with no activity.

github-project-automation bot added this to Planning Feb 5, 2026

github-project-automation bot moved this to To Do in Planning Feb 5, 2026

talos-bot moved this from To Do to In Review in Planning Feb 5, 2026

Jaakkonen force-pushed the enable-restarting-cluster branch from 3812354 to 44f5433 Compare February 5, 2026 22:02

smira moved this from In Review to On Hold in Planning Feb 9, 2026

frezbo requested a review from Orzelius February 19, 2026 08:24

shanduur requested review from Copilot February 19, 2026 14:02

Copilot started reviewing on behalf of shanduur February 19, 2026 14:03 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

github-actions bot added the Stale label Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable restarting precreated qemu cluster#12733

feat: enable restarting precreated qemu cluster#12733
Jaakkonen wants to merge 1 commit intosiderolabs:mainfrom
Jaakkonen:enable-restarting-cluster

Jaakkonen commented Feb 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	return fmt.Errorf("no CNI config in state")
	return fmt.Errorf("no CNI config in state; cluster was created with older talosctl, please destroy and recreate")

	This is useful when the development container is restarted and the VM processes are no longer running.
	This is useful when the host system or development container is restarted and the VM processes are no longer running.

Uh oh!

Conversation

Jaakkonen commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

What? (description)

Why? (reasoning)

Acceptance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jaakkonen commented Feb 5, 2026 •

edited

Loading