feat: enable restarting precreated qemu cluster#12733
feat: enable restarting precreated qemu cluster#12733Jaakkonen wants to merge 1 commit intosiderolabs:mainfrom
Conversation
Allows same talos clusters to created by `talosctl cluster create dev` to be started again after system restart or such. This is useful when using talosctl created clusters as development clusters where rapid iteration is required. Signed-off-by: Jaakko Sirén <jaakko@craci.com>
3812354 to
44f5433
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds support for restarting precreated QEMU clusters without full recreation, significantly reducing development iteration time from ~5 minutes to ~2 minutes. The feature is particularly useful when development hosts are frequently restarted.
Changes:
- Added
talosctl cluster startcommand to restart existing stopped clusters - Refactored helper service methods (LoadBalancer, DHCPd, DNSd) to separate configuration persistence from service startup
- Added network bridge recreation logic with state-based configuration
- Implemented process status checking and node restart functionality
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| website/content/v1.13/reference/cli.md | Added documentation for the new talosctl cluster start command |
| pkg/provision/state.go | Added configuration structs for helper services (LoadBalancerConfig, DHCPdConfig, DNSdConfig, CNIConfig) to enable restart |
| pkg/provision/result.go | Added IPMasquerade field to NetworkInfo for state persistence |
| pkg/provision/provision.go | Added Start method to Provisioner interface |
| pkg/provision/providers/vm/start_test.go | Added unit tests for IsProcessRunning function |
| pkg/provision/providers/vm/start.go | Added IsProcessRunning helper to check if a process is alive |
| pkg/provision/providers/vm/network_linux.go | Refactored CreateNetwork into reusable components; added RecreateNetwork for restart |
| pkg/provision/providers/vm/network_darwin.go | Added RecreateNetwork stub that returns error (restart not supported on darwin) |
| pkg/provision/providers/vm/loadbalancer*.go | Capitalized GetLbBindIP and refactored CreateLoadBalancer to use StartLoadBalancer |
| pkg/provision/providers/vm/dhcpd*.go | Refactored to separate config persistence (Create) from service start (Start) |
| pkg/provision/providers/vm/dnsd*.go | Refactored to separate config persistence (Create) from service start (Start) |
| pkg/provision/providers/qemu/start.go | Implemented Start method to restart clusters by recreating network and restarting all services/nodes |
| pkg/provision/providers/qemu/create.go | Added config persistence for helper services to support restart |
| pkg/provision/providers/docker/docker.go | Added Start stub that returns error (not supported for docker) |
| cmd/talosctl/cmd/mgmt/cluster/start.go | Added CLI command implementation for cluster start |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Save helper service configurations for restart support. | ||
| state.SelfExecutable = request.SelfExecutable | ||
|
|
||
| controlPlaneIPs := xslices.Map(request.Nodes.ControlPlaneNodes(), func(req provision.NodeRequest) string { | ||
| return req.IPs[0].String() | ||
| }) | ||
|
|
||
| state.LoadBalancerConfig = &provision.LoadBalancerConfig{ | ||
| BindAddress: vm.GetLbBindIP(request.Network.GatewayAddrs[0]), | ||
| Upstreams: controlPlaneIPs, | ||
| Ports: request.Network.LoadBalancerPorts, | ||
| } | ||
|
|
||
| state.DHCPdConfig = &provision.DHCPdConfig{ | ||
| GatewayAddrs: request.Network.GatewayAddrs, | ||
| IPXEBootScript: request.IPXEBootScript, | ||
| } | ||
|
|
||
| state.DNSdConfig = &provision.DNSdConfig{ | ||
| GatewayAddrs: request.Network.GatewayAddrs, | ||
| } | ||
|
|
||
| state.CNIConfig = &provision.CNIConfig{ | ||
| BinPath: request.Network.CNI.BinPath, | ||
| CacheDir: request.Network.CNI.CacheDir, | ||
| } |
There was a problem hiding this comment.
This code duplicates configuration that is already set by the Create methods for helper services (CreateLoadBalancer, CreateDHCPd, CreateDNSd). Each of these methods already sets the respective config fields (LoadBalancerConfig, DHCPdConfig, DNSdConfig) and SelfExecutable on the state object, making these assignments redundant. Only CNIConfig needs to be saved here since CreateNetwork doesn't set it on state.
| // Bridge doesn't exist, need to recreate it | ||
| // Use CNI config from state (saved during cluster create) | ||
| if state.CNIConfig == nil { | ||
| return fmt.Errorf("no CNI config in state") |
There was a problem hiding this comment.
The RecreateNetwork function checks for CNIConfig but provides a non-specific error message. Consider making the error message consistent with other helper service error messages by including guidance about older talosctl versions. The error should be: "no CNI config in state; cluster was created with older talosctl, please destroy and recreate"
| return fmt.Errorf("no CNI config in state") | |
| return fmt.Errorf("no CNI config in state; cluster was created with older talosctl, please destroy and recreate") |
| ### Synopsis | ||
|
|
||
| Starts a local Talos Kubernetes cluster that was previously created but is now stopped. | ||
| This is useful when the development container is restarted and the VM processes are no longer running. |
There was a problem hiding this comment.
The documentation states "This is useful when the development container is restarted" but the PR description mentions a different use case: "My use case involves using talos on a host I need to restart often". The documentation should reflect the broader use case of system restarts, not just development container restarts. Consider updating to: "This is useful when the host system or development container is restarted and the VM processes are no longer running."
| This is useful when the development container is restarted and the VM processes are no longer running. | |
| This is useful when the host system or development container is restarted and the VM processes are no longer running. |
| func TestIsProcessRunning(t *testing.T) { | ||
| t.Run("non-existent pid file", func(t *testing.T) { | ||
| assert.False(t, vm.IsProcessRunning("/nonexistent/path/to/pid")) | ||
| }) | ||
|
|
||
| t.Run("invalid pid in file", func(t *testing.T) { | ||
| tmpDir := t.TempDir() | ||
| pidPath := filepath.Join(tmpDir, "test.pid") | ||
|
|
||
| err := os.WriteFile(pidPath, []byte("not-a-number"), 0o644) | ||
| assert.NoError(t, err) | ||
|
|
||
| assert.False(t, vm.IsProcessRunning(pidPath)) | ||
| }) | ||
|
|
||
| t.Run("non-existent process", func(t *testing.T) { | ||
| tmpDir := t.TempDir() | ||
| pidPath := filepath.Join(tmpDir, "test.pid") | ||
|
|
||
| // Use a very high PID that's unlikely to exist | ||
| err := os.WriteFile(pidPath, []byte("999999999"), 0o644) | ||
| assert.NoError(t, err) | ||
|
|
||
| assert.False(t, vm.IsProcessRunning(pidPath)) | ||
| }) | ||
|
|
||
| t.Run("running process", func(t *testing.T) { | ||
| tmpDir := t.TempDir() | ||
| pidPath := filepath.Join(tmpDir, "test.pid") | ||
|
|
||
| // Use current process PID | ||
| err := os.WriteFile(pidPath, []byte(strconv.Itoa(os.Getpid())), 0o644) | ||
| assert.NoError(t, err) | ||
|
|
||
| assert.True(t, vm.IsProcessRunning(pidPath)) | ||
| }) | ||
| } |
There was a problem hiding this comment.
The test suite should include a test case for process cleanup after signal handling. The IsProcessRunning function uses syscall.Signal(0) to check if a process exists, but doesn't verify that the process actually belongs to the expected program (e.g., a rogue process could have reused the PID). Consider adding a test that verifies behavior when a PID file contains a valid PID for a different process, though this edge case may be acceptable given the development-focused nature of this feature.
| func (p *provisioner) Start(ctx context.Context, cluster provision.Cluster, opts ...provision.Option) error { | ||
| options := provision.DefaultOptions() | ||
|
|
||
| for _, opt := range opts { | ||
| if err := opt(&options); err != nil { | ||
| return err | ||
| } | ||
| } | ||
|
|
||
| state, ok := cluster.(*provision.State) | ||
| if !ok { | ||
| return fmt.Errorf("cluster is not a *provision.State") | ||
| } | ||
|
|
||
| fmt.Fprintln(options.LogWriter, "recreating network bridge", state.BridgeName) | ||
|
|
||
| if err := p.RecreateNetwork(ctx, state, options); err != nil { | ||
| return fmt.Errorf("error recreating network: %w", err) | ||
| } | ||
|
|
||
| fmt.Fprintln(options.LogWriter, "starting load balancer") | ||
|
|
||
| if err := p.StartLoadBalancer(state); err != nil { | ||
| return fmt.Errorf("error starting loadbalancer: %w", err) | ||
| } | ||
|
|
||
| fmt.Fprintln(options.LogWriter, "starting dnsd") | ||
|
|
||
| if err := p.StartDNSd(state); err != nil { | ||
| return fmt.Errorf("error starting dnsd: %w", err) | ||
| } | ||
|
|
||
| fmt.Fprintln(options.LogWriter, "starting nodes") | ||
|
|
||
| if err := p.startNodes(ctx, state, &options); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| fmt.Fprintln(options.LogWriter, "starting dhcpd") | ||
|
|
||
| if err := p.StartDHCPd(state); err != nil { | ||
| return fmt.Errorf("error starting dhcpd: %w", err) | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
The Start method doesn't implement any rollback or cleanup if an error occurs partway through starting services. If starting the load balancer succeeds but dnsd fails, the load balancer will remain running. While this may be acceptable for a development-focused feature, consider whether partial startup states could cause confusion or resource leaks. Either document this behavior or add cleanup logic similar to how the Destroy method handles errors.
|
This PR is stale because it has been open 45 days with no activity. |
This is useful when using talosctl created clusters as development clusters where rapid iteration is required.
Pull Request
What? (description)
Adds a
calosctl cluster startcommand to start existing clusters that were shut down (for example by system restart).Why? (reasoning)
Talos takes ~5 minutes to do an install and fresh start whereas just restaring it takes ~2 minutes to get ready. My use case involves using talos on a host I need to restart often making slow starts waste high amounts of developer time. I'd still want to use Talos instead of K3s but the boot speed is the main problem. This addresses 50% of that.
Acceptance
Please use the following checklist:
make conformance)make fmt)make lint)make docs)make unit-tests)