Skip to content

Per-instance resource + update_policy for stateful replica sets (GCP MIG / k8s StatefulSet equivalent) #243

@h4x3rotab

Description

@h4x3rotab

Component: Phala Cloud + the official
Phala-Network/terraform-provider-phala
Use case: running stateful clusters (Consul, Postgres/Patroni,
etcd, Kafka, …) across dstack CVMs.

Background

dstack already has the right structural shape for stateful
replicas — an app can have multiple instances, each instance
is bound to its own disk, and the instance's identity is persisted
on that disk. That maps cleanly to GCP's Managed Instance Group
with a stateful_policy, or to Kubernetes' StatefulSet + PVC
template, or to AWS ASG built on top of EBS reattach + lifecycle
hooks.

What's missing today is the operator-facing control surface to
drive that model from Terraform, and the rollout policy for
multi-instance updates.

What works today (validated empirically)

We exercised the provider in a small shakedown
(context
includes the related storage_fs ForceNew bug we hit at the same
time), and confirmed:

  • phala_app with replicas: N provisions N CVMs all sharing the
    same app_id (so a TEE-derived getKey() returns the same bytes
    on every replica — important for cluster-wide secrets).
  • In-place compose / env updates preserve app_id and
    primary_cvm_id (~3m39s on a tdx.small).
  • Replicas show up in Terraform state as cvm_ids = […].

So the basic "scale N stateful replicas under one app" already
works. The gaps below show up the moment you want fine-grained
operational control
over the rollout and per-instance lifecycle.

Asks

1. Per-instance Terraform resource

Mirror GCP MIG's
google_compute_per_instance_config
or k8s' StatefulSet+PVC template:

resource "phala_app" "consul_servers" {
  name     = "consul-servers"
  replicas = 3
  # ... shared compose, env, etc.
}

resource "phala_app_instance" "consul_servers" {
  for_each = { for i in range(3) : i => "consul-server-${i}" }

  app_id          = phala_app.consul_servers.app_id
  ordinal         = each.key            # stable slot number
  preserved_state {
    disk          = "data"              # never delete, even on instance recreate
  }
  metadata = {
    role = "server"
  }
}

Use cases this unlocks:

  • terraform apply -target=phala_app_instance.consul_servers[\"1\"]
    to upgrade exactly one replica.
  • Per-instance overrides (e.g. AZ pinning, instance-specific
    metadata).
  • Per-instance state in Terraform — each replica has its own
    terraform state entry instead of being collapsed into the
    parent's cvm_ids list.

This is the structural primitive that lets external tooling
(rollout scripts, operators) gate updates on workload health.

2. update_policy block on phala_app

Mirror GCP MIG's
update_policy:

resource "phala_app" "consul_servers" {
  name     = "consul-servers"
  replicas = 3

  update_policy {
    type                  = "PROACTIVE"   # or "OPPORTUNISTIC"
    minimal_action        = "RESTART"     # NONE / REFRESH / RESTART / REPLACE
    max_unavailable_fixed = 0             # never reduce live capacity
    max_surge_fixed       = 1             # surge one extra during rollout
    min_ready_seconds     = 30            # green for at least 30s before next
  }
}

Reads as "one out at a time, never two simultaneously, give it 30s
of green before moving on." Same shape as k8s StatefulSet's
RollingUpdate { partition, podManagementPolicy }.

Without this, every Phala app update today is opaque — the operator
can't tell whether the platform will roll instances one at a time
or restart them all at once. For a Consul quorum, the difference
is "uptime" vs "split-brain incident."

3. Workload health-gate hooks (lifecycle hooks equivalent)

GCP/AWS both expose lifecycle hooks (PRE_TERMINATE /
POST_LAUNCH) so the cluster can pause a rollout while a
workload-specific drain runs. Examples:

  • Consul: consul operator raft transfer-leader before killing
    the leader.
  • Postgres+Patroni: patronictl switchover --candidate ... before
    draining the primary.
  • etcd: etcdctl member promote after a new joiner is in-sync.

A simple version in dstack would be: an HTTP endpoint or a script
the platform execs in the CVM before terminating it, with a
configurable timeout. Without this, the only safe "stateful rolling
update" today is to do it manually outside of Terraform.

4. auto_healing with custom health checks

If a CVM dies, today the operator notices via
/v1/agent/members going red and runs terraform apply to
recreate. GCP MIG / k8s both reconcile automatically off a health
check. Same shape would fit:

auto_healing_policies {
  health_check      = phala_health_check.consul_healthy.id
  initial_delay_sec = 60
}

resource "phala_health_check" "consul_healthy" {
  http {
    port    = 8500
    request_path = "/v1/health/state/passing"
    expect_200 = true
  }
  check_interval_sec = 10
  unhealthy_threshold = 3
}

Combined with preserved_state, the recreated CVM re-attaches its
existing disk → the cluster heals without losing membership.

Why this matters

Right now, anyone running a stateful cluster on Phala Cloud has to
either:

  1. Build the orchestration outside of Phala (custom scripts that
    call the API directly, drain workloads, then terraform apply -target=... per instance), or
  2. Run an in-cluster operator (Consul Operator, Patroni controller,
    etc.) that wraps the platform and adds the missing primitives.

(1) is what we'll do for our experiment in the meantime; it works
but the orchestration logic ends up duplicated across every project.
(2) is the "operator pattern" but only really pays off in
Kubernetes-shaped environments.

Adding the four primitives above would let the standard cloud-style
HCL pattern (the GCP MIG one) work natively on Phala Cloud, which
seems like the right abstraction to converge on long-term.

Related

Both small but in the same vicinity (tooling for multi-replica apps).

Happy to chat further

We're prototyping a Consul service mesh across dstack CVMs (mesh-conn
overlay + ICE/yamux) and these would graduate it from "demo" to
"managed cluster." Glad to provide more detail / iterate on the API
shape if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions