Per-instance resource + update_policy for stateful replica sets (GCP MIG / k8s StatefulSet equivalent)

**Component:** Phala Cloud + the official
[Phala-Network/terraform-provider-phala](https://github.com/Phala-Network/terraform-provider-phala)
**Use case:** running stateful clusters (Consul, Postgres/Patroni,
etcd, Kafka, …) across dstack CVMs.

## Background

dstack already has the right *structural* shape for stateful
replicas — an **app** can have multiple **instances**, each instance
is bound to its own disk, and the instance's identity is persisted
on that disk. That maps cleanly to GCP's Managed Instance Group
with a `stateful_policy`, or to Kubernetes' `StatefulSet` + `PVC`
template, or to AWS ASG built on top of EBS reattach + lifecycle
hooks.

What's missing today is the **operator-facing control surface** to
drive that model from Terraform, and the **rollout policy** for
multi-instance updates.

## What works today (validated empirically)

We exercised the provider in a small shakedown
([context](https://github.com/Phala-Network/terraform-provider-phala/issues/5)
includes the related `storage_fs` ForceNew bug we hit at the same
time), and confirmed:

- `phala_app` with `replicas: N` provisions N CVMs all sharing the
  same `app_id` (so a TEE-derived `getKey()` returns the same bytes
  on every replica — important for cluster-wide secrets).
- In-place compose / env updates preserve `app_id` and
  `primary_cvm_id` (~3m39s on a tdx.small).
- Replicas show up in Terraform state as `cvm_ids = […]`.

So the basic "scale N stateful replicas under one app" already
works. The gaps below show up the moment you want **fine-grained
operational control** over the rollout and per-instance lifecycle.

## Asks

### 1. Per-instance Terraform resource

Mirror GCP MIG's
[`google_compute_per_instance_config`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_per_instance_config)
or k8s' StatefulSet+PVC template:

```hcl
resource "phala_app" "consul_servers" {
  name     = "consul-servers"
  replicas = 3
  # ... shared compose, env, etc.
}

resource "phala_app_instance" "consul_servers" {
  for_each = { for i in range(3) : i => "consul-server-${i}" }

  app_id          = phala_app.consul_servers.app_id
  ordinal         = each.key            # stable slot number
  preserved_state {
    disk          = "data"              # never delete, even on instance recreate
  }
  metadata = {
    role = "server"
  }
}
```

Use cases this unlocks:

- `terraform apply -target=phala_app_instance.consul_servers[\"1\"]`
  to upgrade exactly one replica.
- Per-instance overrides (e.g. AZ pinning, instance-specific
  metadata).
- Per-instance state in Terraform — each replica has its own
  `terraform state` entry instead of being collapsed into the
  parent's `cvm_ids` list.

This is the structural primitive that lets external tooling
(rollout scripts, operators) gate updates on workload health.

### 2. `update_policy` block on `phala_app`

Mirror GCP MIG's
[`update_policy`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_instance_group_manager#update_policy):

```hcl
resource "phala_app" "consul_servers" {
  name     = "consul-servers"
  replicas = 3

  update_policy {
    type                  = "PROACTIVE"   # or "OPPORTUNISTIC"
    minimal_action        = "RESTART"     # NONE / REFRESH / RESTART / REPLACE
    max_unavailable_fixed = 0             # never reduce live capacity
    max_surge_fixed       = 1             # surge one extra during rollout
    min_ready_seconds     = 30            # green for at least 30s before next
  }
}
```

Reads as "one out at a time, never two simultaneously, give it 30s
of green before moving on." Same shape as k8s StatefulSet's
`RollingUpdate { partition, podManagementPolicy }`.

Without this, every Phala app update today is opaque — the operator
can't tell whether the platform will roll instances one at a time
or restart them all at once. For a Consul quorum, the difference
is "uptime" vs "split-brain incident."

### 3. Workload health-gate hooks (lifecycle hooks equivalent)

GCP/AWS both expose lifecycle hooks (`PRE_TERMINATE` /
`POST_LAUNCH`) so the cluster can pause a rollout while a
workload-specific drain runs. Examples:

- Consul: `consul operator raft transfer-leader` before killing
  the leader.
- Postgres+Patroni: `patronictl switchover --candidate ...` before
  draining the primary.
- etcd: `etcdctl member promote` after a new joiner is in-sync.

A simple version in dstack would be: an HTTP endpoint or a script
the platform `exec`s in the CVM before terminating it, with a
configurable timeout. Without this, the only safe "stateful rolling
update" today is to do it manually outside of Terraform.

### 4. `auto_healing` with custom health checks

If a CVM dies, today the operator notices via
`/v1/agent/members` going red and runs `terraform apply` to
recreate. GCP MIG / k8s both reconcile automatically off a health
check. Same shape would fit:

```hcl
auto_healing_policies {
  health_check      = phala_health_check.consul_healthy.id
  initial_delay_sec = 60
}

resource "phala_health_check" "consul_healthy" {
  http {
    port    = 8500
    request_path = "/v1/health/state/passing"
    expect_200 = true
  }
  check_interval_sec = 10
  unhealthy_threshold = 3
}
```

Combined with `preserved_state`, the recreated CVM re-attaches its
existing disk → the cluster heals without losing membership.

## Why this matters

Right now, anyone running a stateful cluster on Phala Cloud has to
either:

1. Build the orchestration outside of Phala (custom scripts that
   call the API directly, drain workloads, then `terraform apply
   -target=...` per instance), or
2. Run an in-cluster operator (Consul Operator, Patroni controller,
   etc.) that wraps the platform and adds the missing primitives.

(1) is what we'll do for our experiment in the meantime; it works
but the orchestration logic ends up duplicated across every project.
(2) is the "operator pattern" but only really pays off in
Kubernetes-shaped environments.

Adding the four primitives above would let the standard cloud-style
HCL pattern (the GCP MIG one) work natively on Phala Cloud, which
seems like the right abstraction to converge on long-term.

## Related

- [Phala-Network/terraform-provider-phala#5](https://github.com/Phala-Network/terraform-provider-phala/issues/5)
  — `storage_fs` ForceNew bug found in the same shakedown.
- [Phala-Network/phala-cloud#242](https://github.com/Phala-Network/phala-cloud/issues/242)
  — `phala cvms list` collapses replicas to one entry.

Both small but in the same vicinity (tooling for multi-replica apps).

## Happy to chat further

We're prototyping a Consul service mesh across dstack CVMs (mesh-conn
overlay + ICE/yamux) and these would graduate it from "demo" to
"managed cluster." Glad to provide more detail / iterate on the API
shape if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-instance resource + update_policy for stateful replica sets (GCP MIG / k8s StatefulSet equivalent) #243

Background

What works today (validated empirically)

Asks

1. Per-instance Terraform resource

2. `update_policy` block on `phala_app`

3. Workload health-gate hooks (lifecycle hooks equivalent)

4. `auto_healing` with custom health checks

Why this matters

Related

Happy to chat further

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per-instance resource + update_policy for stateful replica sets (GCP MIG / k8s StatefulSet equivalent) #243

Description

Background

What works today (validated empirically)

Asks

1. Per-instance Terraform resource

2. update_policy block on phala_app

3. Workload health-gate hooks (lifecycle hooks equivalent)

4. auto_healing with custom health checks

Why this matters

Related

Happy to chat further

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `update_policy` block on `phala_app`

4. `auto_healing` with custom health checks