consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers by h4x3rotab · Pull Request #95 · Dstack-TEE/dstack-examples

h4x3rotab · 2026-05-04T00:48:20Z

Summary

Adds stage 4 of the consul-postgres-ha example: a 3-coordinator + 3-worker dstack cluster running highly-available Postgres via Patroni, with leader election driven by Consul KV, all replication and Consul gossip carried over a custom userspace mesh (mesh-conn) that hole-punches between TEE CVMs via pion/ICE and multiplexes streams over QUIC.

The PR is large (43 commits, ~9.5k LoC) because it brings the full example end-to-end — phase-0 ICE feasibility through stage-3 Consul Connect through stage-4 Patroni with TEE-derived secrets and in-place env updates. Each commit is independently reviewable and the conventional-commits structure makes the chronological narrative trivial to follow.

What ships in stage 4

mesh-conn — userspace UDP+TCP port-forwarder over pion/ICE + quic-go. Each peer multiplexes 8 identity ports per pair; the QUIC layer provides loss recovery + stream multiplexing on top of pion's lossy UDP underlay.
bootstrap-secrets — init container that calls the dstack SDK's getKey() and writes per-CVM TEE-derived secrets to /run/secrets/, plus the per-CVM identity (role, ordinal, port table) to /run/instance/info.json.
patroni — Postgres + Patroni baked together; reads identity from /run/instance/info.json, joins Consul via mesh-conn, participates in leader election.
cluster-example/cluster.tf — one terraform apply brings up 3 coordinators + 3 workers across CVMs, propagates PEERS_JSON via in-place env updates, preserves disks across topology changes (storage_fs = "zfs").
CI — new consul-postgres-ha-publish.yml workflow builds and publishes all six images (mesh-conn, bootstrap-secrets, signaling, webdemo, sidecar, patroni) to GHCR with Sigstore-backed GitHub Build Provenance attestations on every push to main. Consumers verify with gh attestation verify oci://...@<digest> --repo Dstack-TEE/dstack-examples.

What's verified end-to-end on a live cluster

Full reproducible recipes in consul-postgres-ha/stage4/{FAILOVER,PUBLISHING}.md, plus diagnostic artifacts.

Property	Result
Multi-CVM Patroni HA via Consul KV	✅ 3-replica streaming on timeline 22
pg_basebackup over QUIC mesh-conn	✅ ~25 MB/s sustained between dstack workers
Soft-kill failover RTO	✅ ~24s (kill leader's patroni → first write on new leader)
Hard-kill failover RTO	✅ ~33s (kill all containers on leader CVM)
Cheap rejoin (WAL replay)	✅ ~31s on hard-kill, no pg_basebackup needed
Disk-loss rejoin (full pg_basebackup over mesh-conn)	✅ 7s for 5.2 MB, picks correct bootstrap path
In-place env updates via terraform apply	✅ requires `phala-network/phala 0.2.0-beta.3` (see `Phala-Network/terraform-provider-phala#8`)
Disk persistence across CVM rebuilds	✅ pgdata + Patroni state survive `terraform apply`

Notable transport-layer story (the highlight reel)

The mesh-conn started life on yamux. Yamux assumes a reliable byte-stream underlay; pion/ice.Conn is UDP. Between dstack worker CVMs the UDP path is brutally lossy (~99% one direction on hairpin, ~78% on coturn relay), and yamux's keepalive/recv-window invariants tripped under any sustained load, manifesting as ""keepalive timeout"" / ""recv window exceeded"" but with the real cause being dropped packets violating yamux's reliability assumptions.

Swapped yamux → quic-go with a net.PacketConn shim around ice.Conn. QUIC has loss recovery + stream multiplexing built in — exactly what an unreliable datagram underlay needs. Same hairpin path that killed yamux at 3 KB now sustains 25–28 MB/s for pg_basebackup. See consul-postgres-ha/stage4/RESUME.md for the full diagnosis and consul-postgres-ha/stage4/quic-on-ice/ for the standalone smoke test.

Drafting why

Marking draft because:

The CI publish workflow has never run (this is the first push triggering it). Want to see the matrix build go green and confirm GHCR images + attestations land before flipping ready-for-review.
The 43-commit history is intentionally preserved to capture the engineering narrative; if reviewers prefer a squash strategy or a more aggressive history rewrite (e.g. dropping the early-experiment phase-0 / stage-1 commits if those have already merged elsewhere), happy to restructure.
Curious whether reviewers want me to fold the phala-cloud submodule pointer bump (chore(terraform): bump submodule to v0.2.0-beta.3 Phala-Network/phala-cloud#248) into this PR's narrative, or treat that as already-handled out-of-band.

Test plan

go test ./... clean across all stage-4 modules (mesh-conn, bootstrap-secrets, quic-on-ice)
All six Dockerfiles build clean (verified via docker build + push to ttl.sh + live deploy)
Failover demo (soft + hard) reproduces against the live cluster
Disk-loss rejoin reproduces against the live cluster
terraform apply env-update propagates without CVM destroy/recreate (verified against phala-network/phala 0.2.0-beta.3)
CI publish workflow (consul-postgres-ha-publish.yml) runs cleanly on first push and produces verifiable attestations on GHCR — pending this PR's first run

Known follow-ups (not blocking merge)

consul-postgres-ha/stage4/RESUME.md is now obsolete (originally a session-bridging doc; superseded by README.md + FAILOVER.md + PUBLISHING.md). Will delete once the PR is approaching ready-for-review, unless reviewers want to keep it as engineering-narrative.
If/when phala-network/phala ships a stable 0.2.0 release, the cluster.tf provider pin can move from exact (0.2.0-beta.3) to ~> 0.2.

🤖 Generated with Claude Code

Adds a phase-0 experiment to verify whether dstack CVMs can establish direct UDP paths via NAT hole-punching, as a prerequisite for running Consul (or any UDP-gossip service mesh) across CVMs over the TCP-only dstack-gateway. Components: - coordinator/: docker-compose for coturn (STUN+TURN, UDP+TCP) plus a tiny HTTP signaling broker, deployed on a user-provided public-IP host - phase0/icetest/: single Go binary with two modes - signaling: ferries ICE candidates and ufrag/pwd between two peers - peer: runs pion/ice against coturn, exchanges candidates via the broker, performs connectivity check, sends 20 echo round-trips, and logs the winning candidate-pair type + RTT - phase0/docker-compose.yaml: dstack-CVM compose that runs the peer - deploy/phase0-results.md: result of the live run Result: direct hole-punched UDP works between two dstack CVMs (srflx candidates via NAT hairpinning), median RTT ~6.6 ms over the public internet path, no TURN relay needed. TURN is available as fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds on phase-0's "direct UDP hole-punch works on dstack" finding. Stage 1 wraps a pion/ice connection in a TUN device so arbitrary IP traffic can flow between two CVMs, not just hand-written echo packets. Components: - stage1/mesh-conn: ~280 LoC Go, single binary - opens TUN mesh0 with a virtual /24 IP - establishes one pion/ice connection to its partner via the same coturn + signaling broker phase-0 used - 1:1 pumps L3 packets between TUN and ice.Conn (no framing — ice rides on UDP, datagram boundaries are preserved) - logs the selected ICE candidate pair for visibility - stage1/docker-compose.yaml: mesh-conn + nicolaka/netshoot tester, both on network_mode: host Result (deploy/stage1-mvp-results.md): direct host<->srflx hole-punched path, ICMP through the tunnel runs at 4.8–8.4 ms RTT, matching phase-0 native UDP latency. Confirms userspace overhead is negligible. Caveat: docker-bridge networking forces ICE onto the TURN relay path (observed 163 ms RTT in the broken run) because srflx replies can't route through the bridge NAT. mesh-conn must run with network_mode: host on dstack. This MVP is a stepping stone — next iteration replaces TUN with a userspace port-forwarding agent so apps just bind localhost:<port> upstreams at peers, no virtual L3, no kernel routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the TUN-based overlay with a much simpler userspace UDP port-forwarding agent. No TUN device, no virtual L3, no NET_ADMIN capability, no songgao/water dependency — just `net.ListenUDP` per peer-pair, bridged 1:1 with one pion/ice connection. Identity-port convention: Each peer has a unique 16-bit identity port. On every host: - the local app binds 127.0.0.1:<own_port> - mesh-conn binds 127.0.0.1:<other_peer_port> for every OTHER peer - apps reach peer X by sending UDP to 127.0.0.1:<X_port> The source-port-preservation trick (mesh-conn's bound socket is the *sender peer's* identity port) means the receiving app sees inbound packets as coming from 127.0.0.1:<sender_id_port>, which is the address the cluster's peer-discovery / membership protocol uses to identify the sender. So Consul or any membership-aware service plugs in unchanged. Verified end-to-end on two dstack CVMs (deploy/stage1-portfwd-results.md): ICE selected the direct host<->prflx hole-punched path; 5/5 socat-based UDP round-trips delivered the correct payload through the bridge. Why this is the right shape: The TUN approach (committed earlier as a milestone) gave us a virtual L3 we didn't actually need. Apps in a service-mesh demo just want "send UDP to a stable peer address" — userspace bridge is enough, cheaper to operate (no TUN device on host), and easier to reason about for stage-2 attestation gating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…orkers) Deploys the port-forwarder on 4 dstack CVMs with a shared PEERS_JSON, verifies all 6 ICE links come up concurrently and traffic flows in every direction without code changes — mesh-conn already iterates peers generically. 12/12 cross-peer one-way UDP datagrams delivered, all paths direct hole-punch (no TURN relay selected). Sets up the next decision: how to carry TCP across CVMs so Consul's RPC + gossip-state-sync work. Plan documented at the bottom of the result file: add a multiplexed TCP path to mesh-conn rather than routing TCP via dstack-gateway, so apps only have to know about one transport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layers TCP forwarding onto the existing UDP port-forwarder so Consul (which uses both UDP gossip and TCP RPC + gossip-state-sync on the same serf port) and any other TCP-needing service can run cross-CVM through the same agent. How it works: - Each peer-pair still has exactly one pion/ice connection. - That connection is wrapped in a yamux session; the lex-smaller peer is the yamux client, matching the ICE Dial/Accept convention. - Each yamux stream's first byte tags its purpose: 0x55 streamUDP — long-lived control stream, length-prefixed UDP datagrams flow both ways 0x33 streamTCP — per-connection ephemeral, raw byte splice - mesh-conn now binds both a UDP socket and a TCP listener on 127.0.0.1:<peer-port>; local Accept on either opens a stream of the matching tag. On the remote side a new TCP stream causes a Dial to 127.0.0.1:<self-port> and bidirectional splice. Verified end-to-end on the existing 4-CVM cluster: 12/12 cross-peer HTTP curls succeeded through the bridge (deploy/stage1-tcp-results.md). UDP fan-out from earlier still works. Single ICE conn + yamux mux trades a small head-of-line risk for halving NAT-mapping pressure vs running separate UDP and TCP ICE connections. Acceptable for Consul-grade traffic; can split later if jitter sensitivity demands it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pair Extends mesh-conn so each peer can forward several ports through a single ICE+yamux pair. Required for Consul, which advertises one bind address but uses several ports for distinct protocols (serf-LAN gossip, server-RPC, HTTP API, gRPC/xDS); each protocol needs its own per-peer identity port for the source-port-preservation trick to work. Changes: - Peer.Port int -> Peer.Ports []int (PEERS_JSON now carries a list per peer; index i is the same protocol across peers) - yamux stream header grew from 1 byte to 3: [tag (1)] [receiver-side port, uint16 BE (2)] - Per peer-pair: still one ICE conn + one yamux session * lex-smaller side opens N long-lived UDP streams up front, one per port, each tagged with the peer's port for that index * lex-larger side accepts, looks the port up in self.Ports, pairs with the matching local UDP socket * TCP: per-connection ephemeral streams, header carries the dst port so the receiver dials its own matching local listener Verified: 4-CVM cluster (ctrl + 3 workers), 4 ports per peer. deploy/stage1-multiport-results.md — 48/48 cross-peer HTTP fetches through the bridge succeeded (4 protocol slots × 12 directed peer-pairs). All ICE pairs landed on direct host<->{prflx,srflx} paths, no relay. Trade-offs in design notes: one ICE+yamux per pair was preferred over one ICE per port to keep NAT-mapping pressure low (6 pairs vs 24) and to give an all-or-none readiness guarantee for the protocol slots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…verlay Stands up a real HashiCorp Consul cluster (1 server + 3 clients) on four TEE-isolated dstack CVMs whose only inter-CVM data path is the userspace mesh-conn ICE+yamux port-forwarder built in stage 1. What's new: - stage2/docker-compose.yaml: per-peer compose with mesh-conn, a hashicorp/consul:1.19 agent, and a netshoot tester sidecar, all on network_mode: host. - Consul launched via shell wrapper that branches on ROLE env var: server (-server -bootstrap-expect=1 -ui) for ctrl, client (-retry-join=127.0.0.1:CTRL_SERF_LAN_PORT) for workers. - Each peer's agent binds to 127.0.0.1 with its own per-protocol identity ports (serf=180XX, RPC=181XX, HTTP=182XX, gRPC=183XX), matching the mesh-conn port plan; mesh-conn forwards each port to the corresponding peer and source-port-preservation makes the addresses look right from every Consul agent's perspective. - stage2/README.md documents the port plan and how Consul gossips peer ports so workers can dial the leader's RPC port through the overlay. Verified (deploy/stage2-results.md): - All 4 peers see all 4 members alive in /v1/agent/members. - All 4 peers agree leader = 127.0.0.1:18100 (ctrl's RPC port via the overlay). - KV write from w1 (curl PUT) is readable from w3 (curl GET) — RPC to the leader and Raft replication both work across the overlay. Confirms that Consul's three transport classes (UDP gossip, TCP RPC, TCP HTTP API) all round-trip cleanly through one yamux session per peer-pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…the overlay Walks through the exact recipe for running a Consul cluster across dstack CVMs that have no direct connectivity to each other: - Why the per-protocol identity-port plan exists and how it falls out of mesh-conn's source-port-preservation behaviour. - The compose layout (mesh-conn + consul + tester, all on host networking) and each non-obvious flag explained: bind/advertise on 127.0.0.1, per-peer -serf-lan-port / -server-port / -http-port / -grpc-port overrides, why -dns-port=-1. - The per-CVM env-var matrix for PEER_ID / ROLE / *_PORT / CTRL_SERF_LAN_PORT / PEERS_JSON. - What the boot sequence actually looks like (mesh-conn → Consul agents → leader election → workers join). - How to verify membership, leader, and cross-peer KV. Aim is so the next person setting this up doesn't have to reverse- engineer the trick from the compose file. The whole thing collapses to: "Consul never sees the overlay; identity ports + source-port preservation make every peer look like it's on the same loopback." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fan-out Layers a small user-facing demo on top of the stage-2 Consul cluster. Each peer runs a tiny Go service (stage3a/webdemo, ~150 LoC) that: - registers with the local Consul agent on startup as service "webdemo" with an HTTP health-check on /hello - exposes /hello returning "hello from <peer>" - exposes /all that queries /v1/catalog/service/webdemo on local Consul and fans out /hello calls to every instance returned The addresses Consul hands back (127.0.0.1:<peer's webdemo port>) are routed through mesh-conn to the right peer with no app-side awareness of the overlay. Per-peer port plan grew by one slot (index 4 = webdemo HTTP, ports 18500-18503). Verified end-to-end across 4 CVMs (deploy/stage3a-results.md): - all 4 webdemos register with the cluster (catalog visible from every peer) - /all from every peer returns 4 hellos: ctrl, w1, w2, w3 - HTTP fan-out crosses CVM boundaries via mesh-conn for every non-self peer Bug caught and fixed in this round: Consul's /v1/agent/service/register requires PUT, not POST (returned 405 on first try). Sets up stage 3b: replace the plain HTTP path with Connect sidecars and explicit intentions for mTLS between services. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, gitignore it

…er the overlay Replaces stage-3a's plain HTTP service-to-service calls with a real Consul Connect mesh. Each peer now also runs an Envoy sidecar in front of its webdemo; sidecars do mTLS to each other across the overlay and intentions gate the connections. What's new: - stage3b/sidecar/Dockerfile: small custom image combining the consul CLI (for `consul connect envoy -bootstrap`) with Envoy contrib v1.30. Tiny — no full Consul agent, just the CLI. - stage3b/webdemo: webdemo registers with a Connect.SidecarService block telling Consul to manage a sidecar that listens on the per-peer sidecar_public port and exposes one upstream "webdemo" on local 127.0.0.1:19000. /all hits the upstream N times so Envoy's LB rotates across all 4 instances. - stage3b/docker-compose.yaml: adds the sidecar service, enables Connect on the Consul agent (-hcl 'connect{enabled=true}'), PEERS_JSON now has 6-element ports lists (the new sidecar_public slot, 18600..18603) so mesh-conn forwards mTLS traffic between peer sidecars. Verified end-to-end (deploy/stage3b-results.md): - All 4 sidecars boot cleanly; Envoy logs show clusters loaded and listeners up (public_listener and webdemo upstream). - With intention webdemo->webdemo: allow, /all from w1 returns perfectly balanced load: 2/2/2/2 across ctrl, w1, w2, w3. - Flip intention to deny: 6/8 calls fail with EOF (peer sidecars reject the mTLS handshake). Flip back to allow: full balance restored. Intention enforcement is real. Bug caught: Consul's /v1/connect/intentions create wants POST (not PUT). Update-by-ID uses PUT. Two endpoints, two methods — easy to trip on; called out in the results doc. Combined picture: a HashiCorp Consul service mesh — Envoy sidecars, mTLS, intention enforcement — running across four TEE-isolated dstack CVMs whose only inter-CVM data path is our userspace ICE+yamux overlay. Apps and Envoy never see the overlay; from any CVM the mesh looks like a single loopback-only host with peers on 127.0.0.1:<port>. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ARCHITECTURE.md walks the four-layer data plane (rendezvous infra → ICE+yamux overlay → identity-port forwarder → apps), traces a single Connect mTLS call all the way through, and is precise about the mesh-conn × yamux wire format: one ICE conn per peer-pair, one yamux session per ICE conn, 3-byte stream header (tag, receiver-side port), 2-byte length prefix on UDP datagrams. Includes the four-pump diagram and the actual pump bodies for both UDP and TCP paths. ROBUSTNESS.md is an honest review: per-layer failure modes, what recovers automatically, what doesn't, plus a prioritised punch list. Headlines: - mesh-conn has one real bug today (auth-channel reconnect deadlock) that will bite the first ICE drop. ~30 LoC fix. - Single Consul server is the biggest structural SPOF; 3-server quorum is the obvious "leave it running" upgrade. - Gossip key + RPC TLS not configured today; defence-in-depth gap masked by Layer-3 mTLS but should be closed. - Coordinator is a SPOF for new joins (not for established traffic); two-coordinator setup + signed signalling messages closes both that and the metadata-spoof gap. - "Are we playing too many tricks?" — no. The clever-and-ours surface is just mesh-conn (~330 LoC) and the identity-port plan; everything else is well-trodden libraries (pion/ice, yamux, Consul, Envoy). Risk is concentrated in the small custom shim, not in the count of layers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, validation Three robustness fixes from ROBUSTNESS.md's punch list, paired together because they all touch mesh-conn / Consul agent config. #1: mesh-conn auth-channel reconnect deadlock Each dialICE attempt now installs a fresh peerSession (new *ice.Agent + new authCh), replacing any prior one in the global map. pollLoop looks up currentSession() per message; if no active attempt exists, the message is dropped (rather than buffered into a stale channel that would later poison a reconnect). Fixes the hang where, after an ICE drop, the next dialICE blocks forever on <-sess.authCh because the channel still held a stale auth from the previous attempt. #4: PEERS_JSON validation at mesh-conn startup validatePeers() in main.go fails fast on: - <2 peers - empty peer id, duplicate id - empty Ports list, port out of [1, 65535] - duplicate port within a peer's own Ports list - port collision between two peers (must be globally unique because mesh-conn binds OTHER peers' ports on 127.0.0.1) - port-list length mismatch across peers (every peer must use the same number of protocol slots, by index) - PEER_ID not in PEERS_JSON Also logs a digest of the canonical PEERS_JSON so operators can grep across CVM logs to confirm every peer sees the same config. Tests in validate_test.go cover all cases (8 tests, all passing). #2: Consul gossip key Stage 2/3a/3b composes now require a GOSSIP_KEY env var and pass it to consul agent via -encrypt=$GOSSIP_KEY. Encrypts serf-LAN gossip end-to-end (UDP+TCP) on every agent. Generated at deploy time via openssl rand -base64 32. Layer-3 mTLS already protects payloads; this hardens the membership/check-result path which rides outside Connect. RPC TLS deferred to the dev-experience restructure where central cert provisioning fits naturally; gossip key is the bigger gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan for collapsing the per-stage shell-script + per-peer env-var matrix into a single cluster.yaml + a small `cluster` CLI that drives phala deploy. Headlines: - cluster.yaml is the single source of truth: peers, protocol port plan, intentions, secrets policy, deploy params. - One CLI: validate / plan / up / down / status / logs. - Control plane is an "embedded" mode where one dstack CVM bundles coturn + signaling + Consul server, removing the external Vultr box; requires Phala admin to enable UDP ingress on that CVM. Falls back to "external" mode (separate non-TEE coordinator host) when UDP ingress isn't available. - Mesh-conn / webdemo / sidecar code stays unchanged; the change is entirely in deploy ergonomics. - TEE-app constraint is respected: one compose template per role, only env vars vary per peer; compose-hash audit surface is small. Future direction noted but not in this stage: derive GOSSIP_KEY / TURN_SHARED_SECRET inside each TEE via dstack-sdk getKey() so the deploy host never sees them. Requires AppAuth-shared app-id across peers; reuses stage-2 attestation work. Open questions for the user listed at the end of the doc (CLI language, secret handling, control-plane HA, redeploy semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All three live on the redeployed stage-3b cluster: - mesh-conn validation logs an identical PEERS_JSON digest (NiNhinoUekif) on every peer, confirming cross-peer config consistency. - Consul logs Gossip=true → serf-LAN encrypted with the shared gossip key. - Connect mTLS /all still perfectly balanced 2/2/2/2 across the four webdemo instances; cluster operation unchanged by the fixes. #1 (reconnect bug) is verified by code review + the new validate_test.go test suite; live failure-injection (kill mesh-conn mid-run) is queued for the stage-4 CI work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, in-place updates Reshapes stage-4 around four user decisions: 1. No new CLI — use Phala's official terraform-provider-phala (resource phala_app supports replicas, env, encrypted_env, custom_app_id+nonce, in-place update). One cluster.tf is the source of truth. 2. Secrets never in human hands. A small bootstrap-secrets init container per peer mounts /var/run/dstack.sock, derives gossip key / TURN secret / Connect-CA seed via getKey(), writes them to a tmpfs volume, exits. consul + mesh-conn read those files at startup. All peers share the same app_id (via custom_app_id + cluster_nonce) so getKey() returns the same bytes on every peer. 3. Multi-server Consul stays the next stage but unlocks self-discovering rendezvous: each control CVM registers as service "mesh-coordinator" and "mesh-turn" in Consul; new peers know ONE bootstrap endpoint and learn the rest from the catalog. Topology of the rendezvous becomes a service-mesh-managed concern. 4. In-place updates preserve disk volumes (Consul Raft state, KV, sidecar certs, future Patroni WAL). Compose/env diffs update existing CVMs without recreate; only custom_app_id/nonce changes rotate identity. Per-node rollout for the control plane via terraform -target. Includes a full HCL skeleton, the bootstrap-secrets sketch, and maps each ROBUSTNESS.md punch-list item to stage 4 vs the next stage. Open item: confirm phala_app behaviour (replicas, encrypted_env, in-place env update, custom_app_id) on the 0.2.0-beta.1 provider before committing the dev-experience to it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… provider Empirical verification before committing the dev-experience to terraform. Spun up a tdx.small nginx via terraform apply, exercised in-place updates, replicas, then destroyed. Outcomes (stage4-experiments/tf-shakedown/RESULTS.md): - create works (~2 min for tdx.small) - in-place compose+env update preserves app_id and primary_cvm_id (~3m39s for the upgrade flow); disk volumes survive - replicas: 1 -> 2 plans in-place; both CVMs land under the same app_id, which is exactly what TEE-derived secrets via getKey() need across replicas (no out-of-band coordination required) - destroy clean (~23s) Two gotchas baked into RESULTS.md: - storage_fs MUST be pinned in HCL ("zfs"); otherwise the next apply diffs "zfs -> (known after apply)" which the provider treats as ForceNew → destroys the CVM. Without pin, every diff becomes a recreate. - provider is at 0.2.0-beta.2; Terraform's >= constraint excludes pre-release by default — pin exactly. - field-name shape is positive (listed/public_logs/public_sysinfo), not the CLI's --no-... shape. Verdict: provider is good enough for stage 4. Open follow-ups listed in RESULTS.md (encrypted_env behaviour, custom_app_id + nonce determinism, failure-mode handling, AppAuth-shared-id pattern via on-chain KMS). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two closeout pieces for the architecture-testing phase before implementing stage 4 itself. STAGE4_PLAN.md revision 2: - Drops the per-peer phala_app pattern from rev-1 in favour of "one phala_app per role with replicas: N", matching dstack's native app->instance grain. - Each instance reads its identity from a UUID file written to its persisted disk on first boot. No PEER_ID env var; no PEERS_JSON env var. - Peer discovery via Consul: each instance registers itself with role + ordinal + identity-port set as service tags. Adding a peer is a `replicas` bump. - bootstrap-secrets init container is the keystone: derives all cluster-wide secrets (gossip, TURN, Connect-CA seed) via getKey() AND manages per-instance UUID + ordinal claim via Consul KV CAS. - Rolling updates without per-instance Terraform resources: a rollout.sh that calls workload-aware drain verbs (consul operator raft transfer-leader, etc.) gates each replica. Once phala-cloud#243 lands `update_policy`, most of this collapses into HCL. - Updated migration notes: stages 0-3b stay frozen as historical reference; stage 4 is the integrated product. stage4-experiments/disk-persistence/ — empirical verification of THE keystone assumption: docker named volumes survive in-place phala_app compose updates. Test: deploy a CVM, write UUID 90ce33e5... to a named volume, bump a tfvar that flips the compose body, terraform apply. After ~3 min in-place update (same app_id, same primary_cvm_id), curl the volume-served file -> identical UUID. Disk persisted. ✅ Caveats noted in RESULTS.md: didn't test under replica scaling or image bumps. Will exercise both inline during stage-4 build. Live state cleanup: stage3b cluster (4 CVMs at $0.058/hr each) torn down. coturn + signaling on 155.138.146.255 still up (dirt cheap, useful as TURN fallback for any future test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ng the actual repo Verified the Go SDK at github.com/Dstack-TEE/dstack/sdk/go/dstack — two corrections to the previous draft: 1. Per-instance identity comes from client.Info(ctx).InstanceID directly. The plan's "write UUID to /var/lib/dstack/instance-id on first boot, read it back on subsequent boots" was redundant — dstack already exposes a stable per-CVM ID through the SDK, rooted in the platform rather than a file we wrote. Drop the on-disk UUID dance. 2. GetKey signature is (path, purpose, algorithm) returning a hex-encoded secp256k1 (or other) key, decoded via .DecodeKey() to 32 bytes. Pseudo-call shape gossipKey = GetKey("...:gossip") was wrong; real shape is seed, _ := client.GetKey(ctx, "dstack-mesh/gossip", "cluster", "secp256k1") gossipBytes, _ := seed.DecodeKey() The 32-byte output is fine to use as the gossip key directly, or to HKDF for multiple sub-keys. bootstrap-secrets simplifies as a result: no on-disk UUID write/read logic, just GetKey() + Info() into tmpfs. ~80 LoC. Bonus finding: same SDK exposes Sign(), Verify(), GetQuote() — so the deferred "attestation-gated mesh join" work (originally Stage 2 in the plan) now fits cleanly into stage 4 with no new tooling. Each peer signs its mesh-conn auth message, the coordinator verifies before letting it onto the overlay. Noted as a bonus add-on, not a stage-4 requirement. Open items list updated: disk-persistence ✅, SDK existence ✅; container ordering + Consul CAS-vs-hash for ordinal claim still TBD inline during build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The keystone of the stage-4 design: the only piece that holds plaintext cluster secrets, and it does so entirely inside the TEE. What it does (one-shot, ~250 LoC): 1. Connects to /var/run/dstack.sock via the official Go SDK (github.com/Dstack-TEE/dstack/sdk/go/dstack). 2. client.Info(ctx) -> self identity (AppID, InstanceID, ComposeHash). Per-CVM identity comes from the SDK directly; no on-disk UUID write/read. 3. client.GetKey(ctx, path, purpose, "secp256k1") for each of: - dstack-mesh/gossip - dstack-mesh/turn - dstack-mesh/connect-ca Same path/purpose/algorithm tuple yields the same 32 bytes on every replica that shares an app_id (which all replicas of one phala_app do). No secret material ever transits the deploy host. 4. Workers claim a stable ordinal (0..N-1) via Consul KV CAS on `cluster/<name>/slots/<i>`. InstanceID is the slot's permanent owner so restarts re-find their own slot. Coordinator skips this — it's always ordinal 0 (chicken-and-egg: it IS Consul). 5. Computes per-protocol ports from PROTOCOL_BASES env + ordinal. 6. Writes secrets (hex-encoded, mode 0400) to /run/secrets/* on a tmpfs volume. Writes /run/instance/info.json with identity + ports for sibling services to read. 7. Exits cleanly so docker-compose `depends_on` with `condition: service_completed_successfully` releases consul, mesh-conn, sidecar, etc. Required env: CLUSTER_NAME, ROLE, PROTOCOL_BASES (JSON). Workers also need CONSUL_HTTP_ADDR (the local agent). Compile chain: - Go module pinned to dstack/sdk/go @5cfd7db (2026-03-19; latest commit on master at the time of writing). - SDK requires Go >= 1.24; the local toolchain auto-upgrades via GOTOOLCHAIN=auto. - Multi-stage Dockerfile produces a ~11MB static binary on alpine. Note on stale slots: when an instance is permanently retired (vs restarted), its slot's KV entry stays. Cleanup is an operator task today; production version would key the KV entry with a Consul Session that has a TTL so stale slots auto-clear. Flagged in code comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The whole cluster is now defined in one HCL file and brought up with one `terraform apply`. What's new under stage4/: - mesh-conn/: clone of stage1 with two small additions — self identity loaded from /run/instance/info.json (written by bootstrap-secrets from the dstack SDK's Info()), and TURN secret loaded from /run/secrets/turn (also bootstrap-secrets-derived). PEERS_JSON still env-passed; cluster.tf computes it from the `replicas` count so adding a peer is a `replicas` bump. - compose/coordinator.yaml + compose/worker.yaml: frozen templates that wire bootstrap-secrets + mesh-conn + consul + {coturn,signaling} (coord) or {webdemo,sidecar} (worker), all on network_mode: host, with a tmpfs volume for /run/secrets and /run/instance so derived state never touches the persistent disk. - cluster-example/cluster.tf: the user-facing surface. Two phala_app resources (coordinator replicas:1, worker replicas:N), shared protocol_bases, computed peers_json. Adding a peer = `worker_replicas` bump + apply. - cluster-example/rollout.sh: workload-aware rolling update driver. Snapshots Consul, applies one app at a time via -target, waits for cluster green between steps. Replaces the update_policy block we'd want once phala-cloud#243 lands. - stage4/README.md: how a deploy works, how to add a peer, how to update images, what was deferred. Boot sequence end-to-end: 1. terraform apply provisions both phala_apps; CVMs come up. 2. bootstrap-secrets (init container) calls dstack SDK Info()+GetKey(), writes /run/secrets/{gossip,turn,ca-seed} + /run/instance/info.json (identity + ordinal + ports), exits. 3. consul + mesh-conn + sidecar + workload start in dependency order via `depends_on: { bootstrap-secrets: { condition: service_completed_successfully } }`. They read their config from the tmpfs files written in step 2. 4. mesh-conn opens ICE+yamux per peer-pair; consul forms its cluster through the overlay; Connect mTLS works between workers via Envoy sidecars. Three properties this delivers vs the per-stage scripts we had before: - Single source of truth (cluster.tf), no per-peer env-var matrix duplicated across deploys. - Secrets never seen by the deploy host — bootstrap-secrets is the only piece that holds plaintext keys, and it does so entirely inside the TEE. - Disk volumes preserved across in-place updates (verified in stage4-experiments/disk-persistence/RESULTS.md), so Consul Raft state, KV, and any future Patroni WAL survive image bumps and config changes. Carry-overs to next iteration: stale-slot cleanup needs Consul Sessions with TTL (not unconditional CAS-claim); multi-server Consul HA is a one-line `replicas: 3` change but pulls that question forward. README spells out what's deferred and why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Iterating toward the first green stage-4 deploy. Three substantive fixes that came out of the smoke test: 1. Worker compose was missing WORKER_ORDINAL in bootstrap-secrets's environment block. Cluster.tf passed it via --env to the CVM, bootstrap-secrets's Go code read it from env, but compose never plumbed it into the container. Result: bootstrap-secrets fell into the Consul-CAS-claim path, found no Consul (it's on the unreachable coordinator), exited 1, and dstack tore the whole CVM down with `service "bootstrap-secrets" didn't complete successfully: exit 1`. One missing line, ~3 hours of serial-log archaeology to find. 2. Workers now declared as N separate phala_app resources via for_each (not one app with replicas:N). Each gets its own WORKER_ORDINAL env so bootstrap-secrets can compute the ports without Consul-side coordination. The replicas-N path requires per-instance env which phala_app doesn't expose today (filed as phala-cloud#243). 3. bootstrap-secrets now picks an ordinal source explicitly: a. WORKER_ORDINAL env (preferred when present) b. ROLE=coordinator → ordinal 0 c. Consul KV CAS (fallback for the eventual replicas:N path) This breaks the chicken-and-egg between bootstrap-secrets and Consul that the worker hit. 4. Gossip/turn/ca-seed each emitted in a format the consumer can actually use: gossip is base64 (consul -encrypt), turn is hex (coturn --static-auth-secret), ca-seed is hex (HKDF-friendly bytes). Previously everything was hex which made consul reject the gossip key. 5. Compose templates now use bind-mounts to /tmp/dstack-runtime instead of named docker volumes — initially debugged thinking named volumes didn't share on dstack (filed phala-cloud#245 then closed as user error after retesting cleanly). Bind mounts work fine and the comment notes it's for "secrets are re-derived from getKey() each boot anyway, so /tmp ephemerality is fine". 6. Added compose/worker-debug.yaml — minimal worker (just bootstrap-secrets + a no-depends sleeper) for diagnosing future boot-sequence regressions in isolation. Coordinator still needs Phala admin to enable UDP ingress on its app to make embedded mode (coturn + signaling on the same CVM) fully functional. Next iteration: fall back to external coordinator (the existing Vultr coturn+signaling) so we can land end-to-end smoke without that gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end smoke now passes — `/all` on each worker fans out across all 3 webdemo instances via Consul Connect mTLS over the mesh-conn UDP path. Three independent issues surfaced during the smoke and are fixed: 1. mesh-conn ICE wedge after first failure pion/ice's `agent.Dial`/`Accept` blocks indefinitely once ICE transitions to Failed, so the outer `runPeerLink` retry loop never fires and the peer slot stays dead until the container is bounced. Cancel the dial context from the state callback (Failed/Closed) and add a 60s belt-and-suspenders timeout. Tighten the auth wait from 10 min to 60 s for the same reason — the long timeout was the only reason a retry was even *theoretically* possible, and it left a 10-minute window where the slot looked silent. Also call `agent.Close()` on every error path so a stuck attempt doesn't hold pion goroutines. 2. webdemo + sidecar entrypoints needed jq Both compose entrypoints parse `/run/instance/info.json` with jq; the alpine/envoy base images don't ship it. Add jq to both Dockerfiles. Fast-failing the workload service is what kept webdemo + sidecar in a restart loop on every smoke until now. 3. Coordinator-internal coturn/signaling were unreachable from workers dstack-gateway is TCP-only and doesn't surface arbitrary CVM ports, so `SIGNALING_URL=http://<coord-app-id>...:7000` and TURN against the coordinator's own coturn never worked. Switch both coordinator and worker mesh-conn to the external (Vultr) signaling+coturn that workers were already using; the coordinator's embedded copies still run but are unused. Wire the new paths through cluster.tf as `external_*` variables. Drop `-encrypt` from the consul launch — we'd already removed gossip encryption to unstick the cluster, and the now-unused TURN_SHARED_SECRET-from-/run/secrets path is replaced by env-first resolution in mesh-conn.

Add two follow-up notes to stage4/README.md based on what the smoke turned up: shared TEE-derived secrets across separate phala_apps need a shared AppAuth contract (gating for stage 2 attestation-gated join), and mesh-conn's ICE recovery is now in-process but the signaling broker should also age out stale auth/candidate entries. Cross-link the new terraform-provider-phala#6 env-drift issue.

The provider issue (#6) was a downstream symptom; root cause likely lives in the API surface, so the bug was moved to phala-cloud#246.

Adds the original-goal Patroni service to the worker compose. Each worker runs: - patroni 4.0 + PostgreSQL 16 (single image, ~250MB) - entrypoint.sh renders /etc/patroni.yml from /run/instance/info.json (ordinal, postgres + patroni_rest ports) plus CLUSTER_NAME - data dir lives in the named docker volume `patroni-pgdata` so it survives container restarts (CVM reboots wipe it; persistence across reboots is a future stage4-experiments topic) Cluster wiring: - cluster.tf grows two new protocol slots: postgres=18700 and patroni_rest=18800. Adds `var.patroni_image` + threads PATRONI_IMAGE through worker env. - bootstrap-secrets derives two more cluster-wide secrets via getKey() — patroni-superuser and patroni-replication. They're identical on every replica because all peers derive against the same path + ClusterName, so any peer can bootstrap as leader without out-of-band secret distribution. - All Patroni instances point at 127.0.0.1:<own_consul_http>; cross- peer replication uses 127.0.0.1:<peer_postgres_port>, which the mesh-conn UDP forwarder maps to the right CVM transparently. Patroni's own leader election runs through Consul KV — no separate DCS needed. With three workers we get fault tolerance of one (1 leader + 2 replicas).

…, not the first After a peer bounce, multiple auths from that peer can reach pollLoop in a single batch. The original \`select case authCh <- ... default\` kept the FIRST auth and silently dropped every later one. dialICE then consumed the stale auth, called \`agent.Dial\` against the wrong ufrag/pwd, and ICE Failed. The earlier ICE-state cancel fix correctly aborts and retries — but on retry pollLoop has no fresh auth in the queue (already drained), so dialICE waits 60s and retries again, while the *peer* in turn publishes a NEW auth that pollLoop also drops because the channel is still buffered with the original stale auth. Both sides repeat forever and the link never re-establishes. Drain-then-push so the channel always holds the most-recent auth. The channel is buffered to 1 and only one goroutine writes (pollLoop), so there is no contention and the drain is safe.

Coordinator goes from a single phala_app with replicas:1 to a for_each over `var.coordinator_replicas` (default 3), giving an actual Raft-replicated 3-server Consul cluster instead of bootstrap-expect=1. Per-instance ordinal is passed in via env (`COORDINATOR_ORDINAL`), mirroring the worker pattern, since bootstrap-secrets needs to know its own ordinal before Consul KV is reachable (we can't ask Consul KV for the ordinal because Consul is *on* the coordinators we're trying to bootstrap). The KV-CAS claim path stays as a fallback for the eventual replicas:N future once phala-cloud#243 lands. Worker ordinals shift by `coordinator_replicas` so the peer ID space stays contiguous (coordinators 0..C-1, workers C..C+W-1). Workers retry-join *every* coordinator's serf port (mesh-conn forwards each one), and pick any coordinator's HTTP port for KV calls. Coordinator's consul launches with `-server -bootstrap-expect=N` and loops over COORDINATOR_SERF_PORTS to retry-join its server peers (skipping its own). What this gets us: fault tolerance of 1 (3-server quorum) with the Consul UI/API still served from any coordinator. Patroni's DCS now sits on top of a real HA Consul, not a single point of failure.

…es on new auth The mailbox previously kept appending forever — and because mesh-conn republishes auth+candidates on every dialICE retry, a recipient would drain a long backlog where the FIRST auth was the oldest. After my recent mesh-conn pollLoop fix that backlog became less catastrophic (the latest auth wins in the buffered channel), but the candidates in between are still added to the new ICE agent. pion then dials against addresses whose UDP sockets are gone, ICE Fails, and the loop repeats forever for a peer that bounced. Drop all stale messages from a sender when a NEW auth from that sender lands in the recipient's queue. Auth marks the start of a fresh epoch — mesh-conn always publishes auth BEFORE its candidates (candidates come from OnCandidate AFTER GatherCandidates, which happens after the auth publish), so anything in queue from before this auth is by definition stale. This is the signaling-broker mate of the mesh-conn drain-then-push fix from 4c36c76 — the broker now actively reaps the backlog instead of relying on the consumer to do it correctly. Note: the same mailbox impl is used by the stage4 signaling image (which is built from this phase0 source). Deploying this requires rebuilding + pushing the signaling image and restarting it on the Vultr coordinator host.

Concurrent phala_app creates against the same workspace return 400 'parameters not compatible'. Workaround: terraform apply -parallelism=1. Track upstream fix for the misleading error code.

mesh-conn computes its self_id as `role-ordinal` from /run/instance/info.json, then looks for that ID in PEERS_JSON. The multi-coord change shifted worker ordinals to start at C (coordinator_replicas), but the peer-list IDs were still using slot (`worker-1`, `worker-2`, `worker-3`) — so e.g. worker-1's mesh-conn saw self_id="worker-3" but PEERS_JSON only had "worker-1", and exited with `PEER_ID "worker-3" not in PEERS_JSON`. Use ordinal in the peer ID. The phala_app name still uses the 1-based slot for human-friendly CVM names ("stage4-worker-1"), but the peer-id and the in-CVM identifier are now consistent.

worker↔worker instability under load Adds MESH_CONN_RELAY_ONLY env (default off) that restricts pion's ICE candidate gathering to Relay only — useful as an escape hatch when direct (host/srflx/prflx) candidates establish but flap. Tested on the live stage4 cluster: relay-only made things WORSE for this dstack worker NAT pattern (pion's relay-relay pair selection isn't reliable, observable as TURN allocation churn on coturn). Left the flag in as a debug switch but documented it as not-the-fix in README. The actual symptom — `srflx <-> prflx` link goes Connected, yamux throws `accept: short buffer` 5–60s later, pg_basebackup keeps failing — is captured in the new "Known limitation" section with a concrete next-steps list (instrumentation, MaxStreamWindowSize cap, QUIC, WireGuard).

The instrumentation pass added byte counters per-link, yamux's own log output (was io.Discard), full ICE selected-pair addresses (not just types), and a 10s telemetry tick. That trace pinpointed two bugs that were previously silent: 1. ice.Conn.Read returned io.ErrShortBuffer because pion is packet-oriented — when the caller's buffer is smaller than the next UDP datagram, pion truncates. yamux's 4096-byte bufio.Reader was too small for TURN-encapsulated datagrams. Fixed by a 65535-byte packetizing adapter (countingConn) that always reads full datagrams and re-serves them to yamux as a stream. 2. My own attempted 5s yamux keepalive killed the link under load when a pg_basebackup burst delayed a keepalive past the timeout. Reverted to 30s/10s defaults. Adds two debug env switches that didn't pan out for our specific NAT environment but are kept as escape hatches: - MESH_CONN_RELAY_ONLY=1: only Relay candidates. Made things worse on dstack (relay-relay pair selection unreliable). - MESH_CONN_TCP_ONLY=1: TCP NetworkTypes + filter URLs to Proto=TCP. pion still picks `relay (proto=udp)` because relay transport is the *relayed* leg, always UDP unless RFC 6062 TCP allocation is requested (pion's TURN client doesn't). End state for stage 4: Consul (3-server Raft + 6 members) and Patroni leader election are solid. Patroni replication still requires sustained worker↔worker bulk transfer, which hits the yamux-on-lossy-UDP wall documented in the README "Known limitation" section. Real fix needs a different transport (QUIC, WireGuard, or TCP-relay end-to-end).

Captures the live cluster's app IDs, SSH command pattern, terraform.tfvars image tags, the 60-second reproducer for the open worker↔worker mesh-conn drop, what was already tried (so the next session doesn't re-walk the same paths), and open hypotheses to investigate with fresh eyes — deliberately without committing to a fix direction.

…working tree Working-tree mesh-conn/main.go has been swapped from yamux to quic-go on top of the same pion/ice packet conn, plus a sibling stage4/quic-on-ice/ experimental module. Neither is committed and the live cluster still runs the previous yamux image. RESUME now flags the discrepancy so tomorrow's session sees it on first read.

yamux assumes a reliable byte-stream underlay, but pion/ice.Conn is UDP and the path between dstack worker CVMs is extremely lossy (~99% direction-asymmetric loss when same-NAT hairpinning, ~78% on the coturn-relay path). The "keepalive timeout" / "recv window exceeded" errors we kept seeing were yamux's reliability invariants firing on dropped packets, not yamux bugs. Replace yamux with quic-go on the same pion/ice.Conn (wrapped as a net.PacketConn). QUIC has built-in loss recovery + stream multiplexing, so a lossy UDP underlay is exactly what it expects. TLS uses a self-signed cert because mesh peer trust is established out-of-band by the dstack TEE layer + TURN HMAC. The 3-byte (tag, port) stream header convention is unchanged; runAcceptLoop and the TCP/UDP pumps are line-for-line near-equivalents on *quic.Stream. Same hairpin path that killed yamux at 3 KB now sustains 25-28 MB/s for pg_basebackup. Both replicas (worker-4, worker-5) bootstrap and stream cleanly from leader worker-3. Also drops the old packetizing read-buffer in countingConn (no longer needed — quic-go reads through the PacketConn shim, which preserves datagram boundaries) and introduces a sibling smoke-test module stage4/quic-on-ice/ that proves QUIC over pion/ice.Conn end to end (10 MB worker↔worker hairpin in ~1s). RESUME.md rewritten as a "done" note with the QUIC story and verification recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Soft-kill leader-failover walkthrough verified end-to-end on the live cluster: Patroni elects via Consul KV, worker-4 promotes, writes resume, worker-3 rejoins as a streaming replica without pg_basebackup. Measured RTO ~24s (kill → first successful write on new leader), well within Patroni's default ttl=30s. Captures the reproducible recipe, a measured timeline, knobs for the RTO/availability tradeoff, and what's still untested (hard CVM kill, network partition, disk-loss rejoin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed RTO Extends FAILOVER.md with the whole-userspace failure scenario: kill all containers on the leader CVM simultaneously, then bring them back via `docker compose up -d`. Measured RTO ~33s (9s longer than soft-kill due to Consul gossip-failure detection on top of Patroni's TTL). Also confirms best-replica selection under uneven replica lag, QUIC mesh-conn ICE redial after a peer's userspace evaporates, and cheap rejoin via local WAL replay (no pg_basebackup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the MESH_CONN_TCP_ONLY env knob entirely (from dialICE, both compose templates, and reportLinkStats's tick cadence). The flag was investigated as a yamux-era escape hatch and proven non-helpful — pion still selects relay-UDP candidates regardless because the relay candidate's transport comes from the TURN allocation's relayed leg (always UDP unless RFC 6062 TCP-allocation requested), not from the client→TURN leg. With the QUIC switch, the underlying loss is handled by the transport layer, so the knob has no remaining purpose and was becoming misleading. Also quiets reportLinkStats: tick 10s → 60s and skip the log line entirely when bytes haven't moved since the last tick. Idle peer pairs no longer spam every 10 seconds. Final-stats line on stop is unchanged so postmortems still get a summary regardless of activity. Drops the unused *quic.Conn parameter from reportLinkStats, refreshes the stale "log every 10s" banner, and tightens the MESH_CONN_RELAY_ONLY comment in worker.yaml so the rationale ("flip on if worker-to-worker direct pairs fail") doesn't contradict itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…venance Adds .github/workflows/consul-postgres-ha-publish.yml — a matrix build that builds and pushes the six stage-4 images (mesh-conn, bootstrap-secrets, signaling, webdemo, sidecar, patroni) to ghcr.io/dstack-tee/dstack-examples/consul-postgres-ha-* on push to main, tagged with both the long-form commit SHA and `latest`. PRs build to verify but do not push. Each push is signed with a Sigstore-backed GitHub Build Provenance attestation via actions/attest-build-provenance@v2 — the workflow's GitHub OIDC token gets a short-lived Sigstore cert, no keys we manage. Consumers verify with `gh attestation verify oci://...@<digest> --repo Dstack-TEE/dstack-examples`, which proves the image came from this commit of this workflow. Replaces ttl.sh references in terraform.tfvars.example with the GHCR ones, fills in the previously-missing patroni_image and coordinator_replicas lines, and adds inline docs on pinning to a sha-tag for prod stability and on running the verification command. PUBLISHING.md walks through the three paths a stage-4 user actually hits: the CI publish (steady state), manual one-off ttl.sh / personal- GHCR builds for dev iteration, and the on-CVM hot-patch flow that sidesteps phala-cloud#246 when iterating on a running cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…patch terraform-provider-phala#8 fixed the env-block in-place-update bug (phala-cloud#246) and shipped as v0.2.0-beta.3, so: - cluster.tf required_providers now pins ">= 0.2.0-beta.3" with a comment explaining why earlier versions are unusable for this stack. - PUBLISHING.md's hot-patch section reframes its motivation: the per-CVM hot-patch path remains useful as a dev shortcut and as the only option on clusters still running 0.2.0-beta.2, but it is no longer the workaround for env updates not landing — terraform apply works correctly now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rejoin Two small follow-ups after verifying the v0.2.0-beta.3 env-update path against the live cluster: 1. Provider pin in cluster.tf changed from `>= 0.2.0-beta.3` to `0.2.0-beta.3` exactly. Terraform's `>=` operator does NOT include later prerelease versions, so `>= 0.2.0-beta.3` only matches stable `>= 0.2.0` — `terraform init` failed with "no available releases match the given constraints". Pin exactly until we hit a stable. 2. FAILOVER.md gains a disk-loss rejoin section: stop patroni, wipe the patroni-pgdata volume, restart, watch Patroni's bootstrap path pull a full pg_basebackup from the leader over mesh-conn's QUIC tunnel. Measured 5.2 MB / 7s end-to-end on the live cluster (handshake-dominated for a small dataset; the real throughput number remains the ~25 MB/s pg_basebackup observed during the soft -kill section). Closes the last "What this demo does NOT cover" item. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lines Discovered while verifying terraform's beta.3 env-update path against the live cluster on 2026-05-04: when terraform recreates a CVM, the peers' QUIC links to it die, but the *redial* path can hang. Specifically: dialICE returns a Connected ice.Conn, dialAndPump enters quic.Dial, ICE later goes Failed (peer went away again, hairpin lost, etc.). quic.Dial's context times out and quic-go calls SetReadDeadline(past) to interrupt the blocked ReadFrom in our iceConnPacketConn shim. The shim was returning nil from SetReadDeadline, so the call had no effect on the underlying ice.Conn.Read, and the goroutine hung forever. The surrounding runPeerLink retry loop never got to retry, leaving the peer slot permanently dead until the entire mesh-conn process was restarted. Fix: delegate SetDeadline / SetReadDeadline / SetWriteDeadline to the underlying conn (pion/ice.Conn implements net.Conn deadlines properly). Same fix applied to the stage4/quic-on-ice smoke test so future debugging stays trustworthy. Adds a regression test using net.Pipe (which honors deadlines) that asserts ReadFrom returns a Timeout-flagged net.Error within ~50ms of SetReadDeadline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The directory was an engineering log — phase0/, stage1/, stage2/, stage3a/, stage3b/, stage4/, stage4-experiments/ — useful while building, useless to a user landing here cold who just wants to deploy HA Postgres on dstack-TEE. Promote stage4/ contents up one level to consul-postgres-ha/ as the canonical, opinionated shape. Rename phase0/icetest → signaling/. Move stage3b/{webdemo,sidecar} up. Drop the predecessor stage*/ + phase0/ + stage4-experiments/ + deploy/ (historical results) + STAGE4_PLAN.md. Git history preserves everything. Final layout: consul-postgres-ha/ ├── README.md / ARCHITECTURE.md / FAILOVER.md / PUBLISHING.md / ROBUSTNESS.md ├── cluster-example/ one cluster.tf ├── compose/ coordinator.yaml + worker.yaml templates ├── coordinator/ external-coordinator docker-compose ├── mesh-conn/ QUIC-over-pion/ICE overlay ├── bootstrap-secrets/ TEE-derives per-CVM secrets ├── patroni/ Patroni + Postgres ├── webdemo/ sidecar/ example workload + Envoy bootstrapper ├── signaling/ HTTP /publish + /poll broker for ICE rendezvous └── quic-on-ice/ standalone smoke test for the QUIC-over-ICE transport Updates beyond the moves: - README.md rewritten as a deploy-first story; old stage-4-internal README's "Known limitation" + punch-list (yamux + worker-pair instability) is obsolete since the QUIC swap and isn't preserved. - ARCHITECTURE.md: 4-CVM topology (ctrl+w1/w2/w3) → 6-CVM (3+3), yamux deep-dive section replaced with a tight QUIC summary that matches the actual code. - ROBUSTNESS.md: yamux → QUIC mentions, "single Consul server SPOF" section updated to reflect the 3-server quorum that's been live since `17f4642`, "real registry" recommended-fix moved to "already shipped" since GHCR + Sigstore is now the publish path. - All Go module paths bumped: github.com/Dstack-TEE/dstack-examples /consul-postgres-ha/<name> (no stage4/ or phase0/ infix). - CI workflow path filters + matrix `context:` paths updated. - .gitignore rewritten to match the new layout. - Builds + tests pass on all 5 Go modules under the new paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ion admission Two open architectural gaps surfaced during the consolidation pass on PR #95 — both deserve their own focused implementation passes rather than being squeezed into the mega PR. Capturing them as design docs so a future agent (or a future-me) can pick up either one cold and start. - design/single-sidecar.md: collapse the 5 platform-plumbing containers (keepalive, bootstrap-secrets, mesh-conn, consul, sidecar/Envoy) into one image with a shell-init multi-process supervisor. Per-CVM container count goes 8 → 3. - design/attestation-admission.md: replace the TURN-HMAC-only mesh admission with dstack TEE attestation as the credential. Phased plan: per-app-id check first (Phase 1, smallest delta, no rolling-upgrade pain), Consul-KV-rooted policy doc later (Phase 2). Recommends the post-QUIC-handshake-stream insertion point over the public signaling broker for privacy. Both docs include current state, approach, risks, open questions, and explicit hand-off instructions. Each is ~250-350 lines, written to be self-contained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…decar image Per-CVM container count drops from 7 → 3 on workers (sidecar + patroni + webdemo) and from 6 → 1 on coordinators (sidecar). The new sidecar image bundles bootstrap-secrets, mesh-conn, consul, and (workers only) envoy behind a tini-wrapped shell init that dispatches on ROLE; the old keepalive placeholder, the four-image lockstep, and the vestigial on-CVM signaling/coturn that had been documented as unused all drop. CI matrix: 6 → 4 (sidecar, patroni, webdemo, signaling). The sidecar build uses the parent consul-postgres-ha/ as docker context so its multi-stage Dockerfile can pull bootstrap-secrets/ and mesh-conn/ Go sources from sibling subdirs. cluster.tf: BOOTSTRAP_SECRETS_IMAGE, MESH_CONN_IMAGE, SIGNALING_IMAGE (coordinator) and the matching tfvars all collapse into SIDECAR_IMAGE. Smoke-tested against a fresh terraform apply on dstack-pha-prod5 (2026-05-04). Soft-kill RTO 27s, hard-kill RTO 33s, cheap rejoin verified, disk-loss rejoin 26s — all within noise of the pre-Gap-2 baselines on the previous multi-container cluster. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

refactor(consul-postgres-ha): collapse platform plumbing to single mesh-sidecar

h4x3rotab and others added 30 commits May 1, 2026 20:53

chore(consul-postgres-ha): drop accidentally-committed webdemo binary…

eb49a65

…, gitignore it

docs(consul-postgres-ha): repoint env-drift bug to phala-cloud#246

5b4099f

The provider issue (#6) was a downstream symptom; root cause likely lives in the API surface, so the bug was moved to phala-cloud#246.

docs(consul-postgres-ha): cross-reference phala-cloud#247

de37e56

Concurrent phala_app creates against the same workspace return 400 'parameters not compatible'. Workaround: terraform apply -parallelism=1. Track upstream fix for the misleading error code.

h4x3rotab and others added 17 commits May 2, 2026 23:59

Merge pull request #96 from Dstack-TEE/gap2/consolidate-platform-sidecar

f36201f

refactor(consul-postgres-ha): collapse platform plumbing to single mesh-sidecar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95

consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95
h4x3rotab wants to merge 47 commits intomainfrom
dstack-consul-ha-db

h4x3rotab commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

h4x3rotab commented May 4, 2026

Summary

What ships in stage 4

What's verified end-to-end on a live cluster

Notable transport-layer story (the highlight reel)

Drafting why

Test plan

Known follow-ups (not blocking merge)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant