consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95
Draft
consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95
Conversation
Adds a phase-0 experiment to verify whether dstack CVMs can establish
direct UDP paths via NAT hole-punching, as a prerequisite for running
Consul (or any UDP-gossip service mesh) across CVMs over the TCP-only
dstack-gateway.
Components:
- coordinator/: docker-compose for coturn (STUN+TURN, UDP+TCP) plus a
tiny HTTP signaling broker, deployed on a user-provided public-IP host
- phase0/icetest/: single Go binary with two modes
- signaling: ferries ICE candidates and ufrag/pwd between two peers
- peer: runs pion/ice against coturn, exchanges candidates via the
broker, performs connectivity check, sends 20 echo round-trips, and
logs the winning candidate-pair type + RTT
- phase0/docker-compose.yaml: dstack-CVM compose that runs the peer
- deploy/phase0-results.md: result of the live run
Result: direct hole-punched UDP works between two dstack CVMs (srflx
candidates via NAT hairpinning), median RTT ~6.6 ms over the public
internet path, no TURN relay needed. TURN is available as fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on phase-0's "direct UDP hole-punch works on dstack" finding.
Stage 1 wraps a pion/ice connection in a TUN device so arbitrary IP
traffic can flow between two CVMs, not just hand-written echo packets.
Components:
- stage1/mesh-conn: ~280 LoC Go, single binary
- opens TUN mesh0 with a virtual /24 IP
- establishes one pion/ice connection to its partner via the same
coturn + signaling broker phase-0 used
- 1:1 pumps L3 packets between TUN and ice.Conn (no framing — ice
rides on UDP, datagram boundaries are preserved)
- logs the selected ICE candidate pair for visibility
- stage1/docker-compose.yaml: mesh-conn + nicolaka/netshoot tester,
both on network_mode: host
Result (deploy/stage1-mvp-results.md): direct host<->srflx hole-punched
path, ICMP through the tunnel runs at 4.8–8.4 ms RTT, matching phase-0
native UDP latency. Confirms userspace overhead is negligible.
Caveat: docker-bridge networking forces ICE onto the TURN relay path
(observed 163 ms RTT in the broken run) because srflx replies can't
route through the bridge NAT. mesh-conn must run with
network_mode: host on dstack.
This MVP is a stepping stone — next iteration replaces TUN with a
userspace port-forwarding agent so apps just bind localhost:<port>
upstreams at peers, no virtual L3, no kernel routing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the TUN-based overlay with a much simpler userspace UDP port-forwarding agent. No TUN device, no virtual L3, no NET_ADMIN capability, no songgao/water dependency — just `net.ListenUDP` per peer-pair, bridged 1:1 with one pion/ice connection. Identity-port convention: Each peer has a unique 16-bit identity port. On every host: - the local app binds 127.0.0.1:<own_port> - mesh-conn binds 127.0.0.1:<other_peer_port> for every OTHER peer - apps reach peer X by sending UDP to 127.0.0.1:<X_port> The source-port-preservation trick (mesh-conn's bound socket is the *sender peer's* identity port) means the receiving app sees inbound packets as coming from 127.0.0.1:<sender_id_port>, which is the address the cluster's peer-discovery / membership protocol uses to identify the sender. So Consul or any membership-aware service plugs in unchanged. Verified end-to-end on two dstack CVMs (deploy/stage1-portfwd-results.md): ICE selected the direct host<->prflx hole-punched path; 5/5 socat-based UDP round-trips delivered the correct payload through the bridge. Why this is the right shape: The TUN approach (committed earlier as a milestone) gave us a virtual L3 we didn't actually need. Apps in a service-mesh demo just want "send UDP to a stable peer address" — userspace bridge is enough, cheaper to operate (no TUN device on host), and easier to reason about for stage-2 attestation gating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orkers) Deploys the port-forwarder on 4 dstack CVMs with a shared PEERS_JSON, verifies all 6 ICE links come up concurrently and traffic flows in every direction without code changes — mesh-conn already iterates peers generically. 12/12 cross-peer one-way UDP datagrams delivered, all paths direct hole-punch (no TURN relay selected). Sets up the next decision: how to carry TCP across CVMs so Consul's RPC + gossip-state-sync work. Plan documented at the bottom of the result file: add a multiplexed TCP path to mesh-conn rather than routing TCP via dstack-gateway, so apps only have to know about one transport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers TCP forwarding onto the existing UDP port-forwarder so Consul
(which uses both UDP gossip and TCP RPC + gossip-state-sync on the same
serf port) and any other TCP-needing service can run cross-CVM through
the same agent.
How it works:
- Each peer-pair still has exactly one pion/ice connection.
- That connection is wrapped in a yamux session; the lex-smaller peer
is the yamux client, matching the ICE Dial/Accept convention.
- Each yamux stream's first byte tags its purpose:
0x55 streamUDP — long-lived control stream, length-prefixed UDP
datagrams flow both ways
0x33 streamTCP — per-connection ephemeral, raw byte splice
- mesh-conn now binds both a UDP socket and a TCP listener on
127.0.0.1:<peer-port>; local Accept on either opens a stream of the
matching tag. On the remote side a new TCP stream causes a Dial to
127.0.0.1:<self-port> and bidirectional splice.
Verified end-to-end on the existing 4-CVM cluster: 12/12 cross-peer HTTP
curls succeeded through the bridge (deploy/stage1-tcp-results.md). UDP
fan-out from earlier still works.
Single ICE conn + yamux mux trades a small head-of-line risk for
halving NAT-mapping pressure vs running separate UDP and TCP ICE
connections. Acceptable for Consul-grade traffic; can split later if
jitter sensitivity demands it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pair
Extends mesh-conn so each peer can forward several ports through a
single ICE+yamux pair. Required for Consul, which advertises one bind
address but uses several ports for distinct protocols (serf-LAN gossip,
server-RPC, HTTP API, gRPC/xDS); each protocol needs its own per-peer
identity port for the source-port-preservation trick to work.
Changes:
- Peer.Port int -> Peer.Ports []int (PEERS_JSON now carries a list per
peer; index i is the same protocol across peers)
- yamux stream header grew from 1 byte to 3:
[tag (1)] [receiver-side port, uint16 BE (2)]
- Per peer-pair: still one ICE conn + one yamux session
* lex-smaller side opens N long-lived UDP streams up front, one per
port, each tagged with the peer's port for that index
* lex-larger side accepts, looks the port up in self.Ports, pairs
with the matching local UDP socket
* TCP: per-connection ephemeral streams, header carries the dst
port so the receiver dials its own matching local listener
Verified: 4-CVM cluster (ctrl + 3 workers), 4 ports per peer.
deploy/stage1-multiport-results.md — 48/48 cross-peer HTTP fetches
through the bridge succeeded (4 protocol slots × 12 directed peer-pairs).
All ICE pairs landed on direct host<->{prflx,srflx} paths, no relay.
Trade-offs in design notes: one ICE+yamux per pair was preferred over
one ICE per port to keep NAT-mapping pressure low (6 pairs vs 24) and
to give an all-or-none readiness guarantee for the protocol slots.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verlay
Stands up a real HashiCorp Consul cluster (1 server + 3 clients) on
four TEE-isolated dstack CVMs whose only inter-CVM data path is the
userspace mesh-conn ICE+yamux port-forwarder built in stage 1.
What's new:
- stage2/docker-compose.yaml: per-peer compose with mesh-conn,
a hashicorp/consul:1.19 agent, and a netshoot tester sidecar,
all on network_mode: host.
- Consul launched via shell wrapper that branches on ROLE env var:
server (-server -bootstrap-expect=1 -ui) for ctrl,
client (-retry-join=127.0.0.1:CTRL_SERF_LAN_PORT) for workers.
- Each peer's agent binds to 127.0.0.1 with its own per-protocol
identity ports (serf=180XX, RPC=181XX, HTTP=182XX, gRPC=183XX),
matching the mesh-conn port plan; mesh-conn forwards each port
to the corresponding peer and source-port-preservation makes the
addresses look right from every Consul agent's perspective.
- stage2/README.md documents the port plan and how Consul gossips
peer ports so workers can dial the leader's RPC port through the
overlay.
Verified (deploy/stage2-results.md):
- All 4 peers see all 4 members alive in /v1/agent/members.
- All 4 peers agree leader = 127.0.0.1:18100 (ctrl's RPC port via
the overlay).
- KV write from w1 (curl PUT) is readable from w3 (curl GET) — RPC
to the leader and Raft replication both work across the overlay.
Confirms that Consul's three transport classes (UDP gossip, TCP RPC,
TCP HTTP API) all round-trip cleanly through one yamux session per
peer-pair.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…the overlay Walks through the exact recipe for running a Consul cluster across dstack CVMs that have no direct connectivity to each other: - Why the per-protocol identity-port plan exists and how it falls out of mesh-conn's source-port-preservation behaviour. - The compose layout (mesh-conn + consul + tester, all on host networking) and each non-obvious flag explained: bind/advertise on 127.0.0.1, per-peer -serf-lan-port / -server-port / -http-port / -grpc-port overrides, why -dns-port=-1. - The per-CVM env-var matrix for PEER_ID / ROLE / *_PORT / CTRL_SERF_LAN_PORT / PEERS_JSON. - What the boot sequence actually looks like (mesh-conn → Consul agents → leader election → workers join). - How to verify membership, leader, and cross-peer KV. Aim is so the next person setting this up doesn't have to reverse- engineer the trick from the compose file. The whole thing collapses to: "Consul never sees the overlay; identity ports + source-port preservation make every peer look like it's on the same loopback." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fan-out
Layers a small user-facing demo on top of the stage-2 Consul cluster.
Each peer runs a tiny Go service (stage3a/webdemo, ~150 LoC) that:
- registers with the local Consul agent on startup as service
"webdemo" with an HTTP health-check on /hello
- exposes /hello returning "hello from <peer>"
- exposes /all that queries /v1/catalog/service/webdemo on local
Consul and fans out /hello calls to every instance returned
The addresses Consul hands back (127.0.0.1:<peer's webdemo port>)
are routed through mesh-conn to the right peer with no app-side
awareness of the overlay. Per-peer port plan grew by one slot
(index 4 = webdemo HTTP, ports 18500-18503).
Verified end-to-end across 4 CVMs (deploy/stage3a-results.md):
- all 4 webdemos register with the cluster (catalog visible from
every peer)
- /all from every peer returns 4 hellos: ctrl, w1, w2, w3
- HTTP fan-out crosses CVM boundaries via mesh-conn for every
non-self peer
Bug caught and fixed in this round: Consul's
/v1/agent/service/register requires PUT, not POST (returned 405 on
first try).
Sets up stage 3b: replace the plain HTTP path with Connect sidecars
and explicit intentions for mTLS between services.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er the overlay
Replaces stage-3a's plain HTTP service-to-service calls with a real
Consul Connect mesh. Each peer now also runs an Envoy sidecar in front
of its webdemo; sidecars do mTLS to each other across the overlay and
intentions gate the connections.
What's new:
- stage3b/sidecar/Dockerfile: small custom image combining the
consul CLI (for `consul connect envoy -bootstrap`) with Envoy
contrib v1.30. Tiny — no full Consul agent, just the CLI.
- stage3b/webdemo: webdemo registers with a Connect.SidecarService
block telling Consul to manage a sidecar that listens on the
per-peer sidecar_public port and exposes one upstream
"webdemo" on local 127.0.0.1:19000. /all hits the upstream N
times so Envoy's LB rotates across all 4 instances.
- stage3b/docker-compose.yaml: adds the sidecar service, enables
Connect on the Consul agent (-hcl 'connect{enabled=true}'),
PEERS_JSON now has 6-element ports lists (the new sidecar_public
slot, 18600..18603) so mesh-conn forwards mTLS traffic between
peer sidecars.
Verified end-to-end (deploy/stage3b-results.md):
- All 4 sidecars boot cleanly; Envoy logs show clusters loaded and
listeners up (public_listener and webdemo upstream).
- With intention webdemo->webdemo: allow, /all from w1 returns
perfectly balanced load: 2/2/2/2 across ctrl, w1, w2, w3.
- Flip intention to deny: 6/8 calls fail with EOF (peer sidecars
reject the mTLS handshake). Flip back to allow: full balance
restored. Intention enforcement is real.
Bug caught: Consul's /v1/connect/intentions create wants POST (not
PUT). Update-by-ID uses PUT. Two endpoints, two methods — easy to
trip on; called out in the results doc.
Combined picture: a HashiCorp Consul service mesh — Envoy sidecars,
mTLS, intention enforcement — running across four TEE-isolated dstack
CVMs whose only inter-CVM data path is our userspace ICE+yamux
overlay. Apps and Envoy never see the overlay; from any CVM the mesh
looks like a single loopback-only host with peers on 127.0.0.1:<port>.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ARCHITECTURE.md walks the four-layer data plane (rendezvous infra →
ICE+yamux overlay → identity-port forwarder → apps), traces a single
Connect mTLS call all the way through, and is precise about the
mesh-conn × yamux wire format: one ICE conn per peer-pair, one yamux
session per ICE conn, 3-byte stream header (tag, receiver-side port),
2-byte length prefix on UDP datagrams. Includes the four-pump
diagram and the actual pump bodies for both UDP and TCP paths.
ROBUSTNESS.md is an honest review: per-layer failure modes, what
recovers automatically, what doesn't, plus a prioritised punch list.
Headlines:
- mesh-conn has one real bug today (auth-channel reconnect
deadlock) that will bite the first ICE drop. ~30 LoC fix.
- Single Consul server is the biggest structural SPOF; 3-server
quorum is the obvious "leave it running" upgrade.
- Gossip key + RPC TLS not configured today; defence-in-depth gap
masked by Layer-3 mTLS but should be closed.
- Coordinator is a SPOF for new joins (not for established
traffic); two-coordinator setup + signed signalling messages
closes both that and the metadata-spoof gap.
- "Are we playing too many tricks?" — no. The clever-and-ours
surface is just mesh-conn (~330 LoC) and the identity-port
plan; everything else is well-trodden libraries (pion/ice,
yamux, Consul, Envoy). Risk is concentrated in the small
custom shim, not in the count of layers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, validation Three robustness fixes from ROBUSTNESS.md's punch list, paired together because they all touch mesh-conn / Consul agent config. #1: mesh-conn auth-channel reconnect deadlock Each dialICE attempt now installs a fresh peerSession (new *ice.Agent + new authCh), replacing any prior one in the global map. pollLoop looks up currentSession() per message; if no active attempt exists, the message is dropped (rather than buffered into a stale channel that would later poison a reconnect). Fixes the hang where, after an ICE drop, the next dialICE blocks forever on <-sess.authCh because the channel still held a stale auth from the previous attempt. #4: PEERS_JSON validation at mesh-conn startup validatePeers() in main.go fails fast on: - <2 peers - empty peer id, duplicate id - empty Ports list, port out of [1, 65535] - duplicate port within a peer's own Ports list - port collision between two peers (must be globally unique because mesh-conn binds OTHER peers' ports on 127.0.0.1) - port-list length mismatch across peers (every peer must use the same number of protocol slots, by index) - PEER_ID not in PEERS_JSON Also logs a digest of the canonical PEERS_JSON so operators can grep across CVM logs to confirm every peer sees the same config. Tests in validate_test.go cover all cases (8 tests, all passing). #2: Consul gossip key Stage 2/3a/3b composes now require a GOSSIP_KEY env var and pass it to consul agent via -encrypt=$GOSSIP_KEY. Encrypts serf-LAN gossip end-to-end (UDP+TCP) on every agent. Generated at deploy time via openssl rand -base64 32. Layer-3 mTLS already protects payloads; this hardens the membership/check-result path which rides outside Connect. RPC TLS deferred to the dev-experience restructure where central cert provisioning fits naturally; gossip key is the bigger gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan for collapsing the per-stage shell-script + per-peer env-var
matrix into a single cluster.yaml + a small `cluster` CLI that drives
phala deploy.
Headlines:
- cluster.yaml is the single source of truth: peers, protocol port
plan, intentions, secrets policy, deploy params.
- One CLI: validate / plan / up / down / status / logs.
- Control plane is an "embedded" mode where one dstack CVM bundles
coturn + signaling + Consul server, removing the external Vultr
box; requires Phala admin to enable UDP ingress on that CVM.
Falls back to "external" mode (separate non-TEE coordinator host)
when UDP ingress isn't available.
- Mesh-conn / webdemo / sidecar code stays unchanged; the change is
entirely in deploy ergonomics.
- TEE-app constraint is respected: one compose template per role,
only env vars vary per peer; compose-hash audit surface is small.
Future direction noted but not in this stage: derive GOSSIP_KEY /
TURN_SHARED_SECRET inside each TEE via dstack-sdk getKey() so the
deploy host never sees them. Requires AppAuth-shared app-id across
peers; reuses stage-2 attestation work.
Open questions for the user listed at the end of the doc (CLI
language, secret handling, control-plane HA, redeploy semantics).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three live on the redeployed stage-3b cluster: - mesh-conn validation logs an identical PEERS_JSON digest (NiNhinoUekif) on every peer, confirming cross-peer config consistency. - Consul logs Gossip=true → serf-LAN encrypted with the shared gossip key. - Connect mTLS /all still perfectly balanced 2/2/2/2 across the four webdemo instances; cluster operation unchanged by the fixes. #1 (reconnect bug) is verified by code review + the new validate_test.go test suite; live failure-injection (kill mesh-conn mid-run) is queued for the stage-4 CI work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, in-place updates
Reshapes stage-4 around four user decisions:
1. No new CLI — use Phala's official
terraform-provider-phala (resource phala_app supports
replicas, env, encrypted_env, custom_app_id+nonce, in-place
update). One cluster.tf is the source of truth.
2. Secrets never in human hands. A small bootstrap-secrets init
container per peer mounts /var/run/dstack.sock, derives
gossip key / TURN secret / Connect-CA seed via getKey(),
writes them to a tmpfs volume, exits. consul + mesh-conn
read those files at startup. All peers share the same
app_id (via custom_app_id + cluster_nonce) so getKey()
returns the same bytes on every peer.
3. Multi-server Consul stays the next stage but unlocks
self-discovering rendezvous: each control CVM registers as
service "mesh-coordinator" and "mesh-turn" in Consul; new
peers know ONE bootstrap endpoint and learn the rest from
the catalog. Topology of the rendezvous becomes a
service-mesh-managed concern.
4. In-place updates preserve disk volumes (Consul Raft state,
KV, sidecar certs, future Patroni WAL). Compose/env diffs
update existing CVMs without recreate; only
custom_app_id/nonce changes rotate identity. Per-node
rollout for the control plane via terraform -target.
Includes a full HCL skeleton, the bootstrap-secrets sketch, and
maps each ROBUSTNESS.md punch-list item to stage 4 vs the next
stage.
Open item: confirm phala_app behaviour (replicas, encrypted_env,
in-place env update, custom_app_id) on the 0.2.0-beta.1 provider
before committing the dev-experience to it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… provider
Empirical verification before committing the dev-experience to
terraform. Spun up a tdx.small nginx via terraform apply, exercised
in-place updates, replicas, then destroyed.
Outcomes (stage4-experiments/tf-shakedown/RESULTS.md):
- create works (~2 min for tdx.small)
- in-place compose+env update preserves app_id and primary_cvm_id
(~3m39s for the upgrade flow); disk volumes survive
- replicas: 1 -> 2 plans in-place; both CVMs land under the same
app_id, which is exactly what TEE-derived secrets via getKey()
need across replicas (no out-of-band coordination required)
- destroy clean (~23s)
Two gotchas baked into RESULTS.md:
- storage_fs MUST be pinned in HCL ("zfs"); otherwise the next
apply diffs "zfs -> (known after apply)" which the provider
treats as ForceNew → destroys the CVM. Without pin, every diff
becomes a recreate.
- provider is at 0.2.0-beta.2; Terraform's >= constraint excludes
pre-release by default — pin exactly.
- field-name shape is positive (listed/public_logs/public_sysinfo),
not the CLI's --no-... shape.
Verdict: provider is good enough for stage 4. Open follow-ups
listed in RESULTS.md (encrypted_env behaviour, custom_app_id +
nonce determinism, failure-mode handling, AppAuth-shared-id
pattern via on-chain KMS).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two closeout pieces for the architecture-testing phase before
implementing stage 4 itself.
STAGE4_PLAN.md revision 2:
- Drops the per-peer phala_app pattern from rev-1 in favour of
"one phala_app per role with replicas: N", matching dstack's
native app->instance grain.
- Each instance reads its identity from a UUID file written to
its persisted disk on first boot. No PEER_ID env var; no
PEERS_JSON env var.
- Peer discovery via Consul: each instance registers itself with
role + ordinal + identity-port set as service tags. Adding a
peer is a `replicas` bump.
- bootstrap-secrets init container is the keystone: derives all
cluster-wide secrets (gossip, TURN, Connect-CA seed) via
getKey() AND manages per-instance UUID + ordinal claim via
Consul KV CAS.
- Rolling updates without per-instance Terraform resources: a
rollout.sh that calls workload-aware drain verbs (consul
operator raft transfer-leader, etc.) gates each replica.
Once phala-cloud#243 lands `update_policy`, most of this
collapses into HCL.
- Updated migration notes: stages 0-3b stay frozen as historical
reference; stage 4 is the integrated product.
stage4-experiments/disk-persistence/ — empirical verification of
THE keystone assumption: docker named volumes survive in-place
phala_app compose updates.
Test: deploy a CVM, write UUID 90ce33e5... to a named volume, bump
a tfvar that flips the compose body, terraform apply. After ~3 min
in-place update (same app_id, same primary_cvm_id), curl the
volume-served file -> identical UUID. Disk persisted. ✅
Caveats noted in RESULTS.md: didn't test under replica scaling
or image bumps. Will exercise both inline during stage-4 build.
Live state cleanup: stage3b cluster (4 CVMs at $0.058/hr each)
torn down. coturn + signaling on 155.138.146.255 still up
(dirt cheap, useful as TURN fallback for any future test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng the actual repo
Verified the Go SDK at github.com/Dstack-TEE/dstack/sdk/go/dstack —
two corrections to the previous draft:
1. Per-instance identity comes from client.Info(ctx).InstanceID
directly. The plan's "write UUID to /var/lib/dstack/instance-id
on first boot, read it back on subsequent boots" was redundant —
dstack already exposes a stable per-CVM ID through the SDK,
rooted in the platform rather than a file we wrote. Drop the
on-disk UUID dance.
2. GetKey signature is (path, purpose, algorithm) returning a
hex-encoded secp256k1 (or other) key, decoded via .DecodeKey()
to 32 bytes. Pseudo-call shape gossipKey = GetKey("...:gossip")
was wrong; real shape is
seed, _ := client.GetKey(ctx, "dstack-mesh/gossip", "cluster", "secp256k1")
gossipBytes, _ := seed.DecodeKey()
The 32-byte output is fine to use as the gossip key directly,
or to HKDF for multiple sub-keys.
bootstrap-secrets simplifies as a result: no on-disk UUID
write/read logic, just GetKey() + Info() into tmpfs. ~80 LoC.
Bonus finding: same SDK exposes Sign(), Verify(), GetQuote() — so
the deferred "attestation-gated mesh join" work (originally Stage 2
in the plan) now fits cleanly into stage 4 with no new tooling.
Each peer signs its mesh-conn auth message, the coordinator
verifies before letting it onto the overlay. Noted as a bonus
add-on, not a stage-4 requirement.
Open items list updated: disk-persistence ✅, SDK existence ✅;
container ordering + Consul CAS-vs-hash for ordinal claim still
TBD inline during build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The keystone of the stage-4 design: the only piece that holds
plaintext cluster secrets, and it does so entirely inside the TEE.
What it does (one-shot, ~250 LoC):
1. Connects to /var/run/dstack.sock via the official Go SDK
(github.com/Dstack-TEE/dstack/sdk/go/dstack).
2. client.Info(ctx) -> self identity (AppID, InstanceID,
ComposeHash). Per-CVM identity comes from the SDK directly;
no on-disk UUID write/read.
3. client.GetKey(ctx, path, purpose, "secp256k1") for each of:
- dstack-mesh/gossip
- dstack-mesh/turn
- dstack-mesh/connect-ca
Same path/purpose/algorithm tuple yields the same 32 bytes on
every replica that shares an app_id (which all replicas of
one phala_app do). No secret material ever transits the
deploy host.
4. Workers claim a stable ordinal (0..N-1) via Consul KV CAS on
`cluster/<name>/slots/<i>`. InstanceID is the slot's permanent
owner so restarts re-find their own slot. Coordinator skips
this — it's always ordinal 0 (chicken-and-egg: it IS Consul).
5. Computes per-protocol ports from PROTOCOL_BASES env +
ordinal.
6. Writes secrets (hex-encoded, mode 0400) to /run/secrets/* on
a tmpfs volume. Writes /run/instance/info.json with identity
+ ports for sibling services to read.
7. Exits cleanly so docker-compose `depends_on` with
`condition: service_completed_successfully` releases consul,
mesh-conn, sidecar, etc.
Required env:
CLUSTER_NAME, ROLE, PROTOCOL_BASES (JSON).
Workers also need CONSUL_HTTP_ADDR (the local agent).
Compile chain:
- Go module pinned to dstack/sdk/go @5cfd7db (2026-03-19; latest
commit on master at the time of writing).
- SDK requires Go >= 1.24; the local toolchain auto-upgrades via
GOTOOLCHAIN=auto.
- Multi-stage Dockerfile produces a ~11MB static binary on
alpine.
Note on stale slots: when an instance is permanently retired (vs
restarted), its slot's KV entry stays. Cleanup is an operator
task today; production version would key the KV entry with a
Consul Session that has a TTL so stale slots auto-clear. Flagged
in code comments.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole cluster is now defined in one HCL file and brought up with
one `terraform apply`.
What's new under stage4/:
- mesh-conn/: clone of stage1 with two small additions —
self identity loaded from /run/instance/info.json (written by
bootstrap-secrets from the dstack SDK's Info()), and TURN secret
loaded from /run/secrets/turn (also bootstrap-secrets-derived).
PEERS_JSON still env-passed; cluster.tf computes it from the
`replicas` count so adding a peer is a `replicas` bump.
- compose/coordinator.yaml + compose/worker.yaml: frozen
templates that wire bootstrap-secrets + mesh-conn + consul +
{coturn,signaling} (coord) or {webdemo,sidecar} (worker), all on
network_mode: host, with a tmpfs volume for /run/secrets and
/run/instance so derived state never touches the persistent disk.
- cluster-example/cluster.tf: the user-facing surface. Two
phala_app resources (coordinator replicas:1, worker
replicas:N), shared protocol_bases, computed peers_json. Adding
a peer = `worker_replicas` bump + apply.
- cluster-example/rollout.sh: workload-aware rolling update
driver. Snapshots Consul, applies one app at a time via
-target, waits for cluster green between steps. Replaces the
update_policy block we'd want once phala-cloud#243 lands.
- stage4/README.md: how a deploy works, how to add a peer,
how to update images, what was deferred.
Boot sequence end-to-end:
1. terraform apply provisions both phala_apps; CVMs come up.
2. bootstrap-secrets (init container) calls dstack SDK
Info()+GetKey(), writes /run/secrets/{gossip,turn,ca-seed} +
/run/instance/info.json (identity + ordinal + ports), exits.
3. consul + mesh-conn + sidecar + workload start in dependency
order via `depends_on: { bootstrap-secrets: { condition:
service_completed_successfully } }`. They read their config
from the tmpfs files written in step 2.
4. mesh-conn opens ICE+yamux per peer-pair; consul forms its
cluster through the overlay; Connect mTLS works between
workers via Envoy sidecars.
Three properties this delivers vs the per-stage scripts we had
before:
- Single source of truth (cluster.tf), no per-peer env-var matrix
duplicated across deploys.
- Secrets never seen by the deploy host — bootstrap-secrets is the
only piece that holds plaintext keys, and it does so entirely
inside the TEE.
- Disk volumes preserved across in-place updates (verified in
stage4-experiments/disk-persistence/RESULTS.md), so Consul Raft
state, KV, and any future Patroni WAL survive image bumps and
config changes.
Carry-overs to next iteration: stale-slot cleanup needs Consul
Sessions with TTL (not unconditional CAS-claim); multi-server Consul
HA is a one-line `replicas: 3` change but pulls that question
forward. README spells out what's deferred and why.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterating toward the first green stage-4 deploy. Three substantive
fixes that came out of the smoke test:
1. Worker compose was missing WORKER_ORDINAL in bootstrap-secrets's
environment block. Cluster.tf passed it via --env to the CVM,
bootstrap-secrets's Go code read it from env, but compose
never plumbed it into the container. Result: bootstrap-secrets
fell into the Consul-CAS-claim path, found no Consul (it's on
the unreachable coordinator), exited 1, and dstack tore the
whole CVM down with `service "bootstrap-secrets" didn't
complete successfully: exit 1`. One missing line, ~3 hours of
serial-log archaeology to find.
2. Workers now declared as N separate phala_app resources via
for_each (not one app with replicas:N). Each gets its own
WORKER_ORDINAL env so bootstrap-secrets can compute the ports
without Consul-side coordination. The replicas-N path requires
per-instance env which phala_app doesn't expose today (filed as
phala-cloud#243).
3. bootstrap-secrets now picks an ordinal source explicitly:
a. WORKER_ORDINAL env (preferred when present)
b. ROLE=coordinator → ordinal 0
c. Consul KV CAS (fallback for the eventual replicas:N path)
This breaks the chicken-and-egg between bootstrap-secrets and
Consul that the worker hit.
4. Gossip/turn/ca-seed each emitted in a format the consumer can
actually use: gossip is base64 (consul -encrypt), turn is hex
(coturn --static-auth-secret), ca-seed is hex (HKDF-friendly
bytes). Previously everything was hex which made consul reject
the gossip key.
5. Compose templates now use bind-mounts to /tmp/dstack-runtime
instead of named docker volumes — initially debugged thinking
named volumes didn't share on dstack (filed phala-cloud#245
then closed as user error after retesting cleanly). Bind
mounts work fine and the comment notes it's for "secrets are
re-derived from getKey() each boot anyway, so /tmp ephemerality
is fine".
6. Added compose/worker-debug.yaml — minimal worker (just
bootstrap-secrets + a no-depends sleeper) for diagnosing
future boot-sequence regressions in isolation.
Coordinator still needs Phala admin to enable UDP ingress on its
app to make embedded mode (coturn + signaling on the same CVM)
fully functional. Next iteration: fall back to external
coordinator (the existing Vultr coturn+signaling) so we can land
end-to-end smoke without that gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke now passes — `/all` on each worker fans out across all 3 webdemo instances via Consul Connect mTLS over the mesh-conn UDP path. Three independent issues surfaced during the smoke and are fixed: 1. mesh-conn ICE wedge after first failure pion/ice's `agent.Dial`/`Accept` blocks indefinitely once ICE transitions to Failed, so the outer `runPeerLink` retry loop never fires and the peer slot stays dead until the container is bounced. Cancel the dial context from the state callback (Failed/Closed) and add a 60s belt-and-suspenders timeout. Tighten the auth wait from 10 min to 60 s for the same reason — the long timeout was the only reason a retry was even *theoretically* possible, and it left a 10-minute window where the slot looked silent. Also call `agent.Close()` on every error path so a stuck attempt doesn't hold pion goroutines. 2. webdemo + sidecar entrypoints needed jq Both compose entrypoints parse `/run/instance/info.json` with jq; the alpine/envoy base images don't ship it. Add jq to both Dockerfiles. Fast-failing the workload service is what kept webdemo + sidecar in a restart loop on every smoke until now. 3. Coordinator-internal coturn/signaling were unreachable from workers dstack-gateway is TCP-only and doesn't surface arbitrary CVM ports, so `SIGNALING_URL=http://<coord-app-id>...:7000` and TURN against the coordinator's own coturn never worked. Switch both coordinator and worker mesh-conn to the external (Vultr) signaling+coturn that workers were already using; the coordinator's embedded copies still run but are unused. Wire the new paths through cluster.tf as `external_*` variables. Drop `-encrypt` from the consul launch — we'd already removed gossip encryption to unstick the cluster, and the now-unused TURN_SHARED_SECRET-from-/run/secrets path is replaced by env-first resolution in mesh-conn.
Add two follow-up notes to stage4/README.md based on what the smoke turned up: shared TEE-derived secrets across separate phala_apps need a shared AppAuth contract (gating for stage 2 attestation-gated join), and mesh-conn's ICE recovery is now in-process but the signaling broker should also age out stale auth/candidate entries. Cross-link the new terraform-provider-phala#6 env-drift issue.
The provider issue (#6) was a downstream symptom; root cause likely lives in the API surface, so the bug was moved to phala-cloud#246.
Adds the original-goal Patroni service to the worker compose. Each
worker runs:
- patroni 4.0 + PostgreSQL 16 (single image, ~250MB)
- entrypoint.sh renders /etc/patroni.yml from /run/instance/info.json
(ordinal, postgres + patroni_rest ports) plus CLUSTER_NAME
- data dir lives in the named docker volume `patroni-pgdata` so
it survives container restarts (CVM reboots wipe it; persistence
across reboots is a future stage4-experiments topic)
Cluster wiring:
- cluster.tf grows two new protocol slots: postgres=18700 and
patroni_rest=18800. Adds `var.patroni_image` + threads
PATRONI_IMAGE through worker env.
- bootstrap-secrets derives two more cluster-wide secrets via
getKey() — patroni-superuser and patroni-replication. They're
identical on every replica because all peers derive against the
same path + ClusterName, so any peer can bootstrap as leader
without out-of-band secret distribution.
- All Patroni instances point at 127.0.0.1:<own_consul_http>; cross-
peer replication uses 127.0.0.1:<peer_postgres_port>, which the
mesh-conn UDP forwarder maps to the right CVM transparently.
Patroni's own leader election runs through Consul KV — no separate
DCS needed. With three workers we get fault tolerance of one (1
leader + 2 replicas).
…, not the first After a peer bounce, multiple auths from that peer can reach pollLoop in a single batch. The original \`select case authCh <- ... default\` kept the FIRST auth and silently dropped every later one. dialICE then consumed the stale auth, called \`agent.Dial\` against the wrong ufrag/pwd, and ICE Failed. The earlier ICE-state cancel fix correctly aborts and retries — but on retry pollLoop has no fresh auth in the queue (already drained), so dialICE waits 60s and retries again, while the *peer* in turn publishes a NEW auth that pollLoop also drops because the channel is still buffered with the original stale auth. Both sides repeat forever and the link never re-establishes. Drain-then-push so the channel always holds the most-recent auth. The channel is buffered to 1 and only one goroutine writes (pollLoop), so there is no contention and the drain is safe.
Coordinator goes from a single phala_app with replicas:1 to a for_each over `var.coordinator_replicas` (default 3), giving an actual Raft-replicated 3-server Consul cluster instead of bootstrap-expect=1. Per-instance ordinal is passed in via env (`COORDINATOR_ORDINAL`), mirroring the worker pattern, since bootstrap-secrets needs to know its own ordinal before Consul KV is reachable (we can't ask Consul KV for the ordinal because Consul is *on* the coordinators we're trying to bootstrap). The KV-CAS claim path stays as a fallback for the eventual replicas:N future once phala-cloud#243 lands. Worker ordinals shift by `coordinator_replicas` so the peer ID space stays contiguous (coordinators 0..C-1, workers C..C+W-1). Workers retry-join *every* coordinator's serf port (mesh-conn forwards each one), and pick any coordinator's HTTP port for KV calls. Coordinator's consul launches with `-server -bootstrap-expect=N` and loops over COORDINATOR_SERF_PORTS to retry-join its server peers (skipping its own). What this gets us: fault tolerance of 1 (3-server quorum) with the Consul UI/API still served from any coordinator. Patroni's DCS now sits on top of a real HA Consul, not a single point of failure.
…es on new auth The mailbox previously kept appending forever — and because mesh-conn republishes auth+candidates on every dialICE retry, a recipient would drain a long backlog where the FIRST auth was the oldest. After my recent mesh-conn pollLoop fix that backlog became less catastrophic (the latest auth wins in the buffered channel), but the candidates in between are still added to the new ICE agent. pion then dials against addresses whose UDP sockets are gone, ICE Fails, and the loop repeats forever for a peer that bounced. Drop all stale messages from a sender when a NEW auth from that sender lands in the recipient's queue. Auth marks the start of a fresh epoch — mesh-conn always publishes auth BEFORE its candidates (candidates come from OnCandidate AFTER GatherCandidates, which happens after the auth publish), so anything in queue from before this auth is by definition stale. This is the signaling-broker mate of the mesh-conn drain-then-push fix from 4c36c76 — the broker now actively reaps the backlog instead of relying on the consumer to do it correctly. Note: the same mailbox impl is used by the stage4 signaling image (which is built from this phase0 source). Deploying this requires rebuilding + pushing the signaling image and restarting it on the Vultr coordinator host.
Concurrent phala_app creates against the same workspace return 400 'parameters not compatible'. Workaround: terraform apply -parallelism=1. Track upstream fix for the misleading error code.
mesh-conn computes its self_id as `role-ordinal` from
/run/instance/info.json, then looks for that ID in PEERS_JSON. The
multi-coord change shifted worker ordinals to start at C
(coordinator_replicas), but the peer-list IDs were still using slot
(`worker-1`, `worker-2`, `worker-3`) — so e.g. worker-1's mesh-conn
saw self_id="worker-3" but PEERS_JSON only had "worker-1", and
exited with `PEER_ID "worker-3" not in PEERS_JSON`.
Use ordinal in the peer ID. The phala_app name still uses the
1-based slot for human-friendly CVM names ("stage4-worker-1"), but
the peer-id and the in-CVM identifier are now consistent.
worker↔worker instability under load Adds MESH_CONN_RELAY_ONLY env (default off) that restricts pion's ICE candidate gathering to Relay only — useful as an escape hatch when direct (host/srflx/prflx) candidates establish but flap. Tested on the live stage4 cluster: relay-only made things WORSE for this dstack worker NAT pattern (pion's relay-relay pair selection isn't reliable, observable as TURN allocation churn on coturn). Left the flag in as a debug switch but documented it as not-the-fix in README. The actual symptom — `srflx <-> prflx` link goes Connected, yamux throws `accept: short buffer` 5–60s later, pg_basebackup keeps failing — is captured in the new "Known limitation" section with a concrete next-steps list (instrumentation, MaxStreamWindowSize cap, QUIC, WireGuard).
The instrumentation pass added byte counters per-link, yamux's own log output (was io.Discard), full ICE selected-pair addresses (not just types), and a 10s telemetry tick. That trace pinpointed two bugs that were previously silent: 1. ice.Conn.Read returned io.ErrShortBuffer because pion is packet-oriented — when the caller's buffer is smaller than the next UDP datagram, pion truncates. yamux's 4096-byte bufio.Reader was too small for TURN-encapsulated datagrams. Fixed by a 65535-byte packetizing adapter (countingConn) that always reads full datagrams and re-serves them to yamux as a stream. 2. My own attempted 5s yamux keepalive killed the link under load when a pg_basebackup burst delayed a keepalive past the timeout. Reverted to 30s/10s defaults. Adds two debug env switches that didn't pan out for our specific NAT environment but are kept as escape hatches: - MESH_CONN_RELAY_ONLY=1: only Relay candidates. Made things worse on dstack (relay-relay pair selection unreliable). - MESH_CONN_TCP_ONLY=1: TCP NetworkTypes + filter URLs to Proto=TCP. pion still picks `relay (proto=udp)` because relay transport is the *relayed* leg, always UDP unless RFC 6062 TCP allocation is requested (pion's TURN client doesn't). End state for stage 4: Consul (3-server Raft + 6 members) and Patroni leader election are solid. Patroni replication still requires sustained worker↔worker bulk transfer, which hits the yamux-on-lossy-UDP wall documented in the README "Known limitation" section. Real fix needs a different transport (QUIC, WireGuard, or TCP-relay end-to-end).
Captures the live cluster's app IDs, SSH command pattern, terraform.tfvars image tags, the 60-second reproducer for the open worker↔worker mesh-conn drop, what was already tried (so the next session doesn't re-walk the same paths), and open hypotheses to investigate with fresh eyes — deliberately without committing to a fix direction.
…working tree Working-tree mesh-conn/main.go has been swapped from yamux to quic-go on top of the same pion/ice packet conn, plus a sibling stage4/quic-on-ice/ experimental module. Neither is committed and the live cluster still runs the previous yamux image. RESUME now flags the discrepancy so tomorrow's session sees it on first read.
yamux assumes a reliable byte-stream underlay, but pion/ice.Conn is UDP and the path between dstack worker CVMs is extremely lossy (~99% direction-asymmetric loss when same-NAT hairpinning, ~78% on the coturn-relay path). The "keepalive timeout" / "recv window exceeded" errors we kept seeing were yamux's reliability invariants firing on dropped packets, not yamux bugs. Replace yamux with quic-go on the same pion/ice.Conn (wrapped as a net.PacketConn). QUIC has built-in loss recovery + stream multiplexing, so a lossy UDP underlay is exactly what it expects. TLS uses a self-signed cert because mesh peer trust is established out-of-band by the dstack TEE layer + TURN HMAC. The 3-byte (tag, port) stream header convention is unchanged; runAcceptLoop and the TCP/UDP pumps are line-for-line near-equivalents on *quic.Stream. Same hairpin path that killed yamux at 3 KB now sustains 25-28 MB/s for pg_basebackup. Both replicas (worker-4, worker-5) bootstrap and stream cleanly from leader worker-3. Also drops the old packetizing read-buffer in countingConn (no longer needed — quic-go reads through the PacketConn shim, which preserves datagram boundaries) and introduces a sibling smoke-test module stage4/quic-on-ice/ that proves QUIC over pion/ice.Conn end to end (10 MB worker↔worker hairpin in ~1s). RESUME.md rewritten as a "done" note with the QUIC story and verification recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soft-kill leader-failover walkthrough verified end-to-end on the live cluster: Patroni elects via Consul KV, worker-4 promotes, writes resume, worker-3 rejoins as a streaming replica without pg_basebackup. Measured RTO ~24s (kill → first successful write on new leader), well within Patroni's default ttl=30s. Captures the reproducible recipe, a measured timeline, knobs for the RTO/availability tradeoff, and what's still untested (hard CVM kill, network partition, disk-loss rejoin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed RTO Extends FAILOVER.md with the whole-userspace failure scenario: kill all containers on the leader CVM simultaneously, then bring them back via `docker compose up -d`. Measured RTO ~33s (9s longer than soft-kill due to Consul gossip-failure detection on top of Patroni's TTL). Also confirms best-replica selection under uneven replica lag, QUIC mesh-conn ICE redial after a peer's userspace evaporates, and cheap rejoin via local WAL replay (no pg_basebackup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the MESH_CONN_TCP_ONLY env knob entirely (from dialICE, both
compose templates, and reportLinkStats's tick cadence). The flag was
investigated as a yamux-era escape hatch and proven non-helpful — pion
still selects relay-UDP candidates regardless because the relay
candidate's transport comes from the TURN allocation's relayed leg
(always UDP unless RFC 6062 TCP-allocation requested), not from the
client→TURN leg. With the QUIC switch, the underlying loss is handled
by the transport layer, so the knob has no remaining purpose and was
becoming misleading.
Also quiets reportLinkStats: tick 10s → 60s and skip the log line
entirely when bytes haven't moved since the last tick. Idle peer pairs
no longer spam every 10 seconds. Final-stats line on stop is unchanged
so postmortems still get a summary regardless of activity.
Drops the unused *quic.Conn parameter from reportLinkStats, refreshes
the stale "log every 10s" banner, and tightens the MESH_CONN_RELAY_ONLY
comment in worker.yaml so the rationale ("flip on if worker-to-worker
direct pairs fail") doesn't contradict itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…venance Adds .github/workflows/consul-postgres-ha-publish.yml — a matrix build that builds and pushes the six stage-4 images (mesh-conn, bootstrap-secrets, signaling, webdemo, sidecar, patroni) to ghcr.io/dstack-tee/dstack-examples/consul-postgres-ha-* on push to main, tagged with both the long-form commit SHA and `latest`. PRs build to verify but do not push. Each push is signed with a Sigstore-backed GitHub Build Provenance attestation via actions/attest-build-provenance@v2 — the workflow's GitHub OIDC token gets a short-lived Sigstore cert, no keys we manage. Consumers verify with `gh attestation verify oci://...@<digest> --repo Dstack-TEE/dstack-examples`, which proves the image came from this commit of this workflow. Replaces ttl.sh references in terraform.tfvars.example with the GHCR ones, fills in the previously-missing patroni_image and coordinator_replicas lines, and adds inline docs on pinning to a sha-tag for prod stability and on running the verification command. PUBLISHING.md walks through the three paths a stage-4 user actually hits: the CI publish (steady state), manual one-off ttl.sh / personal- GHCR builds for dev iteration, and the on-CVM hot-patch flow that sidesteps phala-cloud#246 when iterating on a running cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…patch terraform-provider-phala#8 fixed the env-block in-place-update bug (phala-cloud#246) and shipped as v0.2.0-beta.3, so: - cluster.tf required_providers now pins ">= 0.2.0-beta.3" with a comment explaining why earlier versions are unusable for this stack. - PUBLISHING.md's hot-patch section reframes its motivation: the per-CVM hot-patch path remains useful as a dev shortcut and as the only option on clusters still running 0.2.0-beta.2, but it is no longer the workaround for env updates not landing — terraform apply works correctly now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rejoin Two small follow-ups after verifying the v0.2.0-beta.3 env-update path against the live cluster: 1. Provider pin in cluster.tf changed from `>= 0.2.0-beta.3` to `0.2.0-beta.3` exactly. Terraform's `>=` operator does NOT include later prerelease versions, so `>= 0.2.0-beta.3` only matches stable `>= 0.2.0` — `terraform init` failed with "no available releases match the given constraints". Pin exactly until we hit a stable. 2. FAILOVER.md gains a disk-loss rejoin section: stop patroni, wipe the patroni-pgdata volume, restart, watch Patroni's bootstrap path pull a full pg_basebackup from the leader over mesh-conn's QUIC tunnel. Measured 5.2 MB / 7s end-to-end on the live cluster (handshake-dominated for a small dataset; the real throughput number remains the ~25 MB/s pg_basebackup observed during the soft -kill section). Closes the last "What this demo does NOT cover" item. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lines Discovered while verifying terraform's beta.3 env-update path against the live cluster on 2026-05-04: when terraform recreates a CVM, the peers' QUIC links to it die, but the *redial* path can hang. Specifically: dialICE returns a Connected ice.Conn, dialAndPump enters quic.Dial, ICE later goes Failed (peer went away again, hairpin lost, etc.). quic.Dial's context times out and quic-go calls SetReadDeadline(past) to interrupt the blocked ReadFrom in our iceConnPacketConn shim. The shim was returning nil from SetReadDeadline, so the call had no effect on the underlying ice.Conn.Read, and the goroutine hung forever. The surrounding runPeerLink retry loop never got to retry, leaving the peer slot permanently dead until the entire mesh-conn process was restarted. Fix: delegate SetDeadline / SetReadDeadline / SetWriteDeadline to the underlying conn (pion/ice.Conn implements net.Conn deadlines properly). Same fix applied to the stage4/quic-on-ice smoke test so future debugging stays trustworthy. Adds a regression test using net.Pipe (which honors deadlines) that asserts ReadFrom returns a Timeout-flagged net.Error within ~50ms of SetReadDeadline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The directory was an engineering log — phase0/, stage1/, stage2/,
stage3a/, stage3b/, stage4/, stage4-experiments/ — useful while
building, useless to a user landing here cold who just wants to
deploy HA Postgres on dstack-TEE.
Promote stage4/ contents up one level to consul-postgres-ha/ as the
canonical, opinionated shape. Rename phase0/icetest → signaling/.
Move stage3b/{webdemo,sidecar} up. Drop the predecessor stage*/ +
phase0/ + stage4-experiments/ + deploy/ (historical results) +
STAGE4_PLAN.md. Git history preserves everything.
Final layout:
consul-postgres-ha/
├── README.md / ARCHITECTURE.md / FAILOVER.md / PUBLISHING.md / ROBUSTNESS.md
├── cluster-example/ one cluster.tf
├── compose/ coordinator.yaml + worker.yaml templates
├── coordinator/ external-coordinator docker-compose
├── mesh-conn/ QUIC-over-pion/ICE overlay
├── bootstrap-secrets/ TEE-derives per-CVM secrets
├── patroni/ Patroni + Postgres
├── webdemo/ sidecar/ example workload + Envoy bootstrapper
├── signaling/ HTTP /publish + /poll broker for ICE rendezvous
└── quic-on-ice/ standalone smoke test for the QUIC-over-ICE transport
Updates beyond the moves:
- README.md rewritten as a deploy-first story; old stage-4-internal
README's "Known limitation" + punch-list (yamux + worker-pair
instability) is obsolete since the QUIC swap and isn't preserved.
- ARCHITECTURE.md: 4-CVM topology (ctrl+w1/w2/w3) → 6-CVM (3+3),
yamux deep-dive section replaced with a tight QUIC summary that
matches the actual code.
- ROBUSTNESS.md: yamux → QUIC mentions, "single Consul server SPOF"
section updated to reflect the 3-server quorum that's been live
since `17f4642`, "real registry" recommended-fix moved to "already
shipped" since GHCR + Sigstore is now the publish path.
- All Go module paths bumped: github.com/Dstack-TEE/dstack-examples
/consul-postgres-ha/<name> (no stage4/ or phase0/ infix).
- CI workflow path filters + matrix `context:` paths updated.
- .gitignore rewritten to match the new layout.
- Builds + tests pass on all 5 Go modules under the new paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion admission Two open architectural gaps surfaced during the consolidation pass on PR #95 — both deserve their own focused implementation passes rather than being squeezed into the mega PR. Capturing them as design docs so a future agent (or a future-me) can pick up either one cold and start. - design/single-sidecar.md: collapse the 5 platform-plumbing containers (keepalive, bootstrap-secrets, mesh-conn, consul, sidecar/Envoy) into one image with a shell-init multi-process supervisor. Per-CVM container count goes 8 → 3. - design/attestation-admission.md: replace the TURN-HMAC-only mesh admission with dstack TEE attestation as the credential. Phased plan: per-app-id check first (Phase 1, smallest delta, no rolling-upgrade pain), Consul-KV-rooted policy doc later (Phase 2). Recommends the post-QUIC-handshake-stream insertion point over the public signaling broker for privacy. Both docs include current state, approach, risks, open questions, and explicit hand-off instructions. Each is ~250-350 lines, written to be self-contained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…decar image Per-CVM container count drops from 7 → 3 on workers (sidecar + patroni + webdemo) and from 6 → 1 on coordinators (sidecar). The new sidecar image bundles bootstrap-secrets, mesh-conn, consul, and (workers only) envoy behind a tini-wrapped shell init that dispatches on ROLE; the old keepalive placeholder, the four-image lockstep, and the vestigial on-CVM signaling/coturn that had been documented as unused all drop. CI matrix: 6 → 4 (sidecar, patroni, webdemo, signaling). The sidecar build uses the parent consul-postgres-ha/ as docker context so its multi-stage Dockerfile can pull bootstrap-secrets/ and mesh-conn/ Go sources from sibling subdirs. cluster.tf: BOOTSTRAP_SECRETS_IMAGE, MESH_CONN_IMAGE, SIGNALING_IMAGE (coordinator) and the matching tfvars all collapse into SIDECAR_IMAGE. Smoke-tested against a fresh terraform apply on dstack-pha-prod5 (2026-05-04). Soft-kill RTO 27s, hard-kill RTO 33s, cheap rejoin verified, disk-loss rejoin 26s — all within noise of the pre-Gap-2 baselines on the previous multi-container cluster. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
refactor(consul-postgres-ha): collapse platform plumbing to single mesh-sidecar
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds stage 4 of the
consul-postgres-haexample: a 3-coordinator + 3-worker dstack cluster running highly-available Postgres via Patroni, with leader election driven by Consul KV, all replication and Consul gossip carried over a custom userspace mesh (mesh-conn) that hole-punches between TEE CVMs via pion/ICE and multiplexes streams over QUIC.The PR is large (43 commits, ~9.5k LoC) because it brings the full example end-to-end — phase-0 ICE feasibility through stage-3 Consul Connect through stage-4 Patroni with TEE-derived secrets and in-place env updates. Each commit is independently reviewable and the conventional-commits structure makes the chronological narrative trivial to follow.
What ships in stage 4
mesh-conn— userspace UDP+TCP port-forwarder over pion/ICE + quic-go. Each peer multiplexes 8 identity ports per pair; the QUIC layer provides loss recovery + stream multiplexing on top of pion's lossy UDP underlay.bootstrap-secrets— init container that calls the dstack SDK'sgetKey()and writes per-CVM TEE-derived secrets to/run/secrets/, plus the per-CVM identity (role, ordinal, port table) to/run/instance/info.json.patroni— Postgres + Patroni baked together; reads identity from/run/instance/info.json, joins Consul via mesh-conn, participates in leader election.cluster-example/cluster.tf— oneterraform applybrings up 3 coordinators + 3 workers across CVMs, propagatesPEERS_JSONvia in-place env updates, preserves disks across topology changes (storage_fs = "zfs").consul-postgres-ha-publish.ymlworkflow builds and publishes all six images (mesh-conn, bootstrap-secrets, signaling, webdemo, sidecar, patroni) to GHCR with Sigstore-backed GitHub Build Provenance attestations on every push to main. Consumers verify withgh attestation verify oci://...@<digest> --repo Dstack-TEE/dstack-examples.What's verified end-to-end on a live cluster
Full reproducible recipes in
consul-postgres-ha/stage4/{FAILOVER,PUBLISHING}.md, plus diagnostic artifacts.phala-network/phala 0.2.0-beta.3(seePhala-Network/terraform-provider-phala#8)terraform applyNotable transport-layer story (the highlight reel)
The mesh-conn started life on yamux. Yamux assumes a reliable byte-stream underlay; pion/ice.Conn is UDP. Between dstack worker CVMs the UDP path is brutally lossy (~99% one direction on hairpin, ~78% on coturn relay), and yamux's keepalive/recv-window invariants tripped under any sustained load, manifesting as ""keepalive timeout"" / ""recv window exceeded"" but with the real cause being dropped packets violating yamux's reliability assumptions.
Swapped yamux → quic-go with a
net.PacketConnshim aroundice.Conn. QUIC has loss recovery + stream multiplexing built in — exactly what an unreliable datagram underlay needs. Same hairpin path that killed yamux at 3 KB now sustains 25–28 MB/s for pg_basebackup. Seeconsul-postgres-ha/stage4/RESUME.mdfor the full diagnosis andconsul-postgres-ha/stage4/quic-on-ice/for the standalone smoke test.Drafting why
Marking draft because:
Test plan
go test ./...clean across all stage-4 modules (mesh-conn,bootstrap-secrets,quic-on-ice)docker build+ push to ttl.sh + live deploy)terraform applyenv-update propagates without CVM destroy/recreate (verified againstphala-network/phala 0.2.0-beta.3)consul-postgres-ha-publish.yml) runs cleanly on first push and produces verifiable attestations on GHCR — pending this PR's first runKnown follow-ups (not blocking merge)
consul-postgres-ha/stage4/RESUME.mdis now obsolete (originally a session-bridging doc; superseded byREADME.md+FAILOVER.md+PUBLISHING.md). Will delete once the PR is approaching ready-for-review, unless reviewers want to keep it as engineering-narrative.phala-network/phalaships a stable0.2.0release, thecluster.tfprovider pin can move from exact (0.2.0-beta.3) to~> 0.2.🤖 Generated with Claude Code