Skip to content

Add Hetzner server provisioning and management to ops repo #51

@OAGr

Description

@OAGr

Context

We have an existing Hetzner Ubuntu server running Claude Code for AI-assisted development work. Currently it's manually managed. We want the ops repo to handle provisioning, configuration, and ongoing management — similar to how we manage our DigitalOcean K8s cluster.

The server's role is to:

  • Run Claude Code sessions (using a Claude subscription, not API calls)
  • Poll GitHub for issues/PRs labeled groundskeeper-autofix and work on them autonomously
  • Run the Discord bot with full code review/fix capabilities (see below)
  • Be SSH-accessible for manual use alongside automation

What needs to happen

1. Terraform stack (terraform/stacks/hetzner/)

Import the existing server into Terraform state and manage:

  • Server resource (hcloud_server) with lifecycle { ignore_changes = [user_data, image] } to prevent accidental replacement
  • Firewall rules (hcloud_firewall) — SSH access, any other needed ports
  • SSH keys (hcloud_ssh_key)
  • DNS records if needed
  • Hetzner API token stored in 1Password, pulled via the 1Password Terraform provider (consistent with existing pattern)

Gotchas to watch for:

  • Changing user_data or image on an imported server forces replacement — must use ignore_changes
  • Use delete_protection = true on the server resource
  • Pin the hetznercloud/hcloud provider version

2. Server configuration management

Need to install and configure: git, gh CLI, node, pnpm, Claude Code CLI, clone repos (longterm-wiki, ops), configure git auth with QURI Bot account.

Tool options (in order of recommendation)

Option A: Ansible (recommended)

  • Industry-standard pairing with Terraform: "Terraform provisions, Ansible configures"
  • Native 1Password integration via community.general.onepassword_info lookup plugin or op run
  • Idempotent re-runs — safe to run repeatedly to update configuration
  • Playbook would be ~100-200 lines for our use case
  • Supports ongoing config management, not just one-time setup
  • Moderate learning curve but extremely well-documented
  • An ansible/ directory in the ops repo with a playbook and inventory generated from Terraform outputs

Option B: Shell script (scripts/setup-hetzner.sh)

  • Simplest approach, no new tools
  • Run via op run -- ssh user@server 'bash -s' < scripts/setup-hetzner.sh
  • Must manually write idempotency guards (check if things exist before installing)
  • Good enough for initial setup, fragile for ongoing management
  • Risk: becomes unmaintainable as requirements grow

Option C: Pyinfra

  • Python-based alternative to Ansible — faster, real Python instead of YAML
  • Built-in idempotent operations
  • Smaller community and ecosystem than Ansible
  • No native 1Password module (would use op CLI)
  • Good middle ground if Ansible feels heavyweight

Not recommended:

  • Terraform provisioners (remote-exec) — HashiCorp themselves say to avoid these; not tracked in state, no re-run capability
  • Packer — designed for immutable images, poor fit for a mutable dev server people SSH into
  • NixOS — powerful but very steep learning curve, requires OS change from Ubuntu
  • Cloud-init alone — first-boot only, no ongoing management capability

3. Secrets management

Secrets needed on the server:

  • GitHub PAT (QURI Bot account) for gh auth and git operations
  • Claude Code authentication (subscription login or API key)
  • SSH keys for repo access
  • Any wiki-server API keys needed by the polling daemon
  • Discord bot token (for running the Discord bot)
  • Claude Code OAuth token (for /ask command via Agent SDK)

Approach: Store secrets in 1Password (consistent with existing infra). Either:

  • Install op CLI on the server so it can pull secrets at runtime
  • Inject secrets via Ansible from 1Password during playbook runs
  • Use op run when running setup scripts to pass secrets as env vars

4. Polling daemon

A service on the Hetzner server that replaces the groundskeeper's issue-responder:

  • Polls GitHub for issues/PRs with groundskeeper-autofix label (or /groundskeeper comments)
  • Spawns Claude Code sessions with full repo access and shell
  • Reports results back to GitHub (comments, commits, PR updates)
  • Runs as a systemd service for reliability
  • The daemon code itself should live in the longterm-wiki repo; ops manages its deployment and configuration

5. Discord bot on Hetzner

Migrate the Discord bot from K8s to Hetzner to unlock full code capabilities:

Currently the Discord bot runs in K8s with limited capabilities — it can answer wiki Q&A questions (@mention) and do read-only research (/ask), but can't review or fix code because the K8s pod lacks terminal access, git, and full repo checkouts.

Running on Hetzner would enable:

  • Code review: Bot can read full source, run linters/tests, provide substantive PR reviews
  • Code fixes: Bot can edit files, create branches, open PRs in response to Discord requests
  • Full Claude Code access: Agent SDK with Bash, Edit, Write tools — not just Read/Glob/Grep
  • Persistent repo state: Git repos stay cloned and up-to-date, no init containers needed
  • Terminal access: Can run builds, tests, type-checks as part of answering questions

Implementation:

  • Run the Discord bot as a systemd service alongside the polling daemon
  • Full repo checkouts at a known path (e.g., /home/bot/repos/longterm-wiki)
  • WIKI_REPO_PATH points to the actual repo instead of a stripped-down content copy
  • Bot has access to gh CLI for creating PRs, commenting on issues
  • Can share the Claude Code OAuth token with the polling daemon

This effectively consolidates the groundskeeper, issue-responder, and Discord bot into one well-provisioned server.

6. Monitoring

  • Groundskeeper health check or simple uptime monitor that pings the Hetzner server
  • Alert via Discord webhook if the server is unreachable
  • Could be as simple as adding the server to the groundskeeper's health-check task

Architecture diagram

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   GitHub     │◄────│   Hetzner        │────►│  Wiki Server    │
│  (issues,    │     │   Server         │     │  (K8s pod)      │
│   PRs,       │     │                  │     │                 │
│   labels)    │     │ - Claude Code    │     │ - Report results│
│              │     │ - Poll daemon    │     │ - Agent sessions│
│              │     │ - Discord bot    │     │                 │
└─────────────┘     │ - Full repos     │     └─────────────────┘
                    │ - Shell/git/gh   │
      ┌─────────┐  └──────────────────┘
      │ Discord  │        ▲      ▲
      │ (users)  │────────┘      │ SSH
      └─────────┘                │
                           ┌─────┴─────┐
                           │  Operator  │
                           │  (manual)  │
                           └───────────┘

Open questions

  • Should the groundskeeper's issue-responder be disabled once the Hetzner polling daemon is running? Or should they coexist (groundskeeper for simple tasks, Hetzner for complex ones)?
  • Should the Hetzner server also run the groundskeeper itself (replacing the K8s pod)?
  • Do we want multiple repos cloned, or just longterm-wiki?
  • What's the budget/size constraint for the Hetzner server? (affects server_type choice)
  • Should the Discord bot be fully migrated to Hetzner, or should we keep a lightweight K8s version for basic Q&A and only delegate code tasks to Hetzner?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions