This document is for goal setting and tracking over time. This content is initially taken from the Memorandum of Understanding with NLnet. The semantics of Epics and Milestones are also inherited from the MoU.
This phase aims to build robust and user-friendly device management tooling, specifically tailored for asynchronously managing a fleet of devices that are capable of and intended to run NixOS. These requirements were identified as unlocking NixOS adoption in small and medium business and educational organizations.
Upon successful completion of this first phase, the tooling will provide a centralized management system offering access control, fleet oversight, streamlined machine enrollment, and clear feedback on deployment status.
Epic 1: Architecture Decisions - Coordinator API, Data structures, Forward & Backward Compatibility considerations; Proof-of-Technology
In this epic I want to do the due diligence to evaluate the detailed technical requirements and make informed architectural decisions. I will draw from experience in the Holo-Host and NITS projects and incorporate real-world pain points from Numtide customers. I'm going to develop Proof-of-Technology level code to validate the decisions.
Within this Epic the items below shall be answered architecturally and proven by code.
Outcome: Design of the overall architecture and component internals, issue definitions, working components based on PoT code
[100 %] Milestone A: Network connectivity and protocols for synchronous/asynchronous messaging and routed/NAT'ed connections
- AC1: The Coordinator can provide a directly addressable network identity and interface so that Admins and Agents can be configured to connect with a specific Coordinator and effectively form a complete network.
- AC2: Connectivity support for standalone WAN and non-WAN deployments.
- AC3: Components can pass custom protocol messages over the network.
- AC4: There's an extensible mechanism by which a component can submit messages to the Coordinator that caches messages to guarantee eventual delivery, disregarding online-status of any component at the time of original message creation.
- AC5: The message delivery cache persists across component restarts.
Solving AC1: Iroh for node connectivity
At its core Iroh is a P2P framework that provides resilient connectivity between nodes. Using iroh as the connectivity framework provides flexibility for any network topology among the component instances.
The echo_completes_admin_to_coordinator uses the custom Echo protocol that ensures the sent data is in fact transmitted and echoed correctly.
This test is quite comprehensive it sets up a Coordinator and Admin to run the Echo in between. More on this follows in subsequent ACs.
Iroh natively uses asymmetric ed25519 key pairs to address nodes. As a nice to have side-note: this allows reusing existing SSH keys where desired.
The public key is used as the of a node and is resolved to a network address via a configurable and customizable discovery mechanism. If no direct or hole-punched connection is possible it falls back to a relayed connection in case a relay server is configured and available.
Iroh provides open-source reference implementations for discovery and relay servers. These can be integrated in-process with any component – the Coordinator is the best fit for the requirements – or run in dedicated process.
The echo_completes_admin_to_coordinator exercises an in-process relay and discovery stack between a Coordinator and an Admin. The test relies on the discoverability of the Coordinator's PublicKey, as the AdminArgs use the node_id to address the other side with no further network information.
I'm going to evaluate the low-level native Iroh SDK and irpc to send custom data between the Admins, Coordinator, and Agents.
There is also irpc: a higher level SDK add-on which provides an abstraction for building RPC APIs with Iroh.
The implementation of an Echo protocol was a mild surprise in the amount of low-level code it requires. Specifically for dealing with boundaries of data across the stream, i.e. message framing, that Iroh does not provide natively.
Besides that, the SDK is equipped for custom protocols and their routing with the help of Application-Layer Protocol Negotiation (ALPN) are plenty sufficient.
The echo_completes_admin_to_coordinator instruments a custom Echo protocol that's built with the native SDK with bi-directional streaming.
irpc comes with a convention of structuring a framed message protocol on top of Iroh.
I wrote a two variant EchoRpc protocol implemented with irpc and the result ended up with a subjectively clearer structure than the implementation with the lower level SDK.
To understand the overhead I wrote a majorly unoptimized benchmark.
Unoptimized refers to the initialization costs not being properly factored out before the benchmarks iteration loop.
The result is an approximate 20% overhead in the overall benchmark performance for the irpc implementation.
At the current phase I'm concluding this to be acceptable and will opt for irpc to benefit from the higher-level building blocks. I'm optimistic that sufficient optimization is possible if the protocol transfer speeds turn out to be a bottleneck.
irpc does not provide caching and the remote endpoint needs to be online for a successful message transmission.
Retries would need to be manually implemented between Admins and the Coordinator, as well as in between the Coordinator and the Agents.
This is in-line with the central role of the Coordinator in this phase.
The goal here is to identify the highest level SDK and libraries for eventual i.e. asynchronous message delivery.
There's a collection of existing protocols which can be built atop of.
Here I'm highlighting the following three that are maintained by the core team and seem related to the problem at hand:
- iroh-blobs: Provides blob and blob sequence transfer support for iroh. It implements a simple request-response protocol based on BLAKE3 verified streaming.
- iroh-gossip: Gossip protocol based on epidemic broadcast trees to disseminate messages among a swarm of nodes interested in a topic.
- iroh-docs: Composes iroh-blobs and iroh-gossip to enable multi-dimensional key-value documents with an eventual consistency synchronization protocol. It supports in-memory and persistent storage for documents.
Given its eventual consistency properties, iroh-docs is a viable candidate to provide runtime-caching for retries out-of-the box.
Here I evaluate a solution that combines synchronous calls via a custom irpc protocol and asynchronous data exchange using iroh-docs. To avoid avoid the complexity that multiple distributed writers would bring I focus on single-writer documents until at a later stage in the project.
The write and read and write capabilities in iroh-docs consist of asymmetric ed25519 key pairs. As well as the author identities which are used to create and optionally sign document entries. For operation simplicity I've chosen to derive these key-pairs from the node's main identity key-pair.
As the evaluation scenario for this pattern I use the a simple version of synchronous Agent enrollment with the Coordinator via irpc, and the continuous delivery of system facts via iroh-docs from the Agent to the Coordinator. The admin_can_get_subscriber_facts_via_coordinator test asserts the functionality of this pattern. The following sequence diagram visualizes the tested workflow:
sequenceDiagram
participant Coordinator
participant EnrollmentServiceAPI
participant EnrollmentServiceActor
participant Agent
participant EnrollmentAgentAPI
participant EnrollmentAgentActor
%% Coordinator Startup Sequence
Coordinator->>Coordinator: Starts up, initializes endpoint with secret key and relay mode
Coordinator->>Coordinator: Sets up blob store, gossip, and docs protocols
Coordinator->>EnrollmentServiceAPI: Spawns EnrollmentServiceAPI with secret key, blobs, and docs
EnrollmentServiceAPI->>EnrollmentServiceActor: Creates EnrollmentServiceActor
EnrollmentServiceActor->>EnrollmentServiceActor: Derives default author from secret key
EnrollmentServiceActor->>EnrollmentServiceActor: Initializes node root document (ensures doc and author exist)
EnrollmentServiceAPI->>Coordinator: Router accepts EnrollmentServiceAPI protocol and exposes it
Coordinator->>Coordinator: Router fully set up, listening for connections
%% Agent Startup Sequence
Agent->>Agent: Starts up, initializes endpoint with secret key and relay mode
Agent->>Agent: Sets up blob store, gossip, and docs protocols
Agent->>EnrollmentAgentAPI: Spawns EnrollmentAgentAPI with secret key, endpoint, blobs, docs, and agent args (e.g., coordinator pubkey)
EnrollmentAgentAPI->>EnrollmentAgentActor: Creates EnrollmentAgentActor
EnrollmentAgentActor->>EnrollmentAgentActor: Derives default author from secret key
EnrollmentAgentActor->>EnrollmentAgentActor: Initializes node root document
EnrollmentAgentActor->>EnrollmentAgentActor: Initializes facts document (separate doc for sharing facts)
EnrollmentAgentActor->>EnrollmentAgentActor: Spawns background facts update loop (updates facts every ~60s)
EnrollmentAgentActor->>EnrollmentAgentActor: Spawns background subscription reconcile loop (reconciles subscriptions every ~10s)
EnrollmentAgentAPI->>Agent: Router accepts EnrollmentAgentAPI protocol and exposes it
Agent->>Agent: Router fully set up, agent is active
%% Agent Subscription Loop (first run)
EnrollmentAgentActor->>EnrollmentAgentActor: Subscription reconcile loop starts (first iteration)
EnrollmentAgentActor->>Coordinator: Connects to Coordinator's EnrollmentServiceAPI
EnrollmentAgentActor->>Coordinator: Sends subscribe request with agent's pubkey and facts document ticket
EnrollmentAgentActor->>EnrollmentAgentActor: Updates local subscription state (e.g., last successful check timestamp)
%% Coordinator Response to Subscribe
Coordinator->>EnrollmentServiceActor: Receives subscribe request via EnrollmentServiceAPI
EnrollmentServiceActor->>EnrollmentServiceActor: Validates and processes request
EnrollmentServiceActor->>EnrollmentServiceActor: Imports the facts document ticket into docs
EnrollmentServiceActor->>EnrollmentServiceActor: Starts syncing facts doc with agent's pubkey (background sync begins)
EnrollmentServiceActor->>EnrollmentServiceActor: Updates enrolled subscribers map (adds agent with timestamp)
EnrollmentServiceActor->>EnrollmentServiceActor: Persists updated subscribers list to node root document
EnrollmentServiceActor->>Coordinator: Responds to agent with SubscribeResponse (success)
EnrollmentAgentActor->>EnrollmentAgentActor: Marks subscription reconcile loop iteration as complete (waits for next interval)
In conclusion this pattern is promising and I'm going to try it out for subsequent features like synchronous update submission and asynchronous distribution.
iroh-docs supports persisting its data via the underlying iroh-blobs filesystem store.
I've been able to confirm that the persistence works as expected. the test called "admin_can_get_subscriber_facts_via_coordinator_after_coordinator_restart" confirms that the Coordinator remembers previously enrolled Agents after a restart of the Coordinator.
I conclude that the evidence is in support of continuing with this set up and use it to implement subsequent features.
[0 %] Milestone B: Authentication and Authorization Model, credential bootstrap flow for initial Admin, Coordinator, and Agent nodes
[0 %] Milestone C: Trust and validation model: starting with signed configuration changes to validation at activation-time on Agent nodes.
[0 %] Milestone G: Artifact Evaluation, Build, Persistence, Delivery: from nix build to the equivalent of nix copy
- AC1: All components can be started interactively for development purposes and are fit for NixOS VM tests.
The main binary flt has an Applet enumeration for each component.
There's a subcommand structure for each of them that allows running each component separately.
The Justfile has commands to run all components that wrap cargo run appropriately and contain defaults for a local development setup.
These recipes can be discovered using just --list.
The rust-workspace Nix package provides the flt binary, so it can be used by the subsequent NixOS VM tests.
The project will require NixOS VM tests. At this point I want to connect the CI to a Nix-native CI (e.g. buildbot-nix, Hydra), preferably an existing instance.
Outcome: Repository on a publicly reachable Forge with Packages, Nix development shell definition, Nix-native CI, Binary Cache
- AC1: The repository provides a Nix Flake that exposes a package to build the Rust workspace binaries and run Rust tests.
- AC2: The Nix Flake exposes a devShell that provides dependencies to work on all components and run tests.
- AC3: The flake exposes a devShell all the release tooling.
Here we can heavily rely on existing frameworks.
Using blueprint for the repository layout because it imposes a simple and sufficient structure for Nix related files. Using the prefix and nix subdirectory to keep the top-level clean.
See flake.nix and nix/
Using crane for the rust-workspace package as it's a proven library that wraps around nixpkgs library functions and cargo itself.
The rust devShell uses the Rust specific craneLib.devShell.
It inherits its build- and runtime dependencies from the previously described rust-workspace package to avoid duplication and keep it easier to maintain.
- AC1: Ergonomic CLI args, file and environment configuration options based on insights from the first set of NixOS VM tests.
- AC2: Hierarchical configuration parser that merges CLI arguments, configuration files and environment variables
[100 %] Milestone C: Workflows and developer documentation for local and CI testing for Rust and Nix.
- AC1: Choose Rust test harness and linters
- AC2:
nix flake checkruns all Rust workflows - AC3: document native Rust and Nix wrapped workflow execution
Using cargo-nextest to run the Rust test suite because it runs each test case in a separate process, supports flaky test heuristics, and provides test grouping/partitioning.
According to the blueprint convention, all Nix derivations that are exposes by a packages' passthru.tests attribute are automatically exposes as flake checks.
This is implemented in the rust-workspace package with Nix derivations for wrapping the cargo workflows clippy, deny, doc, and nextest,
Solving AC3: README#Contributing
I want to strike a balance with the amount of public information that's duplicated in the docs here. Hopefully it's beginner friendly enough to keep encourage contributors or empower them to ask questions on missing information.
[100 %] Milestone D: Nix build infrastructure for x86_64-linux and aarch64-linux and signed Nix Binary cache
- AC1: Pull requests and direct branch pushes run all checks exposed by the Nix Flake
- AC2: CI artifacts can be retrieved via a documented Nix binary cache
Numtide runs a Buildbot-Nix instance that has been configured to respond to this project's pushes and pull-requests. It has connected builders for x86_64-linux and aarch64-linux, and can run VM tests for the former as well.
See the project page on Numtide's Buildbot instance for a live view on build status for this project.
The Buildbot-Nix instance is set up to sign and upload builds to a publicly available cache.
I added the Nix Binary Cache section in the README with more context and instructions for end-users and contributors.
Building NixOS VM integration tests for the PoT that exists at this stage will allow efficient iteration towards a production-grade code.