rack validation: skeleton os image, agent stub and ci defs#206
rack validation: skeleton os image, agent stub and ci defs#206maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Rack-level health assessment requires in-band access to each node's hardware, similar to what Scout OS does. Extending Scout agent, OS and surrounding modules to support RV seems like a reasonable idea at first glance. However, detailed planning has shown that such an approach clashes with the original idea of keeping Scout OS and the corresponding agent as minimal as possible. As a result of discussion on the topic, it was decided to move away from Scout for rack-related validation, and instead provide a basic OS image with test dependencies (drivers, software, etc.) installed on demand. What exactly should be installed is outside of the OS or agent responsibility. Eventually, Carbide should decide on that. The separate entity that does the install on behalf of Carbide needed a name. This is how the "Ranger" name came about, to gently mock the existing "Scout" name. In Claude Opus 4.5's words: "Scout scouts, and Ranger ranges." As noted in the title of the patch, it is a skeleton still, and further patches will bring more meat to the feature. Signed-off-by: Max Olender <molender@nvidia.com>
2365f2f to
b018170
Compare
There was a problem hiding this comment.
Pull request overview
Introduces initial “Ranger” rack-validation OS images and a stub Ranger agent, along with mkosi build profiles and CI jobs to produce and publish the artifacts.
Changes:
- Added mkosi profiles for
ranger(cpio rootfs) andranger-loader(EFI loader) for x86_64 and aarch64. - Added a minimal Rust
ranger-agentbinary plus packaging assets (systemd unit, postinst). - Wired new build/package/copy tasks into
cargo-makeplus added CI jobs to build Ranger artifacts.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pxe/mkosi.profiles/ranger-x86_64/mkosi.postinst.chroot | Installs scout/ranger debs, enables services, writes pre-start script, does basic verification |
| pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/opt/ranger/.gitignore | Keeps /opt/ranger directory present in image without committing contents |
| pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/systemd/system/ranger-agent.service | Systemd unit to run ranger-agent on boot |
| pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/ssh/sshd_config | SSH server configuration for the Ranger image |
| pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/hostname | Hostname for Ranger image |
| pxe/mkosi.profiles/ranger-x86_64/mkosi.conf | mkosi config for x86_64 Ranger rootfs image |
| pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.finalize.chroot | Strips loader image for size (docs/locales/journal/kernel files) |
| pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/modprobe.d/blacklist.conf | Blacklists selected kernel modules in loader |
| pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/hostname | Hostname for x86_64 loader image |
| pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/.gitignore | Ensures only intended /etc files are included in loader profile |
| pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.conf | mkosi config for x86_64 Ranger loader (bootable EFI) |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.postinst.chroot | Decompresses kernel so UKI is valid on arm64 |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.finalize.chroot | Strips arm64 loader image for size (docs/locales/journal/firmware/modules) |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/modprobe.d/blacklist.conf | Blacklists selected kernel modules in arm64 loader |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/hostname | Hostname for arm64 loader image |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/.gitignore | Ensures only intended /etc files are included in arm64 loader profile |
| pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.conf | mkosi config for arm64 Ranger loader (bootable EFI) |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.postinst.chroot | Same postinst flow as x86_64 for arm64 Ranger rootfs |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/opt/ranger/.gitignore | Keeps /opt/ranger directory present in arm64 image |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/systemd/system/ranger-agent.service | Systemd unit to run ranger-agent on boot (arm64 image) |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/ssh/sshd_config | SSH server configuration for arm64 Ranger image |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/hostname | Hostname for arm64 Ranger image |
| pxe/mkosi.profiles/ranger-aarch64/mkosi.conf | mkosi config for arm64 Ranger rootfs image |
| pxe/common_files/ranger-loader-rclocal | rc.local script to download rootfs, measure into TPM PCR, and soft-reboot into it |
| pxe/Makefile.toml | Adds cargo-make tasks to build/copy Ranger images + stage loader rc.local |
| crates/ranger/src/main.rs | Adds stub ranger-agent implementation (CLI + heartbeat loop) |
| crates/ranger/misc/ranger-agent.service | Service file used for packaging/install |
| crates/ranger/misc/DEBIAN/postinst | Debian postinst to create env file and enable systemd service |
| crates/ranger/build.rs | Build script to embed version info |
| crates/ranger/Cargo.toml | New carbide-ranger crate definition |
| Makefile.toml | Adds build/package tasks for Ranger (local + container/cross variants) |
| .github/workflows/ci.yaml | Adds CI jobs to build Ranger images for x86_64 and aarch64 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ============================================================================ | ||
| if [ -f /build-output/forge-scout.deb ]; then | ||
| dpkg -i /build-output/forge-scout.deb | ||
| echo "Scout agent installed (for TLS provisioning)" |
There was a problem hiding this comment.
/opt/ranger is created later in the script, so this copy will always fail during image build (and is then silently ignored) unless something else already created the directory. Create /opt/ranger before attempting the copy, or move the copy below the mkdir -p /opt/ranger section so the TLS root CA actually lands in the intended location.
| echo "Scout agent installed (for TLS provisioning)" | |
| echo "Scout agent installed (for TLS provisioning)" | |
| # Ensure ranger directory exists before copying TLS root CA | |
| mkdir -p /opt/ranger |
| # Generate SSH host keys if they don't exist (they will be regenerated on first boot) | ||
| ssh-keygen -A 2>/dev/null || true |
There was a problem hiding this comment.
Generating SSH host keys at image build time risks shipping identical host keys to all machines that boot this image (they will not be "regenerated on first boot" unless explicitly removed). Prefer leaving host key generation to first boot (or delete any generated keys in an image-finalize step) so each node gets unique host keys.
| # Generate SSH host keys if they don't exist (they will be regenerated on first boot) | |
| ssh-keygen -A 2>/dev/null || true | |
| # Do NOT generate SSH host keys at image build time to avoid identical keys | |
| # on all machines created from this image. Ensure no host keys are present so | |
| # that they will be generated on first boot by the SSH service. | |
| rm -f /etc/ssh/ssh_host_*_key /etc/ssh/ssh_host_*_key.pub || true |
|
|
||
| [Service] | ||
| Type=simple | ||
| EnvironmentFile=-/opt/ranger/ranger-agent.env |
There was a problem hiding this comment.
Because the EnvironmentFile is optional (-), ${RANGER_API_URI} / ${RANGER_MACHINE_ID} may expand to empty strings. That produces arguments like --api= (overriding clap defaults with an empty value) and --machine-interface-id= (likely setting an empty ID), which can break startup. Options: (1) make the env file required (remove -) and ensure it’s always present in the image, or (2) remove these CLI flags and rely on clap’s env/default_value handling, or (3) split into an ExecStart wrapper that conditionally appends flags only when set.
| EnvironmentFile=-/opt/ranger/ranger-agent.env | |
| EnvironmentFile=/opt/ranger/ranger-agent.env |
| PermitRootLogin yes | ||
| PubkeyAuthentication yes | ||
| PasswordAuthentication yes |
There was a problem hiding this comment.
This enables remote root login via password, which is a high-risk default (especially combined with a static root password in mkosi configs). If interactive SSH is required, prefer disabling password auth (PasswordAuthentication no) and using keys; also consider PermitRootLogin prohibit-password (or no) to narrow exposure.
| PermitRootLogin yes | |
| PubkeyAuthentication yes | |
| PasswordAuthentication yes | |
| PermitRootLogin prohibit-password | |
| PubkeyAuthentication yes | |
| PasswordAuthentication no |
|
|
||
| RootPassword=password |
There was a problem hiding this comment.
Hard-coding a known root password into an image is unsafe and becomes especially problematic with PasswordAuthentication yes and PermitRootLogin yes. Prefer removing RootPassword entirely, generating a unique password at provisioning time, or disabling password-based SSH/authentication by default and relying on injected SSH keys / console access for break-glass.
| RootPassword=password |
| RestartPreventExitStatus=255 | ||
| StandardOutput=journal+console | ||
| StandardError=journal+console | ||
| LogLevel=info |
There was a problem hiding this comment.
LogLevel= is not a valid directive for a generic systemd service unit (it will be ignored and may emit warnings). If the intention is to control the agent log verbosity, pass --log-level=info, set an Environment variable consumed by the agent, or use systemd logging controls like LogLevelMax= (if supported in your target systemd) / journald configuration.
| LogLevel=info |
| newrootfsurl=$(cat /proc/cmdline | sed -e 's/.*newrootfs=//' -e 's/ .*//') | ||
| else | ||
| arch=$(uname -m) | ||
| newrootfsurl="http://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst" |
There was a problem hiding this comment.
The loader fetches the next-stage rootfs over plain HTTP, which is vulnerable to on-path tampering. Even though you measure the downloaded blob into TPM PCRs, that does not prevent booting a tampered image unless there is a policy that blocks it. Prefer HTTPS and/or add explicit integrity verification (e.g., pinned hash/signature verification) before extracting and soft-rebooting.
| newrootfsurl="http://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst" | |
| newrootfsurl="https://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst" |
|
Is there a Jira or doc link about the scope of the rack validation intended via this image? |
Rack-level health assessment requires in-band access to each node's hardware, similar to what Scout OS does. Extending Scout agent, OS and surrounding modules to support RV seems like a reasonable idea at first glance. However, detailed planning has shown that such an approach clashes with the original idea of keeping Scout OS and the corresponding agent as minimal as possible.
As a result of discussion on the topic, it was decided to move away from Scout for rack-related validation, and instead provide a basic OS image with test dependencies (drivers, software, etc.) installed on demand.
What exactly should be installed is outside of the OS or agent responsibility. Eventually, Carbide should decide on that.
The separate entity that does the install on behalf of Carbide needed a name. This is how the "Ranger" name came about, to gently mock the existing "Scout" name.
In Claude Opus 4.5's words: "Scout scouts, and Ranger ranges."
As noted in the title of the patch, it is a skeleton still, and further patches will bring more meat to the feature.
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes