Skip to content

Comments

rack validation: skeleton os image, agent stub and ci defs#206

Draft
maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
maxo-nv:feature/mkosi_profile_testos_clean
Draft

rack validation: skeleton os image, agent stub and ci defs#206
maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
maxo-nv:feature/mkosi_profile_testos_clean

Conversation

@maxo-nv
Copy link

@maxo-nv maxo-nv commented Feb 9, 2026

Rack-level health assessment requires in-band access to each node's hardware, similar to what Scout OS does. Extending Scout agent, OS and surrounding modules to support RV seems like a reasonable idea at first glance. However, detailed planning has shown that such an approach clashes with the original idea of keeping Scout OS and the corresponding agent as minimal as possible.

As a result of discussion on the topic, it was decided to move away from Scout for rack-related validation, and instead provide a basic OS image with test dependencies (drivers, software, etc.) installed on demand.

What exactly should be installed is outside of the OS or agent responsibility. Eventually, Carbide should decide on that.

The separate entity that does the install on behalf of Carbide needed a name. This is how the "Ranger" name came about, to gently mock the existing "Scout" name.

In Claude Opus 4.5's words: "Scout scouts, and Ranger ranges."

As noted in the title of the patch, it is a skeleton still, and further patches will bring more meat to the feature.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@maxo-nv maxo-nv requested review from a team, huaweic-nv and lachen-nv as code owners February 9, 2026 04:59
Copilot AI review requested due to automatic review settings February 9, 2026 04:59
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Rack-level health assessment requires in-band access to each node's
hardware, similar to what Scout OS does. Extending Scout agent, OS and
surrounding modules to support RV seems like a reasonable idea at
first glance. However, detailed planning has shown that such an
approach clashes with the original idea of keeping Scout OS and the
corresponding agent as minimal as possible.

As a result of discussion on the topic, it was decided to move away from
Scout for rack-related validation, and instead provide a basic OS image
with test dependencies (drivers, software, etc.) installed on demand.

What exactly should be installed is outside of the OS or agent
responsibility. Eventually, Carbide should decide on that.

The separate entity that does the install on behalf of Carbide needed a
name. This is how the "Ranger" name came about, to gently mock the existing
"Scout" name.

In Claude Opus 4.5's words: "Scout scouts, and Ranger ranges."

As noted in the title of the patch, it is a skeleton still, and further
patches will bring more meat to the feature.

Signed-off-by: Max Olender <molender@nvidia.com>
@maxo-nv maxo-nv force-pushed the feature/mkosi_profile_testos_clean branch from 2365f2f to b018170 Compare February 9, 2026 05:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces initial “Ranger” rack-validation OS images and a stub Ranger agent, along with mkosi build profiles and CI jobs to produce and publish the artifacts.

Changes:

  • Added mkosi profiles for ranger (cpio rootfs) and ranger-loader (EFI loader) for x86_64 and aarch64.
  • Added a minimal Rust ranger-agent binary plus packaging assets (systemd unit, postinst).
  • Wired new build/package/copy tasks into cargo-make plus added CI jobs to build Ranger artifacts.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pxe/mkosi.profiles/ranger-x86_64/mkosi.postinst.chroot Installs scout/ranger debs, enables services, writes pre-start script, does basic verification
pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/opt/ranger/.gitignore Keeps /opt/ranger directory present in image without committing contents
pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/systemd/system/ranger-agent.service Systemd unit to run ranger-agent on boot
pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/ssh/sshd_config SSH server configuration for the Ranger image
pxe/mkosi.profiles/ranger-x86_64/mkosi.extra/etc/hostname Hostname for Ranger image
pxe/mkosi.profiles/ranger-x86_64/mkosi.conf mkosi config for x86_64 Ranger rootfs image
pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.finalize.chroot Strips loader image for size (docs/locales/journal/kernel files)
pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/modprobe.d/blacklist.conf Blacklists selected kernel modules in loader
pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/hostname Hostname for x86_64 loader image
pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.extra/etc/.gitignore Ensures only intended /etc files are included in loader profile
pxe/mkosi.profiles/ranger-loader-x86_64/mkosi.conf mkosi config for x86_64 Ranger loader (bootable EFI)
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.postinst.chroot Decompresses kernel so UKI is valid on arm64
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.finalize.chroot Strips arm64 loader image for size (docs/locales/journal/firmware/modules)
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/modprobe.d/blacklist.conf Blacklists selected kernel modules in arm64 loader
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/hostname Hostname for arm64 loader image
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.extra/etc/.gitignore Ensures only intended /etc files are included in arm64 loader profile
pxe/mkosi.profiles/ranger-loader-aarch64/mkosi.conf mkosi config for arm64 Ranger loader (bootable EFI)
pxe/mkosi.profiles/ranger-aarch64/mkosi.postinst.chroot Same postinst flow as x86_64 for arm64 Ranger rootfs
pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/opt/ranger/.gitignore Keeps /opt/ranger directory present in arm64 image
pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/systemd/system/ranger-agent.service Systemd unit to run ranger-agent on boot (arm64 image)
pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/ssh/sshd_config SSH server configuration for arm64 Ranger image
pxe/mkosi.profiles/ranger-aarch64/mkosi.extra/etc/hostname Hostname for arm64 Ranger image
pxe/mkosi.profiles/ranger-aarch64/mkosi.conf mkosi config for arm64 Ranger rootfs image
pxe/common_files/ranger-loader-rclocal rc.local script to download rootfs, measure into TPM PCR, and soft-reboot into it
pxe/Makefile.toml Adds cargo-make tasks to build/copy Ranger images + stage loader rc.local
crates/ranger/src/main.rs Adds stub ranger-agent implementation (CLI + heartbeat loop)
crates/ranger/misc/ranger-agent.service Service file used for packaging/install
crates/ranger/misc/DEBIAN/postinst Debian postinst to create env file and enable systemd service
crates/ranger/build.rs Build script to embed version info
crates/ranger/Cargo.toml New carbide-ranger crate definition
Makefile.toml Adds build/package tasks for Ranger (local + container/cross variants)
.github/workflows/ci.yaml Adds CI jobs to build Ranger images for x86_64 and aarch64

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# ============================================================================
if [ -f /build-output/forge-scout.deb ]; then
dpkg -i /build-output/forge-scout.deb
echo "Scout agent installed (for TLS provisioning)"
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/opt/ranger is created later in the script, so this copy will always fail during image build (and is then silently ignored) unless something else already created the directory. Create /opt/ranger before attempting the copy, or move the copy below the mkdir -p /opt/ranger section so the TLS root CA actually lands in the intended location.

Suggested change
echo "Scout agent installed (for TLS provisioning)"
echo "Scout agent installed (for TLS provisioning)"
# Ensure ranger directory exists before copying TLS root CA
mkdir -p /opt/ranger

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +84
# Generate SSH host keys if they don't exist (they will be regenerated on first boot)
ssh-keygen -A 2>/dev/null || true
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating SSH host keys at image build time risks shipping identical host keys to all machines that boot this image (they will not be "regenerated on first boot" unless explicitly removed). Prefer leaving host key generation to first boot (or delete any generated keys in an image-finalize step) so each node gets unique host keys.

Suggested change
# Generate SSH host keys if they don't exist (they will be regenerated on first boot)
ssh-keygen -A 2>/dev/null || true
# Do NOT generate SSH host keys at image build time to avoid identical keys
# on all machines created from this image. Ensure no host keys are present so
# that they will be generated on first boot by the SSH service.
rm -f /etc/ssh/ssh_host_*_key /etc/ssh/ssh_host_*_key.pub || true

Copilot uses AI. Check for mistakes.

[Service]
Type=simple
EnvironmentFile=-/opt/ranger/ranger-agent.env
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the EnvironmentFile is optional (-), ${RANGER_API_URI} / ${RANGER_MACHINE_ID} may expand to empty strings. That produces arguments like --api= (overriding clap defaults with an empty value) and --machine-interface-id= (likely setting an empty ID), which can break startup. Options: (1) make the env file required (remove -) and ensure it’s always present in the image, or (2) remove these CLI flags and rely on clap’s env/default_value handling, or (3) split into an ExecStart wrapper that conditionally appends flags only when set.

Suggested change
EnvironmentFile=-/opt/ranger/ranger-agent.env
EnvironmentFile=/opt/ranger/ranger-agent.env

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +13
PermitRootLogin yes
PubkeyAuthentication yes
PasswordAuthentication yes
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables remote root login via password, which is a high-risk default (especially combined with a static root password in mkosi configs). If interactive SSH is required, prefer disabling password auth (PasswordAuthentication no) and using keys; also consider PermitRootLogin prohibit-password (or no) to narrow exposure.

Suggested change
PermitRootLogin yes
PubkeyAuthentication yes
PasswordAuthentication yes
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +64

RootPassword=password
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding a known root password into an image is unsafe and becomes especially problematic with PasswordAuthentication yes and PermitRootLogin yes. Prefer removing RootPassword entirely, generating a unique password at provisioning time, or disabling password-based SSH/authentication by default and relying on injected SSH keys / console access for break-glass.

Suggested change
RootPassword=password

Copilot uses AI. Check for mistakes.
RestartPreventExitStatus=255
StandardOutput=journal+console
StandardError=journal+console
LogLevel=info
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogLevel= is not a valid directive for a generic systemd service unit (it will be ignored and may emit warnings). If the intention is to control the agent log verbosity, pass --log-level=info, set an Environment variable consumed by the agent, or use systemd logging controls like LogLevelMax= (if supported in your target systemd) / journald configuration.

Suggested change
LogLevel=info

Copilot uses AI. Check for mistakes.
newrootfsurl=$(cat /proc/cmdline | sed -e 's/.*newrootfs=//' -e 's/ .*//')
else
arch=$(uname -m)
newrootfsurl="http://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst"
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loader fetches the next-stage rootfs over plain HTTP, which is vulnerable to on-path tampering. Even though you measure the downloaded blob into TPM PCRs, that does not prevent booting a tampered image unless there is a policy that blocks it. Prefer HTTPS and/or add explicit integrity verification (e.g., pinned hash/signature verification) before extracting and soft-rebooting.

Suggested change
newrootfsurl="http://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst"
newrootfsurl="https://carbide-static-pxe.forge/public/blobs/internal/${arch}/ranger.cpio.zst"

Copilot uses AI. Check for mistakes.
@maxo-nv maxo-nv marked this pull request as draft February 9, 2026 05:01
@zhaozhongn
Copy link

Is there a Jira or doc link about the scope of the rack validation intended via this image?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants