kcover - Kubernetes Coverage for Fault Awareness and Recovery

Welcome to kcover, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.

Features

Fault Awareness: Detect and respond to hardware, network, and software failures dynamically.
Instant Recovery: Quickly restore operations without manual intervention, minimizing downtime and ensuring continuous training and service availability.
Scalability: Designed for large-scale environments, handling complexities of distributed AI workloads.

Getting Started

Prerequisites

Ensure you have Kubernetes and Helm installed on your cluster. kcover is compatible with Kubernetes versions 1.19 and above.

Installation

Install kcover using Helm:

helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace

Configuration

Configure kcover to monitor specific Kubernetes resources by labeling them:

kubectl label pytorchjobs <job-name> kcover.io/cascading-recovery=true
kubectl label pytorchjobs <job-name> kcover.io/need-recovery=true

kcover and agent read the current node name from the NODE_NAME environment variable. Helm templates inject this automatically from spec.nodeName. The legacy FAST_RECOVERY_NODE_NAME variable is still read in code for backward compatibility during migration, but new deployments should use NODE_NAME only.

Agent Config

The agent supports loading its runtime configuration from a YAML file mounted from a ConfigMap. The Helm chart creates a default ConfigMap automatically, and you can also point the agent to an existing user-managed ConfigMap.

The only runtime flag kept by the agent is --config, which points to the mounted configuration file. Business settings such as interval, vendor, and all metaX thresholds are now read from the config file only.

Default chart-managed config:

agent:
  config:
    data:
      vendor: 1
      interval: 5
      metaX:
        hcaIDs:
          - mlx5_0
          - mlx5_1
        day2CheckHour: 10
        gpuNum: 8
        temperature: 85
        eccMaxCount: 64
        ntpMaxOffsetMillis: 10

If metaX.hcaIDs is set, the agent runs ibv_devinfo and requires every listed hca_id to have state: PORT_ACTIVE (...).

Use a user-defined ConfigMap:

agent:
  config:
    existingConfigMap: my-agent-config
    path: /etc/kcover-agent/config.yaml

Usage

Once installed, kcover will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.

Preflight Slow Node Detection

The collector expects one preflight report per node.
workload_size is required in the report so the manager can determine the expected report count and batch count.
Each report must contain exactly min(workload_size - 1, 5) logical batch slots, although fail-fast nodes may skip pairwise batch parsing entirely.
For the common 16-node topology, this usually means 16 reports and 15 possible pairings, but the current manager-side aggregation only consumes up to 5 batches per report.
Nodes that fail gpu_check or storage_check are marked abnormal directly and excluded from pairwise slow-node intersection.
Pairwise slow-node detection marks a node as slow only when its node IP appears in failed observations across every effective batch considered by the aggregation logic.
Agent-side node events carry a compacted preflight payload rather than the raw host report. The compacted payload keeps only manager-required fields: report identity plus per-batch batch_idx, pair, self_ip, status, and performance fields needed for bus-bandwidth threshold evaluation.
Incomplete report collections no longer wait forever. The controller expires stale job aggregations after the controller flag --preflight-report-collection-timeout and emits a warning event describing how many reports were received.

Supported compacted report threshold field:

node_check_busbw_threshold_gbps: "5"

Controller timeout example:

controller:
  args:
    - --preflight-report-collection-timeout=30m

Controller leader election can also be toggled from chart values. Keep it enabled for multi-replica or HA deployments. Disable it only when you want a single controller instance to bypass Lease lock acquisition.

controller:
  leaderElection:
    enabled: false

Image Build Notes

The MetaX utility mx-smi is extracted into a dedicated image so that the agent image no longer needs to reference the full maca-pytorch runtime directly.

Extracted image: ghcr.io/baizeai/mx-smi:v0.1
Agent base runtime: ubuntu:24.04
Agent build arg: MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.1

Build and push the extracted mx-smi image:

make image-mx-smi

Build and push the agent image with the extracted mx-smi image injected:

make image-agent

If you need to build manually, use:

docker build -f docker/mx-smi.Dockerfile -t ghcr.io/baizeai/mx-smi:v0.1 .
docker build -f docker/agent.Dockerfile --build-arg MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.1 -t ghcr.io/baizeai/kcover-agent:<tag> .

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
cmd		cmd
docker		docker
manifests/kcover		manifests/kcover
pkg		pkg
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cspell.config.yaml		cspell.config.yaml
go.mod		go.mod
go.sum		go.sum
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kcover - Kubernetes Coverage for Fault Awareness and Recovery

Features

Getting Started

Prerequisites

Installation

Configuration

Agent Config

Usage

Preflight Slow Node Detection

Image Build Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kcover - Kubernetes Coverage for Fault Awareness and Recovery

Features

Getting Started

Prerequisites

Installation

Configuration

Agent Config

Usage

Preflight Slow Node Detection

Image Build Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages