Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/source/getting-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Here's how simple it is to train a model:
Next Steps
----------

.. grid:: 2
.. grid:: 3
:gutter: 3

.. grid-item-card:: Installation
Expand All @@ -69,3 +69,9 @@ Next Steps
:link-type: doc

Train your first model step-by-step.

.. grid-item-card:: Local Development
:link: local-development
:link-type: doc

Set up your local dev environment.
197 changes: 197 additions & 0 deletions docs/source/getting-started/local-development.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
Local Development with SDK Backends
====================================

This guide explains how to run Kubeflow training jobs locally using the SDK's
different backends, helping you iterate faster before deploying to a Kubernetes
cluster.

Overview
--------

The Kubeflow Trainer SDK provides three backends for running training jobs:

.. list-table:: Backend Comparison
:header-rows: 1
:widths: 20 35 45

* - Backend
- Best For
- Requirements
* - **Local Process**
- Quick prototyping, single-node testing
- Python 3.9+
* - **Container**
- Multi-node training, reproducibility
- Docker or Podman installed
* - **Kubernetes**
- Production deployments
- K8s cluster with Trainer operator

All backends use the same ``TrainerClient`` interface - only the configuration
changes. This means you can develop locally and deploy to production with
minimal code changes.

Local Process Backend
---------------------

The fastest option for quick testing. Runs training directly as Python processes.

**When to use:**

- Rapid prototyping and debugging
- Testing training logic without container overhead
- Environments without Docker/Podman

**Example:**

.. code-block:: python

from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure local process backend
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)

# Define your training function
def train_model():
import torch
print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
# Your training logic here

# Create trainer and run
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)

# View logs
client.get_job_logs(name=job_name, follow=True)

**Limitations:**

- Single-node only (no distributed training)
- No container isolation

Container Backend (Docker/Podman)
---------------------------------

Run training in isolated containers with full multi-node distributed training support.

**When to use:**

- Distributed training with multiple workers
- Reproducible containerized environments
- Testing production-like setups locally

**Example with Docker:**

.. code-block:: python

from kubeflow.trainer import TrainerClient, ContainerBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure Docker backend
backend_config = ContainerBackendConfig(
container_runtime="docker", # or "podman"
)
client = TrainerClient(backend_config=backend_config)

# Same trainer works - now with multi-node support!
trainer = CustomTrainer(
func=train_model,
num_nodes=4, # Distributed across 4 containers
)
job_name = client.train(trainer=trainer)

**Choosing Docker vs Podman:**

.. list-table::
:header-rows: 1
:widths: 30 70

* - Runtime
- Recommended For
* - Docker
- General use, especially on macOS/Windows
* - Podman
- Linux servers, rootless/security-focused environments

Switching Between Backends
--------------------------

The key benefit of the SDK is seamless backend switching. Your training code
stays the same - only the backend configuration changes:

.. code-block:: python

# Development: Use local process for fast iteration
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()

# Testing: Switch to Docker for distributed testing
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(container_runtime="docker")

# Production: Deploy to Kubernetes
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")

# Same client and trainer code works with all backends!
client = TrainerClient(backend_config=backend_config)
job_name = client.train(trainer=trainer)

Common Operations
-----------------

These operations work identically across all backends:

**List Jobs:**

.. code-block:: python

jobs = client.list_jobs()
for job in jobs:
print(f"{job.name}: {job.status}")

**View Logs:**

.. code-block:: python

# Follow logs in real-time
for log_line in client.get_job_logs(name=job_name, follow=True):
print(log_line)

**Wait for Completion:**

.. code-block:: python

job = client.wait_for_job_status(
name=job_name,
timeout=3600, # 1 hour timeout
)

**Delete Jobs:**

.. code-block:: python

client.delete_job(name=job_name)

Troubleshooting
---------------

**Local Process Backend:**

- ``ModuleNotFoundError``: Ensure dependencies are installed in current environment
- Training hangs: Check for infinite loops in your training function

**Container Backend:**

- ``Cannot connect to Docker daemon``: Start Docker/Podman service
- Image pull errors: Check network connectivity and image registry access
- Permission denied: For Podman, ensure rootless mode is configured

Next Steps
----------

- `Custom Training <../train/custom-training.html>`_ - Define your trainers
- `Distributed Training <../train/distributed.html>`_ - Scale across nodes
- `Kubeflow Trainer Docs <https://www.kubeflow.org/docs/components/trainer/>`_ - Full documentation
Loading