docs: add local development guide #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

Raakshass wants to merge 3 commits into kubeflow:main from Raakshass:docs/local-development

+204 −1

docs/source/getting-started/index.rst

-Original file line number
+Diff line change
@@ Expand Up / @@ -55,7 +55,7 @@ Here's how simple it is to train a model: @@
     Next Steps
     ----------
-    .. grid:: 2
+    .. grid:: 3
        :gutter: 3
        .. grid-item-card:: Installation
@@ Expand All / @@ -69,3 +69,9 @@ Next Steps @@
           :link-type: doc
           Train your first model step-by-step.
+       .. grid-item-card:: Local Development
+          :link: local-development
+          :link-type: doc
+          Set up your local dev environment.

docs/source/getting-started/local-development.rst

-Original file line number
+Diff line change
@@ -0,0 +1,197 @@
+    Local Development with SDK Backends
+    ====================================
+    This guide explains how to run Kubeflow training jobs locally using the SDK's
+    different backends, helping you iterate faster before deploying to a Kubernetes
+    cluster.
+    Overview
+    --------
+    The Kubeflow Trainer SDK provides three backends for running training jobs:
+    .. list-table:: Backend Comparison
+       :header-rows: 1
+       :widths: 20 35 45
+       * - Backend
+         - Best For
+         - Requirements
+       * - **Local Process**
+         - Quick prototyping, single-node testing
+         - Python 3.9+
+       * - **Container**
+         - Multi-node training, reproducibility
+         - Docker or Podman installed
+       * - **Kubernetes**
+         - Production deployments
+         - K8s cluster with Trainer operator
+    All backends use the same ``TrainerClient`` interface - only the configuration
+    changes. This means you can develop locally and deploy to production with
+    minimal code changes.
+    Local Process Backend
+    ---------------------
+    The fastest option for quick testing. Runs training directly as Python processes.
+    **When to use:**
+    - Rapid prototyping and debugging
+    - Testing training logic without container overhead
+    - Environments without Docker/Podman
+    **Example:**
+    .. code-block:: python
+       from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
+       from kubeflow.trainer import CustomTrainer
+       # Configure local process backend
+       backend_config = LocalProcessBackendConfig()
+       client = TrainerClient(backend_config=backend_config)
+       # Define your training function
+       def train_model():
+           import torch
+           print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
+           # Your training logic here
+       # Create trainer and run
+       trainer = CustomTrainer(func=train_model)
+       job_name = client.train(trainer=trainer)
+       # View logs
+       client.get_job_logs(name=job_name, follow=True)
+    **Limitations:**
+    - Single-node only (no distributed training)
+    - No container isolation
+    Container Backend (Docker/Podman)
+    ---------------------------------
+    Run training in isolated containers with full multi-node distributed training support.
+    **When to use:**
+    - Distributed training with multiple workers
+    - Reproducible containerized environments
+    - Testing production-like setups locally
+    **Example with Docker:**
+    .. code-block:: python
+       from kubeflow.trainer import TrainerClient, ContainerBackendConfig
+       from kubeflow.trainer import CustomTrainer
+       # Configure Docker backend
+       backend_config = ContainerBackendConfig(
+           container_runtime="docker",  # or "podman"
+       )
+       client = TrainerClient(backend_config=backend_config)
+       # Same trainer works - now with multi-node support!
+       trainer = CustomTrainer(
+           func=train_model,
+           num_nodes=4,  # Distributed across 4 containers
+       )
+       job_name = client.train(trainer=trainer)
+    **Choosing Docker vs Podman:**
+    .. list-table::
+       :header-rows: 1
+       :widths: 30 70
+       * - Runtime
+         - Recommended For
+       * - Docker
+         - General use, especially on macOS/Windows
+       * - Podman
+         - Linux servers, rootless/security-focused environments
+    Switching Between Backends
+    --------------------------
+    The key benefit of the SDK is seamless backend switching. Your training code
+    stays the same - only the backend configuration changes:
+    .. code-block:: python
+       # Development: Use local process for fast iteration
+       from kubeflow.trainer import LocalProcessBackendConfig
+       backend_config = LocalProcessBackendConfig()
+       # Testing: Switch to Docker for distributed testing
+       from kubeflow.trainer import ContainerBackendConfig
+       backend_config = ContainerBackendConfig(container_runtime="docker")
+       # Production: Deploy to Kubernetes
+       from kubeflow.trainer import KubernetesBackendConfig
+       backend_config = KubernetesBackendConfig(namespace="kubeflow")
+       # Same client and trainer code works with all backends!
+       client = TrainerClient(backend_config=backend_config)
+       job_name = client.train(trainer=trainer)
+    Common Operations
+    -----------------
+    These operations work identically across all backends:
+    **List Jobs:**
+    .. code-block:: python
+       jobs = client.list_jobs()
+       for job in jobs:
+           print(f"{job.name}: {job.status}")
+    **View Logs:**
+    .. code-block:: python
+       # Follow logs in real-time
+       for log_line in client.get_job_logs(name=job_name, follow=True):
+           print(log_line)
+    **Wait for Completion:**
+    .. code-block:: python
+       job = client.wait_for_job_status(
+           name=job_name,
+           timeout=3600,  # 1 hour timeout
+       )
+    **Delete Jobs:**
+    .. code-block:: python
+       client.delete_job(name=job_name)
+    Troubleshooting
+    ---------------
+    **Local Process Backend:**
+    - ``ModuleNotFoundError``: Ensure dependencies are installed in current environment
+    - Training hangs: Check for infinite loops in your training function
+    **Container Backend:**
+    - ``Cannot connect to Docker daemon``: Start Docker/Podman service
+    - Image pull errors: Check network connectivity and image registry access
+    - Permission denied: For Podman, ensure rootless mode is configured
+    Next Steps
+    ----------
+    - `Custom Training <../train/custom-training.html>`_ - Define your trainers
+    - `Distributed Training <../train/distributed.html>`_ - Scale across nodes
+    - `Kubeflow Trainer Docs <https://www.kubeflow.org/docs/components/trainer/>`_ - Full documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add local development guide #261

Diff view

Diff view

There are no files selected for viewing

Uh oh!

docs: add local development guide #261

Are you sure you want to change the base?

docs: add local development guide #261

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!