Add guaranteed instance termination on all exit scenarios by tomtao57 · Pull Request #1 · fortyfive-labs/jaynes

tomtao57 · 2026-01-08T02:56:05Z

Problem

Previously, EC2/GCE instances would only terminate if the launch script completed successfully. If any error occurred during setup, upload, training, or post-processing, the script would exit early and never reach the termination commands, leaving instances running indefinitely and incurring unnecessary costs.

Solution

Implemented bash trap handlers that guarantee instance termination regardless of how the script exits.

How It Works

Added a cleanup() function containing termination commands and set a trap to call it on EXIT, ERR, INT, and TERM signals:

# Define cleanup function for guaranteed termination
cleanup() {
    EXIT_CODE=$?
    echo "Cleanup triggered (exit code: $EXIT_CODE)"
    # Termination commands (with configured delay)
    aws ec2 terminate-instances --instance-ids $EC2_INSTANCE_ID --region $REGION
}
# Set trap to call cleanup on EXIT, ERR, INT, and TERM
trap cleanup EXIT ERR INT TERM

Termination Guaranteed On

✅ Normal completion (exit 0)
✅ Command failures (exit 1)
✅ Command not found errors (exit 127)
✅ Python/training script errors
✅ Missing dependencies
✅ File not found errors
✅ Script interruptions (Ctrl+C)
✅ Kill signals (SIGTERM)
✅ Any other exit scenario

Changes

`jaynes/launchers/base_launcher.py`

Modified make_launch_script() to generate cleanup trap when terminate_after=True
Trap is only added when termination is configured
Supports both EC2 and GCE termination
Cleanup includes configured delay before termination

Testing

Unit Tests: `test_termination_trap.py`

✅ EC2 termination trap generation
✅ GCE termination trap generation
✅ No trap when terminate_after=False

Integration Tests: `test_error_scenarios.sh`

Tests various failure scenarios to verify termination:

✅ Normal success termination
✅ Setup failure termination
✅ Command not found termination
✅ Python error termination
✅ Missing dependency termination
✅ File not found termination

All tests pass! 🎉

Benefits

Cost Safety

Prevents runaway costs from instances stuck due to:

Configuration errors
Missing dependencies
Training script failures
Network issues during setup
User interruptions

Reliability

Ensures instances always terminate, even when things go wrong, providing peace of mind for long-running jobs and large-scale deployments.

Example Generated Script

Click to see example EC2 launch script with trap

#!/bin/bash
set +o posix

mkdir -p /tmp/jaynes-launch
JAYNES_LAUNCH_DIR=/tmp/jaynes-launch

# Define cleanup function for guaranteed termination
cleanup() {
    EXIT_CODE=$?
    echo "Cleanup triggered (exit code: $EXIT_CODE)"
    sleep 60
    die() { status=$1; shift; echo "FATAL: $*"; exit $status; }
    echo "Now terminate this instance"
    export REGION="$(wget -q -O - http://169.254.169.254/latest/meta-data/placement/region)"
    export EC2_INSTANCE_ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id || die "wget instance-id has failed: $?"`"
    aws ec2 terminate-instances --instance-ids $EC2_INSTANCE_ID --region $REGION
}
# Set trap to call cleanup on EXIT, ERR, INT, and TERM
trap cleanup EXIT ERR INT TERM

{
    # launch.setup script
    # ... (user scripts here)
    # run script
    python train.py
    # post script
    # ... (post-processing here)
} > >(tee -a /tmp/jaynes-launch/jaynes-launch.log) 2> >(tee -a /tmp/jaynes-launch/jaynes-launch.err.log >&2)

Backward Compatibility

✅ Fully backward compatible - Only affects instances with terminate_after: true configured in .jaynes.yml. Existing configurations without auto-termination are unchanged.

🤖 Generated with Claude Code

Problem: Previously, EC2/GCE instances would only terminate if the launch script completed successfully. If any error occurred during setup, upload, training, or post-processing, the script would exit early and never reach the termination commands, leaving instances running indefinitely and incurring unnecessary costs. Solution: Implemented bash trap handlers that guarantee instance termination regardless of how the script exits: - Added cleanup() function containing termination commands - Set trap to call cleanup on EXIT, ERR, INT, and TERM signals - Ensures termination happens on: * Normal completion (exit 0) * Command failures (exit 1) * Command not found errors (exit 127) * Script interruptions (Ctrl+C) * Kill signals (SIGTERM) * Python/training errors * Missing dependencies * Any other exit scenario Changes: - jaynes/launchers/base_launcher.py: * Modified make_launch_script() to generate cleanup trap * Trap is only added when terminate_after=True * Supports both EC2 and GCE termination * Cleanup includes configured delay before termination Testing: - Added test_termination_trap.py: Unit tests for script generation - Added test_error_scenarios.sh: Integration tests for error scenarios - All tests pass, confirming termination in all scenarios This fix ensures cost safety by preventing runaway instances from errors during launch, setup, or training phases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add auto-termination settings (terminate_after: true, delay: 60) - Update to use G-series instances instead of P3 (g5.xlarge, g4dn.12xlarge) - Add simple_gpu_runner and docker_cpu_runner configurations - Improve documentation with setup instructions and quick reference - Add cost estimates and usage examples for all modes - Update region to us-east-1 and AMI to Deep Learning AMI GPU PyTorch 2.7 - Replace hardcoded values with placeholders (YOUR-BUCKET-NAME, etc.) - Add comprehensive comments explaining each configuration section This brings the example file in sync with the working .jaynes.yml configuration used in testing.

- Replace argparse with @proto.cli decorator - Add sweep parameter for loading YAML config files - Add load_sweep_config() function to parse sweep YAML files - Support three launch modes: sweep file, grid search, or simple jobs - Maintain all existing functionality with cleaner syntax 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Remove unnecessary main() wrapper functions - Move @proto.cli decorators to module level - Import params_proto at top of file - Correct pattern: decorator at module level, call in if __name__ This follows the standard params-proto pattern for cleaner CLI definitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add sweep.yaml: comprehensive 12-experiment sweep configuration - Add sweep_simple.yaml: simple 3-experiment test sweep - Add SWEEP_USAGE.md: complete guide for using sweep feature - Update README.md: document Python 3.10+ requirement - Fix Python version issue with params-proto Python 3.10+ is required because params-proto uses modern type hint syntax (type | None). Users with Python 3.9 should use python3.11 or python3.12 explicitly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Bug fixes: 1. Changed sweep parameter from Optional[Path] to str - params-proto doesn't handle Optional[Path] type hints properly - Convert to Path inside load_sweep_config function instead 2. Added 'debug' mode to Literal type hint - Allows testing sweep functionality without GPU instances - Matches available modes in .jaynes.yml 3. Removed unused imports (Optional, Path from top level) Tested: - All three launch modes work (simple, grid-search, sweep) - YAML config loading works correctly - CLI argument parsing works for all parameter combinations - Both sweep.yaml and sweep_simple.yaml validate successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Exclude Claude Code configuration directory from version control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changed gpu_large mode from on-demand to spot instances: - Set spot_price: "1.50" (max bid price) - Updated cost estimate: ~$0.30/hr spot (was ~$1.01/hr on-demand) - Provides ~70% cost savings All GPU modes now use spot instances: - gpu: g4dn.xlarge spot (max $0.40/hr) - gpu_large: g5.xlarge spot (max $1.50/hr) - multi_gpu: g4dn.12xlarge spot (max $2.50/hr) - debug: t3.medium on-demand (for reliability) Auto-termination ensures instances don't run indefinitely even if spot interrupted. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

tomtao57 requested a review from geyang January 8, 2026 02:57

tomtao57 and others added 10 commits January 8, 2026 11:08

jaynes-aws-demo

1fb15af

Add .claude/ to .gitignore

d62f007

Exclude Claude Code configuration directory from version control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add documentation comment for spot request tagging

1ecb388

bump version to 0.9.14

b19e6d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guaranteed instance termination on all exit scenarios#1

Add guaranteed instance termination on all exit scenarios#1
tomtao57 wants to merge 11 commits intomainfrom
fix/auto-terminate-on-error

tomtao57 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomtao57 commented Jan 8, 2026

Problem

Solution

How It Works

Termination Guaranteed On

Changes

jaynes/launchers/base_launcher.py

Testing

Unit Tests: test_termination_trap.py

Integration Tests: test_error_scenarios.sh

Benefits

Cost Safety

Reliability

Example Generated Script

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`jaynes/launchers/base_launcher.py`

Unit Tests: `test_termination_trap.py`

Integration Tests: `test_error_scenarios.sh`