Skip to content

Add guaranteed instance termination on all exit scenarios#1

Open
tomtao57 wants to merge 11 commits intomainfrom
fix/auto-terminate-on-error
Open

Add guaranteed instance termination on all exit scenarios#1
tomtao57 wants to merge 11 commits intomainfrom
fix/auto-terminate-on-error

Conversation

@tomtao57
Copy link
Collaborator

@tomtao57 tomtao57 commented Jan 8, 2026

Problem

Previously, EC2/GCE instances would only terminate if the launch script completed successfully. If any error occurred during setup, upload, training, or post-processing, the script would exit early and never reach the termination commands, leaving instances running indefinitely and incurring unnecessary costs.

Solution

Implemented bash trap handlers that guarantee instance termination regardless of how the script exits.

How It Works

Added a cleanup() function containing termination commands and set a trap to call it on EXIT, ERR, INT, and TERM signals:

# Define cleanup function for guaranteed termination
cleanup() {
    EXIT_CODE=$?
    echo "Cleanup triggered (exit code: $EXIT_CODE)"
    # Termination commands (with configured delay)
    aws ec2 terminate-instances --instance-ids $EC2_INSTANCE_ID --region $REGION
}
# Set trap to call cleanup on EXIT, ERR, INT, and TERM
trap cleanup EXIT ERR INT TERM

Termination Guaranteed On

  • ✅ Normal completion (exit 0)
  • ✅ Command failures (exit 1)
  • ✅ Command not found errors (exit 127)
  • ✅ Python/training script errors
  • ✅ Missing dependencies
  • ✅ File not found errors
  • ✅ Script interruptions (Ctrl+C)
  • ✅ Kill signals (SIGTERM)
  • ✅ Any other exit scenario

Changes

jaynes/launchers/base_launcher.py

  • Modified make_launch_script() to generate cleanup trap when terminate_after=True
  • Trap is only added when termination is configured
  • Supports both EC2 and GCE termination
  • Cleanup includes configured delay before termination

Testing

Unit Tests: test_termination_trap.py

  • ✅ EC2 termination trap generation
  • ✅ GCE termination trap generation
  • ✅ No trap when terminate_after=False

Integration Tests: test_error_scenarios.sh

Tests various failure scenarios to verify termination:

  • ✅ Normal success termination
  • ✅ Setup failure termination
  • ✅ Command not found termination
  • ✅ Python error termination
  • ✅ Missing dependency termination
  • ✅ File not found termination

All tests pass! 🎉

Benefits

Cost Safety

Prevents runaway costs from instances stuck due to:

  • Configuration errors
  • Missing dependencies
  • Training script failures
  • Network issues during setup
  • User interruptions

Reliability

Ensures instances always terminate, even when things go wrong, providing peace of mind for long-running jobs and large-scale deployments.

Example Generated Script

Click to see example EC2 launch script with trap
#!/bin/bash
set +o posix

mkdir -p /tmp/jaynes-launch
JAYNES_LAUNCH_DIR=/tmp/jaynes-launch

# Define cleanup function for guaranteed termination
cleanup() {
    EXIT_CODE=$?
    echo "Cleanup triggered (exit code: $EXIT_CODE)"
    sleep 60
    die() { status=$1; shift; echo "FATAL: $*"; exit $status; }
    echo "Now terminate this instance"
    export REGION="$(wget -q -O - http://169.254.169.254/latest/meta-data/placement/region)"
    export EC2_INSTANCE_ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id || die "wget instance-id has failed: $?"`"
    aws ec2 terminate-instances --instance-ids $EC2_INSTANCE_ID --region $REGION
}
# Set trap to call cleanup on EXIT, ERR, INT, and TERM
trap cleanup EXIT ERR INT TERM

{
    # launch.setup script
    # ... (user scripts here)
    # run script
    python train.py
    # post script
    # ... (post-processing here)
} > >(tee -a /tmp/jaynes-launch/jaynes-launch.log) 2> >(tee -a /tmp/jaynes-launch/jaynes-launch.err.log >&2)

Backward Compatibility

Fully backward compatible - Only affects instances with terminate_after: true configured in .jaynes.yml. Existing configurations without auto-termination are unchanged.

🤖 Generated with Claude Code

Problem:
Previously, EC2/GCE instances would only terminate if the launch script
completed successfully. If any error occurred during setup, upload,
training, or post-processing, the script would exit early and never
reach the termination commands, leaving instances running indefinitely
and incurring unnecessary costs.

Solution:
Implemented bash trap handlers that guarantee instance termination
regardless of how the script exits:

- Added cleanup() function containing termination commands
- Set trap to call cleanup on EXIT, ERR, INT, and TERM signals
- Ensures termination happens on:
  * Normal completion (exit 0)
  * Command failures (exit 1)
  * Command not found errors (exit 127)
  * Script interruptions (Ctrl+C)
  * Kill signals (SIGTERM)
  * Python/training errors
  * Missing dependencies
  * Any other exit scenario

Changes:
- jaynes/launchers/base_launcher.py:
  * Modified make_launch_script() to generate cleanup trap
  * Trap is only added when terminate_after=True
  * Supports both EC2 and GCE termination
  * Cleanup includes configured delay before termination

Testing:
- Added test_termination_trap.py: Unit tests for script generation
- Added test_error_scenarios.sh: Integration tests for error scenarios
- All tests pass, confirming termination in all scenarios

This fix ensures cost safety by preventing runaway instances from
errors during launch, setup, or training phases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tomtao57 tomtao57 requested a review from geyang January 8, 2026 02:57
tomtao57 and others added 10 commits January 8, 2026 11:08
- Add auto-termination settings (terminate_after: true, delay: 60)
- Update to use G-series instances instead of P3 (g5.xlarge, g4dn.12xlarge)
- Add simple_gpu_runner and docker_cpu_runner configurations
- Improve documentation with setup instructions and quick reference
- Add cost estimates and usage examples for all modes
- Update region to us-east-1 and AMI to Deep Learning AMI GPU PyTorch 2.7
- Replace hardcoded values with placeholders (YOUR-BUCKET-NAME, etc.)
- Add comprehensive comments explaining each configuration section

This brings the example file in sync with the working .jaynes.yml
configuration used in testing.
- Replace argparse with @proto.cli decorator
- Add sweep parameter for loading YAML config files
- Add load_sweep_config() function to parse sweep YAML files
- Support three launch modes: sweep file, grid search, or simple jobs
- Maintain all existing functionality with cleaner syntax

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove unnecessary main() wrapper functions
- Move @proto.cli decorators to module level
- Import params_proto at top of file
- Correct pattern: decorator at module level, call in if __name__

This follows the standard params-proto pattern for cleaner CLI definitions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add sweep.yaml: comprehensive 12-experiment sweep configuration
- Add sweep_simple.yaml: simple 3-experiment test sweep
- Add SWEEP_USAGE.md: complete guide for using sweep feature
- Update README.md: document Python 3.10+ requirement
- Fix Python version issue with params-proto

Python 3.10+ is required because params-proto uses modern type hint
syntax (type | None). Users with Python 3.9 should use python3.11 or
python3.12 explicitly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Bug fixes:
1. Changed sweep parameter from Optional[Path] to str
   - params-proto doesn't handle Optional[Path] type hints properly
   - Convert to Path inside load_sweep_config function instead

2. Added 'debug' mode to Literal type hint
   - Allows testing sweep functionality without GPU instances
   - Matches available modes in .jaynes.yml

3. Removed unused imports (Optional, Path from top level)

Tested:
- All three launch modes work (simple, grid-search, sweep)
- YAML config loading works correctly
- CLI argument parsing works for all parameter combinations
- Both sweep.yaml and sweep_simple.yaml validate successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Exclude Claude Code configuration directory from version control.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed gpu_large mode from on-demand to spot instances:
- Set spot_price: "1.50" (max bid price)
- Updated cost estimate: ~$0.30/hr spot (was ~$1.01/hr on-demand)
- Provides ~70% cost savings

All GPU modes now use spot instances:
- gpu: g4dn.xlarge spot (max $0.40/hr)
- gpu_large: g5.xlarge spot (max $1.50/hr)
- multi_gpu: g4dn.12xlarge spot (max $2.50/hr)
- debug: t3.medium on-demand (for reliability)

Auto-termination ensures instances don't run indefinitely even if
spot interrupted.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant