Add guaranteed instance termination on all exit scenarios#1
Open
Add guaranteed instance termination on all exit scenarios#1
Conversation
Problem: Previously, EC2/GCE instances would only terminate if the launch script completed successfully. If any error occurred during setup, upload, training, or post-processing, the script would exit early and never reach the termination commands, leaving instances running indefinitely and incurring unnecessary costs. Solution: Implemented bash trap handlers that guarantee instance termination regardless of how the script exits: - Added cleanup() function containing termination commands - Set trap to call cleanup on EXIT, ERR, INT, and TERM signals - Ensures termination happens on: * Normal completion (exit 0) * Command failures (exit 1) * Command not found errors (exit 127) * Script interruptions (Ctrl+C) * Kill signals (SIGTERM) * Python/training errors * Missing dependencies * Any other exit scenario Changes: - jaynes/launchers/base_launcher.py: * Modified make_launch_script() to generate cleanup trap * Trap is only added when terminate_after=True * Supports both EC2 and GCE termination * Cleanup includes configured delay before termination Testing: - Added test_termination_trap.py: Unit tests for script generation - Added test_error_scenarios.sh: Integration tests for error scenarios - All tests pass, confirming termination in all scenarios This fix ensures cost safety by preventing runaway instances from errors during launch, setup, or training phases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add auto-termination settings (terminate_after: true, delay: 60) - Update to use G-series instances instead of P3 (g5.xlarge, g4dn.12xlarge) - Add simple_gpu_runner and docker_cpu_runner configurations - Improve documentation with setup instructions and quick reference - Add cost estimates and usage examples for all modes - Update region to us-east-1 and AMI to Deep Learning AMI GPU PyTorch 2.7 - Replace hardcoded values with placeholders (YOUR-BUCKET-NAME, etc.) - Add comprehensive comments explaining each configuration section This brings the example file in sync with the working .jaynes.yml configuration used in testing.
- Replace argparse with @proto.cli decorator - Add sweep parameter for loading YAML config files - Add load_sweep_config() function to parse sweep YAML files - Support three launch modes: sweep file, grid search, or simple jobs - Maintain all existing functionality with cleaner syntax 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove unnecessary main() wrapper functions - Move @proto.cli decorators to module level - Import params_proto at top of file - Correct pattern: decorator at module level, call in if __name__ This follows the standard params-proto pattern for cleaner CLI definitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add sweep.yaml: comprehensive 12-experiment sweep configuration - Add sweep_simple.yaml: simple 3-experiment test sweep - Add SWEEP_USAGE.md: complete guide for using sweep feature - Update README.md: document Python 3.10+ requirement - Fix Python version issue with params-proto Python 3.10+ is required because params-proto uses modern type hint syntax (type | None). Users with Python 3.9 should use python3.11 or python3.12 explicitly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Bug fixes: 1. Changed sweep parameter from Optional[Path] to str - params-proto doesn't handle Optional[Path] type hints properly - Convert to Path inside load_sweep_config function instead 2. Added 'debug' mode to Literal type hint - Allows testing sweep functionality without GPU instances - Matches available modes in .jaynes.yml 3. Removed unused imports (Optional, Path from top level) Tested: - All three launch modes work (simple, grid-search, sweep) - YAML config loading works correctly - CLI argument parsing works for all parameter combinations - Both sweep.yaml and sweep_simple.yaml validate successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Exclude Claude Code configuration directory from version control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed gpu_large mode from on-demand to spot instances: - Set spot_price: "1.50" (max bid price) - Updated cost estimate: ~$0.30/hr spot (was ~$1.01/hr on-demand) - Provides ~70% cost savings All GPU modes now use spot instances: - gpu: g4dn.xlarge spot (max $0.40/hr) - gpu_large: g5.xlarge spot (max $1.50/hr) - multi_gpu: g4dn.12xlarge spot (max $2.50/hr) - debug: t3.medium on-demand (for reliability) Auto-termination ensures instances don't run indefinitely even if spot interrupted. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Previously, EC2/GCE instances would only terminate if the launch script completed successfully. If any error occurred during setup, upload, training, or post-processing, the script would exit early and never reach the termination commands, leaving instances running indefinitely and incurring unnecessary costs.
Solution
Implemented bash trap handlers that guarantee instance termination regardless of how the script exits.
How It Works
Added a
cleanup()function containing termination commands and set a trap to call it on EXIT, ERR, INT, and TERM signals:Termination Guaranteed On
Changes
jaynes/launchers/base_launcher.pymake_launch_script()to generate cleanup trap whenterminate_after=TrueTesting
Unit Tests:
test_termination_trap.pyIntegration Tests:
test_error_scenarios.shTests various failure scenarios to verify termination:
All tests pass! 🎉
Benefits
Cost Safety
Prevents runaway costs from instances stuck due to:
Reliability
Ensures instances always terminate, even when things go wrong, providing peace of mind for long-running jobs and large-scale deployments.
Example Generated Script
Click to see example EC2 launch script with trap
Backward Compatibility
✅ Fully backward compatible - Only affects instances with
terminate_after: trueconfigured in.jaynes.yml. Existing configurations without auto-termination are unchanged.🤖 Generated with Claude Code