Baselines for the Global Chess Challenge 2025
Supervised fine-tuning (SFT) for chess using special tokens to encode board positions. This approach trains a language model to predict moves given structured chess position representations.
Traditional chess representations like FEN strings have tokenization issues:
- ❌ FEN strings: Characters get merged unpredictably by BPE tokenizers (e.g., "rnbq" might become one token), making it hard for the model to learn piece-by-piece understanding
- ❌ ASCII board representations: No tokenization issues, but inefficient (~350-400+ tokens per position)
✨ Special token encoding solves both problems:
- ✅ Each piece and square is a single, unambiguous token (e.g.,
<a1><White_Rook>,<e5><Black_Pawn>) - ✅ Compact representation (~130-170 tokens per position including legal moves)
- 🚀 Results: Significantly fewer illegal moves during training and faster convergence due to reduced sequence length
<chess_position>
<a1><White_Rook><b1><White_Knight>...<h8><Black_Rook>
|White|KQkq|-|0|1|
<e2><e4> <g1><f3> <b1><c3> ...
</chess_position>
Structure: [64 square-piece pairs] | [turn] | [castling] | [en passant] | [halfmove] | [fullmove] | [legal moves]
💡 See data_preparation/encode_with_special_tokens.ipynb for detailed implementation.
- Download the dataset:
cd data && bash download_data.sh - Train the model on trainium trn1.2xlarge instance:
docker build -t neuronx-training-py310:latest .
bash start_docker_trainium.sh
# Run this inside the container
torchrun --nproc_per_node 2 train.py --output-dir ./trained_models/your_model_name- Train the model on nvidia gpu:
pip install -r requirements_nvidia.txt
python train_nvidia.py --output-dir ./trained_models/your_model_nameModify encoding schemes or add custom prompting:
prepare_tokenizer.ipynb- Create a tokenizer with added special tokens for chessencode_with_special_tokens.ipynb- Encode chess positions using special tokens
The script assumes trn1.2xlarge instance. Change the settings according to your instance.
- NEURON_COMPILE_CACHE_URL - Sets the directory path for caching compiled Neuron models to avoid recompilation
- NEURON_CC_FLAGS - Configures the Neuron compiler with transformer model type and disables automatic type casting
- XLA_USE_BF16 - Enables bfloat16 precision for XLA operations to improve performance on supported hardware
- NEURON_FUSE_SOFTMAX - Enable softmax fusion optimization for better performance
- XLA_DOWNCAST_BF16 - Automatically downcast operations to bfloat16 precision
- NEURON_CC_PIPELINE_SIZE - Set the pipeline parallelism degree for model compilation
- --nproc_per_node - 2 for 2xlarge instance
The training script (train.py) fine-tunes Qwen3-0.6B on chess positions with a custom tokenizer that includes special tokens for pieces and squares.
Training Pipeline:
- Loads a custom tokenizer with added chess special tokens (
chess_tokenizer_qwen3/) - Resizes model embeddings to accommodate new tokens
- Tokenizes dataset with causal language modeling objective
- Trains with gradient accumulation and bfloat16 precision
- Evaluates during training via
ChessLLMEvaluationCallback(puzzles, vs random, vs Stockfish)
Evaluation During Training:
The ChessLLMEvaluationCallback automatically evaluates your model at regular intervals (e.g., every 3000 steps). The callback:
- Saves a temporary checkpoint of the current model
- Moves the model to CPU and frees GPU memory
- Starts a vLLM server with the checkpoint for fast inference
- Runs the full evaluation suite (puzzle solving, games vs random player, games vs Stockfish)
- Logs metrics to WandB and shuts down vLLM
- Restores the model to GPU and resumes training
You can modify this callback to add any other metrics you want to track.
Key Hyperparameters:
MAX_LENGTH(512): Maximum sequence length. Chess positions with special tokens are compact (~130-170 tokens), so 512 is sufficient.BATCH_SIZE(4) +GRAD_ACCUM_STEPS(2): Effective batch size of 8. IncreaseBATCH_SIZEif you have more GPU memory (24GB+ allows batch size 8-16). AdjustGRAD_ACCUM_STEPSto maintain effective batch size.LEARNING_RATE(1e-4): Standard for SFT on small models. If training is unstable, reduce to 5e-5. If convergence is slow, try 2e-4.WARMUP_STEPS(500): Gradual learning rate warmup to stabilize early training. Typically 5-10% of total steps.WEIGHT_DECAY(0.001): L2 regularization to prevent overfitting on the training set.NUM_LINES_TO_LOAD(1M): Number of training examples from the 2.5M dataset. Start with fewer for faster iteration (100k-500k).EVAL_STEPS(3000): How often to run full evaluation (puzzles + games). Evaluation takes ~10-15 minutes, so balance between frequent feedback and training speed.
Run python run_evaluation.py after starting a vLLM server with your trained model. The evaluation suite measures chess playing strength across three dimensions.
Setup:
- Start vLLM server:
bash run_vllm.sh(edit the script to point to your model checkpoint) in a separate terminal - Run evaluation:
python run_evaluation.py -v - Results saved to
eval_results/directory with JSON metrics and PGN game files
Evaluation Metrics (configured in evaluation_helpers/eval_config.py):
-
🧩 Puzzle Solving (
n_puzzles=200,puzzle_max_elo=600)- Tests tactical pattern recognition on chess puzzles
- Measures: solve rate, average moves to solution
- Easy puzzles (Elo ≤600) test basic tactics and checkmates
-
🎲 vs Random Player (
n_random_games=5)- Sanity check that model can beat random move selection
- Should achieve 100% win rate quickly
- Measures: win rate, average game length, illegal move rate, ACPL
-
🤖 vs Stockfish (
n_stockfish_games=50,stockfish_depth=1,stockfish_skill_level=0)- Tests against weak but coherent opponent
- Stockfish at depth 1, skill 0 plays at ~1000-1200 Elo
- Measures: win/draw/loss rates, illegal moves, ACPL
LLM Inference Settings:
temperature=0.0: Deterministic, always picks highest probability movemax_retries=3: If model generates illegal move, retry with same or slightly higher temperature