♟️ Chess LLM Baselines

Baselines for the Global Chess Challenge 2025

Supervised fine-tuning (SFT) for chess using special tokens to encode board positions. This approach trains a language model to predict moves given structured chess position representations.

🎯 Rationale for Special Tokens

Traditional chess representations like FEN strings have tokenization issues:

❌ FEN strings: Characters get merged unpredictably by BPE tokenizers (e.g., "rnbq" might become one token), making it hard for the model to learn piece-by-piece understanding
❌ ASCII board representations: No tokenization issues, but inefficient (~350-400+ tokens per position)

✨ Special token encoding solves both problems:

✅ Each piece and square is a single, unambiguous token (e.g., <a1><White_Rook>, <e5><Black_Pawn>)
✅ Compact representation (~130-170 tokens per position including legal moves)
🚀 Results: Significantly fewer illegal moves during training and faster convergence due to reduced sequence length

📝 Encoding Format Example

<chess_position>
<a1><White_Rook><b1><White_Knight>...<h8><Black_Rook>
|White|KQkq|-|0|1|
<e2><e4> <g1><f3> <b1><c3> ...
</chess_position>

💡 See data_preparation/encode_with_special_tokens.ipynb for detailed implementation.

🚀 Quick Start

Download the dataset: cd data && bash download_data.sh
Train the model on trainium trn1.2xlarge instance:

docker build -t neuronx-training-py310:latest .
bash start_docker_trainium.sh
# Run this inside the container 
torchrun --nproc_per_node 2 train.py --output-dir ./trained_models/your_model_name

Train the model on nvidia gpu:

pip install -r requirements_nvidia.txt
python train_nvidia.py --output-dir ./trained_models/your_model_name

🔧 Dataset Preparation (Optional)

Modify encoding schemes or add custom prompting:

prepare_tokenizer.ipynb - Create a tokenizer with added special tokens for chess
encode_with_special_tokens.ipynb - Encode chess positions using special tokens

⚙ Neuron Settings

The script assumes trn1.2xlarge instance. Change the settings according to your instance.

NEURON_COMPILE_CACHE_URL - Sets the directory path for caching compiled Neuron models to avoid recompilation
NEURON_CC_FLAGS - Configures the Neuron compiler with transformer model type and disables automatic type casting
XLA_USE_BF16 - Enables bfloat16 precision for XLA operations to improve performance on supported hardware
NEURON_FUSE_SOFTMAX - Enable softmax fusion optimization for better performance
XLA_DOWNCAST_BF16 - Automatically downcast operations to bfloat16 precision
NEURON_CC_PIPELINE_SIZE - Set the pipeline parallelism degree for model compilation
--nproc_per_node - 2 for 2xlarge instance

🏋️ Training

The training script (train.py) fine-tunes Qwen3-0.6B on chess positions with a custom tokenizer that includes special tokens for pieces and squares.

Training Pipeline:

Loads a custom tokenizer with added chess special tokens (chess_tokenizer_qwen3/)
Resizes model embeddings to accommodate new tokens
Tokenizes dataset with causal language modeling objective
Trains with gradient accumulation and bfloat16 precision
Evaluates during training via ChessLLMEvaluationCallback (puzzles, vs random, vs Stockfish)

Evaluation During Training:

The ChessLLMEvaluationCallback automatically evaluates your model at regular intervals (e.g., every 3000 steps). The callback:

Saves a temporary checkpoint of the current model
Moves the model to CPU and frees GPU memory
Starts a vLLM server with the checkpoint for fast inference
Runs the full evaluation suite (puzzle solving, games vs random player, games vs Stockfish)
Logs metrics to WandB and shuts down vLLM
Restores the model to GPU and resumes training

You can modify this callback to add any other metrics you want to track.

Key Hyperparameters:

MAX_LENGTH (512): Maximum sequence length. Chess positions with special tokens are compact (~130-170 tokens), so 512 is sufficient.
BATCH_SIZE (4) + GRAD_ACCUM_STEPS (2): Effective batch size of 8. Increase BATCH_SIZE if you have more GPU memory (24GB+ allows batch size 8-16). Adjust GRAD_ACCUM_STEPS to maintain effective batch size.
LEARNING_RATE (1e-4): Standard for SFT on small models. If training is unstable, reduce to 5e-5. If convergence is slow, try 2e-4.
WARMUP_STEPS (500): Gradual learning rate warmup to stabilize early training. Typically 5-10% of total steps.
WEIGHT_DECAY (0.001): L2 regularization to prevent overfitting on the training set.
NUM_LINES_TO_LOAD (1M): Number of training examples from the 2.5M dataset. Start with fewer for faster iteration (100k-500k).
EVAL_STEPS (3000): How often to run full evaluation (puzzles + games). Evaluation takes ~10-15 minutes, so balance between frequent feedback and training speed.

🎮 Evaluation

Run python run_evaluation.py after starting a vLLM server with your trained model. The evaluation suite measures chess playing strength across three dimensions.

Setup:

Start vLLM server: bash run_vllm.sh (edit the script to point to your model checkpoint) in a separate terminal
Run evaluation: python run_evaluation.py -v
Results saved to eval_results/ directory with JSON metrics and PGN game files

Evaluation Metrics (configured in evaluation_helpers/eval_config.py):

🧩 Puzzle Solving (n_puzzles=200, puzzle_max_elo=600)
- Tests tactical pattern recognition on chess puzzles
- Measures: solve rate, average moves to solution
- Easy puzzles (Elo ≤600) test basic tactics and checkmates
🎲 vs Random Player (n_random_games=5)
- Sanity check that model can beat random move selection
- Should achieve 100% win rate quickly
- Measures: win rate, average game length, illegal move rate, ACPL
🤖 vs Stockfish (n_stockfish_games=50, stockfish_depth=1, stockfish_skill_level=0)
- Tests against weak but coherent opponent
- Stockfish at depth 1, skill 0 plays at ~1000-1200 Elo
- Measures: win/draw/loss rates, illegal moves, ACPL

LLM Inference Settings:

temperature=0.0: Deterministic, always picks highest probability move
max_retries=3: If model generates illegal move, retry with same or slightly higher temperature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♟️ Chess LLM Baselines

🎯 Rationale for Special Tokens

📝 Encoding Format Example

🚀 Quick Start

🔧 Dataset Preparation (Optional)

⚙ Neuron Settings

🏋️ Training

🎮 Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
chess-env @ aa61596		chess-env @ aa61596
data		data
data_preparation		data_preparation
evaluation_helpers		evaluation_helpers
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
chess_evaluation_callback.py		chess_evaluation_callback.py
chess_llm.py		chess_llm.py
requirements_nvidia.txt		requirements_nvidia.txt
run_evaluation.py		run_evaluation.py
run_vllm.sh		run_vllm.sh
start_docker_trainium.sh		start_docker_trainium.sh
train.py		train.py
train_nvidia.py		train_nvidia.py

Folders and files

Latest commit

History

Repository files navigation

♟️ Chess LLM Baselines

🎯 Rationale for Special Tokens

📝 Encoding Format Example

🚀 Quick Start

🔧 Dataset Preparation (Optional)

⚙ Neuron Settings

🏋️ Training

🎮 Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages