🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-13 09:45:36 __init__.py:207] Automatically detected platform cuda.
==((====))== Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
\\ /| NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.999 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 56.82%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 24.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 12.37 GB. Also swap space = 4 GB.
INFO 03-13 09:45:44 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.0.self_attn', 'model.layers.1.mlp', 'model.layers.2.mlp', 'model.layers.3.mlp', 'model.layers.7.mlp', 'model.layers.24.mlp', 'model.layers.26.mlp', 'model.layers.15.self_attn'], 'llm_int8_threshold': 6.0}
INFO 03-13 09:45:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 03-13 09:45:45 interface.py:304] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-13 09:45:45 cuda.py:229] Using Flash Attention backend.
INFO 03-13 09:45:47 model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
[W313 09:45:46.584979306 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-13 09:45:47 loader.py:1089] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-13 09:45:47 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.20s/it]
INFO 03-13 09:45:53 model_runner.py:1115] Loading model weights took 1.4331 GB
INFO 03-13 09:45:53 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-13 09:45:56 worker.py:267] Memory profiling takes 2.27 seconds
INFO 03-13 09:45:56 worker.py:267] the current vLLM instance can use total_gpu_memory (24.00GiB) x gpu_memory_utilization (0.57) = 13.64GiB
INFO 03-13 09:45:56 worker.py:267] model weights take 1.43GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.74GiB.
INFO 03-13 09:45:56 executor_base.py:111] # cuda blocks: 25142, # CPU blocks: 9362
INFO 03-13 09:45:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 392.84x
INFO 03-13 09:45:56 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:23<00:00, 1.47it/s]
INFO 03-13 09:46:20 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.60 GiB
INFO 03-13 09:46:20 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.23 seconds
I am runing on win 10 WSL by the way.
I was trying the script for training reasoning model with Qwen2.5-1.5B-Instruct. I can run the script but the VRAM usage is pretty wired. Based on the blog I only need 7GB VRAM for training, but the actual VRAM usage is 16GB.
Loading model with
AutoModelForCausalLMlooks normal, VRAM usage is only 3.1GBBut when I load the model with
FastLanguageModel, the VRAM spiked to 16GBI am runing on win 10 WSL by the way.