FastLanguageModel load_in_4bit not working, when used with fast_inference=True

I was trying the [script ](https://unsloth.ai/blog/r1-reasoning) for training reasoning model with Qwen2.5-1.5B-Instruct. I can run the script but the VRAM usage is pretty wired. Based on the blog I only need 7GB VRAM for training, but the actual VRAM usage is 16GB.

Loading model with `AutoModelForCausalLM` looks normal, VRAM usage is only 3.1GB

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer
import os

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    load_in_4bit=True,
    cache_dir="D:\Data Scientist\.transformers_cache",
    device_map="cuda"
)
```

![Image](https://github.com/user-attachments/assets/f55273e2-c0f4-4ef2-ad45-e96c84210257)

But when I load the model with `FastLanguageModel`, the VRAM spiked to 16GB

```python
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)
```

![Image](https://github.com/user-attachments/assets/d59efff1-b3d8-4de9-b23b-54bac6a7c593)

```python
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-13 09:45:36 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 56.82%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 24.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 12.37 GB. Also swap space = 4 GB.
INFO 03-13 09:45:44 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.0.self_attn', 'model.layers.1.mlp', 'model.layers.2.mlp', 'model.layers.3.mlp', 'model.layers.7.mlp', 'model.layers.24.mlp', 'model.layers.26.mlp', 'model.layers.15.self_attn'], 'llm_int8_threshold': 6.0}
INFO 03-13 09:45:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 03-13 09:45:45 interface.py:304] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-13 09:45:45 cuda.py:229] Using Flash Attention backend.
INFO 03-13 09:45:47 model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
[W313 09:45:46.584979306 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-13 09:45:47 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-13 09:45:47 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.20s/it]
INFO 03-13 09:45:53 model_runner.py:1115] Loading model weights took 1.4331 GB
INFO 03-13 09:45:53 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-13 09:45:56 worker.py:267] Memory profiling takes 2.27 seconds
INFO 03-13 09:45:56 worker.py:267] the current vLLM instance can use total_gpu_memory (24.00GiB) x gpu_memory_utilization (0.57) = 13.64GiB
INFO 03-13 09:45:56 worker.py:267] model weights take 1.43GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.74GiB.
INFO 03-13 09:45:56 executor_base.py:111] # cuda blocks: 25142, # CPU blocks: 9362
INFO 03-13 09:45:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 392.84x
INFO 03-13 09:45:56 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.47it/s]
INFO 03-13 09:46:20 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.60 GiB
INFO 03-13 09:46:20 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.23 seconds
```

I am runing on win 10 WSL by the way.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FastLanguageModel load_in_4bit not working, when used with fast_inference=True #2008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

FastLanguageModel load_in_4bit not working, when used with fast_inference=True #2008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions