🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by OpenVINO to accelerate end-to-end pipelines on Intel architectures.
OpenVINO is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators (see the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
To install the latest release of 🤗 Optimum Intel with the corresponding required dependencies, you can use pip as follows:
python -m pip install -U "optimum-intel[openvino]"Optimum Intel is a fast-moving project with regular additions of new model support, so you may want to install from source with the following command:
python -m pip install "optimum-intel"@git+https://github.com/huggingface/optimum-intel.gitDeprecation Notice: The extras for openvino (e.g., pip install optimum-intel[openvino,nncf]), nncf, neural-compressor, ipex are deprecated and will be removed in a future release.
To export your model to OpenVINO IR format, use the optimum-cli tool. Below is an example of exporting TinyLlama/TinyLlama_v1.1 model:
optimum-cli export openvino --model TinyLlama/TinyLlama_v1.1 ov_TinyLlama_v1_1Additional information on exporting models is available in the documentation.
To load an exported model and run inference using Optimum Intel, use the corresponding OVModelForXxx class instead of AutoModelForXxx:
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
model_id = "ov_TinyLlama_v1_1"
model = OVModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
results = pipe("Hey, how are you doing today?", max_new_tokens=100)For more details on Optimum Intel inference, refer to the documentation.
Note: Alternatively, an exported model can also be inferred using OpenVINO GenAI framework, that provides optimized execution methods for highly performant Generative AI.
Post-training static quantization can also be applied. Here is an example on how to apply static quantization on a Whisper model using the LibriSpeech dataset for the calibration step.
from optimum.intel import OVModelForSpeechSeq2Seq, OVQuantizationConfig
model_id = "openai/whisper-tiny"
q_config = OVQuantizationConfig(dtype="int8", dataset="librispeech", num_samples=50)
q_model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, quantization_config=q_config)
# The directory where the quantized model will be saved
save_dir = "nncf_results"
q_model.save_pretrained(save_dir)You can find more information in the documentation.
Check out the notebooks directory to see how 🤗 Optimum Intel can be used to optimize models and accelerate inference.
Do not forget to install requirements for every example:
cd <example-folder>
pip install -r requirements.txt