Skip to content

Commit 2a6a29f

Browse files
Support vLLM XFT LLM microservice (#174)
* Support vLLM XFT serving Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix access vllm issue Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add permission for run.sh Signed-off-by: lvliang-intel <liang1.lv@intel.com> * add readme Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix proxy issue Signed-off-by: lvliang-intel <liang1.lv@intel.com> --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent b516758 commit 2a6a29f

5 files changed

Lines changed: 230 additions & 0 deletions

File tree

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
vLLM-xFT is a fork of vLLM to integrate the xfastertransformer backend, maintaining compatibility with most of the official vLLM's features.
2+
For usage of vllm-xFT, please refer to [xFasterTransformer/vllm-xft](https://github.com/intel/xFasterTransformer/blob/main/serving/vllm-xft.md)
3+
4+
# 🚀 Start Microservice with Docker
5+
6+
## 1 Build Docker Image
7+
8+
```bash
9+
cd ../../../
10+
docker build -t opea/llm-vllm-xft:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm-xft/docker/Dockerfile .
11+
```
12+
13+
## 2 Run Docker with CLI
14+
15+
```bash
16+
docker run -it -p 9000:9000 -v /home/sdp/Qwen2-7B-Instruct/:/Qwen2-7B-Instruct/ -e vLLM_LLM_ENDPOINT="http://localhost:18688" -e HF_DATASET_DIR="/Qwen2-7B-Instruct/" -e OUTPUT_DIR="./output" -e TOKEN_PATH="/Qwen2-7B-Instruct/" -e https_proxy=$https_proxy -e http_proxy=$http_proxy --ipc=host opea/llm-vllm-xft:latest
17+
```
18+
19+
# 🚀3. Consume LLM Service
20+
21+
## 3.1 Check Service Status
22+
23+
```bash
24+
curl http://${your_ip}:9000/v1/health_check\
25+
-X GET \
26+
-H 'Content-Type: application/json'
27+
```
28+
29+
## 3.2 Consume LLM Service
30+
31+
You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.
32+
33+
The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`.
34+
35+
```bash
36+
# non-streaming mode
37+
curl http://${your_ip}:9000/v1/chat/completions \
38+
-X POST \
39+
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
40+
-H 'Content-Type: application/json'
41+
42+
# streaming mode
43+
curl http://${your_ip}:9000/v1/chat/completions \
44+
-X POST \
45+
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
46+
-H 'Content-Type: application/json'
47+
```
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM ubuntu:22.04
5+
6+
ARG TAG=main
7+
8+
RUN apt-get update \
9+
&& apt-get upgrade -y \
10+
&& apt-get install -y --no-install-recommends \
11+
gcc-12 \
12+
g++-12 \
13+
make \
14+
wget \
15+
libnuma-dev \
16+
numactl \
17+
git \
18+
pkg-config \
19+
software-properties-common \
20+
zlib1g-dev \
21+
libssl-dev \
22+
libffi-dev \
23+
libbz2-dev \
24+
libsqlite3-dev \
25+
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 60 \
26+
&& update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 60 \
27+
&& apt-get autoremove -y \
28+
&& rm -rf /var/lib/apt/lists/*
29+
30+
# Install python
31+
WORKDIR /tmp
32+
RUN wget -q https://www.python.org/ftp/python/3.8.10/Python-3.8.10.tgz \
33+
&& tar -xzvf Python-3.8.10.tgz
34+
WORKDIR /tmp/Python-3.8.10
35+
RUN ./configure --prefix=/usr/bin/python3.8 --enable-optimizations \
36+
&& make -j \
37+
&& make install \
38+
&& update-alternatives --install /usr/bin/python python /usr/bin/python3.8/bin/python3.8 60 \
39+
&& update-alternatives --install /usr/bin/pip pip /usr/bin/python3.8/bin/pip3 60 \
40+
&& python -m pip install --no-cache-dir --upgrade pip setuptools \
41+
&& pip install --no-cache-dir wheel \
42+
&& rm -rf /tmp/* \
43+
&& echo "export PATH=/usr/bin/python3.8:\$PATH" >> ~/.bashrc
44+
45+
RUN pip install --no-cache-dir torch==2.3.0+cpu --index-url https://download.pytorch.org/whl/cpu
46+
RUN pip install --no-cache-dir cmake==3.26.1 transformers==4.41.2 sentencepiece==0.1.99 accelerate==0.23.0 protobuf tiktoken transformers-stream-generator einops \
47+
&& ln -s /usr/bin/python3.8/lib/python3.8/site-packages/cmake/data/bin/cmake /usr/bin/cmake
48+
49+
# Install oneCCL
50+
RUN git clone https://github.com/oneapi-src/oneCCL.git /tmp/oneCCL
51+
WORKDIR /tmp/oneCCL
52+
RUN git checkout 2021.10 \
53+
&& sed -i 's/cpu_gpu_dpcpp/./g' cmake/templates/oneCCLConfig.cmake.in \
54+
&& mkdir build
55+
WORKDIR /tmp/oneCCL/build
56+
RUN cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local/oneCCL \
57+
&& make -j install
58+
59+
RUN echo "source /usr/local/oneCCL/env/setvars.sh" >> ~/.bashrc
60+
61+
WORKDIR /root/
62+
RUN rm -rf /tmp/oneCCL
63+
64+
RUN git clone https://github.com/intel/xFasterTransformer.git
65+
66+
SHELL ["/bin/bash", "-c"]
67+
WORKDIR /root/xFasterTransformer
68+
RUN git checkout ${TAG} \
69+
&& export "LD_LIBRARY_PATH=/usr/local/mklml_lnx_2019.0.5.20190502/lib:$LD_LIBRARY_PATH" \
70+
&& export "PATH=/usr/bin/python3.8:$PATH" \
71+
&& echo "source /usr/local/oneCCL/env/setvars.sh" >> ~/.bash_profile \
72+
&& source ~/.bash_profile \
73+
&& python setup.py build \
74+
&& python setup.py egg_info bdist_wheel --verbose \
75+
&& pip install --no-cache-dir dist/*
76+
77+
RUN mkdir -p /usr/local/xft/lib \
78+
&& cp /root/xFasterTransformer/build/libxfastertransformer.so /usr/local/xft/lib \
79+
&& cp /root/xFasterTransformer/build/libxft_comm_helper.so /usr/local/xft/lib \
80+
&& cp -r /root/xFasterTransformer/include /usr/local/xft/ \
81+
&& mkdir -p /usr/local/include/xft/ \
82+
&& ln -s /usr/local/xft/include /usr/local/include/xft/include
83+
84+
RUN echo "export \$(python -c 'import xfastertransformer as xft; print(xft.get_env())')" >> ~/.bashrc
85+
86+
COPY comps /root/comps
87+
88+
RUN pip install --no-cache-dir --upgrade pip && \
89+
pip install --no-cache-dir -r /root/comps/llms/text-generation/vllm-xft/requirements.txt
90+
91+
ENV PYTHONPATH=$PYTHONPATH:/root
92+
93+
RUN chmod +x /root/comps/llms/text-generation/vllm-xft/run.sh
94+
95+
WORKDIR /root/comps/llms/text-generation/vllm-xft/
96+
97+
ENTRYPOINT ["/root/comps/llms/text-generation/vllm-xft/run.sh"]
98+
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
6+
from fastapi.responses import StreamingResponse
7+
from langchain_community.llms import VLLMOpenAI
8+
from langsmith import traceable
9+
10+
from comps import GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice
11+
12+
13+
@register_microservice(
14+
name="opea_service@llm_vllm_xft",
15+
service_type=ServiceType.LLM,
16+
endpoint="/v1/chat/completions",
17+
host="0.0.0.0",
18+
port=9000,
19+
)
20+
@traceable(run_type="llm")
21+
def llm_generate(input: LLMParamsDoc):
22+
llm_endpoint = os.getenv("vLLM_LLM_ENDPOINT", "http://localhost:18688")
23+
llm = VLLMOpenAI(
24+
openai_api_key="EMPTY",
25+
openai_api_base=llm_endpoint + "/v1",
26+
max_tokens=input.max_new_tokens,
27+
model_name="xft",
28+
top_p=input.top_p,
29+
temperature=input.temperature,
30+
presence_penalty=input.repetition_penalty,
31+
streaming=input.streaming,
32+
)
33+
34+
if input.streaming:
35+
36+
def stream_generator():
37+
chat_response = ""
38+
for text in llm.stream(input.query):
39+
chat_response += text
40+
chunk_repr = repr(text.encode("utf-8"))
41+
print(f"[llm - chat_stream] chunk:{chunk_repr}")
42+
yield f"data: {chunk_repr}\n\n"
43+
print(f"[llm - chat_stream] stream response: {chat_response}")
44+
yield "data: [DONE]\n\n"
45+
46+
return StreamingResponse(stream_generator(), media_type="text/event-stream")
47+
else:
48+
response = llm.invoke(input.query)
49+
return GeneratedDoc(text=response, prompt=input.query)
50+
51+
52+
if __name__ == "__main__":
53+
opea_microservices["opea_service@llm_vllm_xft"].start()
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
docarray[full]
2+
fastapi
3+
langchain==0.1.16
4+
langsmith
5+
opentelemetry-api
6+
opentelemetry-exporter-otlp
7+
opentelemetry-sdk
8+
shortuuid
9+
vllm-xft
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/bin/sh
2+
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
# convert the model to fastertransformer format
7+
python -c 'import os; import xfastertransformer as xft; xft.Qwen2Convert().convert(os.environ["HF_DATASET_DIR"], os.environ["OUTPUT_DIR"])'
8+
9+
unset http_proxy
10+
11+
# serving with vllm
12+
python -m vllm.entrypoints.openai.api_server \
13+
--model ${OUTPUT_DIR} \
14+
--tokenizer ${TOKEN_PATH} \
15+
--dtype bf16 \
16+
--kv-cache-dtype fp16 \
17+
--served-model-name xft \
18+
--host localhost \
19+
--port 18688 \
20+
--trust-remote-code &
21+
22+
# run llm microservice wrapper
23+
python llm.py

0 commit comments

Comments
 (0)