You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ChatQnA/README.md
+14-4Lines changed: 14 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
This ChatQnA use case performs RAG using LangChain, Redis vectordb and Text Generation Inference on Intel Gaudi2. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Please visit [Habana AI products](https://habana.ai/products) for more details.
2
2
3
3
# Environment Setup
4
+
4
5
To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Habana Gaudi/Gaudi2, please follow these steps:
### Launch a local server instance on 1 Gaudi card:
24
+
23
25
```bash
24
26
bash ./serving/tgi_gaudi/launch_tgi_service.sh
25
27
```
26
28
27
29
For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
28
30
29
-
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
31
+
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
30
32
31
33
```bash
32
34
export HUGGINGFACEHUB_API_TOKEN=<token>
33
35
```
34
36
35
37
### Launch a local server instance on 8 Gaudi cards:
38
+
36
39
```bash
37
40
bash ./serving/tgi_gaudi/launch_tgi_service.sh 8
38
41
```
39
42
40
43
### Customize TGI Gaudi Service
41
44
42
45
The ./serving/tgi_gaudi/launch_tgi_service.sh script accepts three parameters:
46
+
43
47
- num_cards: The number of Gaudi cards to be utilized, ranging from 1 to 8. The default is set to 1.
44
48
- port_number: The port number assigned to the TGI Gaudi endpoint, with the default being 8080.
45
49
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
46
50
47
51
You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable `TGI_ENDPOINT`:
52
+
48
53
```bash
49
54
export TGI_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
50
55
```
51
56
52
57
## Enable TGI Gaudi FP8 for higher throughput
58
+
53
59
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. According to our test results, FP8 quantization yields approximately a 1.8x performance gain compared to BFLOAT16. Please follow the below steps to enable FP8 quantization.
parser=argparse.ArgumentParser(description="Concurrent client to send POST requests")
30
51
parser.add_argument("--url", type=str, default="http://localhost:12345", help="URL to send requests to")
31
-
parser.add_argument("--json_data", type=str, default='{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"do_sample": true}}', help="JSON data to send")
52
+
parser.add_argument(
53
+
"--json_data",
54
+
type=str,
55
+
default='{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"do_sample": true}}',
0 commit comments