Skip to content

Commit d8da203

Browse files
committed
Additional changes
1 parent b8b6b56 commit d8da203

File tree

1 file changed

+52
-49
lines changed
  • docs/guides/kubernetes/ai-chatbot-and-rag-pipeline-for-inference-on-lke

1 file changed

+52
-49
lines changed

docs/guides/kubernetes/ai-chatbot-and-rag-pipeline-for-inference-on-lke/index.md

Lines changed: 52 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -45,15 +45,19 @@ This tutorial requires you to have access to a few different services and local
4545
- You should have both [kubectl](https://kubernetes.io/docs/reference/kubectl/) and [Helm](https://helm.sh/) installed on your local machine. These apps are used for managing your LKE cluster and installing applications to your cluster.
4646
- A **custom dataset** is needed, preferably in Markdown format, though you can use other types of data if you modify the LlamaIndex configuration provided in this tutorial. This dataset should contain all of the information you want used by the Llama 3 LLM. This tutorial uses a Markdown dataset containing all of the Linode Docs.
4747

48-
# Set up infrastructure
49-
50-
The first step is to provision the infrastructure needed for this tutorial and configure it with kubectl, so that you can manage it locally and install software through helm. As part of this process, we’ll also need to install the NVIDIA GPU operator at this step so that the NVIDIA cards within the GPU worker nodes can be used on Kubernetes.
48+
{{< note type="warning" title="Production workloads" >}}
49+
These instructions are intended as a proof of concept for testing and demonstration purposes. They are not designed as a complete production reference architecture.
50+
{{< /note >}}
5151

5252
{{< note type="warning" title="Security notice" >}}
5353
The configuration instructions in this document are expected to not expose any services to the Internet. Instead, they run on the Kubernetes cluster's internal network, and to access the services it’s necessary to forward their ports locally first. This configuration is restricted by design to avoid accidentally exposing those services before they can be properly secured. Additionally, some services will run with no authentication or default credentials configured.
5454
It’s not part of the scope of this document to cover the setup required to secure this configuration for a production deployment.
5555
{{< /note >}}
5656

57+
# Set up infrastructure
58+
59+
The first step is to provision the infrastructure needed for this tutorial and configure it with kubectl, so that you can manage it locally and install software through helm. As part of this process, we’ll also need to install the NVIDIA GPU operator at this step so that the NVIDIA cards within the GPU worker nodes can be used on Kubernetes.
60+
5761
1. **Provision an LKE cluster.** We recommend using at least two **RTX4000 Ada x2 Medium** GPU plans (plan ID: `g2-gpu-rtx4000a2-m`), though you can adjust this as needed. For reference, Kubeflow recommends 32 GB of RAM and 16 CPU cores. This tutorial has been tested using Kubernetes v1.31, though other versions should also work. To learn more about provisioning a cluster, see the [Create a cluster](https://techdocs.akamai.com/cloud-computing/docs/create-a-cluster) guide.
5862

5963
{{< note noTitle=true >}}
@@ -114,7 +118,7 @@ Next, let’s deploy Kubeflow on the LKE cluster. These instructions deploy all
114118

115119
After Kubeflow has been installed, we can now deploy the Llama 3 LLM to KServe. This tutorial uses HuggingFace (a platform that provides pre-trained AI models) to deploy Llama 3 to the LKE cluster. Specifically, these instructions use the [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model.
116120

117-
1. Create a Hugging Face token to use for this project. See the Hugging Face user documentation on [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) for instructions.
121+
1. Create a Hugging Face token with **READ** access to use for this project. See the Hugging Face user documentation on [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) for instructions.
118122

119123
1. Create the manifest file for the [Kubernetes secret](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-config-file/). You can use the following as a template:
120124

@@ -131,7 +135,7 @@ After Kubeflow has been installed, we can now deploy the Llama 3 LLM to KServe.
131135
1. Then, create the secret on your cluster by applying the manifest file:
132136

133137
```command
134-
kubectl apply -f hf-secret.yaml
138+
kubectl apply -f ./hf-secret.yaml
135139
```
136140

137141
1. Create a config file for deploying the Llama 3 model on your cluster.
@@ -174,7 +178,11 @@ After Kubeflow has been installed, we can now deploy the Llama 3 LLM to KServe.
174178
kubectl apply -f model.yaml
175179
```
176180

177-
Once the configuration applies, Llama 3 will be running on your LKE cluster.
181+
1. Verify that the new Llama 3 pod is ready before continuing.
182+
183+
```command
184+
kubectl get pods -A
185+
```
178186

179187
### Install Milvus
180188

@@ -300,6 +308,8 @@ This tutorial employs a Python script to create the YAML file used within Kubefl
300308

301309
This creates a file called pipeline.yaml, which you will upload to Kubeflow in the following section.
302310

311+
1. Run `deactivate` to exit the Python virtual environment.
312+
303313
### Run the pipeline workflow
304314

305315
1. Configure port forwarding on your cluster through kubectl so that you can access the Kubeflow interface from your local computer.
@@ -310,6 +320,10 @@ This tutorial employs a Python script to create the YAML file used within Kubefl
310320

311321
1. Open a web browser and navigate to the Kubeflow interface at http://localhost:8080. A login screen should appear.
312322

323+
{{< note type="warning" noTitle=true >}}
324+
If the browser instead shows the error `Jwks doesn't have key to match kid or alg from Jwt`, there may be a previous JWT session that is interfering. Opening this URL in your browser's private or incognito mode should resolve this.
325+
{{< /note >}}
326+
313327
1. Log in with the username `user@example.com` and use the password that you created in a previous step.
314328
315329
1. Navigate to the Pipelines > Experiments page and click the button to create a new experiment. Enter a name and description for the experiment and click **Next**.
@@ -359,60 +373,49 @@ Despite the naming, these RAG pipeline files are not related to the Kubeflow pip
359373
360374
class Pipeline:
361375
362-
def __init__(self):
363-
self.name = "RAG Pipeline"
364-
self.index = None
365-
pass
376+
def __init__(self):
377+
self.name = "RAG Pipeline"
378+
self.index = None
379+
pass
366380
367381
368-
async def on_startup(self):
369-
# This function is called when the server is started.
370-
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
371-
from llama_index.core import Settings, VectorStoreIndex
372-
from llama_index.llms.openai_like import OpenAILike
373-
from llama_index.vector_stores.milvus import MilvusVectorStore
382+
async def on_startup(self):
383+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
384+
from llama_index.core import Settings, VectorStoreIndex
385+
from llama_index.llms.openai_like import OpenAILike
386+
from llama_index.vector_stores.milvus import MilvusVectorStore
374387
375-
print(f"on_startup:{__name__}")
388+
print(f"on_startup:{__name__}")
376389
377-
Settings.embed_model = HuggingFaceEmbedding(
378-
model_name="BAAI/bge-large-en-v1.5"
379-
)
380-
381-
llm = OpenAILike(
382-
model="llama3",
383-
api_base="http://huggingface-llama3-predictor-00001.default.svc.cluster.local/openai/v1",
384-
api_key = "EMPTY",
385-
max_tokens = 512)
386-
387-
Settings.llm = llm
388-
389-
vector_store = MilvusVectorStore(uri="http://my-release-milvus.default.svc.cluster.local:19530", collection="linode_docs", dim=1024, overwrite=False)
390-
self.index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
390+
Settings.embed_model = HuggingFaceEmbedding(
391+
model_name="BAAI/bge-large-en-v1.5"
392+
)
391393
392-
async def on_shutdown(self):
393-
# This function is called when the server is stopped.
394-
print(f"on_shutdown:{__name__}")
395-
pass
394+
llm = OpenAILike(
395+
model="llama3",
396+
api_base="http://huggingface-llama3-predictor-00001.default.svc.cluster.local/openai/v1",
397+
api_key = "EMPTY",
398+
max_tokens = 512)
396399
400+
Settings.llm = llm
397401
398-
def pipe(
399-
self, user_message: str, model_id: str, messages: List[dict], body: dict
400-
) -> Union[str, Generator, Iterator]:
401-
# This is where you can add your custom RAG pipeline.
402-
# Typically, you would retrieve relevant information from your knowledge base and synthesize it to generate a response.
402+
vector_store = MilvusVectorStore(uri="http://my-release-milvus.default.svc.cluster.local:19530", collection="linode_docs", dim=1024, overwrite=False)
403+
self.index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
403404
404-
print(f"pipe:{__name__}")
405-
print(messages)
406-
print(user_message)
405+
async def on_shutdown(self):
406+
print(f"on_shutdown:{__name__}")
407+
pass
407408
408-
query_engine = self.index.as_query_engine(streaming=True, similarity_top_k=5)
409-
response = query_engine.query(user_message)
410409
411-
print(f"rag_response:{response}")
410+
def pipe(
411+
self, user_message: str, model_id: str, messages: List[dict], body: dict
412+
) -> Union[str, Generator, Iterator]:
413+
print(f"pipe:{__name__}")
412414
413-
# return response.response_gen
414-
# return f"RAG says: {response}"
415-
return f"{response}"
415+
query_engine = self.index.as_query_engine(streaming=True, similarity_top_k=5)
416+
response = query_engine.query(user_message)
417+
print(f"rag_response:{response}")
418+
return f"{response}"
416419
```
417420
418421
Both of these files are used in the next section.

0 commit comments

Comments
 (0)