Skip to content

Commit 769ce21

Browse files
committed
add docs for rag-llm-cpu pattern
1 parent 0f4f4d5 commit 769ce21

File tree

7 files changed

+455
-0
lines changed

7 files changed

+455
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: RAG LLM Chatbot on CPU
3+
date: 2025-10-24
4+
tier: sandbox
5+
summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift GitOps
9+
- Red Hat OpenShift AI
10+
partners:
11+
- Microsoft
12+
industries:
13+
- General
14+
aliases: /rag-llm-cpu/
15+
links:
16+
github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
17+
install: getting-started
18+
bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
19+
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
20+
---
21+
22+
# **CPU-based RAG LLM Chatbot**
23+
24+
## **Introduction**
25+
26+
This Validated Pattern deploys a Retrieval-Augmented Generation (RAG) chatbot on Red Hat OpenShift using Red Hat OpenShift AI. The pattern is designed to run entirely on CPU nodes without requiring GPU hardware, making it a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
27+
It provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.
28+
29+
## **Target Audience**
30+
31+
This pattern is designed for:
32+
33+
- **Developers & Data Scientists** looking to build and experiment with RAG-based LLM applications.
34+
- **MLOps & DevOps Engineers** responsible for deploying and managing AI/ML workloads on OpenShift.
35+
- **Architects** evaluating cost-effective methods for delivering generative AI capabilities on-premise.
36+
37+
## **Why Use This Pattern?**
38+
39+
- **Cost-Effective:** Runs entirely on CPU, removing the need for expensive and often scarce GPU resources.
40+
- **Flexible:** Supports multiple vector database backends (Elasticsearch, PGVector, MS SQL Server) to integrate with your existing data infrastructure.
41+
- **Transparent:** The Gradio frontend is designed to expose the internals of the RAG query and LLM prompts, giving you clear insight into the generation process.
42+
- **Extensible:** Built on open-source standards (KServe, OpenAI-compatible API) to serve as a robust foundation for more complex applications.
43+
44+
## **Architecture Overview**
45+
46+
At a high level, the components work together as follows:
47+
48+
1. A user enters a query into the **Gradio UI**.
49+
2. The backend application, using **LangChain**, first queries a configured **Vector Database** to retrieve relevant documents (the "R" in RAG).
50+
3. These documents are combined with the user's original query into a prompt.
51+
4. The prompt is sent to the **KServe-deployed LLM** (running via llama.cpp on a CPU node).
52+
5. The LLM generates a response, which is streamed back to the Gradio UI for the user.
53+
6. **Vault** securely provides the necessary credentials for the vector database and HuggingFace token at runtime.
54+
55+
![Overview](/images/rag-llm-cpu/rag-augmented-query.png)
56+
57+
_Figure 1. Overview of RAG Query from User's perspective._
58+
59+
## **Prerequisites**
60+
61+
Before you begin, ensure you have access to the following:
62+
63+
- A Red Hat OpenShift cluster (version 4.x). (Recommended size of at least 2 `m5.4xlarge` nodes.)
64+
- A HuggingFace API token.
65+
- Command-line tools: podman.
66+
67+
## **What This Pattern Provides**
68+
69+
- A [kserve](https://github.com/kserve/kserve)-based LLM deployed to [RHOAI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
70+
- A choice of one (or multiple) Vector DB providers to serve as a RAG-backend with configurable web-based or git repo-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
71+
- [Vault](https://developer.hashiCorp.com/vault)-based secret management for HuggingFace API token and credentials for supported databases ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
72+
- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs which exposes the internals of the RAG query and LLM prompts so that users have better insight into what is running.
Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
title: Configuring this Pattern
3+
weight: 20
4+
aliases: /rag-llm-cpu/configure/
5+
---
6+
7+
# **Configuring this Pattern**
8+
9+
This guide covers common customizations, such as changing the default LLM, adding new models, and configuring RAG data sources.
10+
We assume you have already completed the [Getting Started](/rag-llm-cpu/getting-started/) guide.
11+
12+
## **How Configuration Works (Overview)**
13+
14+
This pattern is managed by ArgoCD (GitOps). All application configurations are defined in `values-prod.yaml`.
15+
To customize a component, you will typically:
16+
17+
1. **Enable an Override:** In `values-prod.yaml`, find the application you want to change (e.g., `llm-inference-service`) and add an `extraValueFiles:` entry pointing to a new override file (e.g., `$patternref/overrides/llm-inference-service.yaml`).
18+
2. **Create the Override File:** Create the new .yaml file inside the `/overrides` directory.
19+
3. **Add Your Settings:** Add _only_ the specific values you want to change into this new file.
20+
4. **Commit & Sync:** Commit your changes and let ArgoCD sync the application.
21+
22+
## **Task: Change the Default LLM**
23+
24+
By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf model`. You might want to change this to a different model (e.g., a different quantization) or adjust its resource usage.
25+
You can do this by creating an override file for the _existing_ `llm-inference-service` application.
26+
27+
1. **Enable the Override**:
28+
In `values-prod.yaml`, update the llm-inference-service application to use an override file:
29+
30+
```yaml
31+
clusterGroup:
32+
# ...
33+
applications:
34+
# ...
35+
llm-inference-service:
36+
name: llm-inference-service
37+
namespace: rag-llm-cpu
38+
chart: llm-inference-service
39+
chartVersion: 0.3.*
40+
extraValueFiles: # <-- ADD THIS BLOCK
41+
- $patternref/overrides/llm-inference-service.yaml
42+
```
43+
44+
2. **Create the Override File:**
45+
Create a new file `overrides/llm-inference-service.yaml`. Here is an example that switches to a different model file (Q8_0) and increases the CPU/memory requests:
46+
47+
```yaml
48+
inferenceService:
49+
resources: # <-- Increaed allocated resources
50+
requests:
51+
cpu: "8"
52+
memory: 12Gi
53+
limits:
54+
cpu: "12"
55+
memory: 24Gi
56+
57+
servingRuntime:
58+
args:
59+
- --model
60+
- /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file
61+
62+
model:
63+
repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
64+
files:
65+
- mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
66+
```
67+
68+
## **Task: Add a Second LLM**
69+
70+
You can also deploy an entirely separate, second LLM and add it to the demo UI. This example deploys a different runtime (HuggingFace TGI) instead of llama.cpp.
71+
This is a two-step process: (1) Deploy the new LLM, and (2) Tell the frontend UI about it.
72+
73+
### **Step 1: Deploy the New LLM Service**
74+
75+
1. **Define the New Application:**
76+
In `values-prod.yaml`, add a new application to the applications list. We'll call it `another-llm-inference-service`.
77+
78+
```yaml
79+
clusterGroup:
80+
# ...
81+
applications:
82+
# ...
83+
another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
84+
name: another-llm-inference-service
85+
namespace: rag-llm-cpu
86+
chart: llm-inference-service
87+
chartVersion: 0.3.*
88+
extraValueFiles:
89+
- $patternref/overrides/another-llm-inference-service.yaml
90+
```
91+
92+
2. **Create the Override File:**
93+
Create the new file `overrides/another-llm-inference-service.yaml`. This file needs to define the new model and disable resource creation (like secrets) that the first LLM already created.
94+
95+
```yaml
96+
dsc:
97+
initialize: false
98+
externalSecret:
99+
create: false
100+
101+
# Define the new InferenceService
102+
inferenceService:
103+
name: hf-inference-service # <-- New service name
104+
minReplicas: 1
105+
maxReplicas: 1
106+
resources:
107+
requests:
108+
cpu: "8"
109+
memory: 32Gi
110+
limits:
111+
cpu: "12"
112+
memory: 32Gi
113+
114+
# Define the new runtime (HuggingFace TGI)
115+
servingRuntime:
116+
name: hf-runtime
117+
port: 8080
118+
image: docker.io/kserve/huggingfaceserver:latest
119+
modelFormat: huggingface
120+
args:
121+
- --model_dir
122+
- /models
123+
- --model_name
124+
- /models/Mistral-7B-Instruct-v0.3
125+
- --http_port
126+
- "8080"
127+
128+
# Define the new model to download
129+
model:
130+
repository: mistralai/Mistral-7B-Instruct-v0.3
131+
files:
132+
- generation_config.json
133+
- config.json
134+
- model.safetensors.index.json
135+
- model-00001-of-00003.safetensors
136+
- model-00002-of-00003.safetensors
137+
- model-00003-of-00003.safetensors
138+
- tokenizer.model
139+
- tokenizer.json
140+
- tokenizer_config.json
141+
```
142+
143+
**Warning:** There is currently a bug in the model-downloading container that requires you to explicitly list _all_ files you wish to download from the HuggingFace repository. Make sure you list every file needed for the model to run.
144+
145+
### **Step 2: Add the New LLM to the Demo UI**
146+
147+
Now, tell the frontend that this new LLM exists.
148+
149+
1. **Edit the Frontend Overrides**:
150+
Open `overrides/rag-llm-frontend-values.yaml` (this file should already exist from the initial setup).
151+
2. **Update LLM_URLS:**
152+
Add the URL of your new service to the `LLM_URLS` environment variable. The URL follows the format _http://<service-name>-predictor/v1_ (or _http://<service-name>-predictor/openai/v1_ for the HF runtime).
153+
154+
In `overrides/rag-llm-frontend-values.yaml`:
155+
156+
```yaml
157+
env:
158+
# ...
159+
- name: LLM_URLS
160+
value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
161+
```
162+
163+
## **Task: Customize RAG Data Sources**
164+
165+
By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to your own public git repositories or web pages.
166+
167+
1. **Edit the Vector DB Overrides:**
168+
Open `overrides/vector-db-values.yaml` (this file should already exist).
169+
2. **Update Sources:**
170+
Modify the repoSources and webSources keys. You can add any publicly available Git repo (using globs to filter files) or public web URLs. The job will also process PDFs from webSources.
171+
172+
In `overrides/vector-db-values.yaml`:
173+
174+
```yaml
175+
providers:
176+
qdrant:
177+
enabled: true
178+
mssql:
179+
enabled: true
180+
181+
vectorEmbedJob:
182+
repoSources:
183+
- repo: https://github.com/your-org/your-docs.git # <-- Your repo
184+
globs:
185+
- "**/*.md"
186+
webSources:
187+
- https://your-company.com/product-manual.pdf # <-- Your PDF
188+
chunking:
189+
size: 4096
190+
```
191+
192+
## **Task: Add a New RAG Database Provider**
193+
194+
By default, the pattern enables _qdrant_ and _mssql_. You can also enable _redis_, _pgvector_ (Postgres), or _elastic_ (Elasticsearch).
195+
This is a three-step process: (1) Add secrets, (2) Enable the DB, and (3) Tell the frontend UI.
196+
197+
### **Step 1: Update Your Secrets File**
198+
199+
If your new DB requires credentials (like _pgvector_ or _elastic_), add them to your main secrets file:
200+
201+
```sh
202+
vim ~/values-secret-rag-llm-cpu.yaml
203+
```
204+
205+
Add the necessary credentials. For example:
206+
207+
```yaml
208+
secrets:
209+
# ...
210+
- name: pgvector
211+
fields:
212+
- name: user
213+
value: user # <-- Update the user
214+
- name: password
215+
value: password # <-- Update the password
216+
- name: db
217+
value: db # <-- Update the db
218+
```
219+
220+
**Note:** refer to the file [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) for a reference as to which values are expected.
221+
222+
### **Step 2: Enable the Provider in the Vector DB Chart**
223+
224+
Edit `overrides/vector-db-values.yaml` and set enabled: true for the provider(s) you want to add.
225+
226+
In `overrides/vector-db-values.yaml`:
227+
228+
```yaml
229+
providers:
230+
qdrant:
231+
enabled: true
232+
mssql:
233+
enabled: true
234+
pgvector: # <-- ADD THIS
235+
enabled: true
236+
elastic: # <-- OR THIS
237+
enabled: true
238+
```
239+
240+
### **Step 3: Add the Provider to the Demo UI**
241+
242+
Finally, edit `overrides/rag-llm-frontend-values.yaml` to configure the UI. You must:
243+
244+
1. Add the new provider's secrets to the `dbProvidersSecret.vault` list.
245+
2. Add the new provider's connection details to the `dbProvidersSecret.providers` list.
246+
247+
Below is a complete example showing configuration for the non-default RAG DB providers:
248+
249+
In `overrides/rag-llm-frontend-values.yaml`
250+
251+
```yaml
252+
dbProvidersSecret:
253+
vault:
254+
- key: mssql
255+
field: sapassword
256+
- key: pgvector # <-- Add this block
257+
field: user
258+
- key: pgvector
259+
field: password
260+
- key: pgvector
261+
field: db
262+
- key: elastic # <-- Add this block
263+
field: user
264+
- key: elastic
265+
field: password
266+
providers:
267+
- type: qdrant # <-- Example for Qdrant
268+
collection: docs
269+
url: http://qdrant-service:6333
270+
embedding_model: sentence-transformers/all-mpnet-base-v2
271+
- type: mssql # <-- Example for MSSQL
272+
table: docs
273+
connection_string: >-
274+
Driver={ODBC Driver 18 for SQL Server};
275+
Server=mssql-service,1433;
276+
Database=embeddings;
277+
UID=sa;
278+
PWD={{ .mssql_sapassword }};
279+
TrustServerCertificate=yes;
280+
Encrypt=no;
281+
embedding_model: sentence-transformers/all-mpnet-base-v2
282+
- type: redis # <-- Example for Redis
283+
index: docs
284+
url: redis://redis-service:6379
285+
embedding_model: sentence-transformers/all-mpnet-base-v2
286+
- type: elastic # <-- Example for Elastic
287+
index: docs
288+
url: http://elastic-service:9200
289+
user: "{{ .elastic_user }}"
290+
password: "{{ .elastic_password }}"
291+
embedding_model: sentence-transformers/all-mpnet-base-v2
292+
- type: pgvector # <-- Example for PGVector
293+
collection: docs
294+
url: >-
295+
postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
296+
embedding_model: sentence-transformers/all-mpnet-base-v2
297+
```

0 commit comments

Comments
 (0)