validatedpatterns
diff --git a/‎content/patterns/rag-llm-cpu/_index.md‎
Lines changed: 72 additions & 0 deletions b/‎content/patterns/rag-llm-cpu/_index.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎content/patterns/rag-llm-cpu/configure.md‎
Lines changed: 297 additions & 0 deletions b/‎content/patterns/rag-llm-cpu/configure.md‎
Lines changed: 297 additions & 0 deletions
@@ -0,0 +1,72 @@
+---
+title: RAG LLM Chatbot on CPU
+date: 2025-10-24
+tier: sandbox
+summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
+rh_products:
+  - Red Hat OpenShift Container Platform
+  - Red Hat OpenShift GitOps
+  - Red Hat OpenShift AI
+partners:
+  - Microsoft
+industries:
+  - General
+aliases: /rag-llm-cpu/
+links:
+  github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
+  install: getting-started
+  bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
+  feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
+---
+
+# **CPU-based RAG LLM Chatbot**
+
+## **Introduction**
+
+This Validated Pattern deploys a Retrieval-Augmented Generation (RAG) chatbot on Red Hat OpenShift using Red Hat OpenShift AI. The pattern is designed to run entirely on CPU nodes without requiring GPU hardware, making it a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
+It provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.
+
+## **Target Audience**
+
+This pattern is designed for:
+
+- **Developers & Data Scientists** looking to build and experiment with RAG-based LLM applications.
+- **MLOps & DevOps Engineers** responsible for deploying and managing AI/ML workloads on OpenShift.
+- **Architects** evaluating cost-effective methods for delivering generative AI capabilities on-premise.
+
+## **Why Use This Pattern?**
+
+- **Cost-Effective:** Runs entirely on CPU, removing the need for expensive and often scarce GPU resources.
+- **Flexible:** Supports multiple vector database backends (Elasticsearch, PGVector, MS SQL Server) to integrate with your existing data infrastructure.
+- **Transparent:** The Gradio frontend is designed to expose the internals of the RAG query and LLM prompts, giving you clear insight into the generation process.
+- **Extensible:** Built on open-source standards (KServe, OpenAI-compatible API) to serve as a robust foundation for more complex applications.
+
+## **Architecture Overview**
+
+At a high level, the components work together as follows:
+
+1. A user enters a query into the **Gradio UI**.
+2. The backend application, using **LangChain**, first queries a configured **Vector Database** to retrieve relevant documents (the "R" in RAG).
+3. These documents are combined with the user's original query into a prompt.
+4. The prompt is sent to the **KServe-deployed LLM** (running via llama.cpp on a CPU node).
+5. The LLM generates a response, which is streamed back to the Gradio UI for the user.
+6. **Vault** securely provides the necessary credentials for the vector database and HuggingFace token at runtime.
+
+![Overview](/images/rag-llm-cpu/rag-augmented-query.png)
+
+_Figure 1. Overview of RAG Query from User's perspective._
+
+## **Prerequisites**
+
+Before you begin, ensure you have access to the following:
+
+- A Red Hat OpenShift cluster (version 4.x). (Recommended size of at least 2 `m5.4xlarge` nodes.)
+- A HuggingFace API token.
+- Command-line tools: podman.
+
+## **What This Pattern Provides**
+
+- A [kserve](https://github.com/kserve/kserve)-based LLM deployed to [RHOAI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
+- A choice of one (or multiple) Vector DB providers to serve as a RAG-backend with configurable web-based or git repo-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
+- [Vault](https://developer.hashiCorp.com/vault)-based secret management for HuggingFace API token and credentials for supported databases ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
+- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs which exposes the internals of the RAG query and LLM prompts so that users have better insight into what is running.
@@ -0,0 +1,297 @@
+---
+title: Configuring this Pattern
+weight: 20
+aliases: /rag-llm-cpu/configure/
+---
+
+# **Configuring this Pattern**
+
+This guide covers common customizations, such as changing the default LLM, adding new models, and configuring RAG data sources.
+We assume you have already completed the [Getting Started](/rag-llm-cpu/getting-started/) guide.
+
+## **How Configuration Works (Overview)**
+
+This pattern is managed by ArgoCD (GitOps). All application configurations are defined in `values-prod.yaml`.
+To customize a component, you will typically:
+
+1. **Enable an Override:** In `values-prod.yaml`, find the application you want to change (e.g., `llm-inference-service`) and add an `extraValueFiles:` entry pointing to a new override file (e.g., `$patternref/overrides/llm-inference-service.yaml`).
+2. **Create the Override File:** Create the new .yaml file inside the `/overrides` directory.
+3. **Add Your Settings:** Add _only_ the specific values you want to change into this new file.
+4. **Commit & Sync:** Commit your changes and let ArgoCD sync the application.
+
+## **Task: Change the Default LLM**
+
+By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf model`. You might want to change this to a different model (e.g., a different quantization) or adjust its resource usage.
+You can do this by creating an override file for the _existing_ `llm-inference-service` application.
+
+1. **Enable the Override**:
+   In `values-prod.yaml`, update the llm-inference-service application to use an override file:
+
+   ```yaml
+   clusterGroup:
+     # ...
+     applications:
+       # ...
+       llm-inference-service:
+         name: llm-inference-service
+         namespace: rag-llm-cpu
+         chart: llm-inference-service
+         chartVersion: 0.3.*
+         extraValueFiles: # <-- ADD THIS BLOCK
+           - $patternref/overrides/llm-inference-service.yaml
+   ```
+
+2. **Create the Override File:**
+   Create a new file `overrides/llm-inference-service.yaml`. Here is an example that switches to a different model file (Q8_0) and increases the CPU/memory requests:
+
+   ```yaml
+   inferenceService:
+     resources: # <-- Increaed allocated resources
+       requests:
+         cpu: "8"
+         memory: 12Gi
+       limits:
+         cpu: "12"
+         memory: 24Gi
+
+   servingRuntime:
+     args:
+       - --model
+       - /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file
+
+   model:
+     repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
+     files:
+       - mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
+   ```
+
+## **Task: Add a Second LLM**
+
+You can also deploy an entirely separate, second LLM and add it to the demo UI. This example deploys a different runtime (HuggingFace TGI) instead of llama.cpp.
+This is a two-step process: (1) Deploy the new LLM, and (2) Tell the frontend UI about it.
+
+### **Step 1: Deploy the New LLM Service**
+
+1. **Define the New Application:**
+   In `values-prod.yaml`, add a new application to the applications list. We'll call it `another-llm-inference-service`.
+
+   ```yaml
+   clusterGroup:
+     # ...
+     applications:
+       # ...
+       another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
+         name: another-llm-inference-service
+         namespace: rag-llm-cpu
+         chart: llm-inference-service
+         chartVersion: 0.3.*
+         extraValueFiles:
+           - $patternref/overrides/another-llm-inference-service.yaml
+   ```
+
+2. **Create the Override File:**
+   Create the new file `overrides/another-llm-inference-service.yaml`. This file needs to define the new model and disable resource creation (like secrets) that the first LLM already created.
+
+   ```yaml
+   dsc:
+     initialize: false
+   externalSecret:
+     create: false
+
+   # Define the new InferenceService
+   inferenceService:
+     name: hf-inference-service # <-- New service name
+     minReplicas: 1
+     maxReplicas: 1
+     resources:
+       requests:
+         cpu: "8"
+         memory: 32Gi
+       limits:
+         cpu: "12"
+         memory: 32Gi
+
+   # Define the new runtime (HuggingFace TGI)
+   servingRuntime:
+     name: hf-runtime
+     port: 8080
+     image: docker.io/kserve/huggingfaceserver:latest
+     modelFormat: huggingface
+     args:
+       - --model_dir
+       - /models
+       - --model_name
+       - /models/Mistral-7B-Instruct-v0.3
+       - --http_port
+       - "8080"
+
+   # Define the new model to download
+   model:
+     repository: mistralai/Mistral-7B-Instruct-v0.3
+     files:
+       - generation_config.json
+       - config.json
+       - model.safetensors.index.json
+       - model-00001-of-00003.safetensors
+       - model-00002-of-00003.safetensors
+       - model-00003-of-00003.safetensors
+       - tokenizer.model
+       - tokenizer.json
+       - tokenizer_config.json
+   ```
+
+   **Warning:** There is currently a bug in the model-downloading container that requires you to explicitly list _all_ files you wish to download from the HuggingFace repository. Make sure you list every file needed for the model to run.
+
+### **Step 2: Add the New LLM to the Demo UI**
+
+Now, tell the frontend that this new LLM exists.
+
+1. **Edit the Frontend Overrides**:
+   Open `overrides/rag-llm-frontend-values.yaml` (this file should already exist from the initial setup).
+2. **Update LLM_URLS:**
+   Add the URL of your new service to the `LLM_URLS` environment variable. The URL follows the format _http://<service-name>-predictor/v1_ (or _http://<service-name>-predictor/openai/v1_ for the HF runtime).
+
+   In `overrides/rag-llm-frontend-values.yaml`:
+
+   ```yaml
+   env:
+     # ...
+     - name: LLM_URLS
+       value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
+   ```
+
+## **Task: Customize RAG Data Sources**
+
+By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to your own public git repositories or web pages.
+
+1. **Edit the Vector DB Overrides:**
+   Open `overrides/vector-db-values.yaml` (this file should already exist).
+2. **Update Sources:**
+   Modify the repoSources and webSources keys. You can add any publicly available Git repo (using globs to filter files) or public web URLs. The job will also process PDFs from webSources.
+
+   In `overrides/vector-db-values.yaml`:
+
+   ```yaml
+   providers:
+     qdrant:
+       enabled: true
+     mssql:
+       enabled: true
+
+   vectorEmbedJob:
+     repoSources:
+       - repo: https://github.com/your-org/your-docs.git # <-- Your repo
+         globs:
+           - "**/*.md"
+     webSources:
+       - https://your-company.com/product-manual.pdf # <-- Your PDF
+     chunking:
+       size: 4096
+   ```
+
+## **Task: Add a New RAG Database Provider**
+
+By default, the pattern enables _qdrant_ and _mssql_. You can also enable _redis_, _pgvector_ (Postgres), or _elastic_ (Elasticsearch).
+This is a three-step process: (1) Add secrets, (2) Enable the DB, and (3) Tell the frontend UI.
+
+### **Step 1: Update Your Secrets File**
+
+If your new DB requires credentials (like _pgvector_ or _elastic_), add them to your main secrets file:
+
+```sh
+vim ~/values-secret-rag-llm-cpu.yaml
+```
+
+Add the necessary credentials. For example:
+
+```yaml
+secrets:
+  # ...
+  - name: pgvector
+    fields:
+      - name: user
+        value: user # <-- Update the user
+      - name: password
+        value: password # <-- Update the password
+      - name: db
+        value: db # <-- Update the db
+```
+
+**Note:** refer to the file [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) for a reference as to which values are expected.
+
+### **Step 2: Enable the Provider in the Vector DB Chart**
+
+Edit `overrides/vector-db-values.yaml` and set enabled: true for the provider(s) you want to add.
+
+In `overrides/vector-db-values.yaml`:
+
+```yaml
+providers:
+  qdrant:
+    enabled: true
+  mssql:
+    enabled: true
+  pgvector: # <-- ADD THIS
+    enabled: true
+  elastic: # <-- OR THIS
+    enabled: true
+```
+
+### **Step 3: Add the Provider to the Demo UI**
+
+Finally, edit `overrides/rag-llm-frontend-values.yaml` to configure the UI. You must:
+
+1. Add the new provider's secrets to the `dbProvidersSecret.vault` list.
+2. Add the new provider's connection details to the `dbProvidersSecret.providers` list.
+
+Below is a complete example showing configuration for the non-default RAG DB providers:
+
+In `overrides/rag-llm-frontend-values.yaml`
+
+```yaml
+dbProvidersSecret:
+  vault:
+    - key: mssql
+      field: sapassword
+    - key: pgvector # <-- Add this block
+      field: user
+    - key: pgvector
+      field: password
+    - key: pgvector
+      field: db
+    - key: elastic # <-- Add this block
+      field: user
+    - key: elastic
+      field: password
+  providers:
+    - type: qdrant # <-- Example for Qdrant
+      collection: docs
+      url: http://qdrant-service:6333
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: mssql # <-- Example for MSSQL
+      table: docs
+      connection_string: >-
+        Driver={ODBC Driver 18 for SQL Server};
+        Server=mssql-service,1433;
+        Database=embeddings;
+        UID=sa;
+        PWD={{ .mssql_sapassword }};
+        TrustServerCertificate=yes;
+        Encrypt=no;
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: redis # <-- Example for Redis
+      index: docs
+      url: redis://redis-service:6379
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: elastic # <-- Example for Elastic
+      index: docs
+      url: http://elastic-service:9200
+      user: "{{ .elastic_user }}"
+      password: "{{ .elastic_password }}"
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: pgvector # <-- Example for PGVector
+      collection: docs
+      url: >-
+        postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+```