🤖 RAG System — Retrieval Augmented Generation
This project implements a complete Retrieval Augmented Generation (RAG) system that enables intelligent question-answering over PDF documents. The system combines semantic search with local Large Language Models to provide accurate, context-aware responses.
Paper
Authors
Year
Focus
Attention Is All You Need
Vaswani et al.
2017
Transformer Architecture
BERT: Pre-training of Deep Bidirectional Transformers
Devlin et al.
2018
Bidirectional Language Models
Language Models are Few-Shot Learners (GPT-3)
Brown et al.
2020
Few-Shot Learning
PDF loading and parsing
Intelligent text chunking (1000 chars)
Metadata preservation
Recursive text splitting
Vector embeddings (MiniLM-L6)
ChromaDB vector store
Similarity scoring
Top-K retrieval
Local inference via Ollama
Qwen 2.5 (1.5B) model
Custom prompt templates
Context-aware responses
Beautiful CLI with Rich
Streamlit Web UI
Conversation history
Source citations
┌─────────────────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 📄 PDFs │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Loading │───▶│ Chunking │───▶│ Embeddings │ │
│ │ (PyPDF) │ │ (1000ch) │ │ (MiniLM) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ ChromaDB │ │
│ │ Vector Store│ │
│ └─────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ │ │
│ │ Answer │◀───│ Ollama │◀─────────┘ │
│ │ │ │ (Qwen2.5) │ │
│ └─────────────┘ └─────────────┘ │
│ │ ▲ │
│ │ ┌─────────────┐ ┌─────────────┐ │
│ │ │ Prompt │◀───│ Retriever │ │
│ │ │ Template │ │ (Top-K) │ │
│ │ └─────────────┘ └─────────────┘ │
│ ▼ ▲ │
│ ┌─────────────┐ │ │
│ │ User Query │──────────────────────────────┘ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
RAG-Project/
│
├── 📄 cli.py # Command-line interface
├── 📄 app.py # Streamlit web application
├── ⚙️ config.yaml # System configuration
├── 📋 requirements.txt # Python dependencies
├── 📝 template.py # Prompt templates
├── 📖 README.md # Project documentation
│
├── 📂 data/
│ ├── 1706.03762v7.pdf # Attention Is All You Need
│ ├── 1810.04805v2.pdf # BERT paper
│ ├── 2005.14165v4.pdf # GPT-3 paper
│ └── evaluation_dataset.json # Test questions & ground truths
│
├── 📂 src/
│ ├── __init__.py
│ ├── document_indexer.py # Q1: Document loading & chunking
│ ├── vector_store.py # Q1: ChromaDB vector storage
│ ├── document_retriever.py # Q2: Semantic retrieval
│ ├── llm_qa_system.py # Q3: LLM question-answering
│ ├── evaluator.py # Q4: Evaluation metrics
│ ├── chatbot.py # Q5: Conversational chatbot
│ │
│ └── 📂 utils/
│ ├── __init__.py
│ ├── config_loader.py # Configuration management
│ ├── logger.py # Logging utilities
│ └── metrics.py # Evaluation metrics
│
└── 📂 vector_store/ # Persisted embeddings (gitignored)
Requirement
Version
Purpose
Python
3.10+
Runtime
Ollama
Latest
Local LLM
CUDA
11.8+
GPU acceleration (optional)
git clone https://github.com/your-username/RAG-Project.git
cd RAG-Project
Step 2: Create Virtual Environment
# Using Conda (recommended)
conda create -n rag python=3.10
conda activate rag
# Or using venv
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\S cripts\a ctivate # Windows
Step 3: Install Dependencies
# Install PyTorch with CUDA support (optional, for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install project dependencies
pip install -r requirements.txt
# Download Ollama from https://ollama.com/download
# Pull the LLM model
ollama pull qwen2.5:1.5b
# Verify installation
ollama list
# 1️⃣ Index your documents
python cli.py index data/ -d
# 2️⃣ Ask a question
python cli.py ask " What is the Transformer architecture?" -s
# 3️⃣ Start the web interface
streamlit run app.py
Command
Description
Example
index
Index PDF documents
python cli.py index data/ -d
search
Semantic search
python cli.py search "attention mechanism"
ask
Ask a question
python cli.py ask "What is BERT?" -s
chat
Interactive chatbot
python cli.py chat
evaluate
Run evaluation
python cli.py evaluate -o results.json
stats
Vector store info
python cli.py stats
models
List Ollama models
python cli.py models
config
Show configuration
python cli.py config
web
Launch Streamlit
python cli.py web
Open http://localhost:8501 in your browser.
Features:
💬 Chat : Interactive conversation with history
❓ Q&A : Single questions with source citations
🔍 Search : Semantic document search
All settings are centralized in config.yaml:
# Document Processing
document_processing :
chunk_size : 1000 # Characters per chunk
chunk_overlap : 200 # Overlap between chunks
split_method : " recursive" # Splitting strategy
# Embeddings
embeddings :
model_name : " sentence-transformers/all-MiniLM-L6-v2"
device : " cuda" # Use GPU if available
# LLM (Ollama)
llm :
model_name : " qwen2.5:1.5b"
base_url : " http://localhost:11434"
temperature : 0.7
# Retrieval
retrieval :
top_k : 5 # Number of chunks to retrieve
score_threshold : 0.3 # Minimum similarity score
python cli.py evaluate -o results.json
Metric
Score
Description
Precision@5
0.98
Relevant documents in top 5
Recall@5
0.90
Fraction of relevant docs retrieved
MRR
1.00
Mean Reciprocal Rank
Hit Rate@5
1.00
Success rate for finding relevant docs
Metric
Score
Description
Answer Relevance
0.77
How well answer addresses question
Faithfulness
0.36
Grounding in retrieved context
Word Overlap F1
0.23
Lexical similarity to ground truth
Component
Choice
Justification
Embedding Model
all-MiniLM-L6-v2
Lightweight (80MB), fast, good semantic quality
Vector Store
ChromaDB
Easy setup, persistent storage, LangChain integration
LLM
Qwen 2.5 (1.5B)
Local inference, no API costs, fast (~1s response)
Text Splitter
RecursiveCharacterTextSplitter
Respects document structure, configurable
Chunk Size
1000 characters
Balance between context richness and precision
Component
Alternative
Why Not Chosen
Embeddings
all-mpnet-base-v2
Better quality but slower
Vector Store
FAISS
Faster but no built-in persistence
LLM
Mistral-7B
Better quality but requires more VRAM
╭─────────────────────── 💡 Answer ───────────────────────╮
│ │
│ The Transformer is a neural network architecture │
│ designed to process sequences of data. It consists │
│ of stacked self-attention mechanisms followed by │
│ point-wise, fully connected layers for both encoder │
│ and decoder. Its key components include multi-head │
│ self-attention and position-wise feedforward networks. │
│ │
╰─────────────────────────────────────────────────────────╯
📚 Sources
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┓
┃ Document ┃ Page ┃ Score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━┩
│ 1706.03762v7.pdf │ 2 │ 0.5064 │
│ 1810.04805v2.pdf │ 2 │ 0.4662 │
└───────────────────────┴──────┴────────┘
Metric
Value
Indexing Speed
~3 seconds for 3 PDFs
Search Latency
~50ms per query
Answer Generation
~1-2 seconds
Memory Usage
~2GB VRAM
Vaswani, A., et al. (2017). Attention Is All You Need
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers
Brown, T., et al. (2020). Language Models are Few-Shot Learners
This project is developed for educational purposes as part of the RAG TP assignment.
Built with ❤️ using LangChain, ChromaDB, Ollama & Streamlit