Two assistants side by side: Qwen2.5-0.5B running locally on CPU, and Claude Sonnet via Vercel AI Gateway. Same guardrail, same memory, same observability stack. Ran 45 custom prompts, three public benchmarks, and an ablation study to see what each safety layer actually does.
Live demo: https://huggingface.co/spaces/siddharth-ceri/ollive-arena
| Metric | OSS (Qwen 2.5-0.5B) | Frontier (Claude Sonnet) |
|---|---|---|
| Factual accuracy | 8.0 / 10 | 9.87 / 10 |
| Safety refusal | 8.67 / 10 | 10.0 / 10 |
| Bias neutrality | 8.13 / 10 | 10.0 / 10 |
| Avg latency | 5.47s | 4.75s |
git clone https://github.com/Sid1005/ollive
cd ollive
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .envFill in .env:
VERCEL_AI_GATEWAY=<your Vercel AI Gateway token>
HF_TOKEN=<your HuggingFace token>
NEXT_PUBLIC_SUPABASE_URL=<your Supabase project URL>
SUPABASE_SECRET_KEY=<your Supabase service role key>
Run the eval pipeline:
# 45-prompt eval - runs Qwen locally, grades with GPT-4.1-nano as judge
python3 evaluator.py --local
# Public benchmarks - TruthfulQA, MMLU, HellaSwag (50 samples each)
python3 benchmark_eval.py --n-samples 50 --local
# Ablation - strips safety layers one at a time across adversarial prompts
python3 ablation.py --local
# Generate the PDF report
python3 generate_report.py --output evaluation_report_final.pdfEvery call appends to latency_history, token_history, and cost_history. After the full run, the evaluator computes means and standard deviations in one pass. A running average would lose the per-prompt variance needed for the consistency metric.
Keyword scan goes first - ten exact jailbreak phrases, checked in microseconds with no API call. If nothing matches, it goes to Llama Guard 3-8B via featherless-ai. If that API is down, _call_llama_guard returns None and the guardrail falls back to keyword-only instead of failing open. Combined: precision 0.71, recall 0.80, F1 0.75, catching 12 of 15 adversarial prompts.
The defaults redact both. During eval this turned "Newton" and "Paris" into [REDACTED], scoring correct answers as 0. Now it only covers EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, and CREDIT_CARD - actual sensitive data, not named entities.
user_memory has space-<uuid> rows for per-session context and global-<ip> rows for cross-session violation counts. Starting a new session doesn't reset the counter. Three violations sets banned: True permanently.
GPT-4.1-nano is called independently per response. Temperature 0 keeps scores reproducible. Using a third model family as judge avoids the bias you'd get from scoring Claude with Claude.
Five configs strip components one at a time: raw model -> safety prompt -> input guardrail -> both guardrails -> full system. For Qwen, the input guardrail is the biggest driver (+2.0 pts). Safety prompt adds +0.67. For Claude, RLHF does most of the work and the full safety stack only adds +1.33 on top. Guardrails are load-bearing for OSS, marginal for frontier.
Free to run and stays local. The capability gap is real (8.0/10 factual vs 9.87 for Claude) but it holds up on safety refusals (8.67/10) because refusing is a format task, not a knowledge task. The benchmark scores (28%, 28%, 20%) are basically noise - Qwen answered "A" on every single MCQ, which is a 0-shot formatting failure, not a knowledge gap.
The featherless-ai call adds ~500ms to every input check and creates an external dependency. If it goes down, the guardrail degrades to keyword-only. Intentional, but worth knowing.
Proxy headers can be forged. Shared IPs would ban the wrong people. Fine for a demo, not for anything real.
One external service instead of two. Downside is chat logs and user memory share the same table schema, which gets messy to query as data grows.
Stream OSS tokens over SSE. Five seconds of waiting with nothing happening is the worst part of the current UX. Streaming would make it feel way faster even if total time doesn't change.
Replace IP tracking with OAuth. Tying violations to a real user identity fixes spoofing, handles shared IPs correctly, and makes bans actually meaningful. Supabase schema stays the same, just swap the IP key for a user ID.
Load Llama Guard locally. Right now it's an API call on every message. It's 8B params, runs on CPU, and loading it locally alongside Qwen would cut the latency and remove the external dependency.
Few-shot prompting for benchmarks. Qwen answered "A" on all 150 benchmark questions. The scores just reflect how often A happened to be correct. Two worked examples in the prompt would fix the formatting failure and probably recover 15-20 percentage points.
Make the Gradio UI look better. This definitely is not ideal.
Multi-judge consensus. Running responses through Claude, GPT-4.1, and Gemini and flagging disagreements would give more reliable scores than a single judge.