AI Assistant Arena

Two assistants side by side: Qwen2.5-0.5B running locally on CPU, and Claude Sonnet via Vercel AI Gateway. Same guardrail, same memory, same observability stack. Ran 45 custom prompts, three public benchmarks, and an ablation study to see what each safety layer actually does.

Live demo: https://huggingface.co/spaces/siddharth-ceri/ollive-arena

Results at a glance

Metric	OSS (Qwen 2.5-0.5B)	Frontier (Claude Sonnet)
Factual accuracy	8.0 / 10	9.87 / 10
Safety refusal	8.67 / 10	10.0 / 10
Bias neutrality	8.13 / 10	10.0 / 10
Avg latency	5.47s	4.75s

Setup

git clone https://github.com/Sid1005/ollive
cd ollive
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

Fill in .env:

VERCEL_AI_GATEWAY=<your Vercel AI Gateway token>
HF_TOKEN=<your HuggingFace token>
NEXT_PUBLIC_SUPABASE_URL=<your Supabase project URL>
SUPABASE_SECRET_KEY=<your Supabase service role key>

Run the eval pipeline:

# 45-prompt eval - runs Qwen locally, grades with GPT-4.1-nano as judge
python3 evaluator.py --local

# Public benchmarks - TruthfulQA, MMLU, HellaSwag (50 samples each)
python3 benchmark_eval.py --n-samples 50 --local

# Ablation - strips safety layers one at a time across adversarial prompts
python3 ablation.py --local

# Generate the PDF report
python3 generate_report.py --output evaluation_report_final.pdf

Architecture decisions

`BaseAssistant` stores latency, tokens, and cost as lists, not running averages

Every call appends to latency_history, token_history, and cost_history. After the full run, the evaluator computes means and standard deviations in one pass. A running average would lose the per-prompt variance needed for the consistency metric.

The input guardrail is a two-layer cascade: keywords first, then Llama Guard

Keyword scan goes first - ten exact jailbreak phrases, checked in microseconds with no API call. If nothing matches, it goes to Llama Guard 3-8B via featherless-ai. If that API is down, _call_llama_guard returns None and the guardrail falls back to keyword-only instead of failing open. Combined: precision 0.71, recall 0.80, F1 0.75, catching 12 of 15 adversarial prompts.

Presidio's entity list excludes `PERSON` and `LOCATION`

The defaults redact both. During eval this turned "Newton" and "Paris" into [REDACTED], scoring correct answers as 0. Now it only covers EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, and CREDIT_CARD - actual sensitive data, not named entities.

Violation tracking is keyed by client IP, not session ID

user_memory has space-<uuid> rows for per-session context and global-<ip> rows for cross-session violation counts. Starting a new session doesn't reset the counter. Three violations sets banned: True permanently.

The judge runs at temperature 0.0 as a separate API call

GPT-4.1-nano is called independently per response. Temperature 0 keeps scores reproducible. Using a third model family as judge avoids the bias you'd get from scoring Claude with Claude.

The ablation isolates each layer's marginal contribution

Five configs strip components one at a time: raw model -> safety prompt -> input guardrail -> both guardrails -> full system. For Qwen, the input guardrail is the biggest driver (+2.0 pts). Safety prompt adds +0.67. For Claude, RLHF does most of the work and the full safety stack only adds +1.33 on top. Guardrails are load-bearing for OSS, marginal for frontier.

Tradeoffs

0.5B model size

Free to run and stays local. The capability gap is real (8.0/10 factual vs 9.87 for Claude) but it holds up on safety refusals (8.67/10) because refusing is a format task, not a knowledge task. The benchmark scores (28%, 28%, 20%) are basically noise - Qwen answered "A" on every single MCQ, which is a 0-shot formatting failure, not a knowledge gap.

Llama Guard runs via a hosted API

The featherless-ai call adds ~500ms to every input check and creates an external dependency. If it goes down, the guardrail degrades to keyword-only. Intentional, but worth knowing.

IP-based violation tracking is spoofable

Proxy headers can be forged. Shared IPs would ban the wrong people. Fine for a demo, not for anything real.

Supabase for both observability and memory

One external service instead of two. Downside is chat logs and user memory share the same table schema, which gets messy to query as data grows.

What I would improve with more time

Stream OSS tokens over SSE. Five seconds of waiting with nothing happening is the worst part of the current UX. Streaming would make it feel way faster even if total time doesn't change.

Replace IP tracking with OAuth. Tying violations to a real user identity fixes spoofing, handles shared IPs correctly, and makes bans actually meaningful. Supabase schema stays the same, just swap the IP key for a user ID.

Load Llama Guard locally. Right now it's an API call on every message. It's 8B params, runs on CPU, and loading it locally alongside Qwen would cut the latency and remove the external dependency.

Few-shot prompting for benchmarks. Qwen answered "A" on all 150 benchmark questions. The scores just reflect how often A happened to be correct. Two worked examples in the prompt would fix the formatting failure and probably recover 15-20 percentage points.

Make the Gradio UI look better. This definitely is not ideal.

Multi-judge consensus. Running responses through Claude, GPT-4.1, and Gemini and flagging disagreements would give more reliable scores than a single judge.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
hf_spaces_src		hf_spaces_src
models		models
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ablation.py		ablation.py
ablation_results.json		ablation_results.json
benchmark_eval.py		benchmark_eval.py
benchmark_results.json		benchmark_results.json
eval_prompts.json		eval_prompts.json
eval_results.json		eval_results.json
evaluation_report_final.pdf		evaluation_report_final.pdf
evaluator.py		evaluator.py
generate_report.py		generate_report.py
guardrails.py		guardrails.py
memory.py		memory.py
observability.py		observability.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Assistant Arena

Results at a glance

Setup

Architecture decisions

`BaseAssistant` stores latency, tokens, and cost as lists, not running averages

The input guardrail is a two-layer cascade: keywords first, then Llama Guard

Presidio's entity list excludes `PERSON` and `LOCATION`

Violation tracking is keyed by client IP, not session ID

The judge runs at temperature 0.0 as a separate API call

The ablation isolates each layer's marginal contribution

Tradeoffs

0.5B model size

Llama Guard runs via a hosted API

IP-based violation tracking is spoofable

Supabase for both observability and memory

What I would improve with more time

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Assistant Arena

Results at a glance

Setup

Architecture decisions

BaseAssistant stores latency, tokens, and cost as lists, not running averages

The input guardrail is a two-layer cascade: keywords first, then Llama Guard

Presidio's entity list excludes PERSON and LOCATION

Violation tracking is keyed by client IP, not session ID

The judge runs at temperature 0.0 as a separate API call

The ablation isolates each layer's marginal contribution

Tradeoffs

0.5B model size

Llama Guard runs via a hosted API

IP-based violation tracking is spoofable

Supabase for both observability and memory

What I would improve with more time

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`BaseAssistant` stores latency, tokens, and cost as lists, not running averages

Presidio's entity list excludes `PERSON` and `LOCATION`

Packages