superGPT

Train your own LLM from scratch — with every frontier innovation

GPT-4 • DeepSeek V3 • Gemma 2 • Mistral • LLaMA 3 — Zero abstraction. Pure PyTorch.

Quick Start • Architecture • Presets • Generation • LoRA • Alignment • Export • Tutorials

superGPT is a from-scratch LLM training framework implementing every major innovation from GPT-4 through DeepSeek V3, Gemma 2, and Mistral — in readable PyTorch. Train on any text, scale from laptop to GPU cluster, fine-tune with LoRA, align with DPO, export to GGUF.

Architecture

Innovation	What It Does	Origin
🧠 Multi-head Latent Attention (MLA)	Compresses KV into low-rank latent — ~10x smaller cache	DeepSeek V3
🔄 Grouped Query Attention (GQA)	Fewer KV heads → faster, less memory	GPT-4, LLaMA
🪟 Sliding Window Attention	O(n·w) attention — handles very long sequences	Mistral
🔄 Alternating Global/Local Layers	Even=full attention, odd=windowed — best of both	Gemma 2
🛡️ Logit Soft-Capping	Prevents attention logit explosion	Gemma 2
⚡ Flash Attention	2-4x faster via PyTorch SDPA backend	FlashAttention-2
🧩 DeepSeekMoE	Shared + routed experts with sigmoid gating	DeepSeek V3
⚖️ Aux-Loss-Free Routing	Dynamic bias replaces aux loss	DeepSeek V3
🔮 Multi-Token Prediction	Predicts N+1, N+2... — denser gradients	DeepSeek V3
📐 Decoupled RoPE	Separates position from content attention	DeepSeek V3
🌐 YaRN Context Extension	Extend context window without retraining	LLaMA 3.1, Qwen
🔥 SwiGLU + RMSNorm	Modern FFN + stable normalization	GPT-4, LLaMA
💾 KV-Cache	O(1) per token incremental decoding	Universal
🎯 DPO Alignment	Align with preferences — no reward model	LLaMA 3, Zephyr
🔧 LoRA Fine-tuning	100x fewer params to train	Microsoft
🏎️ Speculative Decoding	2-3x faster inference with draft model	Google/DeepMind
🎲 Top-p / Min-p / Rep Penalty	Advanced sampling strategies	All frontier models
✅ Gradient Checkpointing	~60% memory reduction	Universal
📈 WSD LR Schedule	Warmup-Stable-Decay for better convergence	DeepSeek V3
📦 GGUF Export	Run your model in llama.cpp / Ollama	llama.cpp
🧬 Knowledge Distillation	Transfer knowledge from large to small model	DeepSeek R1, Qwen
🌐 FSDP + 3D Parallelism	Tensor + Pipeline + Data parallel training	Megatron-LM
🔢 QLoRA (4-bit Training)	Fine-tune 7B models on 8GB VRAM	QLoRA
🎮 PPO / GRPO	Full RLHF — PPO or DeepSeek R1-style GRPO	DeepSeek R1, OpenAI
🚀 Inference Server	Continuous batching + PagedAttention + OpenAI API	vLLM, TGI
📊 Streaming Data	Sharded datasets, HF streaming, cloud-ready	Mosaic, WebDataset
📝 Evaluation Harness	MMLU, HellaSwag, ARC, GSM8K, HumanEval	lm-eval-harness
🔥 DAPO Alignment	Clip-Higher + Dynamic Sampling + Token-Level PG — state-of-the-art RL	ByteDance 2025
✨ RLVR	RL with auto-verifiable rewards — emergent reasoning, no labels	DeepSeek R1 2025
⚡ Native Sparse Attention	3-branch (compress + top-k + window) — 9x faster attention	DeepSeek 2025
🧬 White-Box KD (CKA)	Match hidden states across different dimensions with CKA	ICLR 2025
🎯 Mix Distillation	Multi-teacher blending + curriculum learning for small models	arXiv Nov 2025

Quick Start

# Clone and setup
git clone https://github.com/viralcode/superGPT.git
cd superGPT
pip install torch numpy

# Prepare data (included Shakespeare dataset, or use your own)
python data/prepare_data.py

# Train a small model (works on CPU/laptop)
python train.py --preset small

# Generate text
python generate.py --prompt "To be or not to be" --interactive

Example Output

Trained on Shakespeare (~1MB of text) with the small preset on a MacBook:

$ python generate.py --prompt "To be or not to be" --top-p 0.9

To be or not to be ta'en of the tomb:
I'll pay not to see your honour's love.

LADY CAPULET:
You would have you sorrow to my heart did lie.

Nurse:
And that's the prince still tell you have said
And you for your mistre

$ python generate.py --prompt "ROMEO:" --top-p 0.9 --min-p 0.05 --rep-penalty 1.1

ROMEO:
Ay, so much lengthen'd with such a happy great father.

JULIET:
I would you call thee that he is so,
So many some content to the balm of Edward;
That had not fly thee to shake the noble duke.

10.6M params • val loss 1.479 • 49 tokens/sec on CPU • trained in ~45 min

Shakespeare Training Results

The included Shakespeare dataset (data/input.txt, ~1MB) was trained with the small preset on a MacBook (CPU only):

Metric	Value
Model	`small` preset — 6 layers, 6 heads, 384 dim
Parameters	10.6M
Training data	Tiny Shakespeare (1.1MB, ~300K tokens)
Tokenizer	Character-level (vocab_size=65)
Batch size	32
Max iterations	5,000
Best val loss	1.479 (at iteration 1,500)
Training time	~45 min on CPU (Apple M-series)
Inference speed	49 tokens/sec with KV-cache

The model learns Shakespeare's writing style, character names (ROMEO, JULIET, LADY CAPULET), dialogue structure, and poetic phrasing — all from just ~1MB of text.

Training Options

# Basic training
python train.py --preset small --max-iters 5000

# Memory-efficient (saves ~60% VRAM)
python train.py --preset large --gradient-checkpointing

# DeepSeek V3's learning rate schedule
python train.py --preset medium --lr-schedule wsd

# Custom learning rate and batch size
python train.py --preset medium --lr 1e-4 --batch-size 128

# Resume from checkpoint
python train.py --preset small --resume checkpoints/latest.pt

# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train.py --preset xl --distributed

# Compile for maximum speed (PyTorch 2.0+)
python train.py --preset medium --compile

Train on Your Own Data

python data/prepare_data.py --input your_textfile.txt
python train.py --preset medium

Presets

Preset	Params	Attention	MoE	Special	Best For
`small`	~35M	MHA	—	—	CPU / laptop
`medium`	~125M	GQA 12Q/4KV	—	—	Single GPU
`large`	~333M	GQA 16Q/4KV	—	—	A100/4090
`xl`	~1.3B	GQA 16Q/8KV	—	—	Multi-GPU
`gpt4`	~100B	GQA 32Q/8KV	8×top-2	—	GPU cluster
`deepseek`	variable	MLA	64×top-6+2shared	aux-free, MTP	GPU cluster
`mistral`	~7B	GQA 32Q/8KV	—	sliding window 4K	GPU cluster
`gemma2`	~2.7B	GQA 16Q/4KV	—	alternating layers, logit cap	GPU cluster

# Scale up as your hardware allows
python train.py --preset small                    # Laptop
python train.py --preset medium                   # 1× GPU
python train.py --preset large --gradient-checkpointing  # Memory-efficient

# Training options
python train.py --preset medium --lr-schedule wsd  # DeepSeek V3 LR schedule
python train.py --preset large --gradient-checkpointing --compile  # Max efficiency

# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train.py --preset xl --distributed
torchrun --nproc_per_node=8 train.py --preset deepseek --distributed

Generation

# Standard generation
python generate.py --prompt "Once upon a time" --interactive

# Advanced sampling
python generate.py --prompt "Once" --top-p 0.9 --min-p 0.05 --rep-penalty 1.2

# Speculative decoding (2-3x faster!)
# Train a small draft model first, then:
python generate.py --draft-checkpoint checkpoints/small.pt --spec-k 5

Sampling Strategies

Strategy	Flag	Description
Top-k	`--top-k 50`	Keep top-k highest probability tokens
Top-p (nucleus)	`--top-p 0.9`	Keep tokens until cumulative probability reaches p
Min-p	`--min-p 0.05`	Filter tokens below 5% of the max probability
Repetition penalty	`--rep-penalty 1.2`	Reduce probability of repeated tokens
Temperature	`--temperature 0.8`	Control randomness (0=greedy, 1=diverse)

LoRA Fine-tuning

Fine-tune with only ~1-3% trainable parameters:

# Fine-tune a pre-trained model
python finetune.py --checkpoint checkpoints/best.pt --data data/ --lora-rank 16

# Custom LoRA settings
python finetune.py --checkpoint best.pt --data data/ --lora-rank 32 --lora-alpha 64

# Generate with fine-tuned model
python generate.py --checkpoint checkpoints/finetuned_merged.pt --interactive

Alignment

Align your model with human preferences using DPO:

# Create preference data (JSONL):
# {"prompt": "...", "chosen": "good response", "rejected": "bad response"}

python align.py --checkpoint checkpoints/best.pt --data preferences.jsonl
python generate.py --checkpoint checkpoints/aligned.pt --interactive

Export

Export to GGUF format for use with llama.cpp, Ollama, LM Studio:

# FP16 (full quality)
python export.py --checkpoint best.pt --output model-fp16.gguf

# Q8_0 (8-bit quantized, good quality, smaller)
python export.py --checkpoint best.pt --output model-q8.gguf --quantize q8_0

# Q4_0 (4-bit quantized, smallest, fastest)
python export.py --checkpoint best.pt --output model-q4.gguf --quantize q4_0

Context Extension (YaRN)

Extend your model's context window at inference without retraining:

from config import GPTConfig

config = GPTConfig(
    ...,
    rope_scaling_type="yarn",   # or "linear"
    rope_scaling_factor=4.0,    # 4x context: 4K → 16K
)

Knowledge Distillation

Transfer knowledge from a large teacher model to a smaller student model. Supports both HuggingFace models (Qwen, LLaMA, Mistral) and superGPT checkpoints.

# Distill from Qwen (requires: pip install transformers)
python distill.py --hf-teacher Qwen/Qwen2.5-0.5B --student-preset small --data data/

# Distill from LLaMA
python distill.py --hf-teacher meta-llama/Llama-3.2-1B --student-preset medium

# Distill from a larger superGPT model
python distill.py --teacher checkpoints/large.pt --student-preset small --data data/

# Custom temperature and balance
python distill.py --hf-teacher Qwen/Qwen2.5-0.5B --temperature 3.0 --alpha 0.7

# Generate with the distilled model
python generate.py --checkpoint checkpoints/distilled_best.pt --interactive

Recommended HuggingFace teachers:

Model	Size	Best For
`Qwen/Qwen2.5-0.5B`	500M	Quick experiments, CPU-friendly
`Qwen/Qwen2.5-1.5B`	1.5B	Good quality, single GPU
`meta-llama/Llama-3.2-1B`	1B	Strong baseline
`mistralai/Mistral-7B-v0.3`	7B	High quality, needs GPU

RLHF: PPO & GRPO

Align your model with reinforcement learning from human feedback:

# Train a reward model from preference data
python rlhf.py reward --checkpoint best.pt --data preferences.jsonl

# GRPO alignment (DeepSeek R1 style, no value model needed)
python rlhf.py grpo --checkpoint best.pt --reward-model reward.pt

# GRPO with rule-based rewards (no reward model needed)
python rlhf.py grpo --checkpoint best.pt --rule-reward length

QLoRA (4-bit Training)

Fine-tune large models on consumer GPUs with 4-bit quantized LoRA:

from lora import apply_qlora
model = GPT(config)
apply_qlora(model, rank=16)  # Base weights -> NF4 (4-bit), LoRA in FP16
# Fine-tune 7B models on 8GB VRAM

Inference Server

Serve your model with an OpenAI-compatible API:

python serve.py --checkpoint best.pt --port 8000

# Query it
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "To be or not to be", "max_tokens": 100, "stream": true}'

Features: continuous batching, PagedAttention, SSE streaming.

3D Parallelism

Train massive models across GPU clusters:

# 8 GPUs: 2-way tensor parallel x 4-way pipeline parallel
torchrun --nproc_per_node=8 train.py --preset xl \
    --tensor-parallel 2 --pipeline-parallel 4

Streaming Data

Train on multi-terabyte datasets without loading into memory:

# Shard a dataset
python streaming.py shard --input data/train.bin --n-shards 64 --output data/shards/

# Stream from HuggingFace
python train.py --hf-dataset HuggingFaceFW/fineweb --streaming

Evaluation Harness

Benchmark your model on standard LLM evaluations:

# Run all benchmarks (MMLU, HellaSwag, ARC, GSM8K, TruthfulQA, HumanEval)
python evaluate.py --checkpoint best.pt

# Specific benchmarks with few-shot
python evaluate.py --checkpoint best.pt --benchmarks mmlu gsm8k --n-shot 5 --output results.json

Project Structure

superGPT/
├── model.py            # MLA, GQA, sliding window, Flash Attn, MoE, MTP, KV-cache,
│                       # RoPE+YaRN, SwiGLU, speculative decoding, grad checkpointing
├── config.py           # All hyperparameters + presets (small → gemma2)
├── train.py            # Training (AdamW, cosine/WSD LR, FSDP, grad ckpt, mixed prec)
├── generate.py         # Generation (top-k/p, min-p, rep penalty, speculative decoding)
├── align.py            # DPO alignment from preference pairs
├── distill.py          # Knowledge distillation (teacher → student, HuggingFace support)
├── lora.py             # LoRA + QLoRA (4-bit NF4 quantized training)
├── finetune.py         # LoRA / QLoRA fine-tuning script
├── export.py           # GGUF export (FP16, Q8_0, Q4_0)
├── serve.py            # HTTP inference server (continuous batching, PagedAttention)
├── parallel.py         # 3D Parallelism (tensor + pipeline parallel)
├── streaming.py        # Streaming data pipelines (sharded, HuggingFace, text glob)
├── rlhf.py             # RLHF: PPO + GRPO (DeepSeek R1 style)
├── evaluate.py         # Benchmark harness (MMLU, HellaSwag, ARC, GSM8K, HumanEval)
├── data/
│   └── prepare_data.py # Tokenization (tiktoken BPE or character-level)
└── requirements.txt

What This Is (and Isn't)

This is: The most comprehensive from-scratch LLM framework, implementing every major innovation from GPT-4 through the latest frontier models. Every feature is implemented in readable PyTorch — no hidden abstractions.

This isn't: A pretrained model. The architecture is frontier-level, but producing a ChatGPT-quality model requires trillions of tokens and thousands of GPUs. This gives you the complete blueprint; you provide the compute.

References

DeepSeek-V3 Technical Report — MLA, DeepSeekMoE, MTP, WSD schedule
Gemma 2 Technical Report — Alternating attention, logit soft-capping
Mistral 7B — Sliding window attention
GPT-4 Technical Report — MoE, GQA
LLaMA 2 — GQA, SwiGLU, RMSNorm, RoPE
YaRN — Context extension via RoPE scaling
LoRA — Low-rank adaptation
DPO — Direct Preference Optimization
Speculative Decoding — Draft-verify acceleration
QLoRA — 4-bit quantized fine-tuning
PPO — Proximal Policy Optimization
GRPO — Group Relative Policy Optimization (DeepSeek R1)
PagedAttention — Efficient KV-cache management
Megatron-LM — 3D parallelism
nanoGPT — Inspiration

Tutorials

📚 In-depth guides for training frontier LLMs:

Tutorial	Description
Getting Started	Complete guide to superGPT — installation, architecture, all model presets, data preparation, training, text generation, LoRA fine-tuning, distillation, multi-GPU FSDP, and troubleshooting.
Training Data Guide	How to prepare training data from scratch — web crawling, text extraction, quality filtering, deduplication, cleaning, custom data from GitHub/Google/PDFs, synthetic data generation (Magpie, Evol-Instruct), tokenization, data mixing, and curriculum learning.
Instruction Tuning & Chat	Turn a base model into ChatGPT — the complete 4-stage pipeline: SFT with LoRA, DPO alignment, RLHF/GRPO, RLVR (DeepSeek-R1 style). Includes 20+ instruction datasets, chat templates, OpenAI-compatible serving, and reasoning model training.
Deploy on RunPod	Step-by-step guide to renting cloud GPUs on RunPod and training superGPT models — GPU selection, SSH setup, background training, monitoring, downloading checkpoints, multi-GPU, and cost optimization.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
scripts		scripts
supergpt		supergpt
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

superGPT

Architecture

Quick Start

Example Output

Shakespeare Training Results

Training Options

Train on Your Own Data

Presets

Generation

Sampling Strategies

LoRA Fine-tuning

Alignment

Export

Context Extension (YaRN)

Knowledge Distillation

RLHF: PPO & GRPO

QLoRA (4-bit Training)

Inference Server

3D Parallelism

Streaming Data

Evaluation Harness

Project Structure

What This Is (and Isn't)

References

Tutorials

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

superGPT

Architecture

Quick Start

Example Output

Shakespeare Training Results

Training Options

Train on Your Own Data

Presets

Generation

Sampling Strategies

LoRA Fine-tuning

Alignment

Export

Context Extension (YaRN)

Knowledge Distillation

RLHF: PPO & GRPO

QLoRA (4-bit Training)

Inference Server

3D Parallelism

Streaming Data

Evaluation Harness

Project Structure

What This Is (and Isn't)

References

Tutorials

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 1

Languages

Packages