Train your own LLM from scratch — with every frontier innovation
GPT-4 • DeepSeek V3 • Gemma 2 • Mistral • LLaMA 3 — Zero abstraction. Pure PyTorch.
Quick Start • Architecture • Presets • Generation • LoRA • Alignment • Export • Tutorials
superGPT is a from-scratch LLM training framework implementing every major innovation from GPT-4 through DeepSeek V3, Gemma 2, and Mistral — in readable PyTorch. Train on any text, scale from laptop to GPU cluster, fine-tune with LoRA, align with DPO, export to GGUF.
| Innovation | What It Does | Origin |
|---|---|---|
| 🧠 Multi-head Latent Attention (MLA) | Compresses KV into low-rank latent — ~10x smaller cache | DeepSeek V3 |
| 🔄 Grouped Query Attention (GQA) | Fewer KV heads → faster, less memory | GPT-4, LLaMA |
| 🪟 Sliding Window Attention | O(n·w) attention — handles very long sequences | Mistral |
| 🔄 Alternating Global/Local Layers | Even=full attention, odd=windowed — best of both | Gemma 2 |
| 🛡️ Logit Soft-Capping | Prevents attention logit explosion | Gemma 2 |
| ⚡ Flash Attention | 2-4x faster via PyTorch SDPA backend | FlashAttention-2 |
| 🧩 DeepSeekMoE | Shared + routed experts with sigmoid gating | DeepSeek V3 |
| ⚖️ Aux-Loss-Free Routing | Dynamic bias replaces aux loss | DeepSeek V3 |
| 🔮 Multi-Token Prediction | Predicts N+1, N+2... — denser gradients | DeepSeek V3 |
| 📐 Decoupled RoPE | Separates position from content attention | DeepSeek V3 |
| 🌐 YaRN Context Extension | Extend context window without retraining | LLaMA 3.1, Qwen |
| 🔥 SwiGLU + RMSNorm | Modern FFN + stable normalization | GPT-4, LLaMA |
| 💾 KV-Cache | O(1) per token incremental decoding | Universal |
| 🎯 DPO Alignment | Align with preferences — no reward model | LLaMA 3, Zephyr |
| 🔧 LoRA Fine-tuning | 100x fewer params to train | Microsoft |
| 🏎️ Speculative Decoding | 2-3x faster inference with draft model | Google/DeepMind |
| 🎲 Top-p / Min-p / Rep Penalty | Advanced sampling strategies | All frontier models |
| ✅ Gradient Checkpointing | ~60% memory reduction | Universal |
| 📈 WSD LR Schedule | Warmup-Stable-Decay for better convergence | DeepSeek V3 |
| 📦 GGUF Export | Run your model in llama.cpp / Ollama | llama.cpp |
| 🧬 Knowledge Distillation | Transfer knowledge from large to small model | DeepSeek R1, Qwen |
| 🌐 FSDP + 3D Parallelism | Tensor + Pipeline + Data parallel training | Megatron-LM |
| 🔢 QLoRA (4-bit Training) | Fine-tune 7B models on 8GB VRAM | QLoRA |
| 🎮 PPO / GRPO | Full RLHF — PPO or DeepSeek R1-style GRPO | DeepSeek R1, OpenAI |
| 🚀 Inference Server | Continuous batching + PagedAttention + OpenAI API | vLLM, TGI |
| 📊 Streaming Data | Sharded datasets, HF streaming, cloud-ready | Mosaic, WebDataset |
| 📝 Evaluation Harness | MMLU, HellaSwag, ARC, GSM8K, HumanEval | lm-eval-harness |
| 🔥 DAPO Alignment | Clip-Higher + Dynamic Sampling + Token-Level PG — state-of-the-art RL | ByteDance 2025 |
| ✨ RLVR | RL with auto-verifiable rewards — emergent reasoning, no labels | DeepSeek R1 2025 |
| ⚡ Native Sparse Attention | 3-branch (compress + top-k + window) — 9x faster attention | DeepSeek 2025 |
| 🧬 White-Box KD (CKA) | Match hidden states across different dimensions with CKA | ICLR 2025 |
| 🎯 Mix Distillation | Multi-teacher blending + curriculum learning for small models | arXiv Nov 2025 |
# Clone and setup
git clone https://github.com/viralcode/superGPT.git
cd superGPT
pip install torch numpy
# Prepare data (included Shakespeare dataset, or use your own)
python data/prepare_data.py
# Train a small model (works on CPU/laptop)
python train.py --preset small
# Generate text
python generate.py --prompt "To be or not to be" --interactiveTrained on Shakespeare (~1MB of text) with the small preset on a MacBook:
$ python generate.py --prompt "To be or not to be" --top-p 0.9
To be or not to be ta'en of the tomb:
I'll pay not to see your honour's love.
LADY CAPULET:
You would have you sorrow to my heart did lie.
Nurse:
And that's the prince still tell you have said
And you for your mistre
$ python generate.py --prompt "ROMEO:" --top-p 0.9 --min-p 0.05 --rep-penalty 1.1
ROMEO:
Ay, so much lengthen'd with such a happy great father.
JULIET:
I would you call thee that he is so,
So many some content to the balm of Edward;
That had not fly thee to shake the noble duke.
10.6M params • val loss 1.479 • 49 tokens/sec on CPU • trained in ~45 min
The included Shakespeare dataset (data/input.txt, ~1MB) was trained with the small preset on a MacBook (CPU only):
| Metric | Value |
|---|---|
| Model | small preset — 6 layers, 6 heads, 384 dim |
| Parameters | 10.6M |
| Training data | Tiny Shakespeare (1.1MB, ~300K tokens) |
| Tokenizer | Character-level (vocab_size=65) |
| Batch size | 32 |
| Max iterations | 5,000 |
| Best val loss | 1.479 (at iteration 1,500) |
| Training time | ~45 min on CPU (Apple M-series) |
| Inference speed | 49 tokens/sec with KV-cache |
The model learns Shakespeare's writing style, character names (ROMEO, JULIET, LADY CAPULET), dialogue structure, and poetic phrasing — all from just ~1MB of text.
# Basic training
python train.py --preset small --max-iters 5000
# Memory-efficient (saves ~60% VRAM)
python train.py --preset large --gradient-checkpointing
# DeepSeek V3's learning rate schedule
python train.py --preset medium --lr-schedule wsd
# Custom learning rate and batch size
python train.py --preset medium --lr 1e-4 --batch-size 128
# Resume from checkpoint
python train.py --preset small --resume checkpoints/latest.pt
# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train.py --preset xl --distributed
# Compile for maximum speed (PyTorch 2.0+)
python train.py --preset medium --compilepython data/prepare_data.py --input your_textfile.txt
python train.py --preset medium| Preset | Params | Attention | MoE | Special | Best For |
|---|---|---|---|---|---|
small |
~35M | MHA | — | — | CPU / laptop |
medium |
~125M | GQA 12Q/4KV | — | — | Single GPU |
large |
~333M | GQA 16Q/4KV | — | — | A100/4090 |
xl |
~1.3B | GQA 16Q/8KV | — | — | Multi-GPU |
gpt4 |
~100B | GQA 32Q/8KV | 8×top-2 | — | GPU cluster |
deepseek |
variable | MLA | 64×top-6+2shared | aux-free, MTP | GPU cluster |
mistral |
~7B | GQA 32Q/8KV | — | sliding window 4K | GPU cluster |
gemma2 |
~2.7B | GQA 16Q/4KV | — | alternating layers, logit cap | GPU cluster |
# Scale up as your hardware allows
python train.py --preset small # Laptop
python train.py --preset medium # 1× GPU
python train.py --preset large --gradient-checkpointing # Memory-efficient
# Training options
python train.py --preset medium --lr-schedule wsd # DeepSeek V3 LR schedule
python train.py --preset large --gradient-checkpointing --compile # Max efficiency
# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train.py --preset xl --distributed
torchrun --nproc_per_node=8 train.py --preset deepseek --distributed# Standard generation
python generate.py --prompt "Once upon a time" --interactive
# Advanced sampling
python generate.py --prompt "Once" --top-p 0.9 --min-p 0.05 --rep-penalty 1.2
# Speculative decoding (2-3x faster!)
# Train a small draft model first, then:
python generate.py --draft-checkpoint checkpoints/small.pt --spec-k 5| Strategy | Flag | Description |
|---|---|---|
| Top-k | --top-k 50 |
Keep top-k highest probability tokens |
| Top-p (nucleus) | --top-p 0.9 |
Keep tokens until cumulative probability reaches p |
| Min-p | --min-p 0.05 |
Filter tokens below 5% of the max probability |
| Repetition penalty | --rep-penalty 1.2 |
Reduce probability of repeated tokens |
| Temperature | --temperature 0.8 |
Control randomness (0=greedy, 1=diverse) |
Fine-tune with only ~1-3% trainable parameters:
# Fine-tune a pre-trained model
python finetune.py --checkpoint checkpoints/best.pt --data data/ --lora-rank 16
# Custom LoRA settings
python finetune.py --checkpoint best.pt --data data/ --lora-rank 32 --lora-alpha 64
# Generate with fine-tuned model
python generate.py --checkpoint checkpoints/finetuned_merged.pt --interactiveAlign your model with human preferences using DPO:
# Create preference data (JSONL):
# {"prompt": "...", "chosen": "good response", "rejected": "bad response"}
python align.py --checkpoint checkpoints/best.pt --data preferences.jsonl
python generate.py --checkpoint checkpoints/aligned.pt --interactiveExport to GGUF format for use with llama.cpp, Ollama, LM Studio:
# FP16 (full quality)
python export.py --checkpoint best.pt --output model-fp16.gguf
# Q8_0 (8-bit quantized, good quality, smaller)
python export.py --checkpoint best.pt --output model-q8.gguf --quantize q8_0
# Q4_0 (4-bit quantized, smallest, fastest)
python export.py --checkpoint best.pt --output model-q4.gguf --quantize q4_0Extend your model's context window at inference without retraining:
from config import GPTConfig
config = GPTConfig(
...,
rope_scaling_type="yarn", # or "linear"
rope_scaling_factor=4.0, # 4x context: 4K → 16K
)Transfer knowledge from a large teacher model to a smaller student model. Supports both HuggingFace models (Qwen, LLaMA, Mistral) and superGPT checkpoints.
# Distill from Qwen (requires: pip install transformers)
python distill.py --hf-teacher Qwen/Qwen2.5-0.5B --student-preset small --data data/
# Distill from LLaMA
python distill.py --hf-teacher meta-llama/Llama-3.2-1B --student-preset medium
# Distill from a larger superGPT model
python distill.py --teacher checkpoints/large.pt --student-preset small --data data/
# Custom temperature and balance
python distill.py --hf-teacher Qwen/Qwen2.5-0.5B --temperature 3.0 --alpha 0.7
# Generate with the distilled model
python generate.py --checkpoint checkpoints/distilled_best.pt --interactiveRecommended HuggingFace teachers:
| Model | Size | Best For |
|---|---|---|
Qwen/Qwen2.5-0.5B |
500M | Quick experiments, CPU-friendly |
Qwen/Qwen2.5-1.5B |
1.5B | Good quality, single GPU |
meta-llama/Llama-3.2-1B |
1B | Strong baseline |
mistralai/Mistral-7B-v0.3 |
7B | High quality, needs GPU |
Align your model with reinforcement learning from human feedback:
# Train a reward model from preference data
python rlhf.py reward --checkpoint best.pt --data preferences.jsonl
# GRPO alignment (DeepSeek R1 style, no value model needed)
python rlhf.py grpo --checkpoint best.pt --reward-model reward.pt
# GRPO with rule-based rewards (no reward model needed)
python rlhf.py grpo --checkpoint best.pt --rule-reward lengthFine-tune large models on consumer GPUs with 4-bit quantized LoRA:
from lora import apply_qlora
model = GPT(config)
apply_qlora(model, rank=16) # Base weights -> NF4 (4-bit), LoRA in FP16
# Fine-tune 7B models on 8GB VRAMServe your model with an OpenAI-compatible API:
python serve.py --checkpoint best.pt --port 8000
# Query it
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "To be or not to be", "max_tokens": 100, "stream": true}'Features: continuous batching, PagedAttention, SSE streaming.
Train massive models across GPU clusters:
# 8 GPUs: 2-way tensor parallel x 4-way pipeline parallel
torchrun --nproc_per_node=8 train.py --preset xl \
--tensor-parallel 2 --pipeline-parallel 4Train on multi-terabyte datasets without loading into memory:
# Shard a dataset
python streaming.py shard --input data/train.bin --n-shards 64 --output data/shards/
# Stream from HuggingFace
python train.py --hf-dataset HuggingFaceFW/fineweb --streamingBenchmark your model on standard LLM evaluations:
# Run all benchmarks (MMLU, HellaSwag, ARC, GSM8K, TruthfulQA, HumanEval)
python evaluate.py --checkpoint best.pt
# Specific benchmarks with few-shot
python evaluate.py --checkpoint best.pt --benchmarks mmlu gsm8k --n-shot 5 --output results.jsonsuperGPT/
├── model.py # MLA, GQA, sliding window, Flash Attn, MoE, MTP, KV-cache,
│ # RoPE+YaRN, SwiGLU, speculative decoding, grad checkpointing
├── config.py # All hyperparameters + presets (small → gemma2)
├── train.py # Training (AdamW, cosine/WSD LR, FSDP, grad ckpt, mixed prec)
├── generate.py # Generation (top-k/p, min-p, rep penalty, speculative decoding)
├── align.py # DPO alignment from preference pairs
├── distill.py # Knowledge distillation (teacher → student, HuggingFace support)
├── lora.py # LoRA + QLoRA (4-bit NF4 quantized training)
├── finetune.py # LoRA / QLoRA fine-tuning script
├── export.py # GGUF export (FP16, Q8_0, Q4_0)
├── serve.py # HTTP inference server (continuous batching, PagedAttention)
├── parallel.py # 3D Parallelism (tensor + pipeline parallel)
├── streaming.py # Streaming data pipelines (sharded, HuggingFace, text glob)
├── rlhf.py # RLHF: PPO + GRPO (DeepSeek R1 style)
├── evaluate.py # Benchmark harness (MMLU, HellaSwag, ARC, GSM8K, HumanEval)
├── data/
│ └── prepare_data.py # Tokenization (tiktoken BPE or character-level)
└── requirements.txt
This is: The most comprehensive from-scratch LLM framework, implementing every major innovation from GPT-4 through the latest frontier models. Every feature is implemented in readable PyTorch — no hidden abstractions.
This isn't: A pretrained model. The architecture is frontier-level, but producing a ChatGPT-quality model requires trillions of tokens and thousands of GPUs. This gives you the complete blueprint; you provide the compute.
- DeepSeek-V3 Technical Report — MLA, DeepSeekMoE, MTP, WSD schedule
- Gemma 2 Technical Report — Alternating attention, logit soft-capping
- Mistral 7B — Sliding window attention
- GPT-4 Technical Report — MoE, GQA
- LLaMA 2 — GQA, SwiGLU, RMSNorm, RoPE
- YaRN — Context extension via RoPE scaling
- LoRA — Low-rank adaptation
- DPO — Direct Preference Optimization
- Speculative Decoding — Draft-verify acceleration
- QLoRA — 4-bit quantized fine-tuning
- PPO — Proximal Policy Optimization
- GRPO — Group Relative Policy Optimization (DeepSeek R1)
- PagedAttention — Efficient KV-cache management
- Megatron-LM — 3D parallelism
- nanoGPT — Inspiration
📚 In-depth guides for training frontier LLMs:
| Tutorial | Description |
|---|---|
| Getting Started | Complete guide to superGPT — installation, architecture, all model presets, data preparation, training, text generation, LoRA fine-tuning, distillation, multi-GPU FSDP, and troubleshooting. |
| Training Data Guide | How to prepare training data from scratch — web crawling, text extraction, quality filtering, deduplication, cleaning, custom data from GitHub/Google/PDFs, synthetic data generation (Magpie, Evol-Instruct), tokenization, data mixing, and curriculum learning. |
| Instruction Tuning & Chat | Turn a base model into ChatGPT — the complete 4-stage pipeline: SFT with LoRA, DPO alignment, RLHF/GRPO, RLVR (DeepSeek-R1 style). Includes 20+ instruction datasets, chat templates, OpenAI-compatible serving, and reasoning model training. |
| Deploy on RunPod | Step-by-step guide to renting cloud GPUs on RunPod and training superGPT models — GPU selection, SSH setup, background training, monitoring, downloading checkpoints, multi-GPU, and cost optimization. |
MIT