Implementation of concepts from the paper "Solving a Million-Step LLM Task with Zero Errors" using Python and LiteLLM.
This project implements the MAKER system, which enables LLMs to complete tasks requiring over one million steps without errors through:
- Maximal Agentic Decomposition (MAD): Breaking tasks into single-step subtasks
- First-to-Ahead-by-k Voting: Error correction through multi-agent consensus
- Red-Flagging: Anomaly detection to discard unreliable responses
Instead of using a single sophisticated model, MAKER uses many focused microagents that each handle one simple decision. This reduces cumulative error propagation.
Multiple agents vote on each step. The system continues sampling until one candidate achieves k more votes than competitors. The voting margin k grows logarithmically with task length: Θ(ln s).
Responses are checked for anomalies:
- Overly long or short outputs
- Malformed responses
- Failure patterns ("I cannot", "I don't know", etc.)
- Missing expected format
MAKER_CONCEPTS.md- Comprehensive knowledge extraction from the papertowers_of_hanoi.py- Towers of Hanoi game implementation (benchmark task)maker.py- MAKER system implementation for Towers of Hanoitest_maker.py- Comprehensive test suitedemo.py- Simple demonstration scriptrequirements.txt- Python dependencies
MAKER_GENERALIZATION.md- How to apply MAKER to ANY sequential taskmaker_base.py- Generalized MAKER implementation for any task.claude/skills/maker-methodology/- Claude Skill for using MAKERSKILL.md- Main skill instructionsTASK_TEMPLATE.py- Template for creating new MAKER tasksEXAMPLES.md- Concrete examples for different task types
Basic Examples:
example_sudoku.py- Sudoku solver using generalized MAKER
Real-World Scenarios:
scenario1_dependency_resolution.py- Build order/dependency resolutionscenario2_infrastructure_provisioning.py- Cloud infrastructure provisioningscenario3_interview_scheduling.py- Interview scheduling with constraintsREAL_WORLD_SCENARIOS.md- Complete breakdown of scenarios 1-3
Complex Scenarios (Advanced):
scenario4_api_test_execution.py- API integration test suite with dependenciesscenario5_database_migration.py- Production database migration with data preservationscenario6_distributed_deployment.py- Distributed system rolling deploymentCOMPLEX_SCENARIOS.md- Complete breakdown of scenarios 4-6
Search & Exploration:
rubiks_cube.py- Complete 3×3×3 Rubik's Cube implementationrubiks_cube_maker_solver.py- MAKER-based cube solver with heuristicsscenario7_rubiks_cube_solver.py- Rubik's Cube solver scenario
META-MAKER (Requirements Definition):
requirements_definer_maker.py- MAKER for project requirements definitionscenario8_requirements_definition.py- Demonstrations of preventing "LLM spill"META_MAKER.md- Complete guide to using MAKER for requirements
pip install -r requirements.txtThe implementation uses LiteLLM, which supports multiple LLM providers. For OpenAI:
export OPENAI_API_KEY='your-api-key-here'For other providers, see LiteLLM documentation.
python towers_of_hanoi.pyThis should run the basic Towers of Hanoi implementation without requiring an API key.
python demo.pypython test_maker.pyTests include:
- Basic functionality (3 disks)
- Scaling tests (3, 4, 5 disks)
- Voting margin impact (k=1, 2, 3)
- Red-flagging effectiveness
- Solution verification
from maker import MAKER, MAKERConfig
# Configure MAKER
config = MAKERConfig(
model="gpt-4o-mini", # Model to use
k=3, # Voting margin
temperature=0.7, # Sampling temperature
verbose=True # Print progress
)
# Create MAKER instance
maker = MAKER(config)
# Solve Towers of Hanoi
num_disks = 4
success, moves, stats = maker.solve_towers_of_hanoi(num_disks)
print(f"Success: {success}")
print(f"Moves: {len(moves)}")
print(f"Expected: {2**num_disks - 1}")The MAKER approach can be applied to any sequential task! The framework has been generalized to work with:
- Constraint satisfaction problems (Sudoku, N-Queens, scheduling)
- Sequential planning (route planning, workflow orchestration)
- Code generation (multi-file refactoring, test generation)
- Mathematical reasoning (proof construction, equation solving)
- Data pipelines (ETL workflows, data cleaning)
from maker_base import GeneralizedMAKER, MAKERConfig, DecomposableTask
# 1. Define your task by implementing DecomposableTask
class YourTask(DecomposableTask):
def get_possible_actions(self):
# Return list of valid actions from current state
pass
def apply_action(self, action):
# Apply action and update state
pass
def is_complete(self):
# Check if task is done
pass
def format_for_agent(self, step_num):
# Format state as prompt for voting agents
pass
# ... implement other required methods
# 2. Create task instance
task = YourTask(problem_instance)
# 3. Configure and solve with MAKER
config = MAKERConfig(model="gpt-4o-mini", task_type="your_task")
maker = GeneralizedMAKER(config, task)
success, actions, stats = maker.solve()MAKER_GENERALIZATION.md- Complete guide to generalizing MAKERTASK_TEMPLATE.py- Copy-paste template for new tasksEXAMPLES.md- Concrete examples for different domainsexample_sudoku.py- Working Sudoku solver example.claude/skills/maker-methodology/- Claude Skill that teaches MAKER methodology
A Claude Code skill is included that teaches Claude how to apply MAKER to any task:
# The skill is in .claude/skills/maker-methodology/
# Claude will automatically use it when you ask about:
# - Solving multi-step problems
# - Sequential planning tasks
# - Tasks requiring many decisions
# - Constraint satisfaction problemsWhen activated, Claude will:
- Identify if your task is MAKER-compatible
- Help you define the task interface
- Set up voting and red-flagging
- Generate the implementation
- Guide you through testing
from maker_base import GeneralizedMAKER, MAKERConfig
from example_sudoku import SudokuTask, create_easy_sudoku
# Create puzzle
puzzle = create_easy_sudoku()
task = SudokuTask(puzzle)
# Solve with MAKER
config = MAKERConfig(model="gpt-4o-mini", verbose=True)
maker = GeneralizedMAKER(config, task)
success, actions, stats = maker.solve()
# Sudoku solved with zero errors!Three additional complete implementations demonstrate MAKER's versatility:
Problem: Determine build order for software project with module dependencies
Use Cases: npm/pip dependencies, Make/Gradle builds, Terraform resources, Kubernetes deployments
Complexity: 12-module project has ~1,000 valid build orders
python scenario1_dependency_resolution.pyProblem: Provision cloud resources in correct order with parallel limits
Use Cases: Terraform/CloudFormation, Kubernetes setup, multi-region deployments, CI/CD environments
Complexity: 20-resource infrastructure has ~10^6 valid provisioning orders
Features:
- Parallel provisioning (max 3 simultaneous)
- Cost optimization ($654/hour total)
- Resource dependencies (VPC → Subnet → Database → App)
python scenario2_infrastructure_provisioning.pyProblem: Schedule interviews satisfying interviewer availability, room capacity, and candidate preferences
Use Cases: Technical interviews, medical appointments, meeting rooms, university courses, court hearings
Complexity: 5 interviews with 8 time slots each = ~32,000 possible schedules
Features:
- Hard constraints (availability, capacity)
- Soft constraints (candidate preferences)
- Multi-person panel requirements
- Optimization (maximize preference satisfaction)
python scenario3_interview_scheduling.pySee REAL_WORLD_SCENARIOS.md for complete breakdowns of all three scenarios with:
- Detailed problem analysis
- Step-by-step MAKER decomposition
- Agent prompt examples
- Complexity analysis
- Cost comparisons
- Implementation guides
Three highly complex scenarios demonstrating MAKER at production scale:
Problem: Execute comprehensive API test suite with dependencies, data sharing, and parallel execution
Complexity: 13 tests, 23 dependencies, parallel execution (max 3), retry logic, shared state
Features:
- Test dependencies (test B needs test A's output)
- Data sharing between tests (user ID from create used in update)
- Parallel execution constraints
- Flaky test retry logic
- Critical test failure handling
- Setup/teardown management
python scenario4_api_test_execution.pyProblem: Migrate production database while preserving all data and maintaining zero downtime
Complexity: 16 steps, 1.5M rows affected, backup points, rollback capability, risk levels 1-5
Features:
- Zero data loss requirement
- Multi-stage data transformation
- Foreign key dependency management
- Backup before risky operations
- Rollback capability at any point
- Continuous data validation
- 10-minute downtime limit
python scenario5_database_migration.pyProblem: Deploy 5 microservices with rolling updates, health checks, and automatic rollback
Complexity: 25+ steps, 5 services, 19 instances, canary deployments, multi-stage health checks
Features:
- Multi-service dependencies
- Rolling updates (gradual instance replacement)
- Canary deployments for critical services
- Health checks after each stage
- Database migration coordination
- Load balancer reconfiguration
- Automatic rollback on failure
- Zero downtime requirement
python scenario6_distributed_deployment.pySee COMPLEX_SCENARIOS.md for complete analysis of these advanced scenarios with:
- Detailed complexity breakdown
- Multi-dimensional dependency graphs
- Rollback orchestration strategies
- Health check coordination
- Real-world impact metrics
- Comparison tables
- Implementation guide for similar complex tasks
Problem: Solve a scrambled Rubik's Cube using heuristic-guided exploration
Complexity: 43 quintillion possible states, 18 moves per state, heuristic-guided search
Features:
- State space exploration (not just dependency resolution!)
- Heuristic evaluation (score each move)
- Voting-based move selection
- Loop avoidance (track visited states)
- Progressive solving approach
- Solutions in 30-100 moves (vs optimal 20 moves)
Why this is interesting:
- Demonstrates MAKER on SEARCH problems
- Shows heuristic guidance + voting
- Compares with optimal solvers (Kociemba's algorithm)
- Proves MAKER versatility beyond dependency graphs
# Full cube implementation
python rubiks_cube.py
# MAKER solver
python rubiks_cube_maker_solver.py
# Complete scenario with tests
python scenario7_rubiks_cube_solver.pyFiles:
rubiks_cube.py- Complete 3×3×3 cube implementation with all 18 movesrubiks_cube_maker_solver.py- MAKER-based solver with heuristicsscenario7_rubiks_cube_solver.py- Tests, comparisons, demonstrations
Note: MAKER finds solutions (not necessarily optimal). For optimal solving (<20 moves), use Kociemba's algorithm. MAKER demonstrates voting-based search in large state spaces.
Problem: LLM coding agents "spill" unnecessary features because requirements are vague or misaligned
The "LLM Spill" Problem:
User: "Build a simple task management API"
LLM builds:
- Full OAuth2 with social login
- Email notifications
- Webhooks system
- Advanced search with Elasticsearch
- Real-time WebSocket updates
- Export to PDF, CSV, Excel
- 15-level user permission system
- API rate limiting with Redis
... 3,000 lines of unnecessary code!
Solution: Use MAKER to define requirements before writing code
This is META-MAKER - using MAKER to generate the foundation (requirements) that guides all downstream work (coding).
Key Innovation - Explicit Non-Goals:
Core Purpose: "Build REST API for authenticated users to CRUD tasks"
NON-GOALS (Do NOT Build):
- ✗ Email notifications
- ✗ Webhooks
- ✗ Real-time updates
- ✗ Export functionality
- ✗ Advanced search
- ✗ Complex permissions
- ✗ Rate limiting (not needed for MVP)
... prevents "spill"!
Results:
- 10x reduction in code size (3,000 → 250 lines)
- 10x reduction in dev time (3-4 weeks → 2-3 days)
- 10x reduction in technical debt
- Higher user satisfaction (exactly what they need)
How It Works:
- Voting on core purpose (foundation)
- Voting on non-goals (boundaries)
- Voting on features (within boundaries)
- Quality gates: clear, testable, minimal
- Export requirements for coding agent
# See comparison: with vs without MAKER
python scenario8_requirements_definition.py
# Real implementation
python requirements_definer_maker.pyFiles:
requirements_definer_maker.py- MAKER for requirements definitionscenario8_requirements_definition.py- Demonstrations and comparisonsMETA_MAKER.md- Complete guide to preventing "LLM spill"
Why This Matters:
- Most effective anti-spill mechanism
- Prevents problems at root cause
- 10,000x ROI (cost to define requirements vs. cost of unnecessary features)
- The best code is code you never had to write
See META_MAKER.md for complete guide including:
- Before/after examples
- Why voting prevents over-engineering
- Quality gates that catch ambiguity
- Real-world cost savings
- Integration with coding agents
MAKER works best for tasks that are:
- ✅ Sequential (steps must happen in order)
- ✅ Decomposable (can break into single-step decisions)
- ✅ Verifiable (can check if each step is valid)
- ✅ Long (>10 steps where errors accumulate)
MAKER is NOT ideal for:
- ❌ Creative/open-ended generation
- ❌ Tasks requiring holistic understanding
- ❌ Very short tasks (<10 steps)
- ❌ Continuous optimization problems
model(str): LLM model to use (default: "gpt-4o-mini")- Paper finding: Smaller, cheaper models work better with voting
k(int): Voting margin (default: 3)- Grows logarithmically with task steps
- Use
MAKERConfig.compute_k_for_steps(n)for automatic calculation
max_response_length(int): Maximum response length for red-flagging (default: 200)min_response_length(int): Minimum response length for red-flagging (default: 1)temperature(float): Sampling temperature (default: 0.7)max_resamples(int): Maximum resamples if red-flagged (default: 5)verbose(bool): Print progress information (default: True)
The paper shows that k should grow logarithmically with the number of steps:
- ≤10 steps: k=2
- ≤100 steps: k=3
- ≤1000 steps: k=4
- >1000 steps: k = max(3, ln(steps) + 1)
Use MAKERConfig.compute_k_for_steps(num_steps) for automatic calculation.
The Towers of Hanoi puzzle is an ideal benchmark because:
- Scalable: 2^D - 1 steps for D disks
- Verifiable: Easy to check correctness
- Single-step operations: Each move is atomic
| Disks | Steps Required | Time Estimate (k=3) |
|---|---|---|
| 3 | 7 | ~30 seconds |
| 4 | 15 | ~1 minute |
| 5 | 31 | ~2 minutes |
| 10 | 1,023 | ~20 minutes |
| 20 | 1,048,575 | ~200 hours* |
*Based on paper results. Actual time depends on model, API rate limits, and voting margin.
- Decomposition > Sophistication: Many simple agents outperform single sophisticated agents
- Voting is Critical: Multi-agent consensus prevents error propagation
- Red-flagging is Essential: Anomaly detection reduces correlated errors
- Cost-Effective Scaling: Cheaper models with voting beat expensive reasoning models
- Stable Performance: Error rates don't increase with task length
Expected cost scales as Θ(s ln s) where s is the number of steps:
- Each step requires multiple agent calls (voting)
- The voting margin k grows logarithmically
- Total API calls ≈ s × (average agents per vote)
For cost optimization:
- Use cheaper models (gpt-4o-mini performs well)
- Tune k based on task length
- Implement caching for repeated states (not in this demo)
The MAKER approach can be applied to any task that:
- Can be decomposed into steps
- Has verifiable or votable intermediate states
- Allows context extraction for each step
Examples:
- Multi-step reasoning problems
- Code generation with dependencies
- Mathematical proofs
- Sequential planning tasks
To adapt MAKER:
- Implement your task's state representation
- Define single-step operations
- Create prompts for microagents
- Implement validation logic
- Adjust red-flagging criteria
- API Costs: Large tasks require many API calls
- Time: Voting adds latency
- Task Suitability: Only works for decomposable tasks
- Model Dependence: Requires models that can handle single-step decisions
- Implement caching to avoid redundant votes
- Add parallel agent calls for faster voting
- Support for other benchmark tasks
- Adaptive k selection based on confidence
- Cost tracking and optimization
- Support for more LLM providers
Paper: "Solving a Million-Step LLM Task with Zero Errors"
- arXiv: https://arxiv.org/html/2511.09030v1
- Key result: Successfully solved 20-disk Towers of Hanoi (1,048,575 steps) with zero errors
MIT License - See LICENSE file for details
Contributions are welcome! Areas for improvement:
- Additional benchmark tasks
- Performance optimizations
- Better red-flagging heuristics
- Cost optimization strategies
- Documentation and examples