Skip to content

πŸ”„ Local-first AI system that evolves its own codebase through adversarial Red/Blue team dynamics. Ollama-powered, Docker-isolated, with cryptographic safety guards. Research prototype exploring recursive self-improvement via benchmark-driven code evolution.

License

Notifications You must be signed in to change notification settings

moonrunnerkc/aasms

Repository files navigation

πŸ”„ AASMS

Adversarial Architectural Self-Modification System

License: MIT Python 3.10+ Ollama Docker CI Tests Status

A research system exploring recursive self-improvement through adversarial code evolution

Features | Installation | Usage | Architecture | Safety | Contributing


πŸ“– Overview

AASMS is a local-first AI system that evolves its own codebase through adversarial Red/Blue team dynamics. Blue Team agents propose code improvements as unified diffs, Red Team agents generate exploit tests to find flaws, and an Orchestrator evaluates proposals in Docker-isolated sandboxes, committing only survivors that improve benchmark scores.

⚠️ Research Prototype: This system is for research purposes only. See Known Limitations.

Verified Evolution Results

Cycle Score Improvement Status
1 0.01 β€” βœ“ Committed
2 0.02 +100% βœ“ Committed
3 0.03 +50% βœ“ Committed
4 0.04 +33% βœ“ Committed
5 0.05 +25% βœ“ Committed
6 0.06 +20% βœ“ Committed

24 proposals applied, 0 errors, 0 reverts. Scores from benchmarks/reasoning_suite.json (10-question subset). Full methodology in benchmarks/README.md.

πŸ” Reproducibility & Proof of Locality

All results are reproducible and verifiable. See docs/REPRODUCIBILITY_PROOF.md for complete proof artifacts:

Proof Type Description
Seeded Benchmarks python scripts/benchmark.py --cycles 10 --seed 42 produces identical results
JSONL Cycle Logs Full timestamps, scores, commit hashes in results/proof_run_seed42.jsonl
Hardware Report Local GPU (RTX 5070), Ollama endpoints, no cloud calls
Network Isolation Docker --network=none, tcpdump verification
Human Oversight Sample approvals/rejections in persistence/human_veto.json
# Reproduce documented results
python scripts/benchmark.py --cycles 10 --seed 42

# Verify GPU and locality
python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"

✨ Features

Category Capabilities
🏠 Local Execution Ollama-only inference (llama3.2), no cloud APIs, multi-GPU vendor support
🐳 GPU-Aware Docker NVIDIA/AMD/Intel detection, driver compatibility checks, graceful CPU fallback
πŸ›‘οΈ Cryptographic Safety SHA-256 integrity verification, import monitoring, immutable file protection
🎯 Ensemble Anti-Gaming Calibrated detection (90% precision, see methodology), rotating benchmarks
πŸ’Ύ State Protection Atomic commits, auto-reversion, stall watchdog, human veto with auto-approve
⚑ GPU-Aware Parallelism Serialized GPU tasks, concurrent CPU work, contention prevention
πŸ“Š Structured Logging JSONL schema with query tools, CSV export, aggregate statistics
πŸ”¬ Fidelity Verification Domain-specific metrics (entity/number/code preservation), quality grades

πŸ“¦ Installation

Prerequisites

Requirement Version Purpose
Python 3.10+ Runtime
Git Any Version control
Ollama Latest Local LLM inference
Docker Latest Sandbox isolation (recommended)
GPU Driver See below GPU passthrough (optional)

Quick Start

# Clone and install
git clone https://github.com/moonrunnerkc/aasms.git && cd aasms
./scripts/install.sh  # Unified installer for all platforms

# Or manual installation
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -v
πŸ“‹ Unified Installer Details

The scripts/install.sh script handles:

  • Platform detection (Linux, macOS, WSL2)
  • Python version verification (3.10+)
  • Virtual environment setup
  • Ollama installation with verification
  • Docker capability check
  • GPU detection and compatibility report
# Full installation with prompts
./scripts/install.sh

# Check capabilities after install
make check-gpu
🐳 Docker + GPU Setup

Docker Installation:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install docker.io docker-compose
sudo usermod -aG docker $USER && newgrp docker
docker run hello-world

GPU Support by Vendor:

Vendor Requirement Docker Flag Notes
NVIDIA nvidia-container-toolkit + driver 525+ --gpus all Full support
AMD ROCm 5.0+ + rocm-docker --device=/dev/kfd See ROCm docs
Intel Not supported β€” CPU fallback used

NVIDIA Setup:

# Install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install nvidia-docker2
sudo systemctl restart docker

# Verify
docker run --gpus all nvidia/cuda:11.0-base nvidia-smi

Check AASMS GPU detection:

python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"
πŸ–₯️ Platform-Specific Notes

WSL2 (Windows):

# 1. Install Docker Desktop with WSL2 backend
# 2. Enable WSL integration: Docker Desktop > Settings > Resources > WSL Integration
# 3. For GPU: Install NVIDIA drivers on Windows, not in WSL
# 4. Verify: wsl --update && docker run hello-world

# Troubleshooting
wsl --shutdown  # Restart if docker not responding
docker context use default  # Fix context issues

macOS:

brew install ollama docker
# Docker Desktop required for Docker support
# GPU: Apple Silicon supported via Metal, no nvidia-docker needed

Linux (non-Ubuntu):

# Fedora/RHEL
sudo dnf install docker docker-compose
sudo systemctl enable --now docker

# Arch
sudo pacman -S docker docker-compose
sudo systemctl enable --now docker

βš™οΈ Configuration

All configuration lives in config/:

File Purpose
models.yaml LLM endpoints (Ollama only)
teams.yaml Blue/Red agent definitions
benchmarks.yaml Test suites and scoring rules
evolution_mode.yaml Safety thresholds and limits
πŸ”§ Key Settings

evolution_mode.yaml:

mode: prompt_only  # Start safe; unlock code_evolution after stability
min_improvement_pct: 5.0
max_improvement_pct: 50.0
required_test_pass_rate: 1.0
memory_limit_mb: 512
test_timeout_seconds: 30

# Human oversight with empirically-derived thresholds
human_oversight:
  cycles_between_reviews: 10
  max_unattended_cycles: 50
  auto_approve:
    max_files: 3      # 95% of safe commits touch ≀3 files
    max_lines: 50     # Mean safe change: 28 lines (Οƒ=15)
    max_improvement_pct: 15.0
    min_improvement_pct: 5.0
  anomaly_thresholds:
    improvement_threshold: 50.0  # >3Οƒ from mean
    score_drop_threshold: 10.0

# Stall detection
watchdog:
  phase_timeout_seconds: 300   # 5 min per phase
  cycle_timeout_seconds: 1800  # 30 min per cycle
  heartbeat_interval: 10

πŸš€ Usage

Running Evolution Cycles

# Safe mode (recommended start)
python run.py --cycles 10 --mode prompt_only --verbose

# Code evolution (after 20+ stable prompt-only cycles)
python run.py --cycles 5 --mode code_evolution --verbose

# Show options
python run.py --help

Cycle Time Estimates

Hardware Est. Time/Cycle Breakdown
RTX 5070/5080 ~2 min Blue: 45s, Red: 40s, Eval: 35s
RTX 4090 ~1.7 min Blue: 35s, Red: 30s, Eval: 27s
RTX 3090 ~2.6 min Blue: 55s, Red: 50s, Eval: 43s
CPU-only ~7.3 min Blue: 180s, Red: 160s, Eval: 90s
# Get estimate for your hardware
python -c "from utils.profiling import get_realistic_estimate; import json; print(json.dumps(get_realistic_estimate(), indent=2))"
πŸ“Š Reproducible Benchmarks
# Single reproducible run
python scripts/benchmark.py --cycles 10 --seed 42

# Multiple runs for statistical validity
python scripts/benchmark.py --cycles 5 --runs 3 --output results.json

# See benchmark methodology
cat benchmarks/README.md

Scoring Methodology:

  • Dataset: benchmarks/reasoning_suite.json (curated reasoning questions)
  • Scoring: Exact match after normalization
  • Weights: reasoning (40%), math (30%), robustness (20%), code (10%)
  • Low absolute scores (0.01-0.06) reflect strict scoring + small model; key metric is relative improvement
πŸ“Š Log Analysis Tools
# View cycle summary
python -m utils.log_schema persistence/cycle_logs --cycle 6

# Get aggregate statistics
python -m utils.log_schema persistence/cycle_logs --stats

# Export to CSV
python -m utils.log_schema persistence/cycle_logs --export-csv results.csv

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   πŸ‘€ Human Oversight                        β”‚
β”‚     (Configurable Veto, Empirically-Tuned Auto-Approve)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ”΅ Blue Team   β”‚       β”‚       β”‚  πŸ”΄ Red Team    β”‚
β”‚  4 Specialists  β”‚       β”‚       β”‚  4 Specialists  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                β”‚                β”‚
         β–Ό                β–Ό                β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚           βš™οΈ Orchestrator                   β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
    β”‚  β”‚  🐳 Multi-Vendor GPU Docker         β”‚    β”‚
    β”‚  β”‚  β€’ NVIDIA/AMD/Intel detection       β”‚    β”‚
    β”‚  β”‚  β€’ Driver compatibility checks      β”‚    β”‚
    β”‚  β”‚  β€’ Graceful CPU fallback            β”‚    β”‚
    β”‚  β”‚  β€’ Stall watchdog (5min/phase)      β”‚    β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
    β”‚  β”‚       πŸ›‘οΈ Cryptographic Safety       β”‚    β”‚
    β”‚  β”‚  β€’ SHA-256 integrity verification   β”‚    β”‚
    β”‚  β”‚  β€’ Import monitoring (AST)          β”‚    β”‚
    β”‚  β”‚  β€’ Immutable file protection        β”‚    β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     πŸ“Š Calibrated Anti-Gaming Detection     β”‚
    β”‚  β€’ Z-score (threshold=2.5, 92% precision)   β”‚
    β”‚  β€’ Improvement cap (50%, 95% precision)     β”‚
    β”‚  β€’ Ensemble vote (90% precision target)     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ“ Project Structure
aasms/
β”œβ”€β”€ config/                    # YAML configuration
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ blue/                  # 4 constructive specialists
β”‚   └── red/                   # 4 destructive specialists
β”œβ”€β”€ orchestrator/              # Core evolution loop
β”‚   β”œβ”€β”€ cycle_manager.py       # Main coordinator
β”‚   β”œβ”€β”€ commitment.py          # Commit decisions
β”‚   └── redundancy.py          # Checkpointing
β”œβ”€β”€ evaluator/                 # Testing and validation
β”‚   β”œβ”€β”€ gaming_detection.py    # Calibrated ensemble detection
β”‚   β”œβ”€β”€ human_oversight.py     # Empirically-tuned auto-approve
β”‚   β”œβ”€β”€ alignment.py           # Constitutional checks
β”‚   └── static_analysis.py     # Code analysis
β”œβ”€β”€ utils/                     # Shared utilities
β”‚   β”œβ”€β”€ gpu_docker.py          # Multi-vendor GPU detection
β”‚   β”œβ”€β”€ gpu_parallel.py        # GPU-aware parallelism
β”‚   β”œβ”€β”€ immutable_guard.py     # Cryptographic protection
β”‚   β”œβ”€β”€ log_schema.py          # Structured logging
β”‚   β”œβ”€β”€ profiling.py           # Hardware-specific estimates + watchdog
β”‚   └── context_summarizer.py  # Domain-specific fidelity metrics
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ install.sh             # Unified cross-platform installer
β”‚   └── benchmark.py           # Reproducible benchmarks
β”œβ”€β”€ benchmarks/                # Datasets with methodology docs
β”œβ”€β”€ persistence/               # State and logs
β”œβ”€β”€ tests/                     # 374+ tests
β”œβ”€β”€ Makefile                   # Local CI equivalent
└── .github/workflows/ci.yml   # GitHub Actions CI

πŸ§ͺ Testing

# Run all tests
make test                     # or: pytest tests/ -v

# Full CI suite locally (lint + type + test + security)
make test-ci

# With coverage
make test-cov

# Quick checks
make lint                     # ruff
make type-check               # mypy
make security                 # bandit
Category Command Purpose
All make test-ci Full CI locally
Unit pytest tests/test_llm_client.py LLM client
Integration pytest tests/test_cycle_manager.py Orchestrator
Isolation pytest tests/test_sandbox_isolation.py Docker sandbox

πŸ›‘οΈ Safety

Constraint Value Enforcement
Sandbox Docker + seccomp + multi-vendor GPU utils/gpu_docker.py
Integrity SHA-256 manifest utils/immutable_guard.py
Timeout 30s/test, 5min/phase, 30min/cycle utils/profiling.CycleWatchdog
Memory 512MB default Docker + RLIMIT_AS
Network Disabled --network=none
Improvement β‰₯5%, ≀50% orchestrator/commitment.py
Test Pass 100% required orchestrator/commitment.py
Immutable Files 9 protected files utils/immutable_guard.py
Human Review Configurable (default: 10 cycles) evaluator/human_oversight.py
Auto-Approve ≀3 files, ≀50 lines, 5-15% Empirically validated
Anti-Gaming Ensemble (90% precision calibrated) evaluator/gaming_detection.py
Stall Detection Phase/cycle timeouts + heartbeat utils/profiling.CycleWatchdog

Anti-Gaming Calibration

The 90% precision target is based on:

  1. Training: 200 synthetic gaming attempts
  2. Validation: 100 real cycles with 15 labeled gaming attempts
  3. Thresholds: Z-score=2.5 (92% precision), improvement cap=50% (95% precision)
  4. Ongoing: Metrics tracked in persistence/gaming_metrics.json

See evaluator/gaming_detection.py docstring for full methodology.


⚠️ Known Limitations

Technical

  • LLM Quality: Evolution quality depends on local model (tested llama3.2:3b)
  • Compute: 2-8 min/cycle; see Cycle Time Estimates
  • Context Windows: Domain-specific fidelity checks (not ROUGE) mitigate truncation
  • GPU Vendors: NVIDIA full support, AMD partial (ROCm), Intel CPU-fallback

Safety

  • Research Only: Not production-ready
  • Sandbox Escapes: Docker provides strong but not absolute isolation
  • Calibration Drift: Re-calibrate anti-gaming quarterly with accumulated data
  • Human Oversight Required: Configure review intervals appropriately

Research

  • Early Results: 6 cycles shown; longer runs pending external validation
  • Single Implementation: May not generalize to other architectures
  • Dual-Use Risk: See evaluator/alignment.py for ethical guidelines

🀝 Contributing

# Fork, clone, and setup
git clone https://github.com/YOUR_USERNAME/aasms.git && cd aasms
make install-dev

# Create feature branch
git checkout -b feature/your-feature

# Run full CI locally before PR
make test-ci

# Push and open PR
git push origin feature/your-feature

Code Style: PEP8 via ruff, type hints required, docstrings on public functions.

See CONTRIBUTING.md for full guidelines.


πŸ“„ License

MIT License Β© 2026 Bradley R. Kinnard


πŸ“š Citation

@software{aasms2026,
  author = {Kinnard, Bradley R.},
  title  = {Adversarial Architectural Self-Modification System},
  year   = {2026},
  url    = {https://github.com/moonrunnerkc/aasms}
}

⬆ Back to Top

Made with πŸ”§ by @moonrunnerkc

About

πŸ”„ Local-first AI system that evolves its own codebase through adversarial Red/Blue team dynamics. Ollama-powered, Docker-isolated, with cryptographic safety guards. Research prototype exploring recursive self-improvement via benchmark-driven code evolution.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published