Failure-First: Adversarial Evaluation for Embodied and Agentic AI

Failure is not an edge case. It is the primary object of study.

failurefirst.org · Daily Paper Series · Research Blog · Security Policy

The Project

Failure-First is a red-teaming and benchmarking framework that studies how AI systems fail under adversarial pressure. We focus on embodied AI (robots, tool-using agents, multi-agent systems) where failures have physical consequences.

The core research question: when safety mechanisms are tested systematically across hundreds of models and thousands of attack techniques, what patterns emerge?

Key Findings

227 models tested. 141,561 adversarial prompts. 133,646 graded results. 337 attack techniques.

Classifier unreliability is pervasive. Keyword-based jailbreak classifiers agree with LLM-graded ground truth at Cohen's kappa = 0.126. Heuristic compliance labels carry a roughly 80% false positive rate. Most published ASR numbers are likely inflated.
Hallucinated refusals are functionally dangerous. Models that appear to refuse harmful requests sometimes generate the harmful content anyway, wrapped in safety-sounding framing. This "hallucination refusal" pattern adds 11.9 percentage points to the attack success rate on non-abliterated models.
Format-lock attacks exploit structured output compliance. Requesting harmful content formatted as JSON, YAML, or code achieves 24--42% success rates against frontier models. The structured output training objective conflicts with safety training.
Multi-turn escalation disproportionately affects reasoning models. Crescendo-style attacks achieve 65--85% success against extended-reasoning models, whose chain-of-thought tracking makes them susceptible to gradual context manipulation.
Safety mechanism effectiveness varies by 57x across providers. Identical prompts tested across providers reveal that safety investment, not model capability, determines vulnerability.

Methodology

All results use LLM-graded classification (the FLIP protocol) with documented grader reliability audits. We report three-tier ASR (strict, broad, functionally dangerous) with Wilson confidence intervals. Statistical comparisons use chi-square tests with Bonferroni correction. Full methodology is described in our CCS 2026 submission.

Grading methodology matters: always check whether cited ASR numbers use LLM-only, heuristic-only, or coalesced verdicts.

The Site

failurefirst.org hosts 740+ pages including research blog posts, a daily paper analysis series covering recent adversarial ML literature, policy reports, and multimedia overviews (audio, video, infographics via NotebookLM).

Repository Structure

This public repository contains:

Pattern-level findings and methodology descriptions
MANIFEST.json listing dataset structure (no adversarial content)
Design charter and research ethics documentation
Site source for failurefirst.org

Full datasets, traces, and evaluation infrastructure are maintained in a private research repository. Access is available under NDA for AI safety researchers at accredited institutions, government safety bodies, and frontier lab security teams. Open a GitHub issue with institutional affiliation.

Citation

@software{failure_first_2026,
  title   = {Failure-First: Adversarial Evaluation Framework for Embodied AI},
  author  = {Wedd, Adrian},
  year    = {2026},
  url     = {https://failurefirst.org},
  note    = {227 models, 141{,}561 prompts, 337 attack techniques}
}

A CCS 2026 paper is in submission preparation. Citation details will be updated upon acceptance.

Contributing

See CONTRIBUTING.md. This is a research project -- contributions are welcome via issue reports, citations, red-team collaboration proposals, and dataset contributions.

Security

See SECURITY.md for our coordinated vulnerability disclosure process. We currently have 5 pending responsible disclosures with model providers.

License

MIT

Study failures to build better defenses.

Name		Name	Last commit message	Last commit date
Latest commit History 533 Commits
docs		docs
scripts		scripts
site		site
tools		tools
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN_CHARTER.md		DESIGN_CHARTER.md
MANIFEST.json		MANIFEST.json
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Failure-First: Adversarial Evaluation for Embodied and Agentic AI

The Project

Key Findings

Methodology

The Site

Repository Structure

Citation

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Failure-First: Adversarial Evaluation for Embodied and Agentic AI

The Project

Key Findings

Methodology

The Site

Repository Structure

Citation

Contributing

Security

License

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages