Autoresearch Ecosystem Index
A curated index of autonomous research loops, tools, and benchmarks built on the keep-or-revert primitive introduced by Andrej Karpathy.
Reference guide
Loop Reference & Community Outcomes
Quick reference on what autoresearch is, how the loop operates, and documented community results.
Core Concepts
Introduced by Andrej Karpathy as a natural language instruction document (program.md). A coding agent reads it, proposes one change, runs a time-bounded training session, and evaluates the validation bits-per-byte.
Each cycle modifies exactly one file. If the measured metric improves, the change is committed. If not, the file is restored via Git hard reset. This reversible constraint prevents regressions from compounding over hundreds of cycles.
Transfers to any domain with a measurable scalar fitness function. Used in: ML training loss, GPU kernel MFU, software build times, trading strategy Sharpe ratios, ancient document ink detection, and static analysis metrics.
Loop Steps
Community Results
Karpathy's overnight run completed 37 validation cycles. Source
David Cortés achieved 65% faster CI builds; Tobi Lütke contributed, leading to pi-autoresearch. Source
Driveline optimized XGBoost models predicting pitch velocity from sensor data.
Self-supervised multi-agent loops optimized scroll ink generalisation. Source
Ecosystem Case Studies & Writeups
Deep dives, optimization reports, and technical guides from teams running the keep-or-revert loop in production.
Shopify CI Build Optimisation
David Cortés adapted autoresearch to optimise CI build times, achieving 65% faster builds. Tobi Lütke contributed multi-metric support and auto-commits, leading to the open-sourced pi-autoresearch (3,600+ stars).
Tennis XGBoost + Reward Hacking
Autoresearch-inspired loop for tennis match prediction — and an honest account of where the optimisation setup went wrong (reward hacking).
Vesuvius Challenge Ink Detection Swarm
Multi-agent experimental loop applied to ancient-scroll ink detection, with a writeup on cross-scroll generalisation improvements.
Earth System Model Optimisation
Hybrid workflow where an LLM proposes equation structures and a search process tunes parameters, extending autoresearch into scientific modelling.
The Agentic Researcher
A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning. Cites autoresearch as the canonical example of automated ML experiment pipelines.
Scaling Autoresearch to GPU Clusters
Running autoresearch on H100/H200 clusters with cloud orchestration. Covers distributed experiment management and cost control.
Self-Improving Coding Agents
Practical guide to setting up self-improving agent loops with Claude Code. Covers the key primitives and common failure modes.
autoresearch@home: Distributed AI Research
The SETI@home model applied to autoresearch — contribute GPU time to collective model optimisation.
Claude Code + AutoResearch for Self-Improving Skills
Building self-improving AI skills using Claude Code with autoresearch patterns. Step-by-step implementation guide.
100 ML Experiments Overnight
Technical breakdown of the autoresearch loop with domain-agnostic fork applications and reproducible results.
PM's Guide to Autoresearch
Product manager's guide covering setup, community forks, and real-world applications of the autoresearch loop.
Autoresearch 101 Builder's Playbook
Deep-dive on applying autoresearch patterns to prompts, agents, and workflows with concrete examples.
Fortune Feature
Business and industry context on why autoresearch matters for the future of autonomous AI agents.
Forks, adaptations, writeups, and benchmarks
recursive-improve
Recursive self-improvement framework where agents capture execution traces, analyse failure patterns, and apply targeted fixes with keep-or-revert evaluation.
auto-research
Docs-only control plane for an open autonomous AI research lab — file-based operating model for human direction and agent execution.
autoresearch
Claude Code skill that generalises autoresearch into a reusable loop for software, docs, security, shipping, debugging, and other measurable goals.
codex-autoresearch
Codex-native autoresearch skill with resume support, lessons across runs, optional parallel experiments, and mode-specific workflows.
Thoth
Dashboard-first Claude Code and Codex runtime for autoresearch with durable runs, locked work items, visible ledgers, and reviewable verdicts.
gemini-autoresearch
Gemini CLI skill that generalises autoresearch to any measurable goal. Uses Google Search grounding as a live verification source inside the loop.
pi-autoresearch
Extension plus dashboard for persistent experiment loops, live metrics, confidence tracking, and resumable autoresearch sessions.
autoresearch-claude-code
Claude Code plugin port of pi-autoresearch with a clean experiment-loop workflow and a concrete biomechanics case study.
autocontext
Closed-loop control plane for repeated agent improvement, with evaluation, persistent knowledge, staged validation, and optional distillation into cheaper local runtimes.
ax
Local retro loop for AI coding agents: captures session traces, turns repeated friction into proposals, and tracks accepted fixes as experiments.
goal-md
Generalises autoresearch into a GOAL.md pattern for repos where the agent must first construct a measurable fitness function before it can optimise.
lazy-developer
Claude Code skill that orchestrates autoresearch across a prioritised sequence of optimisation goals using GOAL.md as the engine.
autoresearch-at-home
Collaborative fork that adds experiment claiming, shared best-config syncing, hypothesis exchange, and swarm-style coordination across many single-GPU agents.
autoresearch-anything
Generalises autoresearch to any measurable metric — system prompts, API performance, landing pages, test suites, config tuning, SQL queries.
autoresearch-everywhere
Cross-platform expansion that auto-detects hardware config and starts the loop. The glue-and-generalisation half of autoresearch.
ADAS
Automated Design of Agentic Systems — meta-agents that invent novel agent architectures by programming them in code.
self_improving_coding_agent
Self-Improving Coding Agent that edits its own codebase. Workshop paper demonstrating scaffold-level self-improvement on coding benchmarks.
self-improving-agent
Alternative self-improving agent architecture with reflection and meta-learning cycles.
HGM
Huxley-Gödel Machine for coding agents — applies self-improvement to SWE-bench performance via meta-level optimisation.
gepa
Genetic-Pareto reflective prompt evolution that outperforms RL (GRPO) on benchmarks. Optimises any textual parameters against any metric using natural language reflection.
EvoSkill
Automated skill discovery for coding agents: evolves reusable skills and prompts from failed trajectories against benchmarks.
autoevolve
GEPA-inspired autoresearch for self-play: mutate code strategies, evaluate head-to-head, rate with Elo, branch from the Pareto front.
ClawTeam
Agent swarm intelligence for autoresearch — spawns parallel GPU research directions, distributes work across agents, aggregates results.
AI-Research-SKILLs
Comprehensive skill library including autoresearch orchestration with two-loop architecture (inner optimisation + outer synthesis).
aideml
Tree-search ML engineering agent that autonomously improves model performance via iterative code generation and evaluation.
weco.ai
Cloud platform for AIDE with observability, experiment tracking, and managed runs — brings the autoresearch loop into production.
AutoResearchClaw
End-to-end research pipeline that turns a topic into literature review, experiments, analysis, peer review, and paper drafts.
NanoResearch
End-to-end autonomous research engine that plans experiments, generates code, runs jobs locally or on SLURM, analyses real results, and writes papers.
ARK
Automatic Research Kit: idea + venue → paper pipeline orchestrating 6 agents — proposal analysis, literature search, Slurm experiments, LaTeX drafting, iterative peer review.
Auto-claude-code-research-in-sleep
Markdown-first research workflows for Claude Code and other agents, centred on autonomous literature review, experiments, paper iteration, and cross-model critique.
AutoSci
Wiki-centric full-lifecycle research platform built on Claude Code. 20+ skills cover the full loop: ingest → ideate → novelty check → experiment → paper writing.
AutoResearch-SibylSystem
Fully autonomous AI scientist built on Claude Code, with multi-agent research iteration, GPU experiment execution, and a self-evolving outer loop.
autoresearcher
Early open-source package for automating scientific workflows, centred on literature-review generation with an ambition toward broader autonomous research.
agi
Distributed, peer-to-peer research network where autonomous agents run experiments, gossip findings, maintain CRDT leaderboards, and archive results to GitHub.
CORAL
Autonomous multi-agent evolution for open-ended discovery. Long-running agents with shared persistent memory, asynchronous execution, and heartbeat-based interventions. SOTA on 10 math/algorithmic/systems tasks.
View paperAI-Scientist
First comprehensive system for fully automatic scientific discovery. From idea generation to paper writing with minimal human supervision.
AI-Scientist-v2
Workshop-level automated scientific discovery via agentic tree search. Removes template dependency from v1, generalises across research domains.
AiScientist
Long-horizon ML research lab with hierarchical orchestration and File-as-Bus coordination. Drives autonomous paper-reproduction (PaperBench) and MLE-Bench iteration loops.
View paperAI-Researcher
Full end-to-end research automation: hypothesis → experiments → manuscript → peer review. Production version at novix.science.
Auto-Research
Orchestrates a team of AI agents across the full research lifecycle — lit review, hypothesis generation, experiments, manuscript writing, and peer review.
AgentLaboratory
End-to-end autonomous research workflow: idea → literature review → experiments → report. Supports both autonomous and co-pilot modes.
agentrxiv.github.io
Collaborative autonomous research framework where agent laboratories share a preprint server to build on each other's work iteratively.
ResearchAgent
Iterative research idea generation over scientific literature with LLMs. Multi-agent review and feedback loops.
MLR-Copilot
Autonomous ML research framework — generates ideas, implements experiments, analyses results.
ML-Agent
Reinforcing LLM agents for autonomous ML engineering. Learns from trial and error to improve model performance.
LatteReview
Low-code Python package for automated systematic literature reviews via AI-powered agents.
LitLLM
AI-powered literature review assistant using RAG for accurate, well-structured related-work sections in academic writing.
agentlaboratory.github.io
Three-phase research pipeline: Literature Review → Experimentation → Report Writing, with specialised agents for each phase.
openclaw-autoresearch
OpenClaw port of pi-autoresearch; autonomous experiment loop for any optimisation target with statistical confidence scoring.
autoresearch-macos
Widely adopted macOS fork that adapts upstream autoresearch for Apple Silicon / MPS while preserving the original loop shape.
autoresearch-mlx
MLX-native Apple Silicon port that keeps the upstream fixed-budget val_bpb loop while removing the PyTorch/CUDA dependency entirely.
autoresearch-win-rtx
Windows-native RTX fork focused on consumer NVIDIA GPUs, with explicit VRAM floors and a practical desktop setup path.
n-autoresearch
Multi-GPU autoresearch infrastructure with structured experiment tracking, adaptive search strategy, crash recovery, and queryable orchestration.
autoresearch-webgpu
Browser/WebGPU port that lets agents generate training code, run experiments in-browser, and feed results back into the loop without a Python setup.
autoresearch-engram
Fork with persistent cognitive memory — frequency-weighted retrieval of cross-session knowledge for improved experiment continuity.
karpathy/autoresearch#208
Adapts autoresearch for free T4 GPUs (Google Colab / Kaggle) with zero cost and zero local setup. Replaces Flash Attention 3 with PyTorch SDPA.
autoautoresearch
Jetson AGX Orin port with a director — a Go binary that injects novelty (arxiv papers + DeepSeek Reasoner) into the loop to escape local minima.
autoresearch-genealogy
Applies the autoresearch pattern to genealogy, using structured prompts, archive guides, source checks, and vault workflows to iteratively expand family-history research.
autovoiceevals
Uses adversarial callers plus keep-or-revert prompt edits to harden voice AI agents across Vapi, Smallest AI, and ElevenLabs.
atlas-gic
Applies the autoresearch keep-or-revert loop to trading agents, optimising prompts and portfolio orchestration against rolling Sharpe ratio instead of model loss.
autokernel
Applies the autoresearch loop to GPU kernel optimisation: profile bottlenecks, edit one kernel, benchmark, keep or revert, repeat.
autozyme
Multi-agent framework applying the autoresearch loop to CPU-side scientific software: profile a target function, generate one optimisation candidate, benchmark, keep or revert.
autoresearch-growth
Applies autoresearch to landing-page positioning and A/B test candidates, using analytics snapshots and measured experiment results.
autoresearch-sudoku
An AI agent iteratively rewrites and benchmarks a Rust sudoku solver, ultimately beating leading human-built solvers on hard benchmark sets.
autospec
Reads natural-language business rules and autonomously builds a Spring Boot service with tests via the keep-or-revert loop.
tpu_performance_autoresearch_wiki
Applies the autoresearch loop to TPU model performance (MFU / tokens-per-sec) on v6e hardware. Includes Llama3-8B and Qwen3-8B case studies across JAX and torchax lanes.
MLAgentBench
Benchmark suite for evaluating AI agents on ML experimentation tasks. 13 tasks from CIFAR-10 to BabyLM.
mle-bench
OpenAI's benchmark for measuring how well AI agents perform at ML engineering.
mlrbench
Evaluates AI agents on open-ended ML research. 201 tasks from NeurIPS/ICLR/ICML workshops.
ML-Bench
Evaluates LLMs and agents for ML tasks on repository-level code.
AgentBench
Comprehensive benchmark for LLM-as-Agent evaluation across 8 distinct environments.
awesome-deep-research-agent
Curated list of deep research agent papers and systems.
LLM-Agent-Optimization
Papers on LLM agent optimisation methods.
awesome-ai-agent-papers
Curated AI agent papers from 2026 — agent engineering, memory, evaluation, workflows, and autonomous systems.
ai-agent-papers
AI agent research papers updated biweekly via automated arxiv search with curated selection.
Autonomous-Agents
Autonomous agents research papers, updated daily.
Awesome-LLM-Scientific-Discovery
Survey on LLMs in scientific discovery.
Awesome-AI-Scientist-Papers
Collection of AI Scientist / Robot Scientist papers.
agenticscience.github.io
Survey: "From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery."
dspy.ai/GEPA
DSPy integration of GEPA reflective prompt optimiser for compound AI systems.
developers.openai.com
Cookbook for autonomous agent retraining using GEPA-style reflective evolution.
awesome-autoresearch
Curated list of AutoResearch use cases with verifiable traces and progress charts, organised by domain.
82 of 82 entries shown.
Submit via pull requestBuilt something with autoresearch?
This index is maintained as an open list on GitHub. If you've built a fork, adaptation, benchmark, or writeup — open a pull request and we'll add it.
webfuse-com/awesome-autoresearch · CC0 license