Autonomous AI Research
HomeAutoresearch

Autoresearch Ecosystem Index

A curated index of autonomous research loops, tools, and benchmarks built on the keep-or-revert primitive introduced by Andrej Karpathy.

82+Forks & Adaptations
6Ecosystem Categories
100%Unattended Runs
Cycle38READING
01. READ_STATE
Read StatenanoGPT/train.py
Loss ledger loaded
02. PROPOSE
Propose ChangeBatch size: 64
Patch generated
03. SANDBOX
Run Experimenttrain.py --device=cuda
GPU 94% | 82°C
04. MEASURE
Measure Metricval_bpb: 1.556
Diff: -0.028
05. LOG_OUTCOME
Log Outcomeresults.tsv
Ledger synced
06. KEEP_REVERT
Keep / RevertVerdict: KEEP
Git committed

Reference guide

Loop Reference & Community Outcomes

Quick reference on what autoresearch is, how the loop operates, and documented community results.

Core Concepts

Background

Introduced by Andrej Karpathy as a natural language instruction document (program.md). A coding agent reads it, proposes one change, runs a time-bounded training session, and evaluates the validation bits-per-byte.

The Keep-or-Revert Primitive

Each cycle modifies exactly one file. If the measured metric improves, the change is committed. If not, the file is restored via Git hard reset. This reversible constraint prevents regressions from compounding over hundreds of cycles.

Applicability

Transfers to any domain with a measurable scalar fitness function. Used in: ML training loss, GPU kernel MFU, software build times, trading strategy Sharpe ratios, ancient document ink detection, and static analysis metrics.

Loop Steps

01
Read StateReads baseline code and previous logs.
02
Select ChangeProposes code edits based on history.
03
Edit TargetApplies exactly one change to target file.
04
Run ExperimentRuns codebase under a sandbox time budget.
05
Read MetricParses evaluation output (e.g. val_bpb).
06
Keep/RevertCommits on metric success; else reverts.
07
Log OutcomeAppends data to the results ledger.

Community Results

37c
NanoGPT Training

Karpathy's overnight run completed 37 validation cycles. Source

65%
Shopify CI Builds

David Cortés achieved 65% faster CI builds; Tobi Lütke contributed, leading to pi-autoresearch. Source

Pitch Prediction

Driveline optimized XGBoost models predicting pitch velocity from sensor data.

Ancient Ink Detection

Self-supervised multi-agent loops optimized scroll ink generalisation. Source

Case studies

Ecosystem Case Studies & Writeups

Deep dives, optimization reports, and technical guides from teams running the keep-or-revert loop in production.

shopify.engineering
Case StudyShopify Engineering

Shopify CI Build Optimisation

David Cortés adapted autoresearch to optimise CI build times, achieving 65% faster builds. Tobi Lütke contributed multi-metric support and auto-commits, leading to the open-sourced pi-autoresearch (3,600+ stars).

nickoak.com
Case StudyNick Oak

Tennis XGBoost + Reward Hacking

Autoresearch-inspired loop for tennis match prediction — and an honest account of where the optimisation setup went wrong (reward hacking).

scrollprize.substack.com
Case StudyScroll Prize

Vesuvius Challenge Ink Detection Swarm

Multi-agent experimental loop applied to ancient-scroll ink detection, with a writeup on cross-scroll generalisation improvements.

paragiri.com
Case StudyPara Giri

Earth System Model Optimisation

Hybrid workflow where an LLM proposes equation structures and a search process tunes parameters, extending autoresearch into scientific modelling.

arxiv.org
Paperarxiv.org

The Agentic Researcher

A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning. Cites autoresearch as the canonical example of automated ML experiment pipelines.

blog.skypilot.co
Case StudySkyPilot

Scaling Autoresearch to GPU Clusters

Running autoresearch on H100/H200 clusters with cloud orchestration. Covers distributed experiment management and cost control.

addyosmani.com
Case StudyAddy Osmani

Self-Improving Coding Agents

Practical guide to setting up self-improving agent loops with Claude Code. Covers the key primitives and common failure modes.

ensue.dev
Case StudyEnsue Dev

autoresearch@home: Distributed AI Research

The SETI@home model applied to autoresearch — contribute GPU time to collective model optimisation.

mindstudio.ai
Case StudyMindStudio

Claude Code + AutoResearch for Self-Improving Skills

Building self-improving AI skills using Claude Code with autoresearch patterns. Step-by-step implementation guide.

particula.tech
Case StudyParticula

100 ML Experiments Overnight

Technical breakdown of the autoresearch loop with domain-agnostic fork applications and reproducible results.

news.aakashg.com
Case StudyAakash Gupta

PM's Guide to Autoresearch

Product manager's guide covering setup, community forks, and real-world applications of the autoresearch loop.

sidsaladi.substack.com
Case StudySid Saladi

Autoresearch 101 Builder's Playbook

Deep-dive on applying autoresearch patterns to prompts, agents, and workflows with concrete examples.

fortune.com
Case StudyFortune

Fortune Feature

Business and industry context on why autoresearch matters for the future of autonomous AI agents.

Ecosystem Index 82 implementations

Forks, adaptations, writeups, and benchmarks

Fetching star counts...
kayba-ai
kayba-ai

recursive-improve

Recursive self-improvement framework where agents capture execution traces, analyse failure patterns, and apply targeted fixes with keep-or-revert evaluation.

github.com
Visit
vukrosic
vukrosic

auto-research

Docs-only control plane for an open autonomous AI research lab — file-based operating model for human direction and agent execution.

github.com
Visit
uditgoenka
uditgoenka

autoresearch

Claude Code skill that generalises autoresearch into a reusable loop for software, docs, security, shipping, debugging, and other measurable goals.

github.com
Visit
leo-lilinxiao
leo-lilinxiao

codex-autoresearch

Codex-native autoresearch skill with resume support, lessons across runs, optional parallel experiments, and mode-specific workflows.

github.com
Visit
SeeleAI
SeeleAI

Thoth

Dashboard-first Claude Code and Codex runtime for autoresearch with durable runs, locked work items, visible ledgers, and reviewable verdicts.

github.com
Visit
supratikpm
supratikpm

gemini-autoresearch

Gemini CLI skill that generalises autoresearch to any measurable goal. Uses Google Search grounding as a live verification source inside the loop.

github.com
Visit
davebcn87
davebcn87

pi-autoresearch

Extension plus dashboard for persistent experiment loops, live metrics, confidence tracking, and resumable autoresearch sessions.

github.com
Visit
drivelineresearch
drivelineresearch

autoresearch-claude-code

Claude Code plugin port of pi-autoresearch with a clean experiment-loop workflow and a concrete biomechanics case study.

github.com
Visit
greyhaven-ai
greyhaven-ai

autocontext

Closed-loop control plane for repeated agent improvement, with evaluation, persistent knowledge, staged validation, and optional distillation into cheaper local runtimes.

github.com
Visit
Necmttn
Necmttn

ax

Local retro loop for AI coding agents: captures session traces, turns repeated friction into proposals, and tracks accepted fixes as experiments.

github.com
Visit
jmilinovich
jmilinovich

goal-md

Generalises autoresearch into a GOAL.md pattern for repos where the agent must first construct a measurable fitness function before it can optimise.

github.com
Visit
james-s-tayler
james-s-tayler

lazy-developer

Claude Code skill that orchestrates autoresearch across a prioritised sequence of optimisation goals using GOAL.md as the engine.

github.com
Visit
mutable-state-inc
mutable-state-inc

autoresearch-at-home

Collaborative fork that adds experiment claiming, shared best-config syncing, hypothesis exchange, and swarm-style coordination across many single-GPU agents.

github.com
Visit
zkarimi22
zkarimi22

autoresearch-anything

Generalises autoresearch to any measurable metric — system prompts, API performance, landing pages, test suites, config tuning, SQL queries.

github.com
Visit
Entrpi
Entrpi

autoresearch-everywhere

Cross-platform expansion that auto-detects hardware config and starts the loop. The glue-and-generalisation half of autoresearch.

github.com
Visit
ShengranHu
ShengranHu

ADAS

ICLR 2025

Automated Design of Agentic Systems — meta-agents that invent novel agent architectures by programming them in code.

github.com
Visit
MaximeRobeyns
MaximeRobeyns

self_improving_coding_agent

ICLR 2025

Self-Improving Coding Agent that edits its own codebase. Workshop paper demonstrating scaffold-level self-improvement on coding benchmarks.

github.com
Visit
peterskoett
peterskoett

self-improving-agent

Alternative self-improving agent architecture with reflection and meta-learning cycles.

github.com
Visit
metauto-ai
metauto-ai

HGM

Huxley-Gödel Machine for coding agents — applies self-improvement to SWE-bench performance via meta-level optimisation.

github.com
Visit
gepa-ai
gepa-ai

gepa

ICLR 2026 Oral

Genetic-Pareto reflective prompt evolution that outperforms RL (GRPO) on benchmarks. Optimises any textual parameters against any metric using natural language reflection.

github.com
Visit
sentient-agi
sentient-agi

EvoSkill

Automated skill discovery for coding agents: evolves reusable skills and prompts from failed trajectories against benchmarks.

github.com
Visit
MrTsepa
MrTsepa

autoevolve

GEPA-inspired autoresearch for self-play: mutate code strategies, evaluate head-to-head, rate with Elo, branch from the Pareto front.

github.com
Visit
HKUDS
HKUDS

ClawTeam

Agent swarm intelligence for autoresearch — spawns parallel GPU research directions, distributes work across agents, aggregates results.

github.com
Visit
Orchestra-Research
Orchestra-Research

AI-Research-SKILLs

Comprehensive skill library including autoresearch orchestration with two-loop architecture (inner optimisation + outer synthesis).

github.com
Visit
WecoAI
WecoAI

aideml

Tree-search ML engineering agent that autonomously improves model performance via iterative code generation and evaluation.

github.com
Visit
Site

weco.ai

Site

Cloud platform for AIDE with observability, experiment tracking, and managed runs — brings the autoresearch loop into production.

weco.ai
Visit
aiming-lab
aiming-lab

AutoResearchClaw

End-to-end research pipeline that turns a topic into literature review, experiments, analysis, peer review, and paper drafts.

github.com
Visit
OpenRaiser
OpenRaiser

NanoResearch

End-to-end autonomous research engine that plans experiments, generates code, runs jobs locally or on SLURM, analyses real results, and writes papers.

github.com
Visit
kaust-ark
kaust-ark

ARK

Automatic Research Kit: idea + venue → paper pipeline orchestrating 6 agents — proposal analysis, literature search, Slurm experiments, LaTeX drafting, iterative peer review.

github.com
Visit
wanshuiyin
wanshuiyin

Auto-claude-code-research-in-sleep

Markdown-first research workflows for Claude Code and other agents, centred on autonomous literature review, experiments, paper iteration, and cross-model critique.

github.com
Visit
skyllwt
skyllwt

AutoSci

Wiki-centric full-lifecycle research platform built on Claude Code. 20+ skills cover the full loop: ingest → ideate → novelty check → experiment → paper writing.

github.com
Visit
Sibyl-Research-Team
Sibyl-Research-Team

AutoResearch-SibylSystem

Fully autonomous AI scientist built on Claude Code, with multi-agent research iteration, GPU experiment execution, and a self-evolving outer loop.

github.com
Visit
eimenhmdt
eimenhmdt

autoresearcher

Early open-source package for automating scientific workflows, centred on literature-review generation with an ambition toward broader autonomous research.

github.com
Visit
hyperspaceai
hyperspaceai

agi

Distributed, peer-to-peer research network where autonomous agents run experiments, gossip findings, maintain CRDT leaderboards, and archive results to GitHub.

github.com
Visit
Human-Agent-Society
Human-Agent-Society

CORAL

Autonomous multi-agent evolution for open-ended discovery. Long-running agents with shared persistent memory, asynchronous execution, and heartbeat-based interventions. SOTA on 10 math/algorithmic/systems tasks.

View paper
github.com
Visit
SakanaAI
SakanaAI

AI-Scientist

First comprehensive system for fully automatic scientific discovery. From idea generation to paper writing with minimal human supervision.

github.com
Visit
SakanaAI
SakanaAI

AI-Scientist-v2

Workshop-level automated scientific discovery via agentic tree search. Removes template dependency from v1, generalises across research domains.

github.com
Visit
AweAI-Team
AweAI-Team

AiScientist

Long-horizon ML research lab with hierarchical orchestration and File-as-Bus coordination. Drives autonomous paper-reproduction (PaperBench) and MLE-Bench iteration loops.

View paper
github.com
Visit
HKUDS
HKUDS

AI-Researcher

NeurIPS 2025

Full end-to-end research automation: hypothesis → experiments → manuscript → peer review. Production version at novix.science.

github.com
Visit
openags
openags

Auto-Research

Orchestrates a team of AI agents across the full research lifecycle — lit review, hypothesis generation, experiments, manuscript writing, and peer review.

github.com
Visit
SamuelSchmidgall
SamuelSchmidgall

AgentLaboratory

End-to-end autonomous research workflow: idea → literature review → experiments → report. Supports both autonomous and co-pilot modes.

github.com
Visit
Site

agentrxiv.github.io

Site

Collaborative autonomous research framework where agent laboratories share a preprint server to build on each other's work iteratively.

agentrxiv.github.io
Visit
JinheonBaek
JinheonBaek

ResearchAgent

Iterative research idea generation over scientific literature with LLMs. Multi-agent review and feedback loops.

github.com
Visit
du-nlp-lab
du-nlp-lab

MLR-Copilot

Autonomous ML research framework — generates ideas, implements experiments, analyses results.

github.com
Visit
MASWorks
MASWorks

ML-Agent

Reinforcing LLM agents for autonomous ML engineering. Learns from trial and error to improve model performance.

github.com
Visit
PouriaRouzrokh
PouriaRouzrokh

LatteReview

Low-code Python package for automated systematic literature reviews via AI-powered agents.

github.com
Visit
LitLLM
LitLLM

LitLLM

AI-powered literature review assistant using RAG for accurate, well-structured related-work sections in academic writing.

github.com
Visit
Site

agentlaboratory.github.io

Site

Three-phase research pipeline: Literature Review → Experimentation → Report Writing, with specialised agents for each phase.

agentlaboratory.github.io
Visit
gianfrancopiana
gianfrancopiana

openclaw-autoresearch

OpenClaw port of pi-autoresearch; autonomous experiment loop for any optimisation target with statistical confidence scoring.

github.com
Visit
miolini
miolini

autoresearch-macos

Widely adopted macOS fork that adapts upstream autoresearch for Apple Silicon / MPS while preserving the original loop shape.

github.com
Visit
trevin-creator
trevin-creator

autoresearch-mlx

MLX-native Apple Silicon port that keeps the upstream fixed-budget val_bpb loop while removing the PyTorch/CUDA dependency entirely.

github.com
Visit
jsegov
jsegov

autoresearch-win-rtx

Windows-native RTX fork focused on consumer NVIDIA GPUs, with explicit VRAM floors and a practical desktop setup path.

github.com
Visit
iii-hq
iii-hq

n-autoresearch

Multi-GPU autoresearch infrastructure with structured experiment tracking, adaptive search strategy, crash recovery, and queryable orchestration.

github.com
Visit
lucasgelfond
lucasgelfond

autoresearch-webgpu

Browser/WebGPU port that lets agents generate training code, run experiments in-browser, and feed results back into the loop without a Python setup.

github.com
Visit
tonitangpotato
tonitangpotato

autoresearch-engram

Fork with persistent cognitive memory — frequency-weighted retrieval of cross-session knowledge for improved experiment continuity.

github.com
Visit
GitHub

karpathy/autoresearch#208

GitHub

Adapts autoresearch for free T4 GPUs (Google Colab / Kaggle) with zero cost and zero local setup. Replaces Flash Attention 3 with PyTorch SDPA.

github.com
Visit
ArmanJR-Lab
ArmanJR-Lab

autoautoresearch

Jetson AGX Orin port with a director — a Go binary that injects novelty (arxiv papers + DeepSeek Reasoner) into the loop to escape local minima.

github.com
Visit
mattprusak
mattprusak

autoresearch-genealogy

Applies the autoresearch pattern to genealogy, using structured prompts, archive guides, source checks, and vault workflows to iteratively expand family-history research.

github.com
Visit
ArchishmanSengupta
ArchishmanSengupta

autovoiceevals

Uses adversarial callers plus keep-or-revert prompt edits to harden voice AI agents across Vapi, Smallest AI, and ElevenLabs.

github.com
Visit
chrisworsey55
chrisworsey55

atlas-gic

Applies the autoresearch keep-or-revert loop to trading agents, optimising prompts and portfolio orchestration against rolling Sharpe ratio instead of model loss.

github.com
Visit
RightNow-AI
RightNow-AI

autokernel

Applies the autoresearch loop to GPU kernel optimisation: profile bottlenecks, edit one kernel, benchmark, keep or revert, repeat.

github.com
Visit
ElliotXie
ElliotXie

autozyme

Multi-agent framework applying the autoresearch loop to CPU-side scientific software: profile a target function, generate one optimisation candidate, benchmark, keep or revert.

github.com
Visit
Agent-Analytics
Agent-Analytics

autoresearch-growth

Applies autoresearch to landing-page positioning and A/B test candidates, using analytics snapshots and measured experiment results.

github.com
Visit
Rkcr7
Rkcr7

autoresearch-sudoku

An AI agent iteratively rewrites and benchmarks a Rust sudoku solver, ultimately beating leading human-built solvers on hard benchmark sets.

github.com
Visit
jeongph
jeongph

autospec

Reads natural-language business rules and autonomously builds a Spring Boot service with tests via the keep-or-revert loop.

github.com
Visit
vlasenkoalexey
vlasenkoalexey

tpu_performance_autoresearch_wiki

Applies the autoresearch loop to TPU model performance (MFU / tokens-per-sec) on v6e hardware. Includes Llama3-8B and Qwen3-8B case studies across JAX and torchax lanes.

github.com
Visit
snap-stanford
snap-stanford

MLAgentBench

Benchmark suite for evaluating AI agents on ML experimentation tasks. 13 tasks from CIFAR-10 to BabyLM.

github.com
Visit
openai
openai

mle-bench

OpenAI's benchmark for measuring how well AI agents perform at ML engineering.

github.com
Visit
chchenhui
chchenhui

mlrbench

Evaluates AI agents on open-ended ML research. 201 tasks from NeurIPS/ICLR/ICML workshops.

github.com
Visit
gersteinlab
gersteinlab

ML-Bench

Evaluates LLMs and agents for ML tasks on repository-level code.

github.com
Visit
THUDM
THUDM

AgentBench

ICLR 2024

Comprehensive benchmark for LLM-as-Agent evaluation across 8 distinct environments.

github.com
Visit
ai-agents-2030
ai-agents-2030

awesome-deep-research-agent

Curated list of deep research agent papers and systems.

github.com
Visit
YoungDubbyDu
YoungDubbyDu

LLM-Agent-Optimization

Papers on LLM agent optimisation methods.

github.com
Visit
VoltAgent
VoltAgent

awesome-ai-agent-papers

Curated AI agent papers from 2026 — agent engineering, memory, evaluation, workflows, and autonomous systems.

github.com
Visit
masamasa59
masamasa59

ai-agent-papers

AI agent research papers updated biweekly via automated arxiv search with curated selection.

github.com
Visit
tmgthb
tmgthb

Autonomous-Agents

Autonomous agents research papers, updated daily.

github.com
Visit
HKUST-KnowComp
HKUST-KnowComp

Awesome-LLM-Scientific-Discovery

EMNLP 2025

Survey on LLMs in scientific discovery.

github.com
Visit
openags
openags

Awesome-AI-Scientist-Papers

Collection of AI Scientist / Robot Scientist papers.

github.com
Visit
Site

agenticscience.github.io

Site

Survey: "From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery."

agenticscience.github.io
Visit
Site

dspy.ai/GEPA

Site

DSPy integration of GEPA reflective prompt optimiser for compound AI systems.

dspy.ai
Visit
Site

developers.openai.com

Site

Cookbook for autonomous agent retraining using GEPA-style reflective evolution.

developers.openai.com
Visit
WecoAI
WecoAI

awesome-autoresearch

Curated list of AutoResearch use cases with verifiable traces and progress charts, organised by domain.

github.com
Visit

82 of 82 entries shown.

Submit via pull request

Built something with autoresearch?

This index is maintained as an open list on GitHub. If you've built a fork, adaptation, benchmark, or writeup — open a pull request and we'll add it.

Submit via pull request

webfuse-com/awesome-autoresearch · CC0 license