flowchart TD
L0["Lesson 0<br>The question<br>What if AI could research<br>its own code?"]
L1["Lesson 1<br>The core loop<br>propose → run → measure → keep/revert"]
L2["Lesson 2<br>program.md<br>Encoding research taste"]
L3["Lesson 3<br>prepare.py<br>Fixed evaluation, TIME_BUDGET = 300"]
L4["Lesson 4<br>train.py<br>The agent's canvas"]
L5["Lesson 5<br>Human/agent split<br>Who controls what"]
L6["Lesson 6<br>Trace one experiment<br>Proposal → diff → metric → decision"]
L7["Lesson 7<br>Why naive loops fail<br>The taste problem"]
HO["Hands-on<br>Build your own<br>autoresearch loop"]
L8["Lesson 8<br>Community extensions<br>4 directions"]
L9["Lesson 9<br>The pattern beyond ML<br>Generalized loop"]
L10["Lesson 10<br>Why hype cooled<br>Perception vs reality"]
L0 --> L1
L1 --> L2
L1 --> L3
L1 --> L4
L2 --> L5
L3 --> L5
L4 --> L5
L5 --> L6
L6 --> L7
L7 --> HO
HO --> L8
L8 --> L9
L9 --> L10
Autoresearch: An Autonomous Research Loop — A Design Journey
First Break AI — Project Watch
This is a Project Watch deep dive. We study real, shipping AI projects and reverse-engineer the engineering decisions. This post goal: trace the design journey and understand the autoresearch loop, the human/agent split, and why this pattern generalizes. Connects to: Step 3 (inference engines), Step 4 (training), Step 5 (building AI products).
The project source code discussed in this post.
Table of contents
- The big picture
- Lesson 0: The question that started it
- Lesson 1: The core loop
- Lesson 2: Read program.md
- Lesson 3: Read prepare.py
- Lesson 4: Read train.py
- Lesson 5: The human/agent split
- Lesson 6: Trace one experiment
- Lesson 7: Why the naive loop fails
- Hands-on: Build your own autoresearch
- Lesson 8: What the community built
- Lesson 9: The pattern beyond ML
- Lesson 10: Why hype cooled but the project didn’t stall
- The complete picture
- Learning plan
How Autoresearch Works
By the end of this post you will have:
- Understood the autoresearch core loop: propose, run, measure, keep or revert
- Read every key file in the repo (
program.md,prepare.py,train.py) - Understood the human/agent split — who controls what, and why
- Traced one real experiment from proposal through measurement to commit or revert
- Understood why autoresearch is not AutoML (code diffs vs parameter grids)
- Built your own mini autoresearch loop and experienced its failure modes firsthand
- Seen how the community generalized this loop far beyond ML training
This is a design journey. You will trace the decisions in the autoresearch repository. At each step we ask: what problem was being solved, and how?
The Big Picture
Here is the complete journey. Each box is a lesson.
Lesson 0: The Question That Started It
Before we look at any code, let us understand the question that motivated this project.
The old way: AutoML
For years, the standard approach to automating research was AutoML — automated machine learning. You define a search space of hyperparameters:
learning_rate: [1e-4, 3e-4, 1e-3]
batch_size: [32, 64, 128]
depth: [6, 8, 12]
A search algorithm (grid, Bayesian, random) picks points from this grid, runs experiments, and finds the best configuration. The search object is a point in a human-defined space.
The problem: these spaces are rigid and brittle. They can only explore what a human thought to parameterize. If the right answer involves “restructure the attention mechanism” or “add gradient clipping and change the learning rate schedule together” — AutoML cannot find it, because those ideas are not in the grid.
The key insight: search over code, not parameters
What if the agent could edit the actual training code?
"Change the window pattern from [128, 256] to [64, 128, 256, 512]"
"Add gradient clipping at 1.0"
"Replace AdamW with a custom optimizer that..."
The search object becomes a code diff. The space is open-ended — the agent can propose any valid Python change. This is closer to how human researchers actually work: they do not pick points from a grid. They read code, form hypotheses, make changes, and measure results.
flowchart LR
subgraph automl ["Classical AutoML"]
P["Human-defined<br>parameter grid<br>lr, batch_size, depth"] --> S["Search algorithm<br>grid, Bayesian, random"]
S --> R["Best config<br>a point in the grid"]
end
subgraph autoresearch ["Autoresearch"]
C["Agent reads<br>existing code"] --> D["Proposes a<br>code diff"]
D --> E["Runs experiment<br>measures metric"]
E --> F["Keeps or reverts<br>git commit/revert"]
F --> C
end
As tensor argued: “AutoML methods operated on human-parameterized search spaces… This changes with models, which can operate as ‘softly’ as humans can on the search space.” The key word is softly — not rigid parameter sweeps but flexible, context-aware code edits.
Check your understanding
- What is the search object in classical AutoML?
- What is the search object in autoresearch?
- Why can an agent find improvements that a parameter grid cannot?
Lesson 1: The Core Loop
The answer to “what if AI could research its own code?” is remarkably minimal. The entire system is a single loop.
The five steps
- Agent reads
program.md— the rules and constraints - Agent proposes a code change to
train.py— a diff to the training script - Run
uv run train.py— execute the experiment under a 5-minute time budget - Measure
val_bpb— the fixed evaluation metric (validation bits-per-byte) - Decision: if
val_bpbimproved,git commit. If not,git revert. Go to step 2. Never stop.
flowchart TD
A["Agent reads program.md<br>(rules and constraints)"] --> B["Agent proposes code change<br>to train.py"]
B --> C["Run: uv run train.py<br>(5-minute budget)"]
C --> D["Measure val_bpb<br>(fixed evaluation from prepare.py)"]
D --> E{Improved?}
E -- Yes --> F["git commit<br>(keep the change)"]
E -- No --> G["git revert<br>(discard the change)"]
F --> B
G --> B
That is the entire system. ~32k stars, ~4k forks, ~100 open PRs — all built on this loop.
Three design decisions that matter
Fixed time budget (5 minutes). Every experiment gets exactly 5 minutes. This makes results comparable across experiments and prevents the agent from running one experiment for hours to overfit.
Git as memory. Every change is a commit, every failure is a revert. The entire history of experiments is in the git log. The agent can look back at what worked and what did not. This is better than “just overwrite the file” because nothing is ever lost.
Indefinite loop. The agent never stops. It runs experiments 24/7. This is the key difference from human research — a human sleeps, the agent does not.
Check your understanding
- What are the five steps of the autoresearch loop?
- Why does every experiment get exactly 5 minutes?
- Why is git better than “just overwrite the file” for tracking experiments?
Lesson 2: Read program.md — Encoding Research Taste
Now let us open the actual files. Each of the next three lessons reads one file from the repository.
What program.md contains
The file defines a strict autonomous experimentation loop:
- Create a branch
- Run
uv run train.pyrepeatedly under the fixed budget - Grep metrics from stdout
- Keep commits that improve
val_bpb, revert those that do not - Never stop
But it also contains something more subtle: domain-specific guidance. The file includes instructions about what kinds of changes to try, what to avoid, and how to interpret results. This is where the human encodes the research strategy.
The insight: who iterates on what
The human iterates on program.md — refining the prompt, the constraints, the strategy. The agent iterates on train.py — making code changes and measuring results.
This means “research taste” currently lives in the human’s prompt, not in the agent itself. The agent follows instructions; the human decides what good research looks like. This is a critical limitation — and the reason the community is trying to build “taste” into agents (Lesson 7).
Check your understanding
- What is the purpose of
program.md? - Where does “research taste” live — in the agent or in
program.md? - What happens if the human writes bad instructions in
program.md?
Lesson 3: Read prepare.py — The Fixed Evaluation
The key constants
TIME_BUDGET = 300 # seconds (5 minutes)
The function evaluate_bpb computes validation bits-per-byte on a fixed dataset, with fixed splits, every time. The evaluation is deterministic — same data, same splits, same metric.
Why the evaluation is fixed
This is a deliberate design decision to prevent Goodharting — the principle that “when a measure becomes a target, it ceases to be a good measure.”
If the agent could edit prepare.py, it could: - Change the evaluation dataset to one where the model already scores well - Modify the metric calculation to produce artificially lower val_bpb - Reduce the dataset size so evaluation is easier
By making prepare.py untouchable, the design forces the agent to genuinely improve the training code. The only way to improve val_bpb is to write better train.py.
flowchart LR
subgraph editable ["Editable zone (agent controls)"]
T["train.py<br>Architecture, hyperparams,<br>optimizer, data loading"]
end
subgraph fixed ["Fixed zone (human controls)"]
P["prepare.py<br>eval metric, dataset,<br>TIME_BUDGET = 300"]
PM["program.md<br>rules, constraints,<br>research strategy"]
end
T -- "agent proposes changes" --> T
P -- "cannot be modified" --> T
PM -- "guides agent behavior" --> T
Check your understanding
- What is
val_bpband why is it the ground-truth metric? - What is Goodharting, and how does the fixed evaluation prevent it?
- What would happen if the agent could modify
prepare.py?
Lesson 4: Read train.py — The Agent’s Canvas
What train.py contains
A small training script for a character-level language model. If you completed Step 2, you will recognize the core components:
- Embedding layer — maps character IDs to vectors (like
token_embedding_tableinrun.c) - Attention mechanism — Q, K, V projections, multi-head attention (like the attention loop in
run.c) - FFN — feed-forward network with gated activation (like the SwiGLU FFN in
run.c) - Training loop — forward pass, loss computation, backward pass, optimizer step
The model is intentionally small. The file is ~200 lines. This is a deliberate design choice:
The file must fit in an LLM’s context window. If train.py were thousands of lines, the agent could not reason about it effectively. Keeping it small means the agent can read the entire file, understand the architecture, and propose meaningful changes.
flowchart TD
subgraph trainpy ["train.py (~200 lines)"]
EMB["Embedding<br>char → vector"]
ATT["Attention<br>Q, K, V, multi-head"]
FFN["FFN<br>gated activation"]
LOSS["Loss<br>cross-entropy"]
OPT["Optimizer<br>AdamW"]
end
EMB --> ATT --> FFN --> LOSS --> OPT
OPT -- "backward + step" --> EMB
What the agent can change
Everything. Architecture (depth, width, head count). Hyperparameters (learning rate, batch size, weight decay). Optimization strategy (optimizer, scheduler, clipping). Data loading (sequence length, sampling). Regularization (dropout, weight initialization).
The constraint is not what can be changed, but how it is evaluated: every change must improve val_bpb within 5 minutes, or it gets reverted.
Check your understanding
- Why is
train.pyintentionally kept to ~200 lines? - What components of the model can the agent change?
- How does the 5-minute time budget constrain what changes are practical?
Lesson 5: The Human/Agent Split
Now that you have read all three files, the architecture becomes clear. Autoresearch works because responsibilities are strictly separated.
The split
flowchart TD
subgraph human ["Human responsibility"]
PM["program.md<br>Rules, constraints,<br>research strategy<br>(iterates on the prompt)"]
PP["prepare.py<br>Fixed evaluation,<br>val_bpb metric<br>(defines ground truth)"]
end
subgraph agent ["Agent responsibility"]
TP["train.py<br>Training code<br>(iterates on the code)"]
GIT["git<br>Commit improvements,<br>revert failures<br>(memory)"]
end
PM -- "guides" --> agent
PP -- "evaluates" --> agent
The human controls the objective and constraints:
program.md— what the agent should try, what it should avoid, how to interpret resultsprepare.py— the metric, the dataset, the time budget
The agent controls the experiments:
train.py— the code being optimized- git — the history of all experiments
Why this split matters
Safety: The agent cannot change what “success” means. It cannot game the evaluation. It can only try to genuinely improve the training code.
Reliability: The evaluation is deterministic. Same code → same metric. No randomness in whether an improvement is “real.”
Scalability: The human can refine the strategy (edit program.md) without touching the code. The agent can run experiments 24/7 without human supervision.
Debugging: When something goes wrong, the boundary is clear. Is the problem in the agent’s code changes? Check train.py diffs. Is the problem in the evaluation? Check prepare.py. Is the problem in the strategy? Check program.md.
The deeper insight
The human writes the “meta-prompt” — instructions about how to do research. The agent writes the code — the actual research artifacts. This is a new division of labor that does not exist in traditional software engineering or traditional research.
Check your understanding
- What does the human control in autoresearch?
- What does the agent control?
- Why can the agent not game the evaluation metric?
Lesson 6: Trace One Experiment
Let us follow one iteration of the loop from start to finish. This makes the abstract pattern concrete.
What a real experiment looks like
The session reports (linked from the repo README) show real agent behavior. Here is the pattern of a typical iteration:
Agent reads current state: The agent reads
train.pyandprogram.md. It sees the current architecture, hyperparameters, and recent git history.Agent proposes a change: For example, “increase batch size from 32 to 64 and reduce learning rate from 3e-4 to 1e-4.”
Change is applied: The diff modifies two lines in
train.py.Experiment runs:
uv run train.pyexecutes for 5 minutes. The training loop runs, and at the end,val_bpbis computed.Decision: If
val_bpbdecreased (lower is better for bits-per-byte), the change is committed. If not, it is reverted.
flowchart LR
READ["Agent reads<br>train.py + program.md<br>+ git log"] --> PROPOSE["Proposes:<br>batch_size 32→64<br>lr 3e-4→1e-4"]
PROPOSE --> DIFF["Diff applied<br>2 lines changed<br>in train.py"]
DIFF --> RUN["uv run train.py<br>5 minutes"]
RUN --> METRIC["val_bpb measured<br>1.42 → 1.38"]
METRIC --> DECISION{"Improved?"}
DECISION -- "Yes (1.38 < 1.42)" --> COMMIT["git commit<br>'Increase batch size,<br>reduce lr'"]
DECISION -- "No" --> REVERT["git revert"]
What the session reports show
Real gains come from concrete, specific changes:
- Batch size adjustments — finding the sweet spot for the 5-minute budget
- Depth changes — adding or removing transformer layers
- Window pattern tuning — changing attention window sizes
- RoPE parameter tuning — adjusting rotary position encoding settings
- Weight decay and initialization changes — small regularization tweaks
The agent finds improvements that a human researcher might also find — but it runs 24/7 and tries far more combinations than a human would.
Check your understanding
- What information does the agent have when proposing a change?
- How does the agent decide whether to keep or revert a change?
- Why does the git log matter for future experiments?
Lesson 7: Why the Naive Loop Fails
The core loop works — the agent finds real improvements. But it also hits walls. Understanding these failures is the key to understanding everything the community is building.
The search-policy wall
After running for many iterations, the autoresearch loop often gets stuck. Four problems emerge:
1. Hallucinated code. The agent sometimes proposes changes that do not compile or produce runtime errors. A 5-minute experiment wasted on a syntax error is 5 minutes lost.
2. Depth-first search only. The agent tends to make small incremental changes — “increase batch size by 8,” “decrease learning rate slightly.” It rarely makes bold structural changes like “replace the optimizer” or “restructure the attention mechanism.” This is the “depth-first” critique: the agent explores one direction deeply but does not explore broadly.
3. No memory. The basic loop has no mechanism for the agent to remember why previous experiments succeeded or failed. Each iteration starts fresh — the agent reads the current code and proposes a change. It does not have a “research journal” of insights.
4. No transferability. An improvement on one hardware setup may not transfer to another. An improvement at one model scale may not transfer to a larger scale. The loop does not test for generalization.
flowchart TD
subgraph failures ["Why the naive loop fails"]
F1["Hallucinated code<br>Syntax errors, runtime crashes<br>→ wasted 5-minute runs"]
F2["Depth-first only<br>Small incremental changes<br>→ misses bold structural improvements"]
F3["No memory<br>Agent forgets why things worked<br>→ repeats failed experiments"]
F4["No transferability<br>Improvements may not generalize<br>→ overfitting to one setup"]
end
This motivates everything
These are not abstract problems. The community hit all four of them. And the response — memory agents, guidance systems, verification tooling, search-policy improvements — is exactly what Lesson 8 is about.
Check your understanding
- What is the “depth-first search” problem in autoresearch?
- Why does lack of memory lead to repeated failed experiments?
- How could you test whether an improvement generalizes beyond the current setup?
Hands-on: Build Your Own Autoresearch
You have seen the design. Now build a toy version yourself. This is not a detour — it is the fastest way to make Lessons 0-7 stick. You will experience the failures from Lesson 7 with your own hands.
Step 1: The editable artifact — my_train.py
Every autoresearch loop needs an editable artifact — a file that the agent modifies. In the real version, this is train.py. We start with something simpler.
Create a file called my_train.py:
import math
def train():
"""A tiny 'model' that predicts sin(x) using a polynomial."""
coefficients = [0.0, 1.0, 0.0, -0.1] # initial guess: x - 0.1*x^3
def predict(x):
return sum(c * x**i for i, c in enumerate(coefficients))
test_points = [i * 0.1 for i in range(-30, 31)]
errors = [(predict(x) - math.sin(x))**2 for x in test_points]
mse = sum(errors) / len(errors)
print(f"METRIC:mse={mse:.6f}")
return mse
if __name__ == "__main__":
train()Run it:
python my_train.py
# Output: METRIC:mse=0.847532That MSE is the number the agent will try to minimize. The file is small enough to fit in any LLM’s context window — this matters because the agent needs to read the full file to propose changes.
Step 2: The fixed evaluation — my_eval.py
The evaluation is the ground truth. It must be fixed — the agent cannot change it. This is what prevents Goodharting (Lesson 3).
Create a file called my_eval.py:
import subprocess
import re
import sys
def evaluate():
"""Run my_train.py and extract the metric. Returns mse or None on failure."""
try:
result = subprocess.run(
[sys.executable, "my_train.py"],
capture_output=True, text=True, timeout=10
)
if result.returncode != 0:
print(f"EVAL_ERROR: script failed\n{result.stderr}")
return None
match = re.search(r"METRIC:mse=([\d.]+)", result.stdout)
if not match:
print("EVAL_ERROR: no METRIC line found in output")
return None
mse = float(match.group(1))
print(f"EVAL_RESULT: mse={mse:.6f}")
return mse
except subprocess.TimeoutExpired:
print("EVAL_ERROR: timeout (10s)")
return None
except Exception as e:
print(f"EVAL_ERROR: {e}")
return None
if __name__ == "__main__":
evaluate()Key design decisions — compare these to prepare.py (Lesson 3):
- Subprocess isolation —
my_train.pyruns in a separate process. If it crashes, the evaluation catches it. - Timeout — 10 seconds. If the agent proposes code that runs forever, it gets killed.
- Structured output — the metric is extracted from a specific
METRIC:mse=line. - Error handling — anything that goes wrong returns
None, which the loop treats as a failure.
Step 3: The loop — my_loop.py
Now the core — the agent loop that ties everything together.
Create my_loop.py:
import subprocess
import sys
import re
def read_file(path):
with open(path) as f:
return f.read()
def write_file(path, content):
with open(path, "w") as f:
f.write(content)
def git(cmd):
result = subprocess.run(
["git"] + cmd.split(),
capture_output=True, text=True
)
return result.stdout.strip()
def evaluate():
"""Run my_eval.py and return the mse, or None on failure."""
result = subprocess.run(
[sys.executable, "my_eval.py"],
capture_output=True, text=True, timeout=30
)
match = re.search(r"EVAL_RESULT: mse=([\d.]+)", result.stdout)
if match:
return float(match.group(1))
return None
def ask_agent(current_code, current_mse, history):
"""Ask an LLM to propose a change to my_train.py.
Replace the body of this function with your preferred LLM API.
"""
prompt = f"""You are an AI research agent. Your goal is to minimize the MSE
of a polynomial approximation to sin(x).
Here is the current code in my_train.py:
```python
{current_code}Current MSE: {current_mse:.6f}
Previous attempts: {history if history else “None yet.”}
Propose a SINGLE concrete change to my_train.py that will reduce the MSE. Return the COMPLETE new file content wrapped in python ... markers. Only change the coefficients list or add more terms. Do not change the evaluation logic (test_points, the METRIC print line). ““” # — REPLACE THIS with your LLM API call — # Example with OpenAI: # from openai import OpenAI # client = OpenAI() # resp = client.chat.completions.create( # model=“gpt-4o-mini”, # messages=[{“role”: “user”, “content”: prompt}] # ) # return resp.choices[0].message.content # # Example with a local model via ollama: # result = subprocess.run( # [“ollama”, “run”, “qwen3:0.6b”, prompt], # capture_output=True, text=True # ) # return result.stdout raise NotImplementedError( “Replace ask_agent() with your LLM API call.” “See comments in the function for examples.” )
def extract_code(response): “““Extract Python code from LLM response.”“” match = re.search(r”python\n(.*?)“, response, re.DOTALL) if match: return match.group(1).strip() return None
def run_loop(n_iterations=10): git(“init”) git(“add my_train.py my_eval.py”) git(“commit -m initial-commit”)
baseline_mse = evaluate()
if baseline_mse is None:
print("ERROR: baseline evaluation failed")
return
print(f"=== BASELINE MSE: {baseline_mse:.6f} ===\n")
best_mse = baseline_mse
history = []
for i in range(n_iterations):
print(f"--- Iteration {i+1}/{n_iterations} ---")
current_code = read_file("my_train.py")
response = ask_agent(current_code, best_mse, "\n".join(history[-5:]))
new_code = extract_code(response)
if new_code is None:
print(" Agent returned no valid code. Skipping.")
history.append(f"Iter {i+1}: agent returned invalid response")
continue
write_file("my_train.py", new_code)
new_mse = evaluate()
if new_mse is None:
print(" Evaluation failed. Reverting.")
git("checkout -- my_train.py")
history.append(f"Iter {i+1}: evaluation failed (crash/timeout)")
continue
if new_mse < best_mse:
improvement = best_mse - new_mse
print(f" IMPROVED: {best_mse:.6f} -> {new_mse:.6f} "
f"(delta={improvement:.6f})")
git("add my_train.py")
git(f"commit -m improved-mse-{new_mse:.6f}")
best_mse = new_mse
history.append(
f"Iter {i+1}: KEPT. mse {best_mse+improvement:.6f} -> "
f"{new_mse:.6f}"
)
else:
print(f" REVERTED: {best_mse:.6f} -> {new_mse:.6f} (worse)")
git("checkout -- my_train.py")
history.append(
f"Iter {i+1}: REVERTED. mse went to {new_mse:.6f}"
)
print()
print(f"=== FINAL MSE: {best_mse:.6f} "
f"(started at {baseline_mse:.6f}) ===")
print(f"=== Improvement: {baseline_mse - best_mse:.6f} ===")
if name == “main”: run_loop()
### Walk through the loop
Read `run_loop()` step by step:
1. **Initialize git** — `git init`, commit the initial files. This is the ledger.
2. **Establish baseline** — run `my_eval.py` to get the starting MSE.
3. **For each iteration:**
- Read the current `my_train.py`
- Ask the agent to propose a change (pass it the current code, current metric, and recent history)
- Extract the new code from the agent's response
- Write the new code to `my_train.py`
- Run the evaluation
- If MSE improved: `git commit` (keep the change)
- If MSE worsened or evaluation failed: `git checkout` (revert)
4. **Report** — final MSE vs. baseline
This is exactly the five-step loop from Lesson 1, implemented in Python. Compare: `program.md` → your prompt, `prepare.py` → `my_eval.py`, `train.py` → `my_train.py`, git → git.
```{mermaid}
flowchart TD
A["Read my_train.py<br>(current best version)"] --> B["Ask LLM:<br>propose a code change"]
B --> C["Write new code<br>to my_train.py"]
C --> D["Run my_eval.py<br>(10s timeout)"]
D --> E{MSE improved?}
E -- Yes --> F["git commit<br>keep the change"]
E -- No --> G["git checkout<br>revert to last good"]
E -- Error --> G
F --> A
G --> A
Step 4: Run it and watch
Before running, implement ask_agent() with your preferred LLM API:
Option A: OpenAI API (if you have an API key):
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return resp.choices[0].message.contentOption B: Local model via Ollama (if you completed Step 2):
result = subprocess.run(
["ollama", "run", "qwen3:0.6b", prompt],
capture_output=True, text=True, timeout=60
)
return result.stdoutOption C: Any API — Claude, Gemini, Groq, or any model that accepts a text prompt and returns text.
Once ask_agent() is implemented, create a fresh directory and run:
mkdir my-autoresearch && cd my-autoresearch
cp ../my_train.py ../my_eval.py ../my_loop.py .
python my_loop.pyWatch what happens across 10 iterations. Then check the git log:
git log --onelineWhat you experienced
After running your loop, compare what you saw to the failures in Lesson 7:
| Lesson 7 failure | What you likely experienced |
|---|---|
| Hallucinated code | Agent proposed code that didn’t run — syntax errors, undefined variables |
| Depth-first only | Agent kept tweaking coefficients instead of trying fundamentally different approaches |
| No memory | Agent repeated similar proposals it already tried |
| No transferability | Your polynomial improved on test_points but might not generalize to other ranges |
These are not bugs in your code. These are the fundamental problems of agentic research — the same ones the community is responding to in Lesson 8.
Now you understand why the real version has program.md (research taste), a fixed evaluation (anti-Goodharting), and git as memory. You built the naive loop; the community is building the smart one.
Lesson 8: What the Community Built
The autoresearch repo has ~32k stars, ~4k forks, and ~100 open PRs. Analyzing the PR and issue backlog reveals that the community is treating autoresearch as four different things simultaneously — each one a response to the failures in Lesson 7.
Direction 1: Research orchestration
Problem it solves: The naive loop has no coordination, no memory, no dashboards.
What the community is building:
- Guidance agents (agents that steer other agents)
- Long-term memory and semantic knowledge banks
- Worker/function/trigger primitives
- Checkpoint and queue systems
- Multi-agent swarms with shared state
- Dashboards and experiment visualization
This is not “AutoML for one file.” This is agent-native research operations — a platform where agents coordinate, remember, and verify.
Direction 2: Hardware portability
Problem it solves: The original demo ran on an H100. Most people do not have H100s.
What the community is building:
- Apple Silicon / MLX support
- Consumer NVIDIA GPU support (RTX 3090, 4090)
- DGX Spark / GB10 support
- Multi-GPU / DDP setups
- Google Colab / Kaggle support
- SDPA fallback for non-Hopper GPUs
Users are treating autoresearch as a portable benchmark harness: “can an agent improve a training run on my hardware in 5 minutes?”
Direction 3: Research taste and verification
Problem it solves: The naive loop has no “taste” — it tries everything, including garbage.
What the community is building:
- Pre-verification (reject obviously bad ideas before spending a full run)
- Anti-overfitting policies
- Early stopping detection
- Bayesian sweeps and diversity-aware search
- Interpretability and transfer tests
- Deterministic controls
This is the community recognizing the core bottleneck: raw iteration is not enough. Agents need taste, memory, transferability, and verification.
Direction 4: Pattern generalization
Problem it solves: The loop works for ML training. Does it work for other domains?
What the community is building: Applications of the autoresearch pattern to sorting algorithms, prompt optimization, ranking systems, trading strategies, and more. We cover this in Lesson 9.
flowchart TD
subgraph core ["Core loop"]
LOOP["propose → run → measure<br>→ keep/revert"]
end
subgraph community ["Community extensions"]
ORCH["Direction 1: Orchestration<br>multi-agent, memory,<br>dashboards, queues"]
PORT["Direction 2: Portability<br>MLX, RTX, Colab,<br>multi-GPU, DDP"]
TASTE["Direction 3: Taste<br>verification, triage,<br>search policy, early stop"]
GEN["Direction 4: Generalization<br>beyond ML training"]
end
core --> ORCH
core --> PORT
core --> TASTE
core --> GEN
PR clustering data
From the PR/issue backlog, the community’s work clusters like this:
| Category | PR count | Issue count | What it signals |
|---|---|---|---|
| Agent orchestration & ResearchOps | 15 | 5 | Multi-agent coordination, dashboards, buses |
| Platform support & performance | 10 | 3 | Non-H100 hardware (Mac, Windows, RTX, Colab) |
| Security & supply-chain hardening | 7 | 2 | Autonomous code execution creates safety concerns |
| Search strategy & experiment design | 6 | 4 | Taste, diversity, pre-verification, early stopping |
| Notable forks & ecosystem | 5 | 2 | The repo is becoming a hub, not a single implementation |
| Evaluation & interpretability | 4 | 4 | Better measurement, logging, transfer tests |
| Documentation & hygiene | 9 | 1 | Typical for early viral open source |
Check your understanding
- What are the four directions the community is building in?
- Which direction responds to the “depth-first only” problem?
- Why is security a concern in autoresearch?
Lesson 9: The Pattern Beyond ML
The most important signal from the community is this: the value is the loop, not the specific training target.
The generalized pattern
Any autoresearch-style system has four components:
- An editable artifact — the thing the agent is allowed to change (like
train.py) - A fixed evaluation — a deterministic metric the agent cannot game (like
val_bpb) - A time budget — every experiment gets the same fixed time
- A git ledger — every change is committed or reverted, creating memory
flowchart LR
ARTIFACT["Editable artifact<br>(code, config, strategy)"] --> LOOP["Tight loop<br>propose → run → measure"]
LOOP --> EVAL["Fixed evaluation<br>(any deterministic metric)"]
EVAL --> DECISION{"Improved?"}
DECISION -- "Yes" --> COMMIT["git commit<br>(keep + remember)"]
DECISION -- "No" --> REVERT["git revert<br>(discard + remember)"]
COMMIT --> ARTIFACT
REVERT --> ARTIFACT
Beyond ML training
People are already applying this pattern to domains far from training loops:
“Autocontext” — a recursive self-improving harness for any text task. The agent generates a rubric, evaluates outputs against it, and iteratively improves. The editable artifact is a prompt template. The fixed evaluation is a rubric score.
Distributed agent networks — connecting multiple autoresearch agents in peer-to-peer networks. Each agent explores a different branch. Improvements cross-pollinate between agents. The README describes the next step as “asynchronous, massively collaborative agents” — a SETI@home for research.
Skill factories — using the propose/test/keep loop to create libraries of verified agent skills. The editable artifact is a skill definition. The fixed evaluation is a task completion rate.
Quant strategy evolution — applying the same loop to trading strategy optimization. The editable artifact is a strategy script. The fixed evaluation is backtest performance on historical data.
Design exercise: your own loop
Pick any domain. Define the four components:
| Component | ML training | Sorting algorithm | Prompt optimization |
|---|---|---|---|
| Editable artifact | train.py |
sort.py |
prompt.txt |
| Fixed evaluation | val_bpb |
sort time on fixed input | accuracy on fixed test set |
| Time budget | 5 minutes | 10 seconds | 30 seconds |
| Git ledger | commit/revert | commit/revert | commit/revert |
If you can fill in all four columns for your domain, you have an autoresearch-style loop.
Check your understanding
- What are the four components of any autoresearch-style system?
- Why does the pattern work for domains beyond ML training?
- What would a sorting algorithm autoresearch loop look like?
Lesson 10: Why Hype Cooled but the Project Didn’t Stall
If you look at Twitter/X, it might feel like autoresearch peaked and faded. The PR/issue data tells a different story.
Five factors explain the gap
1. The novelty phase ended fast. Launch-week discourse was about the meme: “AI doing research on itself.” After that, the hard questions took over — does it transfer? Is it just shallow local search? That naturally produces fewer viral takes and more infrastructure PRs.
2. The base repo is intentionally minimal. It was designed to stay tiny and reviewable — one editable file, one context file, one GPU, one metric. That means ambitious extensions end up in forks and PR backlog, not in the main branch. Visible momentum shifts outward.
3. The community hit the search-policy wall. Current behavior is too narrow — “only does depth-first search.” Contributors are pushing for Bayesian sweeps, memory agents, diversity-aware exploration. This is the transition from “does the loop work?” to “how do we make the loop smart?”
4. Infrastructure PRs are not viral. Porting to MLX, securing tokenizer caches, adding checkpoints, fixing notebooks — these are important but not “demo material.” They signal maturation, not collapse.
5. Central repo activity is no longer the whole story. Once the pattern generalizes into forks, custom backends, and orchestration layers, the right metric is ecosystem activity, not commits-to-main.
flowchart LR
subgraph perception ["What it looks like"]
HYPE["Tweet volume drops"]
FEWER["Fewer viral demos"]
end
subgraph reality ["What is actually happening"]
INFRA["Infrastructure PRs<br>MLX, security, checkpoints"]
FORKS["Activity moves to forks<br>Custom backends, orchestration"]
HARD["Hard problems tackled<br>Search policy, memory, verification"]
end
perception -- "people think<br>'it stalled'" --> reality
The accurate take
The hype cycle cooled, but the project shifted from novelty to engineering reality. The work is now about memory, orchestration, verification, portability, and better search policies. That is what maturation looks like in open source.
Check your understanding
- Why is “commits-to-main” the wrong metric for autoresearch activity?
- What is the “search-policy wall” the community hit?
- Why do infrastructure PRs signal maturation, not collapse?
The Complete Picture
Here is the full journey from the original question to the community’s response:
flowchart TD
subgraph question ["The question (Lesson 0)"]
Q["What if AI could<br>research its own code?"]
end
subgraph design ["The design (Lessons 1-5)"]
LOOP["Core loop:<br>propose → run → measure → keep/revert"]
FILES["Three files:<br>program.md + prepare.py + train.py"]
SPLIT["Human/agent split:<br>human controls eval + strategy<br>agent controls code + experiments"]
end
subgraph practice ["In practice (Lessons 6-7)"]
WORKS["Real improvements found:<br>batch size, depth, RoPE, init"]
FAILS["But also fails:<br>hallucinations, depth-first,<br>no memory, no transfer"]
end
subgraph community ["Community response (Lessons 8-10)"]
ORCH["Orchestration"]
PORT["Portability"]
TASTE["Taste + verification"]
GEN["Pattern generalization"]
end
question --> design
design --> practice
practice --> community
Connection to your learning
| What you learned in the roadmap | How autoresearch connects |
|---|---|
Step 2: val_bpb is a metric like any other eval |
Autoresearch uses val_bpb as its single ground truth |
| Step 2: attention, KV cache, forward pass | train.py implements a small transformer — the agent modifies its architecture |
| Step 3: quantization, serving, benchmarking | Portability PRs are “benchmark this harness on my hardware” |
| Step 4: training loops, loss, optimization | The entire system is a training loop — but the “optimizer” is an agent editing code |
| Step 5: building AI products, agents, tool use | Autoresearch is a live case study of agentic product design |
The deepest lesson: autoresearch is not a new model or a new training technique. It is a design pattern — a way to structure the relationship between humans, agents, code, and evaluation. That pattern is transferable far beyond ML.
Exercises
Exercise 1: Read program.md
Open program.md. Answer: What constraints does the agent operate under? What is off-limits? Where does “research taste” live?
Exercise 2: Read prepare.py
Open prepare.py. Find TIME_BUDGET and evaluate_bpb. Why is the evaluation deterministic? Could the agent game this metric?
Exercise 3: Trace an experiment
Look at one of the session reports (linked from README). For one change: what did the agent propose? Did val_bpb improve? Was the commit kept or reverted?
Exercise 4: Design your own loop
Pick a domain outside ML training. Define: the editable artifact, the fixed evaluation, the time budget, and the program.md constraints. Write it as a one-page design doc.
Exercise 5: Fork and run
If you have GPU access: fork the repo, run one 5-minute experiment, analyze the result. Write a one-paragraph analysis: was the change smart?
Learning Plan for First Break AI
Concept-to-code mapping
| Concept | Where to find it | Lesson |
|---|---|---|
| AutoML vs code diffs | Lesson 0 diagrams | 0 |
| Core loop | Core loop diagram, program.md |
1 |
| Research taste | program.md in the repo |
2 |
| Fixed evaluation | prepare.py in the repo |
3 |
| Agent’s canvas | train.py in the repo |
4 |
| Human/agent split | Architecture diagram | 5 |
| Real experiment trace | Session reports | 6 |
| Failure modes | Community PRs/issues | 7 |
| Build your own loop | my_train.py, my_eval.py, my_loop.py |
Hands-on |
| Community extensions | PR clustering table | 8 |
| Generalized pattern | Pattern diagram | 9 |
Phase 1: Understand the design (Day 1)
Theory:
Practice:
Verification:
Phase 2: See it in action (Day 2)
Theory:
Practice:
Verification:
Phase 3: See the bigger picture (Day 3)
Theory:
Practice:
Verification:
Progress tracker
Copy and paste this into your notes:
Autoresearch Design Journey Progress
======================================
[ ] Lesson 0: The question that started it
[ ] Lesson 1: The core loop
[ ] Lesson 2: Read program.md
[ ] Lesson 3: Read prepare.py
[ ] Lesson 4: Read train.py
[ ] Lesson 5: The human/agent split
[ ] Lesson 6: Trace one experiment
[ ] Lesson 7: Why the naive loop fails
[ ] Hands-on: Build your own autoresearch
[ ] Lesson 8: What the community built
[ ] Lesson 9: The pattern beyond ML
[ ] Lesson 10: Why hype cooled
Exercises:
[ ] Exercise 1: Read program.md
[ ] Exercise 2: Read prepare.py
[ ] Exercise 3: Trace an experiment
[ ] Exercise 4: Design your own loop
[ ] Exercise 5: Fork and run
Summary
| Concept | What it is | Where to find it |
|---|---|---|
| The core loop | propose → run → measure → keep/revert | program.md defines it |
| AutoML vs autoresearch | Parameter grids vs code diffs | Lesson 0 |
| Human/agent split | Human controls eval + strategy, agent controls code | program.md + prepare.py vs train.py |
| Fixed evaluation | Deterministic metric the agent cannot game | prepare.py, val_bpb |
| Goodharting prevention | Agent cannot modify the evaluation | prepare.py is untouchable |
| The taste problem | Agents need memory, diversity, verification | Community PRs, Lesson 7 |
| The generalized pattern | Editable artifact + fixed eval + time budget + git | Lesson 9 |