Comparative analysis of Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 as fully autonomous ML researchers.
Based on karpathy/autoresearch
Overview
An open-source framework that gives AI models full autonomy to design, implement, and evaluate ML experiments. Each model operates as a complete research agent -- reading data, writing code, debugging failures, and iterating on results -- with no human in the loop.
Models design ML experiments, write training scripts, handle errors, and evaluate results independently.
Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 on identical benchmarks.
Every experiment is logged, versioned, and reproducible. Setup scripts and structured output included.
Per-experiment and per-model cost tracking. Know exactly what you spent.
Phase 1 Results
362 experiments across three Anthropic models. Each model built on the previous model's best result -- different baselines, so improvement percentages are not directly comparable.
| Model | Exps | Kept | Disc. | Crashed | Skipped | Keep % | Crash % | Skip % | Baseline | Best | Improv. | Cost |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sonnet 4 | 147 | 5 | 45 | 50 | 47 | 5.2% | 34% | 32% | 0.948083 | 0.936221 | 1.25% | ~$15 |
| Sonnet 4.6 | 104 | 21 | 70 | 9 | 4 | 22.1% | 8.7% | 4% | 1.245301 | 0.955865 | 23.24% | ~$11 |
| Opus 4.6 (32k) | 111 | 8 | 64 | 21 | 18 | 8.9% | 18.9% | 16% | 0.955747 | 0.901860 | 5.64% | ~$55 |
Keep rate computed over non-crashed experiments. Skipped = LLM returned unparseable response. Different baselines per model (sequential design).
Key Finding
Opus 4.6 with 32K extended thinking discovered an optimization chain that required understanding second-order effects -- something the shorter-context models could not do.
Removed gradient accumulation
grad_accum 2 -> 1, doubling optimizer steps within the 5-minute training window.
Recognized doubled weight decay
More optimizer steps = more weight decay applications per run. The model predicted this side effect.
Halved weight decay to compensate
Adjusted weight decay down to offset the increased application frequency.
Increased matrix learning rate
With more frequent but smaller effective steps, raised the Muon learning rate for matrix parameters.
Added exponential x0_lambda gradient
Introduced exponential scheduling for initial parameter regularization -- depth-aware skip connections.
Each step depends on the previous one. Without extended thinking, the model would need multiple experiment rounds to discover each compensation independently. Opus proposed steps 1-4 as a single coherent change that succeeded on the first attempt.
In Progress
Phase 1 had different baselines per model, making direct comparison unreliable. Phase 2 fixes this -- all models start from byte-identical code with the same unoptimized baseline.
Results will be published in v2 of the paper.
| Runs | Model | API | Status | Est. Cost |
|---|---|---|---|---|
| 1-3 | GPT-4.1 | Azure OpenAI | In progress | $0 |
| 4-6 | Sonnet 4.6 | Anthropic | Planned | $18 |
| 7-9 | Sonnet 4 | Anthropic | Planned | $18 |
| 10-12 | Opus 4.6 | Anthropic | Planned | $150 |
| Total (12 runs x 100 experiments) | $186 | |||
Getting Started
Running in under five minutes. Copy each block directly.
# Clone and setup git clone https://github.com/bmdhodl/fullautoresearch.git cd fullautoresearch bash scripts/setup.sh
# Run pre-flight tests bash scripts/test.sh
# Run with Sonnet 4.6 (recommended) AUTORESEARCH_DEPTH=8 AUTORESEARCH_BATCH_SIZE=16 \ bash scripts/run_forever.sh --dataset pubmed --tag my-run
# Run with Opus 4.6 bash scripts/run_forever.sh --opus --dataset pubmed --tag my-opus-run
# Run with GPT-4.1 via Azure bash scripts/run_forever.sh --azure gpt-4.1 --dataset pubmed --tag my-azure-run
Cost Estimation
API cost per experiment (March 2026 pricing). GPU electricity not included -- see Table 5 in the paper for full breakdown.
$6.00
Estimated API Cost
100 experiments x $0.06/exp (Sonnet 4.6)
Acknowledgments
This project builds on Andrej Karpathy's original autoresearch framework.
Built with PyTorch, Triton, HuggingFace, and Rich. Model APIs from Anthropic and Microsoft Azure / OpenAI.
Thanks to Renato Umeton, Ph.D. for publication guidance and championing open-source ML research, Dave Graham from ML Commons for parallel research collaboration and accountability, and the LinkedIn ML/NLP community for feedback.