How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
I built an autonomous research agent that proposes neural network changes, trains models, evaluates results, and iterates — all while I sleep. Here's what happened.
How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.
The Setup
The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:
- Propose — The agent analyzes current model performance and proposes a specific code change
- Implement — It writes the actual Python code to modify the neural network
- Train — The modified model trains on PubMed medical text data
- Evaluate — Loss metrics are compared against the baseline
- Decide — If improvement > threshold, keep the change. Otherwise, revert.
- Repeat — Go back to step 1 with updated context
The Results
Out of 100 experiments:
- 93 failed — proposed changes made the model worse or had no effect
- 7 succeeded — measurable improvements that the agent kept
- Net result — 25% improvement in model performance
The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.
What the Agent Discovered
The 7 successful experiments included:
- Learning rate scheduling changes I wouldn't have tried
- A specific attention head configuration that improved convergence
- Batch size adjustments that were counterintuitive but worked
- Layer normalization placement that contradicted my assumptions
The Hardware
This runs on consumer hardware:
- GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
- CPU: Standard desktop AMD Ryzen
- RAM: 32GB
- Storage: 1TB NVMe SSD
Total cost for the overnight run: about $0.50 in electricity + Claude API calls.
Why This Matters
The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.
Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.
The Code
The agent is ~300 lines of Python orchestrating:
- Claude Sonnet for reasoning and code generation
- PyTorch for training
- A simple SQLite database tracking all experiments
- Git for version control of each experiment
It's not magic. It's a loop with good prompts and clear evaluation criteria.
What I Learned
- Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
- Failure is the feature — 93% failure rate is fine when experiments are cheap
- Consumer hardware is enough — you don't need cloud GPUs for meaningful research
- Overnight is the killer use case — run experiments while you sleep, review results over coffee
Try It Yourself
You need:
- A GPU (even a 3060 works)
- An API key for Claude or GPT
- A clear metric to optimize
- Patience to debug the loop
The hardest part isn't the code — it's defining what "better" means for your specific model.
Want me to build an autonomous agent for your workflow? Start a project →
Ready to automate?
I build AI agents and automated workflows. Async delivery. No meetings. Flat rate.
Start a ProjectGet new posts delivered to your inbox
No spam. Unsubscribe anytime.
More from the blog
OpenClaw Has 250K GitHub Stars — But Should Your Business Actually Use It?
OpenClaw is the hottest open-source AI agent tool in 2026. But there's a gap between cool demo and production business automation. Here's when OpenClaw makes sense — and when you need something custom.
Multi-Agent AI Systems for Business: What They Are and When You Actually Need One
Single AI agents hit a ceiling fast. Multi-agent systems let specialized agents collaborate on complex workflows — here's how they work, when they make sense, and how to build one without a six-figure budget.
How Much Does It Cost to Build an AI Agent in 2026?
A transparent breakdown of what AI agent development actually costs — from simple automations to complex autonomous systems. Real numbers, not consulting-speak.