/* ------------------------------------------------ */

BMD

// build. measure. deploy.

*/

pat@5090: ~/desk

A real shell running in your browser. Type desk to size your GPU, or tap a command. The bench replay streams at the measured rate. It is a replay, not a live model.

pat@5090:~$ desk

See what your GPU can actually run.

Pick the GPU you own. Get the local models that fit it, the quant to use, expected tokens per second, and a copy-paste run command. Measured on real hardware where we have it, estimated everywhere else.

[ Open the sizing desk ][ Subscribe to the journal ]

pat@5090:~$ cat latest_sweep.md

THE 5090 REPORTS // the lab notes behind the desk

I run open models on my own GPUs and publish the results.

Each report shows the model, quant, prompt, hardware, speed, VRAM use, and what broke. AgentGuard handles spend limits, loops, timeouts, and rate limits.

Read the 5090 Reports

pat@5090:~$ column -t fig1_sweep.tsv

Headline run01 / 06

Model: llama3.1:8b
Quant: Q4_K_M
Generation: 228.9 TOK/S
Peak VRAM: 7.2 GB

Open all six measured rows

Measured benchmark rows from the 2026-07-09 sweep
Model	Quant	Workload	Gen tok/s	Prompt tok/s	Peak VRAM
llama3.1:8b	Q4_K_M	short-gen-256	228.9	679	7.2 GB
llama3.1:8b	Q4_K_M	long-context-summarize	206.7	12,109	7.8 GB
llama3.1:8b	Q4_K_M	agent-code-task-512	227.8	1,028	7.8 GB
gemma4:26b	Q4_K_M	short-gen-256	198.8	15 (cold)	19.9 GB
gemma4:26b	Q4_K_M	long-context-summarize	180.2	6,179	20.2 GB
gemma4:26b	Q4_K_M	agent-code-task-512	207.4	197	20.2 GB

Fig. 1 Six runs on the RTX 5090, Ollama 0.31.1, temperature 0. llama3.1:8b Q4_K_M reached 228.9 tok/s generation, measured 2026-07-09.Raw artifact

pat@5090:~$ tail -n 1 stderr.log

[WARN] stderr

Changing num_ctx between requests forces a full model reload. On a 26B that is 140 seconds per swap. Pin your context size.

Field note, 2026-07-09

pat@5090:~$ agentguard --status

I use AgentGuard to stop spend, loops, timeouts, and rate-limit failures before a test touches real work.

$ pip install agentguard47

Open AgentGuard docs Read the 5090 Reports

pat@5090:~$ ls tools/

§ 003 / TOOLS12 live tools

Local AI tools + 12 live tools.

12 live / 1 beta from 13 public tools. Open what helps.

#

Tool

How

Status

VRAM Calculatorweb

free+pro / browser

Model Pickerweb

free+pro / browser

Quantization Compareweb

free+pro / browser

Local LLM Toolkit Pro7-day trial

AgentGuardv1.2.13

$ pip install agentguard47

Agent Roadmap Scannerweb

01 / free+pro / browser* LIVE

VRAM Calculatorweb

02 / free+pro / browser* LIVE

Model Pickerweb

03 / free+pro / browser* LIVE

Quantization Compareweb

04 / $7/mo* LIVE

Local LLM Toolkit Pro7-day trial

05 / free / Python* LIVE

AgentGuardv1.2.13

$ pip install agentguard47

06 / free / browser* LIVE

Agent Roadmap Scannerweb

pat@5090:~$ ps aux | grep agents

§ 004 / OPERATING LOOPship, measure, package

One person. Small tools. Agent-assisted ops.

01

Run

Run the model on the target GPU.

02

Measure

Record tokens, latency, VRAM, cost, and failure mode.

03

Publish

Write up the result or ship the tool it required.

pat@5090:~$ tail blog.log

§ 005 / BUILD NOTES[ ALL POSTS ] ->

Build notes.

Aug 2, 20265 min read

My 5090 benchmark was missing the field I needed most

Aug 1, 20266 min read

Search Old Results Before Publishing an LLM Test

Jul 31, 20266 min read

Build Local LLM Eval Data From Real Failures

Jul 30, 20266 min read

How I Keep LLM Results Valid After a Driver Update

[bmdpat] 0:desk*VRAM 7.8/32.0 GB (llama3.1:8b sweep)