Builder notes on the AI tools I run on real deals.
One idea per post. Architecture decisions, things I tried first and abandoned, what I'd do differently.
- 007
The dashboard was wrong for two weeks. Every health check was green.
Model evals don't catch a broken data layer. The fix is the same discipline pointed at the database — checks on the numbers people actually act on.
AI2026 · 06 · 10 - 006
The feature that tested seven points better and changed almost nothing
How the eval caught a leaked metric — and then told me how to make the feature actually useful.
AI2026 · 06 · 08 - 005
An accuracy number is meaningless without the cost of being wrong
Not all classification errors cost the same. The eval reports two numbers and ranks every failure by business impact.
AI2026 · 06 · 05 - 004
You can't grade a language model against its own guesses
The fastest way to build a useless eval is to use the model's own output as the answer key.
AI2026 · 06 · 02 - 003
The one place the LLM lives in my data pipeline
Dozens of Python tools, exactly one LLM call per document. The boundary is the design decision.
AI2026 · 05 · 25 - 002
Why I moved my rent-benchmark and income-statement workflows out of Claude chat into Skills
Three problems running spreadsheet workflows in Claude chat that pushed me to build dedicated skills.
AI2026 · 05 · 22 - 001
What this site is
An introduction to the site — who I am, what I build, and what to expect here.
2026 · 05 · 17