Presentations

17 Feb, 2026
The Eval Flywheel: From "Works on My Laptop" to Systematic Quality

The Eval Flywheel: From "Works on My Laptop" to Systematic Quality

Most teams shipping GenAI products have no evaluation system. They have vibes, a few saved prompts, and hope. This talk starts with observability: if you can't see what's happening in production, nothing else matters. From there, we build the eval flywheel — a practical pattern where production observability feeds error analysis, error analysis generates eval cases, and eval cases prevent recurrence.

11 Feb, 2026
Evals, Benchmarks, and Guardrails: A Pythonista's Guide to Not Mixing Them Up

Evals, Benchmarks, and Guardrails: A Pythonista's Guide to Not Mixing Them Up

"I'll just write pytest tests for my LLM"—but should you? This talk untangles benchmarks, evals, and guardrails: three concepts that sound similar but map to different Python patterns. Learn why pytest CAN work for evals (with the right mindset), why guardrails aren't tests at all, and a grounded theory approach to defining what "good" actually means for your task.

03 Feb, 2026
When (& How) to Start Writing Evals

When (& How) to Start Writing Evals

Most teams approach LLM evaluation like test-driven development: write tests first, then build. But LLMs have an infinite failure surface — you can't predict what will break. This talk argues for a different approach: deploy first, observe failures, then build evals for the patterns you've actually discovered.