Title TBD
Upcoming talk at CERNA.AI festival.
Related talks.
2026 Cognitive Exhaust Fumes: What Read-Only AI Sees That You Can't ai.engineer/europe (Online) The Eval Flywheel: From "Works on My Laptop" to Systematic Quality GenAI Meetup (Prague, Czech Republic) Evals, Benchmarks, and Guardrails: A Pythonista's Guide to Not Mixing Them Up PyData (Prague, Czech Republic) When (& How) to Start Writing Evals Evals.cz Meetup #1 (Prague, Czech Republic)
What happens when AI systems passively observe information without modifying it? Exploring the patterns and insights that read-only AI reveals — the cognitive byproducts humans overlook. view ↗
Most teams shipping GenAI products have no evaluation system. They have vibes, a few saved prompts, and hope. This talk starts with observability: if you can't see what's happening in production, nothing else matters. From there, we build the eval flywheel — a practical pattern where production observability feeds error analysis, error analysis generates eval cases, and eval cases prevent recurrence. view ↗
"I'll just write pytest tests for my LLM"—but should you? This talk untangles benchmarks, evals, and guardrails: three concepts that sound similar but map to different Python patterns. Learn why pytest CAN work for evals (with the right mindset), why guardrails aren't tests at all, and a grounded theory approach to defining what "good" actually means for your task.
view ↗
Most teams approach LLM evaluation like test-driven development: write tests first, then build. But LLMs have an infinite failure surface — you can't predict what will break. This talk argues for a different approach: deploy first, observe failures, then build evals for the patterns you've actually discovered. view ↗
¶ Want this talk for your audience? Invite me to speak ↗