Cognitive Exhaust Fumes: What Read-Only AI Sees That You Can't
What happens when AI systems passively observe information without modifying it? Exploring the patterns and insights that read-only AI reveals — the cognitive byproducts humans overlook.
What happens when AI systems passively observe information without modifying it? Exploring the patterns and insights that read-only AI reveals — the cognitive byproducts humans overlook.
Most teams shipping GenAI products have no evaluation system. They have vibes, a few saved prompts, and hope. This talk starts with observability: if you can't see what's happening in production, nothing else matters. From there, we build the eval flywheel — a practical pattern where production observability feeds error analysis, error analysis generates eval cases, and eval cases prevent recurrence.
"I'll just write pytest tests for my LLM"—but should you? This talk untangles benchmarks, evals, and guardrails: three concepts that sound similar but map to different Python patterns. Learn why pytest CAN work for evals (with the right mindset), why guardrails aren't tests at all, and a grounded theory approach to defining what "good" actually means for your task.
Most teams approach LLM evaluation like test-driven development: write tests first, then build. But LLMs have an infinite failure surface — you can't predict what will break. This talk argues for a different approach: deploy first, observe failures, then build evals for the patterns you've actually discovered.