Don't Write Evals for Fast-Moving Systems
- Evals
- LLMs
When developing Clobsidian, an obvious question came up: as an evals person (Head of AI even, to hear me tell it), should I be evaluating the capabilities more than I am? Which is to say, at all?
The answer is: not yet, but I know when I’ll start.
Scope
The title is a bit of a clickbait. Here’s the checklist to see if this applies to your system:
- You’re developing an LLM-powered system.
- It’s moving fast.
- You’re the only user.
- You’re the only one who can fix the errors.
- You’re the only one who knows what success looks like.
Example: the draft-doctor skill
I have a skill that helps me develop half-baked ideas that I’ve written down but haven’t developed into a full draft. (It’s emphatically not a slop maker — rather, it’s a tool to help me get unstuck when I’m stuck, which mostly involves a lot of questions and some brainstorming.)
An excerpt from the skill:
Phase 2: Research Context
Before asking questions, gather context:
- URLs in draft: Use WebFetch to read linked articles/pages
- Previous posts: Check
02-Areas/Drafts/Posted/for related content- People/concepts: Note anything that might inform your questions
Phase 3: Ask Clarifying Questions (Socratic, Not Strategic)
Use
AskUserQuestionto help the author think through the subject matter itself — not meta questions about format, audience, or packaging.Ask questions that:
- Probe the core claim (“What exactly bothers you about this?”)
- Surface hidden assumptions (“Is there a case where this would be fine?”)
- Find the personal angle (“Have you done this yourself?”)
- Sharpen the argument (“What’s the strongest counterargument?”)
- Uncover the real insight (“What do you know that most people don’t?”)
Don’t ask:
- “Who is this for?” (meta)
- “What tone do you want?” (meta)
- “Should this be short or long?” (meta)
- “What do you want readers to do?” (meta)
Guidelines:
- Ask 1-2 questions at a time
- Ask about the IDEAS, not the presentation
- Follow the thread — let their answers reveal what they actually think
- Stop when THEY have clarity (you’ll see it in their answers)
- Match the language of the draft (English or Czech)
Even in this partial excerpt, there’s already quite a bit that can go wrong.
- The questions could be irrelevant.
- Early questions could overdetermine the later ones.
- The questions could be too narrow or too broad.
Should I have written evals for this skill, then? No. Not yet.
The reality: fix on fail, commit, move on
In reality, I’ll test the skill with myself as the first user. If it fails, I’ll fix it and commit the fix. If it works, I’ll commit the change with an explanation of why I made it, and move on to the next issue.
Okay, Simon, but surely you’re at least logging the errors? Annotating them, perhaps? Not if I’m the only user, and not if I’m fixing the errors myself right away.
If the error never recurs, this will be the last time I think about it.
Trigger 1: Regression
If the error comes back, though? That’s when an eval is a good idea.
It indicates that I didn’t initially know how to fix the error. Alternately, it indicates that the error is in some way sticky, and will recur if allowed.
Trigger 2: Onboarding
The other reason to write evals is to define the standards for success for others. Because if they don’t know what success looks like, they can’t fix the failure when they encounter it.
Trigger commonality
In both cases, you’re removing yourself from the critical path. That’s a necessity… at some point. But if your heart and soul are still in the system, there’s no need to remove yourself just yet.
What this means for instrumentation
Note that I’m saying you shouldn’t write evals too early. You absolutely should be logging what’s going on in the system, though.
(This is a bit tricky in a Claude Agent-ish setup — the simplest way that I’ve found is to have a separate sub-skill that’s explicitly invoked by the main skill.)
Conclusion
Evals have a cost. Pay it when it’s worth it.
Feel free to contact me
If you're looking for an data-driven AI consultant or simply want to have a chat.