Writing
Notes while building.
Mostly working through what actually holds up when you put AI in front of real users: retrieval, evaluation, and the judgment calls a metric will not make for you. A few pieces are listed below — still writing them.
What a RAG Demo Hides: The Metrics That Actually Matter
A retrieval demo answering one good question proves almost nothing. The interesting part is everything the happy path never shows you.
Designing Evaluation Loops for AI Products
Evaluation is not a QA step at the end. For AI products it is part of the product, and treating it that way changes what you build.
Let AI Answer, or Keep It Human? A Decision I Got Wrong Once
At GMATWhiz we automated a piece of mentorship that should have stayed human. Here is the framework I wish I had used.