yet-another-applied-llm-benchmark. Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for.
There are two defining features of this benchmark that make it interesting. Most importantly, I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities. This DSL allows for specifying both how the question should be asked and also how the answer should be evaluated. [...] And then, directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants
The DSL he's using is fascinating. Here's an example:
"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \
VisionLLMRun("What flag is shown in this image?") >> \
(SubstringEvaluator("United States") | SubstringEvaluator("USA")))
This triggers an LLM to execute the prompt asking for a C program that renders an American Flag, runs that through a C compiler and interpreter (executed in a Docker container), then passes the output of that to a vision model to guess the flag and checks that it returns a string containing "United States" or "USA".
The DSL itself is implemented entirely in Python, using the __rshift__
magic method for >>
and __rrshift__
to enable strings to be piped into a custom object using "command to run" >> LLMRunNode
.
Recent articles
- Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!" - 10th August 2025
- My Lethal Trifecta talk at the Bay Area AI Security Meetup - 9th August 2025
- The surprise deprecation of GPT-4o for ChatGPT consumers - 8th August 2025