Where is the Credulity Benchmarking?
Here's a recent prompt I threw at Claude Sonnet 4:
to `.me.bashrc`, add a alias or fcn `dictate` that:
runs
- sudo systemctl restart dictation.service
- systemctl --user restart dication_tray.service
if either restart fails, (does restart fail if something
is not already running?), try again with a plain `start`
(ignore this if not necessary)
suggestion alternatives if bundling these doesn't
work w/ the sudo pw prompt
Small task. Sonnet 4 wrote a nice little function to my bashrc, which I glanced over and then tested out.
Runtime failure!
Failed to restart dication_tray.service:
Unit dication_tray.service not found.
Do you see it? My original prompt had a typo - dication instead of dictation for the secondary service in the restart. Claude ran with it, even though the broader context makes the typo glaringly obvious.
Credulity is something that LLM assistants struggle with - they take information from context both too literally and assign it too great a confidence level.
It is of course technically feasible that I have a service called dication_tray on my machine, and that I want it restarted together with the dictation service, but come on. At least ask about it!
For me, desired behaviour here is something like:
Quick check: you wrote
dication_tray.servicein the instructions here but I assume this is a typo and you meantdictation_tray.service(see the t after the c in dictation). Is this correct?
For the record, newer Claudes do better on this. Both 4.1 Opus and 4.5 Sonnet assumed my input was a typo, wrote functions to address the correct service, and mentioned this correction in their response body.
All sample sizes here (for Sonnet 4, Opus 4.1, and Sonner 4.5) are 1.
The general questions:
- how resistant is a given model to incorrect information?
- how actively do models surface and seek to clarify contradictions in the running context?
- are these questions addressed in existing benchmarks?