There is no cognitive core
Simon Willison asks:
What's the latest research on how much baked-in knowledge an LLM needs in order to be useful?
If I want a specialist coding model can I trim the size of that model down by stripping out detailed knowledge of human history, geography etc?
Can we even do that?
Andrej Karpathy describes work toward such a system
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability.
... It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date.
I'm pretty skeptical of this direction for threeish reasons.
1.
In humans, external tools have been used as a justification to avoid the real work of learning a skill, always to ill effect. Why learn your multiplication tables if you can just tape a reference table to your desk? Einstein himself has the famous quote about refusing to memorize something he could look up.
It turns out that learning both background knowledge and subroutine skills cold is very valuable. It reduces working memory requirements (by automating the subroutine or placing it on a reliable stack) and enables the vaunted higher-order thinking. This is a "the things you carry" issue. On a walk in the forest, trees look different to you depending on whether you carry a chainsaw or a sketchbook or a hammock.
Karpathy's own quote included the example that the cognitive core model "can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly". But being able to calculate it isn't the point - knowing the string cold turns it from random debug noise into a red flag: something that probably should have been data was not actually set before getting hashed. A model that doesn't know this, and thousands of other magic strings, is severely handicapped as a debugger.
Your education (or a model's knowledge base) is something you carry as an observer or actor into each situation.
2.
I almost never write code about code.
I almost always write code that is related to some real-world domain knowledge. Modeling X. Managing Y. Right now I'm working on a phonics-based edtech app and the domain knowledge of models is continuously put to use.
I think a model where the almost-always case puts us in the external-tools-and-information-lookup mode is a bad bet. I'm much happier approaching domain problems with an assistant that knows not to paint the sky green. People may say "give the model basic knowledge, and prune the esoteric stuff", but there is no such line.
3ish.
Warning: extreme hand wave from layperson.
I doubt the feasibility of big-model capabilities in small-model geometries. "Pruning knowledge" as I understand it relies on a monosemantic mental model of how LLMs work. Find the history-encoding parameters, cut them out. Find the syllogism-encoding parameters, leave them in. But it's knowledge encoding is diffuse in the models, and individual parameters are used by many concepts.
Very roughly: I think there are model capabilities that require large geometries to emerge. Given large geometries, it's possible to pack densely with lots of information. If you want the capabilities, there may be no extra cost to packing the information.
[ Recent success in distilled smaller models like Haiku 4.5 and Gemini Flash 3 seem to disagree with this! ]