Think Toggles are Dumb

27 Feb, 2025

Claude 3.7 was released a few days ago. Its code output is yet another impressive jump up from the already excellent Claude 3.5 and 3.5 (new) aka 3.5.1 aka 3.6.

Another major addition is the Extended mode, where Claude does a Chain-of-Thought style deliberation before providing a final response to the query. It's accessible via the web interface for the mere cost of two clicks per interaction:

select-mouse

(Or five keystrokes, if you're that way inclined - tab down down down enter)

Vs Human Interaction

In real life, I do more explicit prompting for humans to actively think than average. I was a teacher, and I currently have two young kids. But for most people, in most situations, this is a very unnatural thing to think about.

If I ask a Spaniard what the capital of Spain is I would expect, and be happy with, an off-the-cuff response. If I ask the same Spaniard for a summary of treatment options for a rare disease, and also gave them a bunch of diagnostic materials and family history, I would expect a more deliberated response.

Importantly: I wouldn't have to tell him the difference.

All this to say: a manual think toggle betrays some stupidity in the overall mechanism. It also severely dampens the ability for conversations to naturally wander between periods of lighter and deeper substance.

A Modest Proposal

A more natural way to interact is for the assistant, like a person, to exercise some judgement about how much effort to put into a given response.

The basic structure is simple:

The user asks a question / provides some prompt.
The model assigns a complexity score to it.
Below a certain threshold, the model responds immediately.
Above the threshold, the model allocates a thinking budged proportional to the complexity score, and then responds.

It's so simple that Claude 3.7 (almost) produced a PoC from one prompt. I rounded some edges and the PoC is now live at http://nilock.github.io/autothink. This is a Bring-your-own-API-Key affair, and the browser communicates directly with the Anthropic API.

Note below that the simple query provides no option to examine the thinking tokens, because there are none. Thinking is conditioned on the user inputs. The progress bars embedded in Claude's responses give the perceived complexity of the prompts, rated from 0-100 (complexity 10 is the thinking threshold).

AutoThink UI AutoThink - Hard

The mechanics here are as simple as the bullet points above. We pre-fire the user query for the complexity estimate:

const response = await anthropic.messages.create({
  model: "claude-3-7-sonnet-20250219",
  max_tokens: 50,
  system:
    "You are an AI assistant that analyzes query complexity and returns ONLY a number from 0-100.",
  messages: [
    // [ ] todo: incorporate the entire conversation context.
    {
      role: "user",
      content: `
Please analyze the following user query and rate its
complexity on a scale from 0-100, where 0 means a simple,
straightforward question requiring minimal reasoning, and 100
means an extremely complex problem requiring intensive
step-by-step analysis. Provide ONLY a number between 0-100
with no explanation.

User query: "${query}"`,
    },
  ],
});
const score = parseInt(response.content[0].text.trim(), 10);

I used 3.7-sonnet as the analyst, which is probably overkill. I expect that 3.7-haiku would give good-enough results for the purpose.

Once we get a complexity score, we run it through a helper to scale the thinking token budget:

const calculateThinkingBudget = (complexityScore) => {
  // Minimum thinking budget is 1024 tokens
  if (complexityScore < 10) return 0; // No extended thinking for simple queries

  // Scale from minimum (1024) to maximum (32000) based on complexity
  const minBudget = 1024;
  const maxBudget = 32000;
  return Math.round(
    minBudget + (maxBudget - minBudget) * (complexityScore / 100),
  );
};

and finally resend the original query with the allocated thinking budget:

await anthropic.messages.create({
  model: "claude-3-7-sonnet-20250219",
  max_tokens: budget + 4000, // 4000 here for the actual response
  system: "You are Claude, a helpful AI assistant.",
  messages: newMessages.map((msg) => ({
    role: msg.role,
    content: msg.content,
  })),
  thinking:
    budget > 0
      ? {
          type: "enabled",
          budget_tokens: budget,
        }
      : {
          type: "disabled",
        },
});

In practice, the thinking threshold and budgeting would be adjusted for cost / benefit, and according to load, etc. But there you go - a Claude that automatically thinks when it expects it might help, and answers off the cuff where appropriate, automatically, in about a dozen lines of functional code.

Developers Developers Developers De... Users?

Developers love knobs, and frontier AI research labs seem like developer heavy organizations. But consumer tools can't have a lot of knobs.

The extended toggle presents people with two options:

forget that it exists
think about it every time they interact with the product

But both are bad! The first because it's hard to do long-term, and whenever you are reminded, you suffer the FOMO of having misused a powerful tool. The second because it's just a pain in the butt and an overall quality-of-life reduction.

From 3.0 onward, part of Claude's stickiness advantage has been how consistently pleasant it's been to interact with. The knobby-ness of the UI chips away at that advantage.