
Join Neptune to save, like, and publish prompts.
By signing in, you agree to our Terms of Service and Privacy Policy.

Join Neptune to save, like, and publish prompts.
By signing in, you agree to our Terms of Service and Privacy Policy.
Guide + templates for o1/o3/Claude thinking/Gemini — what to do, what NOT to do, effort control (2026)
Reasoning / Extended Thinking Model Prompting Guide & Templates
Sources: Anthropic Claude Prompting Best Practices (platform.claude.com, 2025-2026),
OpenAI o3/o4-mini Function Calling Guide (developers.openai.com, 2025),
Helicone thinking model guide, OpenAI Reasoning Best Practices docs
------------------------------------------------------------------
WHAT MAKES REASONING MODELS DIFFERENT:
Reasoning models (o1, o3, o4-mini, Claude 3.7/Sonnet 4.6 adaptive thinking,
Gemini 2.0 Flash Thinking) run an internal chain-of-thought BEFORE producing
their response. This internal CoT is hidden — you never see it directly.
Implications:
- They self-plan. You do not need to tell them how to think.
- Explicit CoT instructions ("think step by step") interfere with their
native reasoning and degrade performance.
- They excel on 5+ step problems. They are overkill for simple queries.
- They are slower and more expensive per token. Use them when it matters.
- Few-shot examples often hurt rather than help (they constrain the
model's natural reasoning path).
Mental model: treat them as a senior colleague who figures out the approach
themselves when you clearly state the goal — not a junior who needs
step-by-step instructions.
------------------------------------------------------------------
THE GOLDEN RULES:
DO:
- State the goal clearly and completely
- Provide all relevant context upfront (long context at TOP of prompt)
- Use XML tags or markdown sections to separate distinct inputs
- Set explicit output format constraints (what the final answer should look like)
- Use "think thoroughly" or "reason carefully" over prescribed plans
- For Claude: control depth via the effort parameter (low / medium / high / max)
- For o3/o4: use developer messages to set role + boundaries + tool rules
- Ask for self-verification: "Before finalizing, verify your answer against [criteria]"
- Break multi-shot problems into well-defined sub-tasks
DON'T:
- "Think step by step" — redundant, often harmful
- "Let's think about this carefully before answering" — same problem
- "First do X, then do Y, then do Z" — prescribing the reasoning path
- Provide many few-shot examples — use zero-shot first
- Use structured output requests (JSON, tables) — these models perform worse;
use standard LLMs for format-heavy tasks or add strict schema enforcement
- Give vague goals and expect inference — be explicit
- Overload context with irrelevant detail — quality over quantity
- Use these models for tasks with fewer than 3 reasoning steps — wasteful
------------------------------------------------------------------
WHEN TO USE REASONING MODELS vs STANDARD MODELS:
Use reasoning models for:
Complex multi-step problems (5+ steps)
Math, logic, code debugging, legal/medical analysis
Agentic tasks with tool use across many steps
Problems where accuracy matters more than speed/cost
Ambiguous tasks requiring judgment about the right approach
Use standard models (GPT-4.1, Claude Haiku, Gemini Flash) for:
Simple queries (< 3 reasoning steps)
Structured output generation (JSON, CSV, templates)
Tasks where few-shot examples are critical
High-volume, latency-sensitive workloads
Basic summarization, translation, extraction
------------------------------------------------------------------
CLAUDE ADAPTIVE THINKING (Sonnet 4.6 / Opus 4.6) — SPECIFIC GUIDE:
How it works: Claude dynamically decides when and how much to think based on
the `effort` parameter and query complexity. On simple queries it skips thinking
entirely. On complex queries it reasons deeply.
API configuration:
thinking={"type": "adaptive"}
output_config={"effort": "high"} # low | medium | high | max
Effort settings:
low → high-volume, latency-sensitive tasks, simple queries
medium → most production applications (recommended default)
high → agentic coding, multi-step tool use, complex reasoning
max → hardest long-horizon problems (cost: significant)
Prompt guidance for effort calibration:
To increase thinking: "Take your time and reason carefully about this."
To reduce thinking: "Extended thinking adds latency and should only be used
when it meaningfully improves answer quality. When in
doubt, respond directly."
To prevent over-exploration: "Choose an approach and commit to it. Avoid
revisiting decisions unless you encounter new
information that directly contradicts your reasoning."
Interleaved thinking (after tool use):
"After receiving tool results, carefully reflect on their quality and
determine optimal next steps before proceeding."
Self-check pattern:
"Before you finish, verify your answer against [specific test criteria]."
Note on extended thinking with budget_tokens: still functional on Sonnet/Opus 4.6
but deprecated. Prefer adaptive thinking + effort parameter.
------------------------------------------------------------------
OPENAI o3 / o4-MINI — SPECIFIC GUIDE:
How it works: o-series models convert system prompts to "developer messages"
automatically. They reason internally before every response and every tool call.
Do NOT instruct them to plan before calling tools — they already do this.
System prompt → developer message (auto-converted). Same format, different semantic:
Purpose: set role, define available actions, establish tool-calling order,
set usage boundaries.
Developer message structure:
1. Role definition + scope of available actions
2. Function-call ordering rules (if multi-step tool use)
3. Tool usage boundaries (when to use which tool, when NOT to)
4. Proactiveness / confirmation guidance
Function description rules:
- Clarify invocation criteria explicitly ("Only call if directory exists")
- Add argument construction rules upfront ("Do not overwrite without using
file_delete or file_update first")
- Use few-shot examples only for complex argument formats
- Flat argument schemas outperform deeply nested ones
- Up to ~100 tools and ~20 arguments per tool stays within training distribution
Anti-patterns specific to o3/o4:
- "Plan before calling tools" → redundant, degrades performance
- Lazy behavior accumulation → start fresh conversations for unrelated topics;
discard irrelevant past tool calls; summarize instead
- Hallucinating future tool calls → "Do NOT promise function calls later.
Only call a function when you are ready to execute it."
Structured output enforcement: use `strict: true` in tool schema for validation.
Persist reasoning between tool calls (Responses API):
include=["reasoning.encrypted_content"]
This maintains CoT context across calls, improving tool-selection decisions.
------------------------------------------------------------------
REUSABLE SYSTEM PROMPT TEMPLATE (Claude adaptive thinking):
<system>
You are [role description].
<task_scope>
[Clear description of what you are responsible for and what is out of scope.]
</task_scope>
<output_requirements>
[Exact format, length, structure of the expected output. Be specific.]
</output_requirements>
<constraints>
[Hard limits: what you must never do, what data you must not access, etc.]
</constraints>
<quality_standard>
Before finalizing any response, verify it against:
- [Criterion 1]
- [Criterion 2]
Only respond when you are confident your answer meets these criteria.
</quality_standard>
</system>
------------------------------------------------------------------
REUSABLE DEVELOPER MESSAGE TEMPLATE (OpenAI o3/o4-mini):
You are [role]. You can help users with: [action 1], [action 2], [action 3].
Tool usage rules:
- Use [tool A] for [specific purpose].
- Use [tool B] only when [condition].
- If both [tool A] and [tool B] could apply, prefer [tool A] for [reason].
- Do NOT call any tool unless you are ready to execute it now.
- Do NOT promise future tool calls.
Execution order for [complex workflow]:
1. Call [tool A] to [retrieve/validate X]
2. Only if X is confirmed, call [tool B] with the result
3. Summarize outcome to user
When uncertain about user intent, ask one clarifying question before acting.
------------------------------------------------------------------
PROMPT PATTERNS BY TASK TYPE:
Complex analysis / reasoning:
"Analyze [problem]. Consider [dimension 1], [dimension 2], [dimension 3].
Provide a final recommendation with your confidence level and the
main uncertainty you could not resolve."
Code debugging (Claude):
"Debug this function. After your analysis, verify the fix by mentally
tracing through [specific test case]. Only return the corrected code."
Multi-step research (o3/o4):
"Research [topic]. For each finding, note your confidence level.
Identify the key uncertainty that most affects the final answer.
Do not speculate beyond available evidence."
Decision under ambiguity:
"I need a decision on [X]. Constraints: [list]. If you need to make an
assumption, state it explicitly and flag it so I can correct it.
Give me your recommendation and the single most important caveat."
Agentic coding (Claude high effort):
"Complete [task]. Use tests to verify correctness at each step.
Do not remove or skip tests to make them pass — fix the implementation.
After completing, summarize what you changed and why."
------------------------------------------------------------------
THINKING BUDGET / COST MANAGEMENT:
Claude (adaptive):
Low effort: fastest, cheapest — use for high-volume chat, simple queries
Medium effort: best default — balances quality and cost
High effort: agentic coding, tool chains, complex reasoning
Max effort: large-scale migrations, deep research, long-horizon agents
Prevent runaway cost:
"Extended thinking should only be used when it meaningfully improves
answer quality. When in doubt, respond directly without extended reasoning."
OpenAI o3/o4:
Use o4-mini-high for most reasoning tasks (cost-efficient)
Use o3 for the hardest problems where o4-mini falls short
Do not use reasoning models for classification, extraction, or templating —
those are standard model tasks
------------------------------------------------------------------
COMMON MISTAKES AND THEIR SYMPTOMS:
Mistake Symptom
------- -------
"Think step by step" Verbose, constrained reasoning; worse answers
Too many few-shot examples Model mimics example format instead of reasoning
Prescribing the reasoning path Model gets locked into suboptimal approach
Using for simple tasks Slow, expensive, often worse than GPT-4.1 / Haiku
Structured output request Inconsistent, malformed JSON/table output
Vague goal statement Confident-sounding but wrong answer
Over-prompting tool use Tool overtriggering; irrelevant calls accumulate
"Explain your reasoning" Adds output tokens without improving answer quality
------------------------------------------------------------------
QUICK REFERENCE — DO / DON'T:
DO DON'T
-- -----
State goals clearly and completely Prescribe reasoning steps
Use XML tags to separate input sections Use "think step by step"
Set explicit output format requirements Add many few-shot examples
Use effort / budget parameters Request structured output (JSON)
Ask for self-verification at the end Use for simple < 3-step tasks
Zero-shot first, few-shot only if needed Over-explain the reasoning process