Claude vs GPT vs Gemini in Production

The benchmark wars are over and they did not predict which model you should ship. Public benchmarks measure narrow tasks under controlled conditions. Production traffic is messy. The model that wins MMLU does not necessarily win "user pasted a 30-page PDF and wants a structured summary in French at 11pm on a Sunday". We run all three frontier model families across our SaaS portfolio. Claude vs GPT vs Gemini, in our actual production traffic, looks very different from the leaderboard.

This is what we measured, where each one wins, and the specific routing rules we use today.

What we run, where

Quick context for credibility. We use:

Claude Sonnet as the default for almost everything: Carriva's RAG assistant, the lesson-planning generation pipeline, content drafting for the studio site, code generation via Claude Code.
GPT-4 family (GPT-4o and the newer GPT-4.x line) for specific tasks where it has empirically outperformed: certain structured output formats and one-shot vision tasks.
Gemini (1.5 Pro and 2.x Flash) for cheap large-context summarization and as a reranker in some RAG pipelines.

These choices were not theoretical. We tested all three on each task. The choices reflect what we measured.

The dimensions that actually matter

Forget MMLU. Forget HellaSwag. The dimensions that determine which model lives in production are these.

1. Instruction following under pressure

When the prompt is 8k tokens, has 14 constraints, and the user gave you a half-formed question, which model produces output that obeys the constraints?

Claude wins here for us, consistently. GPT-4 is close but more often skips a stated constraint. Gemini is the most likely to produce well-written output that quietly ignores one of your rules.

2. Structured output reliability

When you ask for JSON matching a schema, does the model produce valid JSON 99.9% of the time? 99% is not enough. Every 1 in 100 invalid JSON response is a 500 error in your app.

GPT-4 has the strongest structured-output discipline, especially with the JSON mode and function-calling APIs. Claude is excellent and improving. Gemini is the weakest of the three for strict JSON, in our experience.

3. Hallucination rate on specific facts

When asked a factual question that the model does not know, does it admit it or invent a plausible answer?

Claude is the most willing to say "I do not know" or "I cannot verify". GPT-4 is in the middle. Gemini is the most confident hallucinator on niche topics, which is a serious risk for regulated work.

4. Latency at the 95th percentile

Median latency is a vanity metric. P95 is what your users experience.

GPT-4o and Gemini Flash are notably faster than Claude Sonnet at P95 for short generations. For longer generations (over 1k tokens), Claude is competitive and sometimes faster because it streams more steadily.

5. Cost per task, not per token

Token pricing alone is misleading. The right comparison is cost per task that actually completes correctly. A model 30% cheaper per token that produces invalid output 5% of the time and forces a retry is more expensive in practice.

The honest comparison table

Approximate, late 2026 numbers for our workloads. Not benchmarks; observed production behavior.

Dimension	Claude Sonnet	GPT-4 family	Gemini Pro
Instruction following	Excellent	Very good	Good
Structured output	Very good	Excellent	Adequate
Hallucination control	Best of three	Good	Weakest of three
P50 latency (short)	Medium	Fast	Fast
P95 latency (long)	Fast	Medium	Medium
Code generation	Best of three	Very good	Good
Vision (PDF, image)	Very good	Excellent for some niches	Very good, large context
Cost per 1M input	~3 USD	~2.50 to 5 USD	~1.25 USD
Tool use reliability	Excellent	Excellent	Good
Long context (over 200k tokens)	Strong	Variable	Strongest

A few elaborations.

Where GPT-4 still wins

Two tasks where GPT-4 is the right choice for us:

Strict OpenAPI schema generation. The function-calling and JSON mode are mature and very reliable.
Specific vision tasks like reading scanned forms or low-quality images. GPT-4o has had the edge on certain image preprocessing for us.

Where Gemini wins

Cheap, fast summarization of very large inputs. When we feed Gemini Flash 200 pages of context for a summary, the cost-per-result is unmatched.
As a reranker in a RAG pipeline. A cheap, fast model rescoring 20 candidates from the vector retrieval is a pattern that works well with Gemini Flash. We covered the broader topic in our RAG vs fine-tuning decision framework.

Where Claude wins for us

Anything that requires nuanced reasoning over instructions. The "follows what I asked" rate is the single best in the category for us.
Code generation in agentic flows. The Claude Code coding agent we rely on for engineering. We documented the workflow in our why Claude Code writeup.
Long, structured writing. Brand-voice consistency, factual carefulness, willingness to say "I do not know".

Routing rules we actually use

Our production traffic does not pick one winner. It routes per task. The simplified routing logic:

User-facing chat or assistant: Claude Sonnet.
Strict JSON schema response: GPT-4 family.
PDF or image extraction with low-resolution input: GPT-4o.
Long-document summarization (over 50k tokens): Gemini Flash.
RAG reranking step: Gemini Flash.
Code generation in our agent workflow: Claude Sonnet (or Opus for harder tasks).
Bulk content drafting in our brand voice: Claude Sonnet with prompt caching.

The fact that we run all three is itself a signal. If one model dominated every dimension, we would not be paying for three sets of API keys and three sets of rate-limit dashboards.

Prompt caching changes the math

Anthropic, OpenAI, and Google all support some form of prompt caching now. This matters more than people realize.

Our content pipeline drops 12,000 tokens of brand voice and prior-article context into every generation. With prompt caching enabled, the cached portion is roughly 90% cheaper after the first request. That single optimization moved Claude Sonnet from "borderline expensive" to "cheaper than the alternatives" for our workload.

If you compared model costs without prompt caching last year, your numbers are stale. Re-run the math.

How to actually evaluate

A real evaluation methodology, not a vibe check:

Build a held-out test set. 50 to 200 real prompts from your actual traffic.
Run all three models on the test set. Record outputs.
Score the outputs. Use a rubric that captures the failure modes that matter (instruction adherence, factual accuracy, format compliance). LLM-as-judge with a strong model is acceptable for triage; manual review for the final calibration.
Measure cost and latency in the same run.
Pick by Pareto frontier, not by single metric.

A surprising number of teams pick the model that scored highest on a public benchmark and never run their own eval. They are surprised three months later when the model that "should win" produces unusable output on their specific tasks.

Public benchmarks tell you what the model can do under ideal conditions. Your eval set tells you what it actually does on your data. Always trust the latter.

When the model providers disagree with their own marketing

A few patterns we have observed.

"Claude is bad at math"

Outdated. Claude Sonnet is fine at structured arithmetic and reasoning. The "Claude is bad at math" framing was true for older versions and persisted in folklore. Test it on your task.

"GPT-4 follows JSON schemas"

True if you use the structured-output API correctly. Less reliable if you ask for JSON in plain prompts. The schema enforcement is a system feature, not a model trait.

"Gemini has the best long context"

True for raw context length. Less true for "the model actually used the information from page 47 in its answer". Long context is necessary but not sufficient. Test the recall, not just the input limit.

"Open-weight models are catching up"

For some tasks, yes. For frontier-level reasoning and instruction following, the gap is still meaningful as of late 2026. Open-weight models are excellent for fine-tuning and for cost-sensitive workloads. We use them in some pipelines. They have not displaced the frontier models for the hardest tasks.

Tool use and agent loops

If you are building agent workflows (the model decides which tools to call), the relevant comparison is tool-use reliability, not raw model intelligence.

Claude and GPT-4 are both excellent at tool use, with mature function-calling APIs. We have built typed prompt libraries (the topic of our typed prompt library piece) that work cleanly across both. Gemini's tool use is improving but we have observed more variability in tool selection.

For our agent pipelines, we default to Claude. For very specific structured tool calls where the schema is rigid, we sometimes route to GPT-4 for that one step.

What we would test first

If you are evaluating Claude vs GPT vs Gemini for a new feature:

Pick the single most representative task. Not the easiest, not the hardest. The one closest to your average user request.
Run all three with default settings. Same prompt, same temperature, same max tokens.
Score 30 outputs by hand. Yes, by hand. The first 30 tell you everything.
Pick the obvious winner if there is one. If results are within 10%, pick by cost or latency.
Re-evaluate every 3 months. The frontier moves.

The mistake we see most often is teams running a single test on a single example and picking based on which output "felt better". Sample of 1 is not data. Sample of 30 is.

TL;DR

Claude is our default for reasoning, code, and brand-voice work. GPT-4 owns strict structured output and some vision niches. Gemini Flash is our cheap workhorse for long-context summarization and RAG reranking. The honest answer to Claude vs GPT vs Gemini is that all three deserve a slot in any serious production stack. The art is in the routing.

The frontier moves every quarter. Run your own eval. Trust your numbers, not the leaderboard.

Claude vs GPT vs Gemini: Choosing for Production Workloads in 2026