Skip to main content
Utility

LLM API Inference Cost Calculator – GPT, Claude, Gemini, DeepSeek

Estimate API costs for GPT, Claude, Gemini, DeepSeek and Qwen LLMs with input/output token pricing, prompt caching discounts and USD-CNY conversion. Forecast daily, monthly and yearly bills to plan your AI product budget with confidence.

Overview

LLM APIs bill by tokens, with input and output priced separately and prompt caching offering further discounts. This tool ships with 2026 public pricing for OpenAI, Anthropic, Google, DeepSeek and Qwen, plus fields for custom rates, cache hit rate, call volume and FX conversion. See per-call, daily, monthly and yearly costs instantly, with an input-vs-output share breakdown so you know which side of the pipe to optimise first.

How to use

  1. Pick a model provider and model — official pricing will auto-fill.
  2. Enter the input tokens and output tokens for a single call.
  3. Set the expected calls per day (usually DAU × calls per user).
  4. If you use Anthropic or OpenAI prompt caching, set the cache hit rate.
  5. Override input/output unit prices when needed (self-hosted, batch discounts or newer models).
  6. Adjust the USD-CNY FX rate. Per-call, daily, monthly and yearly costs update instantly with an input/output share breakdown.

Formula

inputCostUsd = inputTokens × (1 - cacheHitRate × (1 - cacheMultiplier)) × inputPrice ÷ 1,000,000; outputCostUsd = outputTokens × outputPrice ÷ 1,000,000; costPerCall = inputCost + outputCost; dailyCost = costPerCall × callsPerDay; monthlyCost = dailyCost × 30; yearlyCost = dailyCost × 365; cnyCost = usdCost × fxRate. Anthropic prompt caching applies a 0.1× multiplier on cached input (90% savings); OpenAI cached input uses 0.5× (50% savings); Google, DeepSeek and Qwen do not apply cache discounts by default.

Common scenarios

Scenario 1 · Chat assistant on GPT-4o mini

10,000 calls/day, 500 input + 800 output tokens each. GPT-4o mini: $0.15/M input, $0.60/M output. Per call ≈ $0.000555, monthly ≈ $166.5 (≈ ¥1,199 at 7.2 FX). Output share ≈ 86%, so terser answers save the most.

Scenario 2 · RAG Q&A on Claude 3.5 Sonnet with 70% cache

50,000 input + 500 output tokens per call, 1,000 calls/day. Without caching: ≈ $4,725/month. Enable Anthropic prompt caching at 70% hit rate: ≈ $1,958/month, a 58.6% saving.

Scenario 3 · Swap to DeepSeek V3

Same 1,000 input + 500 output × 10,000 calls/day on DeepSeek V3 costs ≈ $246/month (≈ ¥1,770), about 70% cheaper than GPT-4o ($825/month). For tasks that do not require frontier capability, domestic models dramatically cut cost.

FAQ

What exactly is a token? How many Chinese characters or English words does 1,000 tokens cover?

Tokens are the billing unit LLMs use after tokenising text. Rule of thumb: English 1 token ≈ 4 characters (~0.75 words); Chinese 1 character ≈ 1.5 tokens. So 1,000 tokens is roughly 750 English words or 500-650 Chinese characters. Different models tokenise slightly differently; for exact counts use tiktoken for OpenAI or each vendor's tokenizer.

Why do the same-length Chinese and English texts differ so much in token count?

BPE/SentencePiece tokenisers are trained on corpora dominated by English, so common English words compress to one token, while many Chinese characters fall outside the vocabulary and get split into 2-3 subword tokens. For equal-length text, Chinese usually uses 1.5-2× the tokens. Claude 3.5 and Gemini have better Chinese coverage than GPT-3.5 era models, but Chinese remains more expensive overall.

Why are input and output priced differently, with output typically 3-5× more expensive?

Output tokens are generated autoregressively — one forward pass per token, and decoding is memory-bandwidth-bound with poor GPU utilisation. Input tokens are processed in a single parallel prefill, which is far cheaper per token. Typical ratios: GPT-4o 1:4, Claude 3.5 Sonnet 1:5, DeepSeek V3 1:4. So RAG (long input / short output) is dominated by input cost, while Chat / Agent workloads are dominated by output cost.

How much do Anthropic prompt caching and OpenAI cached input actually save, and when are they worth it?

Anthropic bills cached hits at 10% of the normal rate (90% saving) but charges 1.25× on cache write. OpenAI bills cached hits at 50% (50% saving) with no extra write charge. Good fits: (1) fixed system prompts or role/instruction blocks reused across many calls; (2) RAG retrieving the same documents repeatedly; (3) agents sending identical tool schemas every turn. Rule of thumb: cached segment ≥ 1,024 tokens and hit rate > 30% almost always pays off.

How do I forecast monthly API spend for a product that has not launched yet, and what buffer should I budget?

Three-step forecast: (1) estimate DAU × calls per user per day (Chat apps 5-15, copilots 30-100, agents 100-500); (2) sample tokens per call by running tiktoken over 50 realistic logs and taking the median; (3) plug into this calculator for a monthly base, then multiply by 1.5-2× as buffer because real traffic typically overshoots by 50-100% (long prompts, retries, streaming anomalies). After launch, reconcile against the real bill weekly — investigate any deviation above 15%.

How should I choose between domestic (DeepSeek, Qwen, Doubao, Kimi) and overseas (GPT, Claude, Gemini) models on cost?

Domestic models (DeepSeek V3, Qwen Plus, Doubao Pro, Kimi) are typically 3-10× cheaper than GPT-4o and 5-15× cheaper than Claude 3.5 Sonnet. Guidelines: (1) Chinese-centric summarisation, classification, support bots, RAG — domestic wins on price-performance; (2) complex reasoning, code, multimodal, very long context — Claude 3.5 Sonnet or GPT-4o remain first choice; (3) strict data-residency requirements — domestic is mandatory; (4) global products — overseas APIs have lower latency to users. A hybrid routing strategy (easy tasks domestic, hard tasks overseas) often wins; estimate both sides monthly with this tool and A/B.

Are bulk API calls cheaper? How do OpenAI Batch API and other vendors price batch workloads?

OpenAI Batch API cuts both input and output to 50% (24-hour SLA, not real-time). Anthropic Batch API is also 50% with the same window. Google Gemini 1.5 Pro batch is 50% of interactive pricing. DeepSeek and Qwen offer 40-60% discounts on offline batch workloads via direct sales engagement. Batch fits well for data cleaning, log summarisation, embedding generation and offline evaluation. Caveats: limited throughput and occasional partial failures — pilot small first, and in this tool just halve the unit prices to forecast batch monthly cost.

Why do my calculator estimates differ from the real bill, and what hidden costs am I missing?

Hidden costs commonly missed: (1) retries on network errors or timeouts inflate tokens 1.2-2×; (2) streaming clients disconnecting while the server still generates tokens — you pay for all of them; (3) intermediate tokens from tool / function calling; (4) system prompts and role blocks often excluded from back-of-envelope math; (5) truncated inputs getting resent; (6) multimodal requests billed by image resolution or audio seconds separately; (7) logging / evaluation pipelines re-sending prompts for QA. Instrument production to log prompt_tokens and completion_tokens from response.usage (every major API returns these), reconcile weekly against this tool's estimate, and investigate any gap above 15%.

Related tools