Question 1

What exactly is a token? How many Chinese characters or English words does 1,000 tokens cover?

Accepted Answer

Tokens are the billing unit LLMs use after tokenising text. Rule of thumb: English 1 token ≈ 4 characters (~0.75 words); Chinese 1 character ≈ 1.5 tokens. So 1,000 tokens is roughly 750 English words or 500-650 Chinese characters. Different models tokenise slightly differently; for exact counts use tiktoken for OpenAI or each vendor's tokenizer.

Question 2

Why do the same-length Chinese and English texts differ so much in token count?

Accepted Answer

BPE/SentencePiece tokenisers are trained on corpora dominated by English, so common English words compress to one token, while many Chinese characters fall outside the vocabulary and get split into 2-3 subword tokens. For equal-length text, Chinese usually uses 1.5-2× the tokens. Claude 3.5 and Gemini have better Chinese coverage than GPT-3.5 era models, but Chinese remains more expensive overall.

Question 3

Why are input and output priced differently, with output typically 3-5× more expensive?

Accepted Answer

Output tokens are generated autoregressively — one forward pass per token, and decoding is memory-bandwidth-bound with poor GPU utilisation. Input tokens are processed in a single parallel prefill, which is far cheaper per token. Typical ratios: GPT-4o 1:4, Claude 3.5 Sonnet 1:5, DeepSeek V3 1:4. So RAG (long input / short output) is dominated by input cost, while Chat / Agent workloads are dominated by output cost.

Question 4

How much do Anthropic prompt caching and OpenAI cached input actually save, and when are they worth it?

Accepted Answer

Anthropic bills cached hits at 10% of the normal rate (90% saving) but charges 1.25× on cache write. OpenAI bills cached hits at 50% (50% saving) with no extra write charge. Good fits: (1) fixed system prompts or role/instruction blocks reused across many calls; (2) RAG retrieving the same documents repeatedly; (3) agents sending identical tool schemas every turn. Rule of thumb: cached segment ≥ 1,024 tokens and hit rate > 30% almost always pays off.

Question 5

How do I forecast monthly API spend for a product that has not launched yet, and what buffer should I budget?

Accepted Answer

Three-step forecast: (1) estimate DAU × calls per user per day (Chat apps 5-15, copilots 30-100, agents 100-500); (2) sample tokens per call by running tiktoken over 50 realistic logs and taking the median; (3) plug into this calculator for a monthly base, then multiply by 1.5-2× as buffer because real traffic typically overshoots by 50-100% (long prompts, retries, streaming anomalies). After launch, reconcile against the real bill weekly — investigate any deviation above 15%.

Question 6

How should I choose between domestic (DeepSeek, Qwen, Doubao, Kimi) and overseas (GPT, Claude, Gemini) models on cost?

Accepted Answer

Domestic models (DeepSeek V3, Qwen Plus, Doubao Pro, Kimi) are typically 3-10× cheaper than GPT-4o and 5-15× cheaper than Claude 3.5 Sonnet. Guidelines: (1) Chinese-centric summarisation, classification, support bots, RAG — domestic wins on price-performance; (2) complex reasoning, code, multimodal, very long context — Claude 3.5 Sonnet or GPT-4o remain first choice; (3) strict data-residency requirements — domestic is mandatory; (4) global products — overseas APIs have lower latency to users. A hybrid routing strategy (easy tasks domestic, hard tasks overseas) often wins; estimate both sides monthly with this tool and A/B.

Question 7

Are bulk API calls cheaper? How do OpenAI Batch API and other vendors price batch workloads?

Accepted Answer

OpenAI Batch API cuts both input and output to 50% (24-hour SLA, not real-time). Anthropic Batch API is also 50% with the same window. Google Gemini 1.5 Pro batch is 50% of interactive pricing. DeepSeek and Qwen offer 40-60% discounts on offline batch workloads via direct sales engagement. Batch fits well for data cleaning, log summarisation, embedding generation and offline evaluation. Caveats: limited throughput and occasional partial failures — pilot small first, and in this tool just halve the unit prices to forecast batch monthly cost.

Question 8

Why do my calculator estimates differ from the real bill, and what hidden costs am I missing?

Accepted Answer

Hidden costs commonly missed: (1) retries on network errors or timeouts inflate tokens 1.2-2×; (2) streaming clients disconnecting while the server still generates tokens — you pay for all of them; (3) intermediate tokens from tool / function calling; (4) system prompts and role blocks often excluded from back-of-envelope math; (5) truncated inputs getting resent; (6) multimodal requests billed by image resolution or audio seconds separately; (7) logging / evaluation pipelines re-sending prompts for QA. Instrument production to log prompt_tokens and completion_tokens from response.usage (every major API returns these), reconcile weekly against this tool's estimate, and investigate any gap above 15%.

LLM API Inference Cost Calculator – GPT, Claude, Gemini, DeepSeek

Overview

How to use

Formula

Common scenarios

Scenario 1 · Chat assistant on GPT-4o mini

Scenario 2 · RAG Q&A on Claude 3.5 Sonnet with 70% cache

Scenario 3 · Swap to DeepSeek V3

FAQ