The problem with public benchmarks
Public translation benchmarks measure literary or news text. Shopify product copy is short, branded and full of domain-specific terms – exactly where public scores correlate poorly with what merchants actually want.
Our harness
For every new model that lands in our gateway we run four layers of evaluation:
- **Automated metrics** – BLEU and chrF against a frozen reference set of 5,000 anonymized Shopify strings.
- **Glossary adherence** – we check whether brand terms in our test glossary survive untouched.
- **Format preservation** – HTML tags, Liquid placeholders, emoji and markdown must round-trip exactly.
- **Human ratings** – three native speakers per locale rate fluency and adequacy on a 1–5 scale, blind to model identity.
What we've learned
- DeepL still wins on European pairs for short product copy.
- GPT and Claude pull ahead the moment context (brand voice, glossary, surrounding copy) is in play.
- Gemini Flash is the price-performance king for bulk catalog translation.
That's why Datora is multi-provider by default – no single model wins everything.