The problem with public benchmarks

Public translation benchmarks measure literary or news text. Shopify product copy is short, branded and full of domain-specific terms – exactly where public scores correlate poorly with what merchants actually want.

Our harness

For every new model that lands in our gateway we run four layers of evaluation:

**Automated metrics** – BLEU and chrF against a frozen reference set of 5,000 anonymized Shopify strings.
**Glossary adherence** – we check whether brand terms in our test glossary survive untouched.
**Format preservation** – HTML tags, Liquid placeholders, emoji and markdown must round-trip exactly.
**Human ratings** – three native speakers per locale rate fluency and adequacy on a 1–5 scale, blind to model identity.

What we've learned

DeepL still wins on European pairs for short product copy.
GPT and Claude pull ahead the moment context (brand voice, glossary, surrounding copy) is in play.
Gemini Flash is the price-performance king for bulk catalog translation.

That's why Datora is multi-provider by default – no single model wins everything.