All posts
Engineering
2026-04-22 · 11 min read · Datora Engineering

How we benchmark AI translation quality

Inside the eval harness we run on every model release – BLEU, chrF, human ratings and the brand-specific tests that actually matter.

The problem with public benchmarks

Public translation benchmarks measure literary or news text. Shopify product copy is short, branded and full of domain-specific terms – exactly where public scores correlate poorly with what merchants actually want.

Our harness

For every new model that lands in our gateway we run four layers of evaluation:

  1. **Automated metrics** – BLEU and chrF against a frozen reference set of 5,000 anonymized Shopify strings.
  2. **Glossary adherence** – we check whether brand terms in our test glossary survive untouched.
  3. **Format preservation** – HTML tags, Liquid placeholders, emoji and markdown must round-trip exactly.
  4. **Human ratings** – three native speakers per locale rate fluency and adequacy on a 1–5 scale, blind to model identity.

What we've learned

  • DeepL still wins on European pairs for short product copy.
  • GPT and Claude pull ahead the moment context (brand voice, glossary, surrounding copy) is in play.
  • Gemini Flash is the price-performance king for bulk catalog translation.

That's why Datora is multi-provider by default – no single model wins everything.