On February 5, 2025, a team of researchers from Grenoble Alpes University and Lingua Custodia, a France-based company specializing in AI and natural language processing (NLP) for the finance sector, introduced DOLFIN, a new test set designed to evaluate document-level machine translation (MT) in the financial domain.
The researchers say that the financial domain presents unique challenges for MT due to its reliance on precise terminology and strict formatting rules. They describe it as “an interesting use-case for MT” since key terms often shift meaning depending on context.
For example, the French word couverture means blanket in a general setting but hedge in financial texts. Such nuances are difficult to capture without larger translation units.
Despite strong research interest in document-level MT, specialized test sets remain scarce, the researchers note. Most datasets focus on general topics rather than domains such as legal and financial translation.
Given that many financial documents “contain an explicit definition of terms used for the mentioned entities that must be respected throughout the document,” they argue that document-level evaluation is essential.
DOLFIN allows researchers to assess how well MT models translate longer texts while maintaining context.
Unlike traditional test sets that rely on sentence-level alignment, DOLFIN structures data into aligned sections, enabling the evaluation of broader linguistic challenges, such as information reorganization, terminology consistency, and formatting accuracy.
**
Context-Sensitive**
To build the dataset, they sourced parallel documents from Fundinfo, a provider of investment fund data, and extracted and aligned financial sections rather than individual sentences. The dataset covers English-French, English-German, English-Spanish, English-Italian, and French-Spanish, with an average of 1,950 segments per language pair.
The goal, according to the researchers, was to develop “a test set rich in context-sensitive phenomena to challenge MT models.”
To assess the usefulness of DOLFIN, the researchers evaluated large language models (LLMs) including GPT-4o, Llama-3-70b, and their smaller counterparts. They tested these models in two settings: translating sentence by sentence versus translating full document sections.
They found that DOLFIN effectively distinguishes between context-aware and context-agnostic models, while also exposing model weaknesses in financial translation.
Larger models benefited from more context, producing more accurate and consistent translations, while smaller models often struggled. “For some segments, the generation enters a downhill, and with every token, the model’s predictions get worse,” the researchers observed, describing how smaller LLMs failed to maintain coherence over longer passages.
DOLFIN also reveals persistent weaknesses in financial MT, particularly in formatting and terminology consistency. Many models failed to properly localize currency formats, defaulting to English-style notation instead of adapting to European conventions.
The dataset is publicly available on Hugging Face.
Authors: Mariam Nakhlé, Marco Dinarelli, Raheel Qader, Emmanuelle Esperança-Rodier, and Hervé Blanchon