Large-scale data (‘big data’) collection and analysis have improved the performance of machine learning algorithms, especially in the field of natural language processing. The vastness of data available creates new opportunities and new scenarios - that’s according to Locaria’s Nicola Pegorano, senior director for content and technology. He explains that, while linking together quality data is essential to training machine learning models to understand context and generate reliable translations, this does not come without ever growing challenges.
Nicola notes that the quality of the data plays a vital role in AI-generated translation. Volume is but one factor in the quest for reliable output – diversity and accuracy are crucial too. This is more of an issue for less common language combinations, though the evolution of neural network technology is mitigating this problem. There are also ethical concerns around the use of big data, namely privacy and bias.
LBB spoke to Nicola to learn more about the way big data is shaping the future of AI translation, whether we should worry about running out of data in the next five years, and why the role of the human language expert will be more important than ever to mitigate bias and refine AI-generated output.
LBB> Can you please define what ‘big data’ means in this context?
Nicola> Big data refers to enormous and complex datasets that exceed the capacity of traditional data processing techniques. These extensive datasets are used to uncover patterns, trends, and correlations, especially in business, healthcare, and science. The insights derived from big data can drive innovation and shape decision-making in many areas.
LBB> What role does large-scale language data collection play in improving the efficiency and accuracy of AI translation?
Nicola> Data is the fuel that powers the machine learning engine. Both machine translation (MT) engines and large language models (LLMs) are built and trained with data. The more closely the data aligns with the content needing translation, the better the translation outcomes will be.
Because AI harnesses machine learning to generate original content, guided by patterns derived from big data, it is clear that large-scale language data collection is essential to run and support AI translation models.
LBB> What are the key challenges in collecting and processing large amounts of multilingual data for AI translation models?
Nicola> The effectiveness of an AI translation model is fundamentally linked to the quality of the training data it receives. In this field, the saying “rubbish in, rubbish out” is particularly relevant.
High-quality data is not just about large volumes, but also (or more importantly!) about having a variety of rich, diverse, and accurate information. For AI to translate languages proficiently, it needs comprehensive exposure to different linguistic forms: From formal registers and regional dialects, to colloquial language and technical lingo.
The key challenge is to guarantee that the data is not only varied but also meticulously annotated and representative of diverse cultural contexts. Language is not merely a collection of words and grammar; it is an evolving human activity that differs from one community or region to another.
LBB> How does the quality and diversity of language data impact the performance of AI translation systems across different languages and dialects? Is there enough data from less common languages to ensure accurate translation?
Nicola> The amount and quality of data is what ultimately determines the quality of the translated output. Of the 4000 or so written languages in existence today, only 80-90 are typically available in mainstream translation platforms. Traditionally, quality issues have plagued the output in less common language combinations.
More recently though, the evolution of neural network technology has mitigated this problem. This is a form of artificial intelligence that mimics some aspects of human thinking. Instead of just memorising words and sentences, it can learn their meaning and their usual correlations. Just like humans do not need to read trillions of words before they can try and speak a new language, the same is true for neural networks. Data is still needed of course, but on a smaller scale.
Monolingual data (which tends to be the only available type of digital data for a lot of the rarer languages and dialects) is increasingly being ingested and made sense of by neural networks, which are less reliant on bilingual training material.
Another recent improvement is called ‘zero-shot’ translation, meaning that models can translate between language pairs without prior direct training on those specific pairings. These models (including ‘transformer’ technology like GPT) learn a generalised understanding of language structures, semantics, and contextual nuances. This broad knowledge enables them to infer translations even between language pairs they have not been explicitly taught.
LBB> Are there any advancements in data analysis techniques that are driving the evolution of AI translation? How are these making translation better?
Nicola> I would mention ‘transfer learning’ and ‘data augmentation’. Transfer learning enables AI models to utilise the knowledge acquired from one task or language to boost their performance in a different task. This approach is particularly useful for refining models on specialised language pairs or domains where data is scarce, ultimately improving translation precision.
Sophisticated data augmentation methods, like paraphrasing and generating synthetic data, produce extra training examples by altering existing datasets. These techniques enhance the robustness and generalisation capabilities of translation models.
LBB> Are there any ethical concerns around the use of big data, like biases, that need to be addressed?
Nicola> Two common concerns are data privacy and translation bias. Generative AI systems have the potential to store or archive translated data for ongoing training and improvement purposes. This can be a significant concern for users in fields such as healthcare, legal, marketing or finance, where stringent privacy laws exist, and the data is highly sensitive or confidential. In the translation field, it's essential to choose a language provider that can demonstrate (and not just mention in passing!) a focus on data security when working with AI-powered translation tools.
On bias, it is important to understand that generative AI translation models are never neutral: They are directly influenced by the data they are trained on or exposed to. When the data carries biases—whether related to gender, religion, geography, or ethnicity— the AI will unintentionally reflect and perpetuate these biases in translation. This can lead to inappropriate language and misunderstandings, particularly in global communication, where such issues can be highly problematic.
Having professional, qualified human post-editors review the LLM output is essential in mitigating this risk. But let’s remember that bias, by its own very nature, can be difficult to spot and eradicate. This is a concern that goes beyond the translation field of course, and pertains to GenAI in its broadest sense.
LBB> A recent study predicted that AI is likely to run out of data to be trained in by 2026-2032. Is this at all a concern for the language side of things?
Nicola> By making even medium term predictions in the field of AI, one always risks making a fool of oneself. The advent of LLMs and their profound impact on the translation business model was hardly predicted or fully understood by many of the experts who today speak with confidence about the future of language tech.
Although AI may eventually consume all publicly available data in the next few years, the future of AI development remains a complex and uncertain matter. Experts differ in their views on how critical the data shortage is.
A strain on data availability is inevitable, what with the ever-growing expansion of the AI field and proliferation of businesses that tap into it. Yes, the threat that available public textual data will be exhausted by 2032 is real. Yes, a lack of new data would limit AI’s ability to learn from evolving trends and contexts, reducing its effectiveness.
However, advancements in data efficiency, transfer learning, synthetic data generation, and the private data industry, all have the potential to mitigate the impact of a data shortage.
Should all of these fail to meet future demands, without any other breakthroughs, then it is conceivable that the technology as we know it today may reach a performance plateau.
LBB> Finally, what does the human translator’s role look like in the future?
Nicola> I frequently get asked if I think professional translators will no longer be needed in future. AI will certainly continue to impact and shape the language services industry. Far from disappearing though, the role of the professional linguist will remain crucial. The profession will change, from generating translations from scratch to post-editing, refining, improving and sharpening AI-generated content.
Content production is increasing exponentially thanks to genAI. The tech landscape is growing more fragmented day by day. Languages themselves, and the cultures underpinning them, will also continue to change and evolve. Professional language experts, subject-matter language specialists, and skilled post-editors who can provide cultural awareness and local sensitivity, as well as innovative and creative input, will remain an essential part of the language service industry.