REAL

Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research

Máté, Ákos and Sebők, Miklós and Wordliczek, Łukasz and Stolicki, Dariusz and Feldmann, Ádám (2023) Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research. COMPUTATIONAL COMMUNICATION RESEARCH, 5 (2). pp. 1-34. ISSN 2665-9085

[img]
Preview
Text
CCR2023.2.6.MATE.pdf
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

While large language models have revolutionised computational text analysis methods, the field is still tilted towards English language resources. Even as there are pre-trained models for some "smaller" languages, the coverage is far from universal, and pre-training large language models is an expensive and complicated task. This uneven language coverage limits comparative social research in terms of its geographical and linguistic scope. We propose a solution that sidesteps these issues by leveraging transfer learning and open-source machine translation. We use English as a bridge language between Hungarian and Polish bills and laws to solve a classification task related to the Comparative Agendas Project (CAP) coding scheme. Using the Hungarian corpus as training data for model fine-tuning, we categorise the Polish laws into 20 CAP categories. In doing so, we compare the performance of Transformer-based deep learning models (monolinguals, such as BERT, and multilinguals such as XLM-RoBERTa) and machine learning algorithms (e.g., SVM). Results show that the fine-tuned large language models outperform the traditional supervised learning benchmarks but are themselves surpassed by the machine translation approach. Overall, the proposed solution demonstrates a viable option for applying a transfer learning framework for low-resource languages and achieving state-of-the-art results without requiring expensive pre-training.

Item Type: Article
Uncontrolled Keywords: Machine learning, Deep learning, Natural language processing, Classification, Policy topics, Comparative Agendas Project
Subjects: Q Science / természettudomány > QA Mathematics / matematika > QA76.16-QA76.165 Communication networks, media, information society / kommunikációs hálózatok, média, információs társadalom
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 14 Dec 2023 08:15
Last Modified: 14 Dec 2023 08:15
URI: http://real.mtak.hu/id/eprint/182544

Actions (login required)

Edit Item Edit Item