Máté, Ákos and Sebők, Miklós and Wordliczek, Łukasz and Stolicki, Dariusz and Feldmann, Ádám (2023) Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research. COMPUTATIONAL COMMUNICATION RESEARCH, 5 (2). pp. 1-34. ISSN 2665-9085
|
Text
CCR2023.2.6.MATE.pdf Available under License Creative Commons Attribution. Download (1MB) | Preview |
Abstract
While large language models have revolutionised computational text analysis methods, the field is still tilted towards English language resources. Even as there are pre-trained models for some "smaller" languages, the coverage is far from universal, and pre-training large language models is an expensive and complicated task. This uneven language coverage limits comparative social research in terms of its geographical and linguistic scope. We propose a solution that sidesteps these issues by leveraging transfer learning and open-source machine translation. We use English as a bridge language between Hungarian and Polish bills and laws to solve a classification task related to the Comparative Agendas Project (CAP) coding scheme. Using the Hungarian corpus as training data for model fine-tuning, we categorise the Polish laws into 20 CAP categories. In doing so, we compare the performance of Transformer-based deep learning models (monolinguals, such as BERT, and multilinguals such as XLM-RoBERTa) and machine learning algorithms (e.g., SVM). Results show that the fine-tuned large language models outperform the traditional supervised learning benchmarks but are themselves surpassed by the machine translation approach. Overall, the proposed solution demonstrates a viable option for applying a transfer learning framework for low-resource languages and achieving state-of-the-art results without requiring expensive pre-training.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Machine learning, Deep learning, Natural language processing, Classification, Policy topics, Comparative Agendas Project |
Subjects: | Q Science / természettudomány > QA Mathematics / matematika > QA76.16-QA76.165 Communication networks, media, information society / kommunikációs hálózatok, média, információs társadalom |
SWORD Depositor: | MTMT SWORD |
Depositing User: | MTMT SWORD |
Date Deposited: | 14 Dec 2023 08:15 |
Last Modified: | 14 Dec 2023 08:15 |
URI: | http://real.mtak.hu/id/eprint/182544 |
Actions (login required)
Edit Item |