REAL

Training Embedding Models for Hungarian

Hatvani, Péter and Yang, Zijian Győző (2024) Training Embedding Models for Hungarian. In: 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) Proceedings. University of Debrecen, Debrecen, pp. 75-80. ISBN 9798350387889

[img]
Preview
Text
86-91.pdf - Published Version

Download (999kB) | Preview

Abstract

Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models’ performance. Our models—xml roberta sentence hu, hubert sentence hu, and minilm sentence hu—demonstrate substantial improvements in semantic similarity tasks, with the hubert sentence hu model achieving the highest accuracy and F1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.

Item Type: Book Section
Uncontrolled Keywords: Retrieval-Augmented Generation, Hungarian language models, semantic similarity, natural language processing, sentence embeddings, machine learning, NLP for underrepresented languages
Subjects: P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 14 Oct 2024 11:07
Last Modified: 14 Oct 2024 11:07
URI: https://real.mtak.hu/id/eprint/207276

Actions (login required)

Edit Item Edit Item