Margit, Antal (2025) Evaluation of Embedding Models for Hungarian Question-Answer Retrieval on Domain-Specific and Public Benchmarks. INFOCOMMUNICATIONS JOURNAL, 17 (4). pp. 2-11. ISSN 2061-2079
|
Text
InfocomJournal_2025_4_1.pdf - Published Version Download (1MB) | Preview |
Abstract
Embedding models have become a fundamental component of modern natural language processing, yet their performance in morphologically rich, low-resource languages such as Hungarian remains underexplored. In this paper, we present a systematic evaluation of state-of-the-art embedding models for Hungarian question–answer retrieval. We construct two complementary evaluation datasets: (i) a domain-specific corpus collected from company documentation, preprocessed into topical chunks with human-verified question–answer pairs and (ii) the publicly available HuRTE benchmark. Using Chroma as the vector database, we compare eight multilingual and cross-lingual embedding models alongside keyword-based search baseline. Performance is measured using Mean Reciprocal Rank (MRR) and Recall@k. Results show substantial variation across mod- els and datasets, with notable differences between domainspecific and general-purpose retrieval tasks. BGE-M3 and XLM-ROBERTA achieved the highest accuracy (MRR: 0.90) on the Clearservice dataset, while GEMINI demonstrated superior performance on HuRTE (MRR: 0.99). We complement the evaluation with comprehensive error analysis, highlighting challenges posed by Hungarian domain-specific terminology, synonyms, and overlapping topics, and discuss trade-offs in efficiency through index build time and query latency measurements. Our findings provide a comparative study of embeddingbased retrieval in Hungarian, offering practical guidance for downstream applications and setting a foundation for future research in Hungarian representation learning. The dataset and the corresponding evaluation code are publicly accessible at https://github.com/margitantal68/hungarian-embeddings.
| Item Type: | Article |
|---|---|
| Subjects: | T Technology / alkalmazott, műszaki tudományok > T2 Technology (General) / műszaki tudományok általában |
| SWORD Depositor: | MTMT SWORD |
| Depositing User: | MTMT SWORD |
| Date Deposited: | 28 Jan 2026 15:24 |
| Last Modified: | 28 Jan 2026 15:24 |
| URI: | https://real.mtak.hu/id/eprint/232825 |
Actions (login required)
![]() |
Edit Item |




