REAL

Further Keyword Generation Experiment in Hungarian with Fine-tuning PULI LlumiX 32K Model

Dodé, Réka and Yang, Zijian Győző (2024) Further Keyword Generation Experiment in Hungarian with Fine-tuning PULI LlumiX 32K Model. In: 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) Proceedings. University of Debrecen, Debrecen, pp. 20-24. ISBN 9798350387889

[img]
Preview
Text
31-35.pdf - Published Version

Download (707kB) | Preview

Abstract

Our research continues an investigation using neural models to generate and extract keywords from lengthy texts, using data from the REAL repository and author-provided keywords. Previously, we tested three models: fastText for keyword extraction as a multi-label classification baseline, a fine-tuned Hungarian language model PULI GPT-3SX for keyword generation, and a further trained Llama-2-7B-32K model. In this study, we fine-tuned a new model, the PULI LlumiX 32K model with the same data, combining Hungarian language knowledge with Llama-2-7B-32K’s 32,000-token input capacity. We assessed the generation of new, relevant keywords by the models compared to author-provided keywords and those not present in the text. The PULI LlumiX 32K model outperformed both the PULI GPT-3SX language model and Llama-2-7B-32K model. For keywords not present in the text, PULI LlumiX 32K and Llama-2-7B-32K generated approximately 20%, similar to author keywords. PULI GPT-3SX had a higher ratio of about 30%. Some new keywords were relevant, while others were inaccurate due to erroneous phrases.

Item Type: Book Section
Uncontrolled Keywords: PULI LlumiX 32K, generated keywords, finetuning, author-provided keywords, Llama-2-7B-32K, PULI GPT3SX, Hungarian language model
Subjects: P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet
P Language and Literature / nyelvészet és irodalom > PH Finno-Ugrian, Basque languages and literatures / finnugor és baszk nyelvek és irodalom > PH04 Hungarian language and literature / magyar nyelv és irodalom
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 08 Oct 2024 12:26
Last Modified: 08 Oct 2024 12:26
URI: https://real.mtak.hu/id/eprint/207081

Actions (login required)

Edit Item Edit Item