The First Instruct-Following Large Language Models for Hungarian

Yang, Zijian Győző and Dodé, Réka and Ferenczi, Gergő and Hatvani, Péter and Héja, Enikő and Madarász, Gábor and Ligeti-Nagy, Noémi and Sárossy, Bence and SZaniszló, Zsófia and Váradi, Tamás and Verebélyi, Gábor and Prószéky, Gábor (2024) The First Instruct-Following Large Language Models for Hungarian. In: 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) Proceedings. University of Debrecen, Debrecen, pp. 247-252. ISBN 9798350387889

Preview

Text
258-262.pdf - Published Version
Download (714kB) | Preview

Abstract

In recent months, large language models have gained significant attention, with companies striving to develop models capable of solving various natural language processing tasks through extensive data training. The release of ChatGPT by OpenAI demonstrated unprecedented capabilities via a multistep fine-tuning process. For Hungarian, pre-trained large language models include PULI GPT-3SX, PULI GPTrio and in the recent months SambaLingo. In our research, we pre-trained a new large language model based on Llama-2 and inspired by ChatGPT, focuses on fine-tuning with instruction-based prompts. We created a Hungarian prompt dataset and fine-tuned the PULI large language models into instruction-following models. In our research, we discovered that transfer learning allows the model to gain insights from other languages. We found that further pretraining of the language model could leverage valuable knowledge from the originally pre-trained model. Additionally, we can adapt a LLaMA model to another language, such as Hungarian. Our PULI LlumiX models in three Hungarian benchmark could achieve significant better performance. Our instruction model in both HuSST and HuRTE zero-shot competitions could achieve more than 10 accuracy scores. Our further pre-trained Llama-2 model, the PULI LlumiX 32K and the fine-tuned PULI LlumiX 32K Instruct, became state-of-the-art models capable of solving various language technology problems.

Item Type:	Book Section
Uncontrolled Keywords:	PULI models, large language model, instruct model, Llama-2, pre-training, fine-tuning
Subjects:	P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet P Language and Literature / nyelvészet és irodalom > PH Finno-Ugrian, Basque languages and literatures / finnugor és baszk nyelvek és irodalom > PH04 Hungarian language and literature / magyar nyelv és irodalom
SWORD Depositor:	MTMT SWORD
Depositing User:	MTMT SWORD
Date Deposited:	08 Oct 2024 12:08
Last Modified:	08 Oct 2024 12:08
URI:	https://real.mtak.hu/id/eprint/207084

Actions (login required)

Edit Item