REAL

Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

Tóth, Erzsébet and Gál, Zoltán (2024) Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification. INFOCOMMUNICATIONS JOURNAL, 16 (Specia). pp. 58-66. ISSN 2061-2079

[img]
Preview
Text
InfocomJournal_2024_SpecISS_CogInf_CogAspVR_7.pdf - Published Version

Download (2MB) | Preview

Abstract

A parallel corpus comprising Croatian EU legislative documents automatically translated into English spans 28 years and is enriched with metadata, including creation year and hierarchical classifier tags denoting descriptors, document types, and fields. However, nearly two-thirds of the approximately 1.5 thousand texts lack complete metadata, necessitating labor intensive manual efforts that pose challenges for human administration. This incompleteness issue can be observed in the case of official legal sites functioning as regular service provisioning databases. In response, this paper introduces an artificial cognitive and multilabel classification approach to expedite the tagging process with only a fraction of the manual effort. Leveraging the Latent Dirichlet Allocation (LDA) algorithm, our method assigns field values or tags to incompletely labeled documents. We implement a Flexible LDA variant, incorporating the influence of topics close to the most probable topic, regulated by a relative probability threshold (RPT). We evaluate the LDA prediction's dependence on document prefiltering and RPT values. Furthermore, we investigate the dependence of quantitative linguistic properties on the type and speciality of pre-processing tasks. Our algorithm, built on error-correcting optimizing codes, succesfully predicts a mixture of topic probabilities for these legal texts. This prediction is achieved by calculating the Hamming distance of binary feature vectors created using the legal fields of the EUROVOC multilingual thesaurus.

Item Type: Article
Subjects: Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
Q Science / természettudomány > QA Mathematics / matematika > QA76.16-QA76.165 Communication networks, media, information society / kommunikációs hálózatok, média, információs társadalom
Q Science / természettudomány > QA Mathematics / matematika > QA76.76 Software Design and Development / Szoftvertervezés és -fejlesztés
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 15 Jul 2024 13:06
Last Modified: 15 Jul 2024 13:06
URI: https://real.mtak.hu/id/eprint/200174

Actions (login required)

Edit Item Edit Item