REAL

Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion

Szakács, Béla Benedek and Mészáros, Tamás (2020) Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion. INFOCOMMUNICATIONS JOURNAL, 12 (4). pp. 6-13. ISSN 2061-2079

[img]
Preview
Text
InfocomJ_2020_4_2_Szakacs.pdf

Download (1MB) | Preview

Abstract

Dictionaries like Wordnet can help in a variety of Natural Language Processing applications by providing additional morphological data. They can be used in Digital Humanities research, building knowledge graphs and other applications. Creating dictionaries from large corpora of texts written in a natural language is a task that has not been a primary focus of research, as other tasks have dominated the field (such as chat-bots), but it can be a very useful tool in analysing texts. Even in the case of contemporary texts, categorizing the words according to their dictionary entry is a complex task, and for less conventional texts (in old or less researched languages) it is even harder to solve this problem automatically. Our task was to create a software that helps in expanding a dictionary containing word forms and tagging unprocessed text. We used a manually created corpus for training and testing the model. We created a combination of Bidirectional Long-Short Term Memory networks, convolutional networks and a distancebased solution that outperformed other existing solutions. While manual post-processing for the tagged text is still needed, it significantly reduces the amount of it.

Item Type: Article
Subjects: Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 02 Feb 2021 12:43
Last Modified: 30 Jun 2021 23:24
URI: http://real.mtak.hu/id/eprint/120404

Actions (login required)

Edit Item Edit Item