Szakács, Béla Benedek and Mészáros, Tamás (2020) Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion. INFOCOMMUNICATIONS JOURNAL, 12 (4). pp. 6-13. ISSN 2061-2079
|
Text
InfocomJ_2020_4_2_Szakacs.pdf Download (1MB) | Preview |
Abstract
Dictionaries like Wordnet can help in a variety of Natural Language Processing applications by providing additional morphological data. They can be used in Digital Humanities research, building knowledge graphs and other applications. Creating dictionaries from large corpora of texts written in a natural language is a task that has not been a primary focus of research, as other tasks have dominated the field (such as chat-bots), but it can be a very useful tool in analysing texts. Even in the case of contemporary texts, categorizing the words according to their dictionary entry is a complex task, and for less conventional texts (in old or less researched languages) it is even harder to solve this problem automatically. Our task was to create a software that helps in expanding a dictionary containing word forms and tagging unprocessed text. We used a manually created corpus for training and testing the model. We created a combination of Bidirectional Long-Short Term Memory networks, convolutional networks and a distancebased solution that outperformed other existing solutions. While manual post-processing for the tagged text is still needed, it significantly reduces the amount of it.
Item Type: | Article |
---|---|
Subjects: | Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány |
SWORD Depositor: | MTMT SWORD |
Depositing User: | MTMT SWORD |
Date Deposited: | 02 Feb 2021 12:43 |
Last Modified: | 30 Jun 2021 23:24 |
URI: | http://real.mtak.hu/id/eprint/120404 |
Actions (login required)
![]() |
Edit Item |