Madarász, Gábor and Ligeti-Nagy, Noémi and Holl, András and Váradi, Tamás (2024) OCR Cleaning of Scientific Texts with LLMs. In: Natural Scientific Language Processing and Research Knowledge Graphs. First International Workshop, NSLP 2024, May 27, 2024, Hersonissos, Crete, Greece.
|
Text
Madarasz_et_al_OCR_Cleaning_with_LLMs.pdf Available under License Creative Commons Attribution. Download (442kB) | Preview |
Abstract
Correcting Optical Character Recognition (OCR) errors is a major challenge in preprocessing datasets consisting of legacy PDF files. In this study, we develop Large Language Models specially finetuned to correct OCR errors. We experimented with the mT5 model (both the mT5-small and mT5-large configurations), a Text-to-Text Transfer Transformer-based machine translation model, for the post-correction of texts with OCR errors. We compiled a parallel corpus consisting of text corrupted with OCR errors as well as corresponding clean data. Our findings suggest that the mT5 model can be successfully applied to OCR error correction with improving accuracy. The results affirm the mT5 model as an effective tool for OCR post-correction, with prospects for achieving greater efficiency in future research.
Item Type: | Conference or Workshop Item (Lecture) |
---|---|
Subjects: | P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet Z Bibliography. Library Science. Information Resources / könyvtártudomány > Z665 Library Science. Information Science / könyvtártudomány, információtudomány |
SWORD Depositor: | MTMT SWORD |
Depositing User: | MTMT SWORD |
Date Deposited: | 14 Aug 2024 12:13 |
Last Modified: | 14 Aug 2024 12:13 |
URI: | https://real.mtak.hu/id/eprint/202543 |
Actions (login required)
Edit Item |