Kalcsó, Gyula (2025) A magyar webtér aratásával kapcsolatos kurátori feladatok = Curatorial Tasks Related to the Harvesting of the Hungarian Web Domain. In: Oktatási, kutatási és közgyűjteményi infrastruktúrák és tartalmak: digitális transzformáció felsőfokon : NETWORKSHOP 2025 : 34. Országos Informatikai Konferencia : 2025. május 13–15. Széchenyi István Egyetem, Győr. Hungarnet Egyesület, Budapest, pp. 194-201. ISBN 978-615-6792-15-0
|
Text
195_fejezet_NETWORKSHOP_2025.pdf - Published Version Available under License Creative Commons Attribution. Download (260kB) | Preview |
Abstract
According to a government decree, the national library’s essential task is to carry out a harvest as complete as possible of the Hungarian web domain twice a year and to keep a register of the sites known. This complex task is carried out by the web archiving team of the Digital Philology and Web Archiving Department of the Digital Humanities Centre of the Hungarian National Széchényi Library. This paper will describe the most important curatorial activities related to this mandated task. It will illustrate the process of registering websites and the methodology for collecting seed URLs. Since the launch of the Hungarian Web Archive in 2017, the number of registered sites has grown significantly. New URLs have been identified from our own harvests, recommendations have been received, and cooperation has been achieved with the Internet Archive being the main source of new URLs. The seed URL lists need to be maintained before the two annual harvests, which is a complex process involving many steps. The first step is to extract the URLs from the previous captures and sort out those that are not yet known. We automatically retrieve the HTTP status code to determine which sites are live, then retrieve the value of the title tag in the HTML head tag, and see whether the site has a robots.txt file. Based on the structure of the URLs and the information obtained, we can classify the new URLs into the appropriate list. The status codes, the title data as well as the robots.txt are checked for the previously harvested URLs as well, the inactive sites are removed from the lists, and the URLs are classified into the appropriate seed list.
| Item Type: | Book Section |
|---|---|
| Uncontrolled Keywords: | webarchiválás, born digital, a magyar web aratása, webkurátori feladatok, web archiving, harvesting of the Hungarian web, curatorial tasks in web archiving |
| Subjects: | Q Science / természettudomány > QA Mathematics / matematika > QA76.625 Internet Science / internettudomány Z Bibliography. Library Science. Information Resources / könyvtártudomány > Z665 Library Science. Information Science / könyvtártudomány, információtudomány |
| SWORD Depositor: | MTMT SWORD |
| Depositing User: | MTMT SWORD |
| Date Deposited: | 24 Nov 2025 10:18 |
| Last Modified: | 24 Nov 2025 10:18 |
| URI: | https://real.mtak.hu/id/eprint/229680 |
Actions (login required)
![]() |
Edit Item |




