A magyar webtér aratásával kapcsolatos kurátori feladatok = Curatorial Tasks Related to the Harvesting of the Hungarian Web Domain

Kalcsó, Gyula (2025) A magyar webtér aratásával kapcsolatos kurátori feladatok = Curatorial Tasks Related to the Harvesting of the Hungarian Web Domain. In: Oktatási, kutatási és közgyűjteményi infrastruktúrák és tartalmak: digitális transzformáció felsőfokon : NETWORKSHOP 2025 : 34. Országos Informatikai Konferencia : 2025. május 13–15. Széchenyi István Egyetem, Győr. Hungarnet Egyesület, Budapest, pp. 194-201. ISBN 978-615-6792-15-0

Preview

Text
NETWORKSHOP_2025_Kalcso_v2.pdf - Published Version
Available under License Creative Commons Attribution.
Download (275kB) | Preview

Official URL: https://doi.org/10.31915/NWS.2025.21

Abstract

According to a government decree, the national library’s essential task is to carry out a harvest as complete as possible of the Hungarian web domain twice a year and to keep a register of the sites known. This complex task is carried out by the web archiving team of the Digital Philology and Web Archiving Department of the Digital Humanities Centre of the Hungarian National Széchényi Library. This paper will describe the most important curatorial activities related to this mandated task. It will illustrate the process of registering websites and the methodology for collecting seed URLs. Since the launch of the Hungarian Web Archive in 2017, the number of registered sites has grown significantly. New URLs have been identified from our own harvests, recommendations have been received, and cooperation has been achieved with the Internet Archive being the main source of new URLs. The seed URL lists need to be maintained before the two annual harvests, which is a complex process involving many steps. The first step is to extract the URLs from the previous captures and sort out those that are not yet known. We automatically retrieve the HTTP status code to determine which sites are live, then retrieve the value of the title tag in the HTML head tag, and see whether the site has a robots.txt file. Based on the structure of the URLs and the information obtained, we can classify the new URLs into the appropriate list. The status codes, the title data as well as the robots.txt are checked for the previously harvested URLs as well, the inactive sites are removed from the lists, and the URLs are classified into the appropriate seed list.

Item Type:	Book Section
Uncontrolled Keywords:	webarchiválás, born digital, a magyar web aratása, webkurátori feladatok, web archiving, harvesting of the Hungarian web, curatorial tasks in web archiving
Subjects:	Q Science / természettudomány > QA Mathematics / matematika > QA76.625 Internet Science / internettudomány Z Bibliography. Library Science. Information Resources / könyvtártudomány > Z665 Library Science. Information Science / könyvtártudomány, információtudomány
SWORD Depositor:	MTMT SWORD
Depositing User:	MTMT SWORD
Date Deposited:	24 Nov 2025 10:18
Last Modified:	06 Dec 2025 12:05
URI:	https://real.mtak.hu/id/eprint/229680

Actions (login required)

Edit Item