Multilingual Source Editions in Natural Language Processing: A Case Study of a Selected Dietines Acts From the Grand Duchy of Lithuania

Abstract

The article discusses the challenges of computational analysis of multilingual historical texts, using the example of sejmik records from the Grand Duchy of Lithuania. It presents the data preparation process: text extraction from PDFs, cleaning, and language annotation. Particular attention is given to problems stemming from the lack of standardized digital editions, modernized orthography, and the multilingual character of the texts (Polish, Ruthenian, Latin). NLP tools such as Morfeusz (Korbeusz), Concraft, and Stanza were used. The importance of adapting tools to the specifics of historical material and the need for further standardization of annotations within the Universal Dependencies framework are emphasized.

Keywords:

natural language processing, dietines acts, source editions, multilingualism, Grand Duchy of Lithuania



Details

References

Statistics

Authors

Download files

pdf (Język Polski)

Altmetric indicators


Cited by / Share


Roczniki Humanistyczne · ISSN 0035-7707 | eISSN 2544-5200 | DOI: 10.18290/rh
© The Learned Society of the John Paul II Catholic University of Lublin & The John Paul II Catholic University of Lublin, Faculty of Humanities

Articles are licensed under a Creative Commons  Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)