Dialogical Old Texts: From Digitalization of Historical Sources to an Operational Corpus

Abstract

The article aims to show the specificity of old dialogue texts in the context of their digitalization and the possibilities of their use in corpora. It presents the stages of creating the Corpus of Polish Dramatic Texts (1772–1939). The article highlights the differences between older and newer editions of the same texts. It also indicates the shortcomings of OCR techniques in reading dialogue texts, including the inability to automatically separate stage directions from the main text. Then, details related to preparing the data for the correct operation of digital tools are presented. The collected texts required, among other things, writing down the characters’ names, taking part in the scene and removing dots ending the sentences of stage directions. As a result of the presented procedures, the searchable part of the linguistic material and the text casing with a structural function were isolated. The texts were also annotated considering specific values of three sociolinguistic determinants (gender, age and social status of the characters). The applied solutions may be valuable for authors of other corpora and digital tools using so-called speech-related text.

Keywords:

digital tools, corpus linguistic, old Polish drama, dialogue texts, sociolinguistic annotation, open science



Details

References

Statistics

Authors

Download files

pdf (Język Polski)

Altmetric indicators


Cited by / Share


Roczniki Humanistyczne · ISSN 0035-7707 | eISSN 2544-5200 | DOI: 10.18290/rh
© The Learned Society of the John Paul II Catholic University of Lublin & The John Paul II Catholic University of Lublin, Faculty of Humanities

Articles are licensed under a Creative Commons  Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)