On Measuring Psalm Similarity: A Case for Word-Level n-Grams

Jerzy Wójcik

doi:10.18290/rh24725.9s

Published : 2024-12-31

Vol. 72 No. 5 Zeszyt specjalny (2024)

On Measuring Psalm Similarity: A Case for Word-Level n-Grams

Jerzy Wójcik

https://orcid.org/0000-0001-5283-9017

DOI: https://doi.org/10.18290/rh24725.9s

Abstract

The article offers a comparison between Tesserae (a text-reuse detection tool) and cosine similarity (used here as a measure of similarity between texts) and assesses their applicability to tracking textual affinities of different versions of historical texts on the basis of Early Modern English versions of Psalm 6 found in publications printed between 1530 and 1557. It is shown that cosine similarity is a better tool for the task of identifying and measuring the level of similarity between texts. At the same time, the article argues that cosine similarity measurements should be performed on texts represented as feature vectors consisting of n-grams.

Keywords:

digital humanities, cosine similarity, n-grams, Tesserae, Psalm translations

Details

References

Statistics

Authors

Download files

pdf

Altmetric indicators

Cited by / Share

Licence

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Buchler, M. (2016). TRACER: Text reuse detection machine. http://www.etrap.eu/research/tracer

Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774

Butterworth, C. C., & Chester, A. G. (1962). George Joye (1495?-1553). A Chapter in the History of the English Bible and the English Reformation. University of Pennsylvania Press.

Coffee, N., Koenig, J. P., Poornima, S., Forstall, C., Ossewaarde, R., & Jacobson, S. (2012). The tesserae project: Intertextual analysis of Latin poetry. Literary and Linguistic Computing, 28, 221–228.

Charzyńska-Wójcik, M. (2021). Familiarity and favour: Towards assessing psalm translations. Linguistica Silesiana, 42, 43–77. https://doi.org/10.24425/linsi.2021.137231

Charzyńska-Wójcik, M., & Wójcik, J. (2022). Similarity measurements in tracing textual affinities. A study of psalm 129 in 16th-century devotional manuals. Token, 14, 191–220.

Feldman, R., & Sanger, J. (2007). The Text Mining Handbook. Cambridge University Press.

Forstall, C. W., & Scheirer, W. J. (2019). Quantitative intertextuality. Springer.

Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann Publishers.

Hordyjewicz, M. (2023). Scriptural content of the English medieval Book of Hours: Tracing textual traditions of nine lessons from the Book of Job. Polish Journal of English Studies, 9(1), 82–96.

Huang, A. (2008). Similarity measures for text document clustering. New Zealand Computer Science Research Student Conference, 8, 49–56.

Lis, K., & Wójcik, J. (2023). French and English texts of the Laws of Oléron – Assessing proximity between copies and editions by means of cosine similarity. Bulletin of the John Rylands Library, 99(2), 103–126. https://manchesteruniversitypress.co.uk/9781526178503

Mohan, A., Baggili, I. M., & Rogers, M. K. (2010). Authorship attribution of SMS messages using an n-grams approach. Proceedings of CERIAS Tech Report 2010-11, 1–12. Center for Education and Research Information Assurance and Security Purdue University.

Olsen, M., & Horton, R. (2009). PAIR: Pairwise alignment for intertextual relations. https://code.google.com/archive/p/text-pair

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org

Russell, S., & Norvig, P. (2021). Artificial intelligence: A modern approach. Global edition. Pearson Higher Ed.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(4), 623–656.

Sidorov, G. (2019). Syntactic n-grams in computational linguistics. Springer.

Vinson, D. W., Davis, J. K., Sindi, S., & Dale, R. (2016). Efficient N-gram analysis in R with Cmscu. Behavior Research Methods, 3, 909–921. https://doi.org/10.3758/s13428-016-0766-5

Wójcik, J. (2021). Measuring internal spelling variation of an Early Modern English text. Linguistica Silesiana, 42, 107–123. https://doi.org/10.24425/linsi.2021.137234

Wójcik, J. (2023). Cluster analysis in tracing textual dependencies – A case of psalm 6 in 16th-century English devotional manuals. Digital Humanities Quarterly, 17(3), 1–16. http://www.digitalhumanities.org/dhq/vol/17/3/000694/000694.html