Open Stylometric System WebSty: Integrated Language Processing, Analysis and Visualisation
Piasecki Maciej 1*, Walkowiak Tomasz 2, Eder Maciej 3
1 Faculty of Computer Science and Management
Wrocław University of Science and Technology2 Faculty of Electronics
Wrocław University of Science and Technology3 Institute of Polish Language
Polish Academy of Sciences and Pedagogical University of Kraków*E-mail: maciej.piasecki@pwr.wroc.pl
Received:
Received: 14 April 2017; revised: 28 December 2017; accepted: 16 January 2018; published online: 31 March 2018
DOI: 10.12921/cmst.2018.0000007
Abstract:
The paper presents an open, web-based system for stylometric analysis named WebSty, which is a part of the CLARIN-PL research infrastructure. WebSty does not require local installation by users, can be used via any web browser, offers rich set-up, and runs on a computing cluster. We discuss the underlying ideas of the system, its architecture, a pipeline of language tools for processing Polish, and its integration with systems for clustering, visualizing the results of clustering, and identifying the features of the strongest discrimination power. The techniques used for feature weighting and text similarity measuring are also concisely overviewed. In conclusions, we present preliminary evaluation of WebSty on the corpus of 1000 literary works, and we report on the results of the first research applications of WebSty. Even if the system was initially focused on processing Polish texts, we also briefly discuss its development towards a multilingual system, which already supports English, German and Hungarian.
Key words:
authorship attribution, CLARIN, language technology infrastructure, style analysis, stylometry, web application
References:
[1] P. Juola Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006).
[2] M. Koppel, J. Schler, S. Argamon Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26 (2009).
[3] E. Stamatatos A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009).
[4] Le, X., I. Lancashire, G. Hirst, R. Jokel Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists. Literary and Linguistic Computing, 26(4):
435–461 (2011).
[5] Signature Stylometric System (access Apr. 2017). Web Page of the system. URL: http://www.philocomp.net/humanities/signature.htm.
[6] Maurer, Leon (access Apr. 2017) Web page of the StyleTool programURL:https://github.com/lnmaurer/StyleTool.
[7] S. Bird, E. Klein, E. Loper, (2009) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media, URL: http://www.nltk.org/book_1ed/.
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
[9] JGAAP (accees Apr. 2017) Web page of the application. URL: https://github.com/evllabs/JGAAP.
[10] A. McDonald, S. Afroz, A. Caliskan, A. Stolerman, R. Greenstadt, Rachel (2012) Use Fewer Instances of the Letter “i”: Toward Writing Style Anonymization. PETS 2012.
[11] I.H. Witten, Ian H., Frank, Eibe, Hall, Mark A., Christopher J. Pal Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman (2017).
[12] S. Sinclair, Geoffrey Rockwell and the Voyant Tools Team (2012) Voyant Tools (web application). URL: http://docs.voyant-tools.org.
[13] C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J.Bethard, D. McClosky, The Stanford CoreNLP Natural Language Processing Toolkit. Association for Computational Linguistics (ACL) 2014 – System
Demonstrations, ACL (2014).
[14] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit. Web page of the system. URL: http://mallet.cs.umass.edu (2002) .
[15] LATtice – Application for Visualizing Linguistic Variation (access Apr. 2017) WEb page of the application URL: http://winedarksea.org/?p=1285
[16] M. Eder, J. Rybicki, M. Kestemont Stylometry with R: a package for computational text analysis. R Journal, 8(1): 107–121, http://journal.r-project.org/archive/2016-1/eder-rybicki-kestemont.pdf (2016).
[17] P. Wittenburg, et al. Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 60–63.
European Language Resources Association (2010).
[18] M. Marcińczuk, J. Kocoń, M. Janicki, Liner2 – A Customizable Framework for Proper Names Recognition for Polish. Studies in Computational Intelligence, vol. 467, pp. 231–253 (2013).
[19] E. Wolff Microservices: Flexible Software Architectures, Addison-Wesley (2016).
[20] M. Bell SOA Modeling Patterns for Service-Oriented Discovery and Analysis. Wiley & Sons. (2010).
[21] T. Walkowiak Language Processing Modelling Notation – orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International
Conference on Dependability and Complex Systems DepCoS-RELCOMEX, 2017, Springer International Publishing, pp. 464-473 (2017).
[22] C. Peltz, Web services orchestration and choreography. Computer 36(10), 46–52 (2013).
[23] T. Parr, K. Fisher LL(*): the foundation of the ANTLR parser generator. ACM SIGPLAN Notices 46(6), 425–436 (2011).
[24] T. Walkowiak, M. Pol The impact of administrator working hours on the reliability of the Centre of Language Technology. Journal of Polish Safety and Reliability Association 6(1), 167–174 (2017).
[25] A. Radziszewski A tiered CRF tagger for Polish, Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence 467, 215–230 (2013).
[26] G. Salton, McM. Gill J. Introduction to modern information retrieval. McGraw-Hill. ISBN 978-0070544840, 1986.
[27] P. Pantel & D. Ravichandran (2004) Automatically Labeling Semantic Classes. In Susan D. Dumais M. & S. Roukos (Eds.) HLT-NAACL 2004: Main Proceedings , Association for Computational Linguistics,
2004, pp. 321-328.
[28] T. Hastie, R. Tibshirani, J. Friedman The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York, 2009.
[29] T. Landauer & S. Dumais (1997) A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition. Psychological Review, 1997, 104, pp. 211-240.
[30] M. Piasecki; S. Szpakowicz & B. Broda (2009) A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wrocławskiej. URL: http://www.dbc.wroc.pl/Content/4220/Piasecki_Wordnet.pdf
[31] M. Eder, J. Rybicki, K. Młynarczyk, M. Oleksy, R. Borys, M. Maryl, M. Piasecki, 1000 Novels Corpus, CLARIN-PL digital repository, http://hdl.handle.net/11321/312.
[32] L.R. Dice, “Measures of the Amount of Ecologic Association Between Species”. Ecology. 26(3), 297–302 (1945).
[33] T. Sřrensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske
Videnskabernes Selskab. 5(4), 1–34 (1948).
[34] M. Eder Taking stylometry to the limits: benchmark study on 5,281 texts from Patrologia Latina. In: Digital Humanities 2015: Conference Abstracts http://dh2015.org/abstracts (2015).
[35] J.F. Burrows, “Delta”: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17(3), 267–287 (2002).
[36] S. Argamon Interpreting Burrows’s Delta: geometric and probabilistic foundations. Literary and Linguistic Computing 23(2), 131–147 (2008).
[37] P. Gärdenfors, Conceptual Spaces – The Geometry of Thought. The MIT Press, (2000).
[38] Y. Zhao and G. Karypis Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2), 1 (2005).
[39] I. Borg, P. Groenen Modern Multidimensional Scaling – Theory and Applications, Springer Series in Statistics, 1997.
[40] J.P van der L. Maaten, G. Hinton Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov), 2431–2456 (2008).
[41] M. Belkin, P. Niyogi Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15(6), 1373–1396 (2003).
[42] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, (2012) editors. Narodowy Korpus J ̨ ezyka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN.
[43] M. Maryl; M. Piasecki, K. Młynarczyk, (2016) Where Close and Distant Readings Meet: Text Clustering Methods in Literary Analysis of Weblog Genres. In M. Eder & J. Rybicki (Eds.) Digital Humanities
2016 Conference Abstracts, Jagiellonian University and Pedagogical University, pp. 273-275.
The paper presents an open, web-based system for stylometric analysis named WebSty, which is a part of the CLARIN-PL research infrastructure. WebSty does not require local installation by users, can be used via any web browser, offers rich set-up, and runs on a computing cluster. We discuss the underlying ideas of the system, its architecture, a pipeline of language tools for processing Polish, and its integration with systems for clustering, visualizing the results of clustering, and identifying the features of the strongest discrimination power. The techniques used for feature weighting and text similarity measuring are also concisely overviewed. In conclusions, we present preliminary evaluation of WebSty on the corpus of 1000 literary works, and we report on the results of the first research applications of WebSty. Even if the system was initially focused on processing Polish texts, we also briefly discuss its development towards a multilingual system, which already supports English, German and Hungarian.
Key words:
authorship attribution, CLARIN, language technology infrastructure, style analysis, stylometry, web application
References:
[1] P. Juola Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006).
[2] M. Koppel, J. Schler, S. Argamon Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26 (2009).
[3] E. Stamatatos A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009).
[4] Le, X., I. Lancashire, G. Hirst, R. Jokel Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists. Literary and Linguistic Computing, 26(4):
435–461 (2011).
[5] Signature Stylometric System (access Apr. 2017). Web Page of the system. URL: http://www.philocomp.net/humanities/signature.htm.
[6] Maurer, Leon (access Apr. 2017) Web page of the StyleTool programURL:https://github.com/lnmaurer/StyleTool.
[7] S. Bird, E. Klein, E. Loper, (2009) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media, URL: http://www.nltk.org/book_1ed/.
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
[9] JGAAP (accees Apr. 2017) Web page of the application. URL: https://github.com/evllabs/JGAAP.
[10] A. McDonald, S. Afroz, A. Caliskan, A. Stolerman, R. Greenstadt, Rachel (2012) Use Fewer Instances of the Letter “i”: Toward Writing Style Anonymization. PETS 2012.
[11] I.H. Witten, Ian H., Frank, Eibe, Hall, Mark A., Christopher J. Pal Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman (2017).
[12] S. Sinclair, Geoffrey Rockwell and the Voyant Tools Team (2012) Voyant Tools (web application). URL: http://docs.voyant-tools.org.
[13] C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J.Bethard, D. McClosky, The Stanford CoreNLP Natural Language Processing Toolkit. Association for Computational Linguistics (ACL) 2014 – System
Demonstrations, ACL (2014).
[14] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit. Web page of the system. URL: http://mallet.cs.umass.edu (2002) .
[15] LATtice – Application for Visualizing Linguistic Variation (access Apr. 2017) WEb page of the application URL: http://winedarksea.org/?p=1285
[16] M. Eder, J. Rybicki, M. Kestemont Stylometry with R: a package for computational text analysis. R Journal, 8(1): 107–121, http://journal.r-project.org/archive/2016-1/eder-rybicki-kestemont.pdf (2016).
[17] P. Wittenburg, et al. Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 60–63.
European Language Resources Association (2010).
[18] M. Marcińczuk, J. Kocoń, M. Janicki, Liner2 – A Customizable Framework for Proper Names Recognition for Polish. Studies in Computational Intelligence, vol. 467, pp. 231–253 (2013).
[19] E. Wolff Microservices: Flexible Software Architectures, Addison-Wesley (2016).
[20] M. Bell SOA Modeling Patterns for Service-Oriented Discovery and Analysis. Wiley & Sons. (2010).
[21] T. Walkowiak Language Processing Modelling Notation – orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International
Conference on Dependability and Complex Systems DepCoS-RELCOMEX, 2017, Springer International Publishing, pp. 464-473 (2017).
[22] C. Peltz, Web services orchestration and choreography. Computer 36(10), 46–52 (2013).
[23] T. Parr, K. Fisher LL(*): the foundation of the ANTLR parser generator. ACM SIGPLAN Notices 46(6), 425–436 (2011).
[24] T. Walkowiak, M. Pol The impact of administrator working hours on the reliability of the Centre of Language Technology. Journal of Polish Safety and Reliability Association 6(1), 167–174 (2017).
[25] A. Radziszewski A tiered CRF tagger for Polish, Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence 467, 215–230 (2013).
[26] G. Salton, McM. Gill J. Introduction to modern information retrieval. McGraw-Hill. ISBN 978-0070544840, 1986.
[27] P. Pantel & D. Ravichandran (2004) Automatically Labeling Semantic Classes. In Susan D. Dumais M. & S. Roukos (Eds.) HLT-NAACL 2004: Main Proceedings , Association for Computational Linguistics,
2004, pp. 321-328.
[28] T. Hastie, R. Tibshirani, J. Friedman The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York, 2009.
[29] T. Landauer & S. Dumais (1997) A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition. Psychological Review, 1997, 104, pp. 211-240.
[30] M. Piasecki; S. Szpakowicz & B. Broda (2009) A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wrocławskiej. URL: http://www.dbc.wroc.pl/Content/4220/Piasecki_Wordnet.pdf
[31] M. Eder, J. Rybicki, K. Młynarczyk, M. Oleksy, R. Borys, M. Maryl, M. Piasecki, 1000 Novels Corpus, CLARIN-PL digital repository, http://hdl.handle.net/11321/312.
[32] L.R. Dice, “Measures of the Amount of Ecologic Association Between Species”. Ecology. 26(3), 297–302 (1945).
[33] T. Sřrensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske
Videnskabernes Selskab. 5(4), 1–34 (1948).
[34] M. Eder Taking stylometry to the limits: benchmark study on 5,281 texts from Patrologia Latina. In: Digital Humanities 2015: Conference Abstracts http://dh2015.org/abstracts (2015).
[35] J.F. Burrows, “Delta”: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17(3), 267–287 (2002).
[36] S. Argamon Interpreting Burrows’s Delta: geometric and probabilistic foundations. Literary and Linguistic Computing 23(2), 131–147 (2008).
[37] P. Gärdenfors, Conceptual Spaces – The Geometry of Thought. The MIT Press, (2000).
[38] Y. Zhao and G. Karypis Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2), 1 (2005).
[39] I. Borg, P. Groenen Modern Multidimensional Scaling – Theory and Applications, Springer Series in Statistics, 1997.
[40] J.P van der L. Maaten, G. Hinton Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov), 2431–2456 (2008).
[41] M. Belkin, P. Niyogi Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15(6), 1373–1396 (2003).
[42] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, (2012) editors. Narodowy Korpus J ̨ ezyka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN.
[43] M. Maryl; M. Piasecki, K. Młynarczyk, (2016) Where Close and Distant Readings Meet: Text Clustering Methods in Literary Analysis of Weblog Genres. In M. Eder & J. Rybicki (Eds.) Digital Humanities
2016 Conference Abstracts, Jagiellonian University and Pedagogical University, pp. 273-275.