Re-research.pl: where Humanities Meet Computer Science
Dzienisiewicz Daniel 1, Borchmann Łukasz 1, Wierzchoń Piotr 1, Graliński Filip 2
1 Institute of Linguistics, Faculty of Modern Languages and Literatures Adam Mickiewicz University, Poznań, Poland
al. Niepodległości 4, 61-874 Poznań
E-mail: dzienis@amu.edu.pl, borch@amu.edu.pl, wierzch@amu.edu.pl2 Department of Natural Language Processing, Faculty of Mathematics and Computer Science Adam Mickiewicz University, Poznań, Poland
ul. Umultowska 87, 61-614 Poznań
E-mail: filipg@amu.edu.pl
Received:
Received: 20 March 2017; revised: 24 November 2017; accepted: 16 January 2018; published online: 31 March 2018
DOI: 10.12921/cmst.2018.0000004
Abstract:
The article discusses selected projects from the field of digital humanities realised by the Re-research.pl group. The group consists of researchers from the Institute of Linguistics and the Department of Natural Language Processing at Adam Mickiewicz University, Poznan ́, Poland. The projects discussed include National Photocorpus of Polish, Discover- mat, Korea, Koreans and ‘Koreanity’ in the digitised Polish press of the 20th century, Biography of the Nation, 100,000 ministories, Gonito.net and 50,000 words. Domain and chronologisation index. However, the main focus of the article is the interdisciplinary popular-scientific blog Re-research.pl. The daily blog posts include texts on a variety of subjects, ranging from linguistics, history and folklore to computer science. Selected posts and categories of posts are discussed, such as chronologisational challenges, texts devoted to folklore and materials on the structure of text files. Apart from providing daily analyses, the blog promotes other projects and serves as a dialogue platform for representatives of various fields.
Key words:
big data projects, corpus linguistics, digital humanities, e-lexicography, linguochronologisation, photodocumentation
References:
[1] L.F. Klein, M.K. Gold, Digital Humanities: The Ex- panded Field, [in:] M.K. Gold, L.K. Klein, Debates in the Digital Humanities, University of Minnesota Press. http://dhdebates.gc.cuny.edu/debates/2.
[2] P. Wierzchoń, Fotodokumentacja, chronologizacja, emen- dacja: teoria i praktyka weryfikacji materiału leksykalnego w badaniach lingwistycznych, Instytut Językoznawstwa Uniwer- sytetu im. Adama Mickiewicza, Poznań 2008.
[3] P. Wierzchoń, Dlaczego fotodokumentacja? dlaczego chronol- ogizacja? dlaczego emendacja?: instalacja gazowa, parking podziemny i „odległos ́c ́ niezerowa”, Instytut Językoznawstwa Uniwersytetu im. Adama Mickiewicza, Poznań 2009.
[4] P. Wierzchoń, Depozytorium leksykalne języka polskiego. Nowe fotomateriały z lat 1901–2010. Tom I., Uniwersytet Warszawski, Instytut Lingwistyki Stosowanej – BEL Studio, Warszawa 2010.
[5] T. Ruokolainen, O. Kohonen, K. Sirts, S. Grönroos, M. Ku- rimo, S. Virpioja, A comparative study of minimally super- vised morphological segmentation, Computational Linguistics 42(1), 91–120 (2016).
[6] J. Goldsmith, Unsupervised learning of the morphology of a natural language, Computational Linguistics 27(2), 153–198 (2001).
[7] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 1 edition, 2000.
[8] M. Creutz, K. Lagus, Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0., Technical Report A81, Helsinki 2005.
[9] F.F. Graliński, Polish digital libraries as a text corpus, [in:] Z. Vetulani, H. Uszkoreit (eds.) Proceedings of 6th Language & Technology Conference, Fundacja Uniwersytetu im. Adama Mickiewicza, p. 509-513, 2013.
[10] F. Graliński, Folklorystyka 2.0, in: P. Grochowski (ed.) NET- LOR. Wiedza cyfrowych tubylców, Wydawnictwo Naukowe Uniwersystetu Mikołaja Kopernika, p. 119-132, 2013.
[11] D. Dzienisiewicz, P. Wierzchoń, On the Japaneseness of Pol- ish: a Linguochronological Approach, [in:] M. Iwanowski (ed.) Opuscula Iaponica et Slavica 3, BEL Studio, p. 53-76, 2016.
[12] P. Wierzchoń, Słownictwo lat 30. XX wieku w obrazach i liczbach, BEL Studio, Warszawa 2016.
[13] K. Ram, Git can facilitate greater reproducibility and in- creased transparency in science, Source Code for Biology and Medicine 8(1), 1–8 (2013).
[14] D. Spinellis, Version control systems, Software, IEEE 22(5), 108–109 (2005).
[15] R. Jaworski, Ł. Borchmann, P. Wierzchoń, Gonito.net – Open Platform for Research Competition, Cooperation and Re- producibility, [in:] B. António, N. Calzolari, K. Choukri (eds.) Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Ci- tation in Science and Technology of Language, pp. 13- 20, 2016. http://4real.di.fc.ul.pt/wp-content/uploads/2016/04/ 4REALWorkshopProceedings.pdf.
[16] F. Graliński, Ł. Borchmann, P. Wierzchoń, ‘He Said She Said’ – a Male/Female Corpus of Polish, [in:] N. Cal- zolari, K. Choukri, T. Declerck, et al. (eds.) Proceed- ings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Lan- guage Resources Association (ELRA), 2016. http://www.lrec- conf.org/proceedings/lrec2016/pdf/905_Paper.pdf.
[17] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń, Vive la Petite Différence! Exploiting Small Differences for Gender Attribution of Short Texts, [in:] Lecture Notes in Artificial In- telligence, pp. 54-61, 2016.
[18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Es- timation of Word Representations in Vector Space, 2013. https://arxiv.org/pdf/1301.3781.pdf.
[19] Ł. Borchmann, F. Graliński, R. Jaworski, P. Wierzchoń, A semi-automatic method for thematic classification of docu- ments in a large text corpus, [in:] F. Mambrini, M. Passarotti, C. Sporleder (eds.) Proceedings of the Workshop on Corpus- Based Research in the Humanities (CRH), 2015.
[20] C. Kay, J. Roberts, M. Samuels, I. Wotherspoon, Historical Thesaurus of the Oxford English Dictionary, Oxford Univer- sity Press, Glasgow 2009.
[21] J. Wawrzyńczyk, 300 tysięcy czy milion(y)? O stanie zasobów słownictwa polskiego w dniu 31 grudnia 2000 r., Mila Hoshi, Warszawa 2015.
[22] P. Wierzchoń, F. Graliński, Z kart historii „parcia na” neologizmy, Poradnik Językowy 4, 110–129 (2016).
The article discusses selected projects from the field of digital humanities realised by the Re-research.pl group. The group consists of researchers from the Institute of Linguistics and the Department of Natural Language Processing at Adam Mickiewicz University, Poznan ́, Poland. The projects discussed include National Photocorpus of Polish, Discover- mat, Korea, Koreans and ‘Koreanity’ in the digitised Polish press of the 20th century, Biography of the Nation, 100,000 ministories, Gonito.net and 50,000 words. Domain and chronologisation index. However, the main focus of the article is the interdisciplinary popular-scientific blog Re-research.pl. The daily blog posts include texts on a variety of subjects, ranging from linguistics, history and folklore to computer science. Selected posts and categories of posts are discussed, such as chronologisational challenges, texts devoted to folklore and materials on the structure of text files. Apart from providing daily analyses, the blog promotes other projects and serves as a dialogue platform for representatives of various fields.
Key words:
big data projects, corpus linguistics, digital humanities, e-lexicography, linguochronologisation, photodocumentation
References:
[1] L.F. Klein, M.K. Gold, Digital Humanities: The Ex- panded Field, [in:] M.K. Gold, L.K. Klein, Debates in the Digital Humanities, University of Minnesota Press. http://dhdebates.gc.cuny.edu/debates/2.
[2] P. Wierzchoń, Fotodokumentacja, chronologizacja, emen- dacja: teoria i praktyka weryfikacji materiału leksykalnego w badaniach lingwistycznych, Instytut Językoznawstwa Uniwer- sytetu im. Adama Mickiewicza, Poznań 2008.
[3] P. Wierzchoń, Dlaczego fotodokumentacja? dlaczego chronol- ogizacja? dlaczego emendacja?: instalacja gazowa, parking podziemny i „odległos ́c ́ niezerowa”, Instytut Językoznawstwa Uniwersytetu im. Adama Mickiewicza, Poznań 2009.
[4] P. Wierzchoń, Depozytorium leksykalne języka polskiego. Nowe fotomateriały z lat 1901–2010. Tom I., Uniwersytet Warszawski, Instytut Lingwistyki Stosowanej – BEL Studio, Warszawa 2010.
[5] T. Ruokolainen, O. Kohonen, K. Sirts, S. Grönroos, M. Ku- rimo, S. Virpioja, A comparative study of minimally super- vised morphological segmentation, Computational Linguistics 42(1), 91–120 (2016).
[6] J. Goldsmith, Unsupervised learning of the morphology of a natural language, Computational Linguistics 27(2), 153–198 (2001).
[7] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 1 edition, 2000.
[8] M. Creutz, K. Lagus, Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0., Technical Report A81, Helsinki 2005.
[9] F.F. Graliński, Polish digital libraries as a text corpus, [in:] Z. Vetulani, H. Uszkoreit (eds.) Proceedings of 6th Language & Technology Conference, Fundacja Uniwersytetu im. Adama Mickiewicza, p. 509-513, 2013.
[10] F. Graliński, Folklorystyka 2.0, in: P. Grochowski (ed.) NET- LOR. Wiedza cyfrowych tubylców, Wydawnictwo Naukowe Uniwersystetu Mikołaja Kopernika, p. 119-132, 2013.
[11] D. Dzienisiewicz, P. Wierzchoń, On the Japaneseness of Pol- ish: a Linguochronological Approach, [in:] M. Iwanowski (ed.) Opuscula Iaponica et Slavica 3, BEL Studio, p. 53-76, 2016.
[12] P. Wierzchoń, Słownictwo lat 30. XX wieku w obrazach i liczbach, BEL Studio, Warszawa 2016.
[13] K. Ram, Git can facilitate greater reproducibility and in- creased transparency in science, Source Code for Biology and Medicine 8(1), 1–8 (2013).
[14] D. Spinellis, Version control systems, Software, IEEE 22(5), 108–109 (2005).
[15] R. Jaworski, Ł. Borchmann, P. Wierzchoń, Gonito.net – Open Platform for Research Competition, Cooperation and Re- producibility, [in:] B. António, N. Calzolari, K. Choukri (eds.) Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Ci- tation in Science and Technology of Language, pp. 13- 20, 2016. http://4real.di.fc.ul.pt/wp-content/uploads/2016/04/ 4REALWorkshopProceedings.pdf.
[16] F. Graliński, Ł. Borchmann, P. Wierzchoń, ‘He Said She Said’ – a Male/Female Corpus of Polish, [in:] N. Cal- zolari, K. Choukri, T. Declerck, et al. (eds.) Proceed- ings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Lan- guage Resources Association (ELRA), 2016. http://www.lrec- conf.org/proceedings/lrec2016/pdf/905_Paper.pdf.
[17] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń, Vive la Petite Différence! Exploiting Small Differences for Gender Attribution of Short Texts, [in:] Lecture Notes in Artificial In- telligence, pp. 54-61, 2016.
[18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Es- timation of Word Representations in Vector Space, 2013. https://arxiv.org/pdf/1301.3781.pdf.
[19] Ł. Borchmann, F. Graliński, R. Jaworski, P. Wierzchoń, A semi-automatic method for thematic classification of docu- ments in a large text corpus, [in:] F. Mambrini, M. Passarotti, C. Sporleder (eds.) Proceedings of the Workshop on Corpus- Based Research in the Humanities (CRH), 2015.
[20] C. Kay, J. Roberts, M. Samuels, I. Wotherspoon, Historical Thesaurus of the Oxford English Dictionary, Oxford Univer- sity Press, Glasgow 2009.
[21] J. Wawrzyńczyk, 300 tysięcy czy milion(y)? O stanie zasobów słownictwa polskiego w dniu 31 grudnia 2000 r., Mila Hoshi, Warszawa 2015.
[22] P. Wierzchoń, F. Graliński, Z kart historii „parcia na” neologizmy, Poradnik Językowy 4, 110–129 (2016).