Paintball – Automated Wordnet Expansion Algorithm based on Distributional Semantics and Information Spreading

Piasecki Maciej

doi:10.12921/cmst.2018.0000051

Paintball – Automated Wordnet Expansion Algorithm based on Distributional Semantics and Information Spreading

Faculty of Computer Science and Management, G4.19 Research Group Wrocław University of Science and Technology

Wyb. Wyspiańskiego 27, 50-370 Wrocław, Poland
E-mail: maciej.piasecki@pwr.wroc.pl

Received:

Received: 07 November 2018; revised: 30 December 2018; accepted: 03 January 2019; published online: 29 January 2019

DOI: 10.12921/cmst.2018.0000051

Abstract:

plWordNet has been consequently built on the basis of the corpus-based wordnet development method. As plWordNet construction had started from scratch it was necessary to find a way to reduce the amount of work required, and not to reduce the quality. In the paper we discuss the gained experience in applying different tools based on Distributional Semantics methods to support the work of lexicographers. A special attention is given to the Paintball algorithm for semi- automated wordnet expansion and its application in the WordnetWeaver system.

Key words:

automated wordnet expansion, lexical semantic network, linguistic knowledge extraction, natural language engineering, plWordNet, wordnet

References:

[1] M. Maziarz, M. Piasecki, E. Rudnicka, S. Szpakowicz, P. Ke ̨dzia, “plwordnet 3.0 – a comprehensive lexical- semantic resource,” in COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan (N. Calzolari, Y. Matsumoto, R. Prasad, eds.), pp. 2259–2268, ACL, ACL, 2016.

[2] P. Vossen, ed., EuroWordNet. A multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, 1998.

[3] P. Vossen, “EuroWordNet General Document Version 3,” tech. rep., Univ. of Amsterdam, 2002.

[4] M. Derwojedowa, M. Piasecki, S. Szpakowicz, M. Zaw- isławska, B. Broda, “Words, Concepts and Relations in the Construction of Polish WordNet,” in Proc. Fourth Global WordNet Conf. (A. Tanács, D. Csendes, V. Vincze, C. Fell- baum, P. Vossen, eds.), pp. 162–177, 2008.

[5] M. Maziarz, M. Piasecki, E. Rudnicka, S. Szpakowicz, “Be- yond the transfer-and-merge wordnet construction: plWord- Net and a comparison with WordNet,” in Proc. International Conference Recent Advances in Natural Language Process- ing RANLP 2013, pp. 443–452, INCOMA Ltd. Shoumen, BULGARIA, 2013.

[6] D. Widdows, Geometry and Meaning. CSLI Publications, 2004.

[7] T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,“Dis- tributed representations of words and phrases and their com- positionality,” CoRR, vol. abs/1310.4546, 2013.

[8] M. Piasecki, S. Szpakowicz, B. Broda, A Wordnet from the Ground Up. Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej, 2009.

[9] A. Przepiórkowski, The IPI PAN Corpus: Preliminary ver- sion. Institute of Computer Science, Polish Academy of Sci- ences, 2004.

[10] A. Przepiórkowski, M. Bańko, R. L. Górski, B. Lewandowska-Tomaszczyk, eds., Narodowy Korpus Je ̨zyka Polskiego [in Polish]. Wydawnictwo Naukowe PWN, 2012. http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf.

[11] D. Weiss, “Korpus Rzeczpospolitej [Corpus of text from the online edition of “Rzeczpospolita”].” http://www.cs.put.poznan.pl/dweiss/rzeczpospolita, 2008.

[12] M. Woliński, “Morfeusz reloaded,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014 (N. Calzolari, K. Choukri, T. De- clerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds.), (Reykjavík, Iceland), pp. 1106– 1111, ELRA, 2014.

[13] B. Svensén, A Handbook of Lexicography. The Theory and Practice of Dictionary-Making. Cambridge University Press, 2009.

[14] C. Fellbaum, “A Semantic Network of English: The Mother of All WordNets,” Computers and the Humanities, vol. 32, pp. 209–220, 1998.

[15] B. Broda, M. Maziarz, M. Piasecki, “Tools for plWord- Net Development. Presentation and Perspectives,” in Pro- ceedings of the Eight International Conference on Lan- guage Resources and Evaluation (LREC’12) (N. Calzolari, K. Choukri, T. Declerck, M. U. Dog ̆an, B. Maegaard, J. Mari- ani, J. Odijk, S. Piperidis, eds.), (Istanbul, Turkey), pp. 3647– 3652, European Language Resources Association (ELRA), may 2012.

[16] M. Piasecki, M. Marcin ́czuk, R. Ramocki, M. Maziarz, “WordNetLoom: a WordNet development system integrat- ing form-based and graph-based perspectives,” International Journal of Data Mining, Modelling and Management, vol. 5, no. 3, pp. 210–232, 2013.

[17] T. Naskre ̨t, A. Dziob, M. Piasecki, C. Saedi, A. Branco, “WordnetLoom – a multilingual wordnet editing system fo- cused on graph-based presentation,” in Proceedings of the 9th Global Wordnet Conference, Singapore, 8-12 January 2018 (F. Bond, C. Fellbaum, P. Vossen, eds.), Global Wordnet As- sociation, 2018.

[18] M. Wynne, ed., Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books, 2005.

[19] plWordNet, “Frequency List from plWorNet Corpus,” 2012. www.nlp.pwr.wroc.pl/pl/narzedzia-i-zasoby/lista- frekwencyjna.

[20] B. Broda and M. Piasecki, “Parallel, massive processing in SuperMatrix – a general tool for distributional semantic anal- ysis of corpora,” International Journal of Data Mining, Mod- elling and Management, vol. 5, no. 1, pp. 1–19, 2013.

[21] T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient es- timation of word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.

[22] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, “Enrich- ing word vectors with subword information,” arXiv preprint arXiv:1607.04606, 2016.

[23] M. Piasecki, G. Czachor, A. Janz, D. Kaszewski, P. Ke ̨dzia, “Wordnet-based evaluation of large distributional models for polish,” in Proceedings of the 9th Global Wordnet Confer- ence, Singapore, 8-12 January 2018 (F. Bond, C. Fellbaum, P. Vossen, eds.), Global WordNet Association, 2018.

[24] M. Piasecki, A. Janz, D. Kaszewski, G. Czachor, “Word em- beddings for polish,” 2017. CLARIN-PL digital repository.

[25] J. Kocon ́, “KGR10 FastText polish word embeddings,” 2018. CLARIN-PL digital repository.

[26] J. Kocon ́ and M. Marcin ́czuk, “Word embeddings for pol- ish (KGR10, fasttext binary) kgr10_fasttext_bin_v1,” 2018. CLARIN-PL digital repository.

[27] G. Karypis, “CLUTO a clustering toolkit,” Technical Re- port 02-017, Department of Computer Science, University of Minnesota, 2002.

[28] B. Broda, M. Maziarz, M. Piasecki, “Evaluating LexCSD — a Weakly-Supervised Method on Improved Semantically An- notated Corpus in a Large Scale Experiment,” in Proceedings of a Conference on Intelligent Information Systems (S. T. W. M. A. Kłopotek, A. Przepiórkowski and K. Trojanowski, eds.), 2010.

[29] D. Janus and A. Przepiórkowski, “Poliqarp 1.0: Some techni- cal aspects of a linguistic search engine for large corpora,” in The proceedings of Practical Applications of Linguistic Cor- pora, 2005.

[30] T. Machalek, “KonText – a modern, customizable corpus query interface,” in Book of Abstracts of the Corpus Linguis- tics 2017 Conference, 25-28 July 2017,, (Birmingham), Uni- versity of Birmingham, 2017.

[31] M. Piasecki and M. Wendelberger, “Partial measure of se- mantic relatedness based on the local feature selection,” in Text, Speech and Dialogue – 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings (P. Sojka, A. Horák, I. Kopecek, K. Pala, eds.), vol. 8655 of Lecture Notes in Computer Science, pp. 336– 343, Springer, 2014.

[32] M. Piasecki, R. Ramocki, M. Kalin ́ski, “Information spread- ing in expanding wordnet hypernymy structure,” in Proc. International Conference Recent Advances in Natural Lan- guage Processing RANLP 2013, pp. 553–561, INCOMA Ltd. Shoumen, BULGARIA, 2013.

[33] M. Piasecki, Ł. Burdka, M. Maziarz, M. Kalin ́ski, “Diag- nostic tools in plwordnet development process,” in Human Language Technology. Challenges for Computer Science and Linguistics (Z. Vetulani, H. Uszkoreit, and M. Kubis, eds.), vol. 9561 of LNCS, pp. 255–273, Springer, 2016.

[34] R. Snow, D. Jurafsky, A. Y. Ng., “Semantic taxonomy induc- tion from heterogenous evidence.,” pp. 801–808, The Asso- ciation for Computer Linguistics, 2006.

[35] A. M. Collins and E. F. Loftus, “A spreading-activation the- ory of semantic processing,” Psychological Review, vol. 82, no. 6, pp. 407–428, 1975.

[36] G. Salton and C. Buckley, “On the use of spreading activation methods in automatic Information Retrieval,” in Proceedings of ACM SIGIR, 1988.

[37] N. M. Akim, A. Dix, A. Katifori, G. Lepouras, N. Shabir, C. Vassilakis, “Spreading activation for web scale reasoning: Promise and problems,” in Proceedings of WebSci ’11, June 14-17, 2011, Koblenz, Germany, 2011.

[38] A. Troussov, M. Sogrin, J. Judge, D. Botvich, “Mining socio- semantic networks using spreading activation technique,” in Proceedings of I-KNOW ’08 and I-MEDIA ’08 Graz, Austria, September 3-5, 2008, pp. 405–412, 2008.

[39] M. Piasecki, R. Kurc, R. Ramocki, B. Broda, “Lexical Acti- vation Area Attachment Algorithm for Wordnet Expansion,”

in Proc. 15th International Conference on Artificial Intelli- gence: Methodology, Systems, Applications (A. Ramsay and G. Agre, eds.), vol. 7557 of Lecture Notes in Computer Sci- ence, pp. 23–31, Springer, 2012.

[40] M. Derwojedowa, S. Szpakowicz, M. Zawisławska, M. Pi- asecki, “Lexical Units as the Centrepiece of a Wordnet,” in Proc. 16th Int. Conf. on Intelligent Information Sys- tems (M. A. Kłopotek, A. Przepiórkowski, S. T. Wierzchon ́, K. Trojanowski, eds.), pp. 351–358, 2008.

[41] M. Maziarz, M. Piasecki, S. Szpakowicz, “The chicken- and-egg problem in wordnet design: synonymy, synsets and constitutive relations,” Language Resources and Evaluation, vol. 47, no. 3, pp. 769–796, 2013.

[42] C. Fellbaum, ed., WordNet – An Electronic Lexical Database. The MIT Press, 1998.

[43] Ł. Kłyk, P. Myszkowski, B. Broda, M. Piasecki, D. Urbansky, “Metaheuristics for tuning model parameters in two natural language processing applications,” in Proceedings of the 15th International Conference on Artificial Intelligence: Method- ology, Systems, Applications (A. Ramsay and G. Agre, eds.), vol. 7557 of Lecture Notes in Computer Science, (Varna, Bul- garia), pp. 32–37, Springer, 2012.

[44] B. Broda, R. Kurc, M. Piasecki, R. Ramocki, “Evaluation method for automated wordnet expansion,” in Security and Intelligent Information Systems (P. Bouvry, M. Kłopotek, F. Leprevost, M. Marciniak, A. Mykowiecka, H. Rybin ́ski, eds.), LNCS, Springer, 2011.

[45] R. Snow, D. Jurafsky, A. Y. Ng, “Learning syntactic patterns for automatic hypernym discovery,” in NIPS, 2004.

[46] D. Lin, “Principle-based parsing without overgeneration,” in Proc. ACL-93, Columbus, Ohio, 1993.

[47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, “LIBLINEAR: A library for large linear classification,” Jour- nal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.

[48] G. Israel, “Determining sample size,” tech. rep., University of Florida, 1992.

[49] M. Piasecki, S. Szpakowicz, B. Broda, “Automatic selec- tion of heterogeneous syntactic features in semantic similar- ity of Polish nouns,” in Text, Speech and Dialogue, 10th In- ternational Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007, Proceedings (V. Matousek and P. Maut- ner, eds.), vol. 4629 of LNCS, pp. 99–106, Springer, 2007.

[50] M. Piasecki, M. M. annd Stanisław Szpakowicz, B. Broda, “Classification-based filtering of semantic relatedness in hy- pernymy extraction,” in Advances in Natural Language Pro- cessing, 6th International Conference, GoTAL 2008, Gothen- burg, Sweden, August 25-27, 2008, Proceedings (B. Nord- ström and A. Ranta, eds.), vol. 5221 of LNCS, pp. 393–404, Springer, 2008.

[51] M. A. Hearst, Automated Discovery of WordNet Relations, ch. 5, pp. 131–151. Vol. 1 of Fellbaum [42], 1998.

[52] R. Kurc and M. Piasecki, “Automatic acquisition of wordnet relations by the morpho-syntactic patterns extracted from the corpora in polish,” in Proceedings of the International Mul- ticonference on Computer Science and Information Technol- ogy — 3nd International Symposium Advances in Artificial Intelligence and Applications (AAIA’08), pp. 181–188, 2008.

[53] R. Kurc, M. Piasecki, S. Szpakowicz, “Automatic acquisi- tion of wordnet relations by distributionally supported mor- phological patterns extracted from polish corpora,” in Text, Speech and Dialogue, 13th International Conference, TSD 2010, Brno, Czech Republic, September 6-10, 2010. Proceed- ings (P. Sojka, A. Horák, I. Kopecek, K. Pala, eds.), vol. 6231 of Lecture Notes in Computer Science, pp. 133–141, 2010.

Volume 25 (1) 2019, 41–56

Paintball – Automated Wordnet Expansion Algorithm based on Distributional Semantics and Information Spreading

Received:

DOI: 10.12921/cmst.2018.0000051

Abstract:

Key words:

References:

JOURNAL MENU

GALLERY

LAST ISSUE

MANUSCRIPT SUBMISSION

FUTURE ISSUES

ALL ISSUES

DATABASES