Metrika članka

  • citati u SCindeksu: 0
  • citati u CrossRef-u:0
  • citati u Google Scholaru:[=>]
  • posete u prethodnih 30 dana:0
  • preuzimanja u prethodnih 30 dana:0
članak: 1 od 1  
Telfor Journal
2017, vol. 9, br. 2, str. 104-109
jezik rada: engleski
vrsta rada: neklasifikovan
doi:10.5937/telfor1702104B


Sentiment classification of documents in Serbian: The effects of morphological normalization and word embeddings
(naslov ne postoji na srpskom)
Univerzitet u Beogradu, Elektrotehnički fakultet

e-adresa: vuk.batanovic@student.etf.bg.ac.rs, nbosko@etf.bg.

Projekat

Razvoj digitalnih tehnologija i umreženih servisa u sistemima sa ugrađenim elektronskim komponentama (MPNTR - 44009)

Sažetak

(ne postoji na srpskom)
An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

Ključne reči

comparative evaluation; lemmatization; morphology; sentiment analysis; stemming; word embeddings

Reference

Agić, Ž., Ljubešić, N., Merkler, D. (2013) Lemmatization and morphosyntactic tagging of Croatian and Serbian. u: The Fourth Biennial International Workshop on Balto-Slavic Natural Language Processing, Proceedings of, pp. 48-57
Batanovic, V., Nikolic, B., Milosavljević, M. (2016) Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. u: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), str. 2688-2696
Batanovic, V., Nikolic, B. (2016) Sentiment classification of documents in Serbian: The effects of morphological normalization. u: 2016 24th Telecommunications Forum (TELFOR), Institute of Electrical and Electronics Engineers (IEEE), str. 1-4
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146
Bouckaert, R.R. (2003) Choosing between two learning algorithms based on calibrated tests. u: The 20th International Conference on Machine Learning (ICML 2003), Proceedings of, pp. 51-58
Bouckaert, R.R., Frank, E. (2004) Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. Berlin, Heidelberg: Springer Nature, str. 3-12
Fan, R-E., Chang, K-W., Hsieh, C-J., Wang, X-R., Lin, C-J. (2008) LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, vol. 9, pp. 1871-1874
Gesmundo, A., Samardžić, T. (2012) Lemmatising Serbian as category tagging with bidirectional sequence classification. u: The Eight International Conference on Language Resources and Evaluation (LREC 2012), Proceedings of, pp. 2103-2106
Gesmundo, A., Samardžić, T. (2012) Lemmatisation as a tagging task. u: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of, pp. 368-372
Halácsy, P., Kornai, A., Oravecz, C. (2007) HunPos. u: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions - ACL '07, Morristown, NJ, USA: Association for Computational Linguistics (ACL), str. 209
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H. (2009) The WEKA data mining software. ACM SIGKDD Explorations Newsletter, 11(1): 10
Jongejan, B., Dalianis, H. (2009) Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. u: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP '09, Morristown, NJ, USA: Association for Computational Linguistics (ACL), str. 145
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T. (2017) Bag of Tricks for Efficient Text Classification. u: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), str. 427-431
Juršič, M., Mozetič, I., Erjavec, T., Lavrač, N. (2010) Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules. Journal of Universal Computer Science, vol. 16, no. 9, pp. 1190-1214
Kešelj, V., Šipka, D. (2008) A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca, vol. 9, no. 1-2, p. 23a-33a
le Q., Mikolov, T. (2014) Distributed representations of sentences and documents. u: The 31st International Conference on Machine Learning (ICML 2014), Proceedings of, pp. 1188-1196
Li, B., Zhao, Z., Liu, T., Wang, P., Du, X. (2016) Weighted neural bag-of-n-grams model: New baselines for text classification. u: The 26th International Conference on Computational Linguistics (COLING 2016), Proceedings of, pp. 1591-1600
Ljubešić, N., Erjavec, T., Fišer, D., Samardžić, T., Miličević, M., Klubička, F., Petkovski, F. (2016) Easily accessible language technologies for Slovene, Croatian and Serbian. u: The Conference on Language Technologies & Digital Humanities, Proceedings of, pp. 120-124
Ljubešić, N., Boras, D., Kubelka, O. (2007) Retrieving information in Croatian: Building a simple and efficient rule-based stemmer. u: INFuture2007: Digital Information and Heritage, Zagreb, Croatia: Department for Information Sciences, Faculty of Humanities and Social Sciences, pp. 313-320
Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P. (2016) New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. u: The 10th International Conference on Language Resources and Evaluation (LREC 2016), Proceedings of, pp. 4264-4270
Ljubešić, N., Klubička, F. (2014) {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian. u: Proceedings of the 9th Web as Corpus Workshop (WaC-9), Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), str. 29-35
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013) Efficient estimation of word representations in vector space. u: The International Conference on Learning Representations Workshop (ICLR 2013), Proceedings of
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013) Distributed representations of words and phrases and their compositionality. u: The 26th International Conference on Neural Information Processing Systems (NIPS 2013), Proceedings of, pp. 3111-3119
Milošević, N. (2012) Stemmer for Serbian language. arXiv 1209.4471
Orosz, G., Novák, A. (2013) PurePos 2.O: A hybrid tool for morphological disambiguation. u: Recent Advances in Natural Language Processing, Proceedings of, pp. 539-545
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, vol. 12, pp. 2825-2830
Řehůřek, R., Sojka, P. (2010) Software framework for topic modelling with large corpora. u: The LREC 2010 Workshop on New Challenges for NLP Frameworks, Proceedings of, pp. 45-50
Rotim, L., Šnajder, J. (2017) Comparison of Short-Text Sentiment Analysis Methods for Croatian. u: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), str. 69-75
Samardžić, T., Ljubešić, N., Miličević, M. (2015) Regional Linguistic Data Initiative (ReLDI). u: The Fifth Workshop on Balto-Slavic Natural Language Processing (BSNLP 2015), Proceedings of, pp. 40-42
Schmid, H. (1995) Improvements in part-of-speech tagging with an application to German. u: The ACL SIGDAT-Workshop, Proceedings of
Tang, D., Qin, B., Liu, T. (2015) Deep learning for sentiment analysis: successful approaches and future challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(6): 292-303
Turian, J., Ratinov, L., Bengio, Y. (2010) Word representations: A simple and general method for semi-supervised learning. u: The 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Proceedings of, pp. 384-394
Unknown,, Pang, B., Lee, L., Vaithyanathan, S. (2002) Thumbs up? Sentiment Classification using Machine Learning Techniques. u: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02, Morristown, NJ, USA: Association for Computational Linguistics (ACL), str. 79-86
Wang, S., Manning, C.D. (2012) Baselines and bigrams: Simple, good sentiment and topic classification. u: The 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Proceedings of, pp. 90-94