A Comprehensive Analysis of Web-based frequency in Multiword Expression Detection

Hande Aka Uymaz, Senem Kumova Metin
  • Hande Aka Uymaz
    İzmir University of Economics, Turkey


Multiword expressions (MWEs) are syntactic and/or semantic units in language, where the meaning of whole is limitedly connected to the meanings of the constituting units. The most prominent property that distinguishes MWEs from random word combinations is the recurrence. The recurrence is commonly measured by the occurrence frequencies of the MWE and the constituting words. Though occurrence frequency measures are known to be best in distinguishing MWEs from random combinations, the performance of those measures depend mainly on the quality and size of the data source where frequencies are obtained. The main goal of this study is to provide a detailed analysis on the change in performance of frequency based measures when the traditional frequency source, corpus, is swapped with a massive and dynamic data source, the World Wide Web. In order to use the web as a frequency source, the constituting words and word combinations are queried among a popular search engine, and the number of results for each query is accepted to be web-based frequency for the regarding word/word combination.  

In this study, the web-based frequencies are employed in three different MWE detection-related experiments utilizing a Turkish data set. In first group of experiments, the individual performances of 20 well-known frequency metrics in ranking/sorting MWE candidates based on their tendency to be a MWE is examined. Secondly, the most successful frequency metrics are determined by a feature selection method: filtering.  Lastly, MWE detection is accepted to be a classification problem. Eight supervised methods are applied in order to show the combined performance of frequency metrics when the frequency is obtained from web.  In all experiments, the performance of web-based frequencies in identification of MWEs is compared to the performance of traditional corpus based frequencies. The experimental results showed that the use of web-based frequency in identification of MWEs reveals promising results.


multiword expressions; occurrence frequency; web based-frequency; feature selection; supervised learning

Full Text:

Submitted: 2017-05-03 15:32:10
Published: 2017-09-29 16:13:25
Search for citations in Google Scholar
Related articles: Google Scholar


I. A. Sag, T. Baldwin , F. Bond , A. Copestake , D. Flickinger, “Multiword Expressions: A Pain in the Neck for NLP”, In Proc. of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, 2001 (CICLing-2002)

C.D. Manning and H. Schutze, “Foundations of Statistical Natural Language Processing”, MIT Press. England, 1999.

J.R. Firth, “Modes of Meaning”,.Papers in Linguistic 1934-51, Oxford University Press ,1967.

H.Aka-Uymaz, S.Kumova-Metin ”Using web data in identification of multiword expressions in Turkish” in 4th International Conference on Advanced Technology& Sciences (ICAT’Rome) Rome, Italy, November 23-25, 2016

R. K Bisht, H.S.Dhami, and N.Tiwari,“An evaluation of different statistical techniques of collocation extraction using a probability measure to word combinations”, Journal of Quantitative Linguistics, Vol.13,161-175, 2006.

K. W. Church and P. Hanks, “Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics”,1990, Vol. 16 No.1, 22-29 .

F.A. Smadja, “Retrieving Collocations from Text:Xtract”, Computational Linguistics, Vol. 19 No. 1, 143-177 ,1993.

S. Kumova-Metin and B .Karaoğlan,, “Collocation Extraction in Turkish Texts Using Statistical Methods”, 7th International Conference on Natural Language Processing (LNCS-ISI) IceTAL, Reykjavik, Iceland, 2010.

K.Oflazer, O.Çetinoğlu and B. Say,“ Integrating morphology with multi-word expression processing in Turkish”, Proceedings of the Workshop on Multiword Expressions: Integrating Processing,.p. 64-71, 2004.

Y. Tsvetkov and S. Wintner, “Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources”, Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing, pages:836-845, Edinburgh, Scotland, UK, July 2731, 2011.

G. Bouma, “Collocation Extraction beyond the Independence Assumption” Proc. ACL 2010 Conf. Short Pap.; 10914, 2010.

S. Kumova-Metin, “Neighbour Unpredictability Measure in Multiword Expression Extraction”, International Journal of Computer Systems Science and Engineering:31-3, 2016.

S. Kim, J. Yoon and M. Song, “Automatic Extraction of Collocations From Korean Text”, Computers and the Humanities 35: 273–297, 2001.

W. Li, Q. Lu and J. Liu, “Chinese typed collocation extraction using corpus based syntactic collocation patterns”, IEEE NLP-KE 2007 - Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, 2007.

Piao, S, Sun, G, Rayson, P and Yuan, Q “Automatic extraction of Chinese multiword expressions with a statistical tool” Paper presented at Workshop on Multi-word-expressions in a Multilingual Context held in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, Italy, 2006, .

P. Pecina, “Lexical association measures and collocation extraction.” Language Resources Evaluation. 2010; 44(1-2).

P.Pecina, “A Machine Learning Approach to Multiword Expression Extraction”, Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions, 2008.

Ramisch, C., Villavicencio, A., Boitet, C.: mwetoolkit: a Framework for Multiword Expression Identification, LREC, 2010.

S. Kumova-Metin, T. Kışla and B. Karaoğlan, “Named Entity Recognition in Turkish Using Association Measures”, Advanced Computing: An International Journal, Vol.3, No.4, 2012

K. Kira, and L. A. Rendell. “A Practical Approach to Feature Selection”, Proceedings of the ninth international workshop on Machine learning, 1992.

I. Kononenko, “Estimating Attributes: Analysis and Extensions of RELIEF.” Machine Learning: ECML-94 784: 171–82.,1994.

T. Mitchell “Machine Learning” WCB. Boston: McGraw-Hill, 1997.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten “The WEKA data mining software: an update”, SIGKDD Explorations Newsletter 11(1): 10, 2009.

G. H. John. and P. Langley. “Estimating Continuous Distributions in Bayesian Classifiers” In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, 1995, 338-345.

J. C. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines”, Microsoft, 1998.

R.C. Holte “Very simple classification rules perform well on most commonly used datasets”, Machine Learning. 11:63-91, 1993.

J.R. Quinlan, “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers, San Mateo, CA, 1993.

Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,. In: Thirteenth International Conference on Machine Learning, San Francisco, 148-156, 1996.

G. Eibl and K. P. Pfeiffer, ”How to Make AdaBoost.M1 Work for Weak Base Classifiers by Changing Only One Line of the Code” In: Elomaa T., Mannila H., Toivonen H. (eds) Machine Learning: ECML 2002. Lecture Notes in Computer Science, Vol.2430. Springer, Berlin, Heidelberg, 2002.

L. Breiman, “Random Forests”. Machine Learning. 45(1):5-32, 2001.

K.J. Archer and R. V. Kives , “Empirical characterization of random forest variable ımportance measures”, computational statistical data analysis, Computational Statistics & Data Analysis, 52(4), 2249-2260, 2008.

L. Breiman and A.Cutler, Random forest, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm, 2005, (Accessed 3/3/2017).

G.Tür, D. Hakkani-Tür and K.Oflazer, “A statistical Information Extraction System for Turkish” Natural Language Engineering”, Vol 9 No.2, 181-210,2003.

U. Quasthoff, M. Richter and C.Biemann,“Corpus portal for search in monolingual corpora”, Proceedings of the Fifth International Conference on Language Resources and Evaluation, 2006

F. Can, S. Kocberber, O. Baglıoglu, S.Kardas, H. C. Ocalan and E. Uyar,“New event detection and topic tracking in Turkish”, Journal of the American Society for Information Science and Technology, Vol. 61, no. 4, pp. 802-819,2010.

T. Dinçer, “Türkçe için istatistiksel bir bilgi geri-getirim sistemi”, Phd Dissertation,U.B.E.,Ege Universitesi ,2004.

B. Say, D.Zeyrek, K.Oflazer and U.Ozge, "Development of a Corpus and a Treebank for Present-day Written Turkish", Proceedings of the Eleventh International Conference of Turkish Linguistics, 2002.

J.L Fleiss, "Measuring nominal scale agreement among many raters", Psychological Bulletin 378-382, 1971.

Abstract views:


Copyright (c) 2017 International Journal of Intelligent Systems and Applications in Engineering

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
© Prof.Dr. Ismail SARITAS 2013-2018     -    Address: Selcuk University, Faculty of Technology 42031 Selcuklu, Konya/TURKEY.