A survey on the state-of-the-art machine learning models in the context of NLP

Wahab Khan, Ali Daud, Jamal A. Nasir, Tehmina Amjad


Machine learning and Statistical techniques are powerful analysis tools yet to be incorporated in the new multidisciplinaryfield diversely termed as natural language processing (NLP) or computational linguistic. The linguistic knowledge may
be ambiguous or contains ambiguity; therefore, various NLP tasks are carried out in order to resolve the ambiguity in speech and language processing.The current prevailing techniques for addressing various NLP tasks as a supervised learning are hidden Markov models (HMM), conditional random field (CRF), maximum entropy models (MaxEnt), support vector machines (SVM), Nae Bays, and deep learning (DL).The goal of this survey paper is to highlight
ambiguity in speech and language processing, to provide brief overview of basic categories of linguistic knowledge, to discuss different existing machine learning models and their classification into different categories and finally to provide a comprehensive review of different state of the art machine learning models with the goal that new researchers look
into these techniques and depending on these, develops advance techniques. In this survey we reviewed how avantgrademachine learning models can help in this dilemma.


Ambiguity; linguistic knowledge; machine learning; NLP; supervised learning.

Full Text:



Abdel Rahman, S., Elarnaoty, M., Magdy, M. & Fahmy, A. (2010).

Integrated machine learning techniques for Arabic named entity

recognition. International Journal of Computer Science Issues (IJCSI),


Agarwal, N., Ford, K.H. & Shneider, M. (2005). Sentence boundary

detection using a maxEnt classifier. Proceedings of MISC, pp. 1-6.

Ahmed, F. & Nrnberger, A. (2009). Corpora based approach for

Arabic/English word translation disambiguation. Speech and Language

Technology, 11:195-214.

Akita, Y., Saikou, M., Nanjo, H. & Kawahara, T. (2006). Sentence

boundary detection of spontaneous Japanese using statistical language

model and support vector machines. Paper presented at the Interspeech,

pp. 1033-1036.

Ammar, W., Dyer, C. & Smith, N.A. (2014). Conditional random

field auto encoders for unsupervised structured prediction. Proceedings

of the Advances in Neural Information Processing Systems 26(NIPS-

, pp. 1-9.

Antonova, A. & Misyurev, A. (2011). Building a Web-based parallel

corpus and filtering out machine-translated text. Proceedings of the 4th

Workshop on Building and Using Comparable Corpora: Comparable

Corpora and the Web. Association for Computational Linguistics, pp.


Antony, P., Mohan, S.P. & Soman, K. (2010). SVM based part of

speech tagger for Malayalam. Proceedings of IEEE International

Conference on Recent Trends in Information, Telecommunication and

Computing (ITC), pp. 339-341.

Anwar, W., Wang, X., Li, L. & Wang, X.L. (2007). A statistical based

part of speech tagger for Urdu language. Proceedings of International

Conference on Machine Learning and Cybernetics, pp. 3418-3424.

Barakat, H., Nigm, E. & Khaled, O. (2014). Statistical modeling of

extremes under linear and power normalizations with applications to air

pollution. Kuwait Journal of Science, 41(1):1-19.

Benajiba, Y. & Rosso, P. (2008). Arabic named entity recognition

using conditional random fields. In proceedings of Workshop on HLT

& NLP within the Arab World, (LREC), pp. 1-7.

Benajiba, Y., Rosso, P. & Benedruiz, J.M. (2007). Anersys: An

arabic named entity recognition system based on maximum entropy.

Proceedings of 8th International Conference on Computational

Linguistics and Intelligent Text Processing, pp. 143-153.

Borthwick, A. (1999). A maximum entropy approach to named entity

recognition. A dissertation in partial fulfillment of the requirement for

the degree of Doctor of Philosophy, New York University, pp. 1-115.

Bygate, M., Swain, M. &Skehan, P. (2013). Researching pedagogic

tasks: Second language learning, teaching, and testing. Publisher:

Routledge, UK.

Danker, F.W. (2000). A Greek-English lexicon of the New Testament

and other early Christian literature. Publisher: University of Chicago

Press, Chicago, USA.

Daud, A., Khan, W. & Che, D. (2016). Urdu language processing: a

survey. Artificial Intelligence Review, 1-33. DOI 10.1007/s10462-016-


Deng, L., & Yu, D. (2014). Deep learning. Signal Processing, 7:3-4.

Ekbal, A. & Bandyopadhyay, S. (2010). Named entity

recognition using appropriate unlabeled data, post-processing and

voting. Informatica, 34(1):55-76.

Ekbal, A., & Bandyopadhyay, S. (2009). Named entity recognition

in Bengali: A multi-engine approach. Proceeding of the Northern

European Journal of Language Technology, pp. 2658.

Ekbal, A. & Bandyopadhyay, S. (2008). Part of speech tagging in

bengali using support vector machine. Proceedings of International

Conference on the Information Technology (ICIT, 2008), pp. 106-111.

Ekbal, A., Haque, R. & Bandyopadhyay, S. (2008). Maximum

entropy based bengali part of speech tagging. Advances in Natural

Language Processing and Applications Research in Computing Science

RCS Journal, 33:67-78.

Ekbal, A., Haque, R., Das, A., Poka, V., & Bandyopadhyay, S.

(2008). Language independent named entity recognition in Indian

languages. Proceeding of International Joint Conference on Natural

Language Processing ( IJCNLP), pp. 1-7.

Ekbal, A., Naskar, S.K. & Bandyopadhyay, S. (2007). Named

entity recognition and transliteration in Bengali. Lingvisticae

Investigationes, 30(1):95-114.

Elhadj, Y.O.M. (2009). Statistical p-of-speech tagger for traditional

Arabic texts. Journal of Computer Science, 5(11):794-800.

Gillick, D. (2009). Sentence boundary detection and the problem with

the US. In Proceedings of Human Language Technologies: Annual

Conference of the North American Chapter of the Association for

Computational Linguistics, pp. 241-244.

Gouda, A.M. & Rashwan, M. (2004). Segmentation of connected

Arabic characters using hidden Markov models. Proceedings of the

IEEE International Conference on Computational Intelligence for

Measurement Systems and Applications (CIMSA.), pp. 115-119.

Han, J., Kamber, M. & Pei, J. (2006). Data mining: concepts and

Techniques. Publisher: Elsevier, Amsterdam, Netherlands.

Haruechaiyasak, C., Kongyoung, S. & Dailey, M. (2008). A

comparative study on thai word segmentation approaches. Proceedings

of 5th International Conference on Electrical Engineering/Electronics,

Computer, Telecommunications and Information Technology,. (ECTICON),

pp. 1- 4.

Ijaz, M. & Hussain, S. (2007). Corpus based Urdu lexicon development.

The proceedings of Conference on Language Technology (CLT07),

University of Peshawar, Pakistan, pp. 1-12.

Isozaki, H. & Kazawa, H. (2002). Efficient support vector classifiers

for named entity recognition. Proceedings of the 19th International

Conference on Computational Linguistics (ACL), pp. 1-7.

Jurafsky, D. & James, H. (2000). Speech and language processing an

introduction to natural language processing, computational linguistics,

and speech. Publisher: Prentice Hall, United States of America.

Kolr, J. & Liu, Y. (2010). Automatic sentence boundary detection

in conversational speech: A cross-lingual evaluation on English and

Czech. Proceedings of International Conference on Acoustics Speech

and Signal Processing (ICASSP), pp. 5258-5261.

Lafferty, J., McCallum, A. & Pereira, F.C. (2001). Conditional random

fields: Probabilistic models for segmenting and labeling sequence data.

Proceedings of the Eighteenth International Conference on Machine

Learning, (ICML), pp. 282-289.

Li, Y., Miao, C., Bontcheva, K. & Cunningham, H. (2005). Perceptron

learning for Chinese word segmentation. Proceedings of Fourth Sighan

Workshop on Chinese Language Processing (Sighan-05), pp. 154157.

Liu, X., Wei, F., Zhang, S. & Zhou, M. (2013). Named entity

recognition for tweets. ACM Transactions on Intelligent Systems and

Technology (TIST), 4(1):1524-1534.

Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M. &

Harper, M. (2006). Enriching speech recognition with automatic

detection of sentence boundaries and disfluencies. IEEE Transactions

on Audio, Speech, and Language Processing, 14(5):1526-1540.

Low, J.K., Ng, H.T. & Guo, W. (2005). A maximum entropy approach

to Chinese word segmentation. Proceedings of the Fourth Sighan

Workshop on Chinese Language Processing, pp. 1-4

Luo, X. (2003). A maximum entropy Chinese character-based parser.

Proceedings of the 2003 Conference on Empirical Methods in Natural

Language, pp. 1-7.

Mohammed, N.F. & Omar, N. (2012). Arabic named entity recognition

using artificial neural network. Journal of Computer Science, 8(8):1285-

Mohit, B., & Hwa, R. (2005). Syntax-based semi-supervised named

entity tagging. Proceedings of the Association for Computational

Linguistics (ACL 2005) on Interactive Poster and Demonstration

Sessions, pp. 57-60.

Morwal, S. & Chopra, D. (2013). NERHMM: A tool for named entity

recognition based on hidden Markov model. International Journal on

Natural Language Computing (IJNLC), 2:43-49.

Morwal, S. & Jahan, N. (2013). Named entity recognition using

hidden Markov model (HMM): An Experimental Result on Hindi, Urdu

and Marathi Languages. International Journal of Advanced Research in

Computer Science and Software Engineering, 3(4):671-675.

Moses, D. (2015). A survey of data mining algorithms used in

cardiovascular disease diagnosis from multi-lead ECG data. Kuwait

Journal of Science, 42(2):206-235.

Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition

and classification. Special Issue of Lingvisticae Investigationes, 30(1):


Nguyen, C.T., Nguyen, T.K., Phan, X.H., Nguyen, L.M. & Ha,

Q.T. (2006). Vietnamese word segmentation with CRFs and SVMs:

An investigation. Proceedings of 20th Pacific Asia Conference on

Language, Information and Computation (PACLIC 2006), pp. 1-8.

Pandian, S.L. & Geetha, T. (2009). CRF models for tamil part of

speech tagging and chunking. Proceedings of International Conference

on Computer Processing of Oriental Languages, pp. 11-22.

Patel, C. & Gali, K. (2008). Part-of-speech tagging for Gujarati using

conditional random fields. Proceedings of the IJCNLP-08 Workshop on

NLP for Less Privileged Languages, pp. 117122.

Peng, F., Feng, F. & McCallum, A. (2004). Chinese segmentation and

new word detection using conditional random fields Proceedings of the

th International Conference on Computational Linguistics, pp. 1-8.

Qi, Y., Das, S.G., Collobert, R. & Weston, J. (2014). Deep learning

for character-based information extraction. Proceedings of European

Conference on Information Retrieval, pp. 668674.

Ratnaparkhi, A. (1996). A maximum entropy model for part-ofspeech

tagging. Proceedings of the Conference on Empirical Methods

in Natural Language Processing, pp. 133-142.

Rehman, Z. & Anwar, W. (2012). A hybrid approach for Urdu sentence

boundary disambiguation. International Arab Journal of Information

Technology (IAJIT), 9(3):250-255.

Reynar, J.C. & Ratnaparkhi, A. (1997). A maximum entropy approach

to identifying sentence boundaries. Proceedings of the Fifth Conference

on Applied Natural Language Processing, pp. 16-19.

Saha, S.K., Sarkar, S. & Mitra, P. (2008). A hybrid feature set based

maximum entropy Hindi named entity recognition. Proceedings of the

IJCNLP-08 Workshop on NLP for Less Privileged Languages, pp. 343-

Sajjad, H. & Schmid, H. (2009). Tagging Urdu text with parts of

speech: A tagger comparison. Proceedings of the 12th Conference of

the European Chapter of the Association for Computational Linguistics

(EACL), pp. 692-700.

Santos, C.D. & Zadrozny, B. (2014). Learning character-level

representations for part-of-speech tagging. Proceedings of the 31st

International Conference on Machine Learning (ICML-14), pp.1818

Singh, U., Goyal, V. & Lehal, G.S. (2012). Named entity recognition

system for Urdu. Proceedings of COLING 2012: Technical Papers, pp.


Sunny, S., David Peter, S. & Jacob, K.P. (2013). Combined feature

extraction techniques and Naive Bayes classifier for speech recognition.

Computer Science & Information Technology (CS & IT), pp. 155

Talasiewicz, M. (2009). Philosophy of syntax: foundational topics (Book)

st ed. Vol. 29. Springer Science & Business Media.

Todorovic, B.T., Rancic, S.R., Markovic, I.M., Mulalic, E.H. & Ilic,

V.M. (2008). Named entity recognition and classification using context

hidden Markov model. Proceeding of 9th Symposium on Neural

Network Applications in Electrical Engineering, (NEUREL 2008), pp.


Tomanek, K., Wermter, J. & Hahn, U. (2007). Sentence and token

splitting based on conditional random fields. Proceedings of the 10th

Conference of the Pacific Association for Computational Linguistics,

pp. 1-9.

Wang, H. & Huang, Y. (2003). BondecA Sentence Boundary

Detector.CS224N Project, Stanford, CA, USA.

Wenchao, M., Lianchen, L. & Anyan, C. (2010). A comparative study

on Chinese word segmentation using statistical models. Proceedings of

IEEE International Conference on Software Engineering and Service

Sciences (ICSESS), pp. 482 486.

Xue, N. (2003). Chinese word segmentation as character tagging.

Computational Linguistics and Chinese Language Processing, 8(1):


Yao, L., Sun, C., Li, S., Wang, X. & Wang, X. (2009). CRF-based

active learning for Chinese named entity recognition. Proceedings

of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1557-1561.

Youzhi, Z. (2009). Research and implementation of part-of-speech

tagging based on hidden Markov model. Proceedings of Asia-Pacific

Conference on Computational Intelligence and Industrial Applications

(PACIIA), pp. 26-29.

Zhang, L., Pan, Y. & Zhang, T. (2004). Focused named entity

recognition using machine learning. Proceedings of the 27th Annual

International ACM SIGIR Conference on Research and Development

in Information Retrieval, pp. 1-8.

Zhang, Y., Xu, Z. & Zhang, T. (2008). Fusion of multiple features

for chinese named entity recognition based on CRF model Information

Retrieval Technology: Springer, pp. 95-106.

Zheng, G. & Tian, Y. (2010). Chinese web text classification system

model based on Naive Bayes. Proceedings of the International

Conference on E-Product E-Service and E-Entertainment (ICEEE), pp.


Zheng, X., Chen, H. & Xu, T. (2013). Deep learning for Chinese word

segmentation and pos tagging. Proceedings of the 2013 Conference on

Empirical Methods in Natural Language (EMNLP-ACL-2013), pp.


Zhou, G. & Su, J. (2002). Named entity recognition using an HMMbased

chunk tagger. Proceedings of the 40th Annual Meeting on

Association for Computational Linguistics, pp. 473-480.


  • There are currently no refbacks.