A survey on the state-of-the-art machine learning models in the context of NLP
Keywords:
Ambiguity, linguistic knowledge, machine learning, NLP, supervised learning.Abstract
Machine learning and Statistical techniques are powerful analysis tools yet to be incorporated in the new multidisciplinaryfield diversely termed as natural language processing (NLP) or computational linguistic. The linguistic knowledge may
be ambiguous or contains ambiguity; therefore, various NLP tasks are carried out in order to resolve the ambiguity in speech and language processing.The current prevailing techniques for addressing various NLP tasks as a supervised learning are hidden Markov models (HMM), conditional random field (CRF), maximum entropy models (MaxEnt), support vector machines (SVM), Nave Bays, and deep learning (DL).The goal of this survey paper is to highlight
ambiguity in speech and language processing, to provide brief overview of basic categories of linguistic knowledge, to discuss different existing machine learning models and their classification into different categories and finally to provide a comprehensive review of different state of the art machine learning models with the goal that new researchers look
into these techniques and depending on these, develops advance techniques. In this survey we reviewed how avantgrademachine learning models can help in this dilemma.
References
Abdel Rahman, S., Elarnaoty, M., Magdy, M. & Fahmy, A. (2010).
Integrated machine learning techniques for Arabic named entity
recognition. International Journal of Computer Science Issues (IJCSI),
(4):27-36.
Agarwal, N., Ford, K.H. & Shneider, M. (2005). Sentence boundary
detection using a maxEnt classifier. Proceedings of MISC, pp. 1-6.
Ahmed, F. & Nrnberger, A. (2009). Corpora based approach for
Arabic/English word translation disambiguation. Speech and Language
Technology, 11:195-214.
Akita, Y., Saikou, M., Nanjo, H. & Kawahara, T. (2006). Sentence
boundary detection of spontaneous Japanese using statistical language
model and support vector machines. Paper presented at the Interspeech,
pp. 1033-1036.
Ammar, W., Dyer, C. & Smith, N.A. (2014). Conditional random
field auto encoders for unsupervised structured prediction. Proceedings
of the Advances in Neural Information Processing Systems 26(NIPS-
, pp. 1-9.
Antonova, A. & Misyurev, A. (2011). Building a Web-based parallel
corpus and filtering out machine-translated text. Proceedings of the 4th
Workshop on Building and Using Comparable Corpora: Comparable
Corpora and the Web. Association for Computational Linguistics, pp.
Antony, P., Mohan, S.P. & Soman, K. (2010). SVM based part of
speech tagger for Malayalam. Proceedings of IEEE International
Conference on Recent Trends in Information, Telecommunication and
Computing (ITC), pp. 339-341.
Anwar, W., Wang, X., Li, L. & Wang, X.L. (2007). A statistical based
part of speech tagger for Urdu language. Proceedings of International
Conference on Machine Learning and Cybernetics, pp. 3418-3424.
Barakat, H., Nigm, E. & Khaled, O. (2014). Statistical modeling of
extremes under linear and power normalizations with applications to air
pollution. Kuwait Journal of Science, 41(1):1-19.
Benajiba, Y. & Rosso, P. (2008). Arabic named entity recognition
using conditional random fields. In proceedings of Workshop on HLT
& NLP within the Arab World, (LREC), pp. 1-7.
Benajiba, Y., Rosso, P. & Benedruiz, J.M. (2007). Anersys: An
arabic named entity recognition system based on maximum entropy.
Proceedings of 8th International Conference on Computational
Linguistics and Intelligent Text Processing, pp. 143-153.
Borthwick, A. (1999). A maximum entropy approach to named entity
recognition. A dissertation in partial fulfillment of the requirement for
the degree of Doctor of Philosophy, New York University, pp. 1-115.
Bygate, M., Swain, M. &Skehan, P. (2013). Researching pedagogic
tasks: Second language learning, teaching, and testing. Publisher:
Routledge, UK.
Danker, F.W. (2000). A Greek-English lexicon of the New Testament
and other early Christian literature. Publisher: University of Chicago
Press, Chicago, USA.
Daud, A., Khan, W. & Che, D. (2016). Urdu language processing: a
survey. Artificial Intelligence Review, 1-33. DOI 10.1007/s10462-016-
-x.
Deng, L., & Yu, D. (2014). Deep learning. Signal Processing, 7:3-4.
Ekbal, A. & Bandyopadhyay, S. (2010). Named entity
recognition using appropriate unlabeled data, post-processing and
voting. Informatica, 34(1):55-76.
Ekbal, A., & Bandyopadhyay, S. (2009). Named entity recognition
in Bengali: A multi-engine approach. Proceeding of the Northern
European Journal of Language Technology, pp. 2658.
Ekbal, A. & Bandyopadhyay, S. (2008). Part of speech tagging in
bengali using support vector machine. Proceedings of International
Conference on the Information Technology (ICIT, 2008), pp. 106-111.
Ekbal, A., Haque, R. & Bandyopadhyay, S. (2008). Maximum
entropy based bengali part of speech tagging. Advances in Natural
Language Processing and Applications Research in Computing Science
RCS Journal, 33:67-78.
Ekbal, A., Haque, R., Das, A., Poka, V., & Bandyopadhyay, S.
(2008). Language independent named entity recognition in Indian
languages. Proceeding of International Joint Conference on Natural
Language Processing ( IJCNLP), pp. 1-7.
Ekbal, A., Naskar, S.K. & Bandyopadhyay, S. (2007). Named
entity recognition and transliteration in Bengali. Lingvisticae
Investigationes, 30(1):95-114.
Elhadj, Y.O.M. (2009). Statistical p-of-speech tagger for traditional
Arabic texts. Journal of Computer Science, 5(11):794-800.
Gillick, D. (2009). Sentence boundary detection and the problem with
the US. In Proceedings of Human Language Technologies: Annual
Conference of the North American Chapter of the Association for
Computational Linguistics, pp. 241-244.
Gouda, A.M. & Rashwan, M. (2004). Segmentation of connected
Arabic characters using hidden Markov models. Proceedings of the
IEEE International Conference on Computational Intelligence for
Measurement Systems and Applications (CIMSA.), pp. 115-119.
Han, J., Kamber, M. & Pei, J. (2006). Data mining: concepts and
Techniques. Publisher: Elsevier, Amsterdam, Netherlands.
Haruechaiyasak, C., Kongyoung, S. & Dailey, M. (2008). A
comparative study on thai word segmentation approaches. Proceedings
of 5th International Conference on Electrical Engineering/Electronics,
Computer, Telecommunications and Information Technology,. (ECTICON),
pp. 1- 4.
Ijaz, M. & Hussain, S. (2007). Corpus based Urdu lexicon development.
The proceedings of Conference on Language Technology (CLT07),
University of Peshawar, Pakistan, pp. 1-12.
Isozaki, H. & Kazawa, H. (2002). Efficient support vector classifiers
for named entity recognition. Proceedings of the 19th International
Conference on Computational Linguistics (ACL), pp. 1-7.
Jurafsky, D. & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech. Publisher: Prentice Hall, United States of America.
Kolr, J. & Liu, Y. (2010). Automatic sentence boundary detection
in conversational speech: A cross-lingual evaluation on English and
Czech. Proceedings of International Conference on Acoustics Speech
and Signal Processing (ICASSP), pp. 5258-5261.
Lafferty, J., McCallum, A. & Pereira, F.C. (2001). Conditional random
fields: Probabilistic models for segmenting and labeling sequence data.
Proceedings of the Eighteenth International Conference on Machine
Learning, (ICML), pp. 282-289.
Li, Y., Miao, C., Bontcheva, K. & Cunningham, H. (2005). Perceptron
learning for Chinese word segmentation. Proceedings of Fourth Sighan
Workshop on Chinese Language Processing (Sighan-05), pp. 154157.
Liu, X., Wei, F., Zhang, S. & Zhou, M. (2013). Named entity
recognition for tweets. ACM Transactions on Intelligent Systems and
Technology (TIST), 4(1):1524-1534.
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M. &
Harper, M. (2006). Enriching speech recognition with automatic
detection of sentence boundaries and disfluencies. IEEE Transactions
on Audio, Speech, and Language Processing, 14(5):1526-1540.
Low, J.K., Ng, H.T. & Guo, W. (2005). A maximum entropy approach
to Chinese word segmentation. Proceedings of the Fourth Sighan
Workshop on Chinese Language Processing, pp. 1-4
Luo, X. (2003). A maximum entropy Chinese character-based parser.
Proceedings of the 2003 Conference on Empirical Methods in Natural
Language, pp. 1-7.
Mohammed, N.F. & Omar, N. (2012). Arabic named entity recognition
using artificial neural network. Journal of Computer Science, 8(8):1285-
Mohit, B., & Hwa, R. (2005). Syntax-based semi-supervised named
entity tagging. Proceedings of the Association for Computational
Linguistics (ACL 2005) on Interactive Poster and Demonstration
Sessions, pp. 57-60.
Morwal, S. & Chopra, D. (2013). NERHMM: A tool for named entity
recognition based on hidden Markov model. International Journal on
Natural Language Computing (IJNLC), 2:43-49.
Morwal, S. & Jahan, N. (2013). Named entity recognition using
hidden Markov model (HMM): An Experimental Result on Hindi, Urdu
and Marathi Languages. International Journal of Advanced Research in
Computer Science and Software Engineering, 3(4):671-675.
Moses, D. (2015). A survey of data mining algorithms used in
cardiovascular disease diagnosis from multi-lead ECG data. Kuwait
Journal of Science, 42(2):206-235.
Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition
and classification. Special Issue of Lingvisticae Investigationes, 30(1):
-26.
Nguyen, C.T., Nguyen, T.K., Phan, X.H., Nguyen, L.M. & Ha,
Q.T. (2006). Vietnamese word segmentation with CRFs and SVMs:
An investigation. Proceedings of 20th Pacific Asia Conference on
Language, Information and Computation (PACLIC 2006), pp. 1-8.
Pandian, S.L. & Geetha, T. (2009). CRF models for tamil part of
speech tagging and chunking. Proceedings of International Conference
on Computer Processing of Oriental Languages, pp. 11-22.
Patel, C. & Gali, K. (2008). Part-of-speech tagging for Gujarati using
conditional random fields. Proceedings of the IJCNLP-08 Workshop on
NLP for Less Privileged Languages, pp. 117122.
Peng, F., Feng, F. & McCallum, A. (2004). Chinese segmentation and
new word detection using conditional random fields Proceedings of the
th International Conference on Computational Linguistics, pp. 1-8.
Qi, Y., Das, S.G., Collobert, R. & Weston, J. (2014). Deep learning
for character-based information extraction. Proceedings of European
Conference on Information Retrieval, pp. 668674.
Ratnaparkhi, A. (1996). A maximum entropy model for part-ofspeech
tagging. Proceedings of the Conference on Empirical Methods
in Natural Language Processing, pp. 133-142.
Rehman, Z. & Anwar, W. (2012). A hybrid approach for Urdu sentence
boundary disambiguation. International Arab Journal of Information
Technology (IAJIT), 9(3):250-255.
Reynar, J.C. & Ratnaparkhi, A. (1997). A maximum entropy approach
to identifying sentence boundaries. Proceedings of the Fifth Conference
on Applied Natural Language Processing, pp. 16-19.
Saha, S.K., Sarkar, S. & Mitra, P. (2008). A hybrid feature set based
maximum entropy Hindi named entity recognition. Proceedings of the
IJCNLP-08 Workshop on NLP for Less Privileged Languages, pp. 343-
Sajjad, H. & Schmid, H. (2009). Tagging Urdu text with parts of
speech: A tagger comparison. Proceedings of the 12th Conference of
the European Chapter of the Association for Computational Linguistics
(EACL), pp. 692-700.
Santos, C.D. & Zadrozny, B. (2014). Learning character-level
representations for part-of-speech tagging. Proceedings of the 31st
International Conference on Machine Learning (ICML-14), pp.1818
Singh, U., Goyal, V. & Lehal, G.S. (2012). Named entity recognition
system for Urdu. Proceedings of COLING 2012: Technical Papers, pp.
Sunny, S., David Peter, S. & Jacob, K.P. (2013). Combined feature
extraction techniques and Naive Bayes classifier for speech recognition.
Computer Science & Information Technology (CS & IT), pp. 155
Talasiewicz, M. (2009). Philosophy of syntax: foundational topics (Book)
st ed. Vol. 29. Springer Science & Business Media.
Todorovic, B.T., Rancic, S.R., Markovic, I.M., Mulalic, E.H. & Ilic,
V.M. (2008). Named entity recognition and classification using context
hidden Markov model. Proceeding of 9th Symposium on Neural
Network Applications in Electrical Engineering, (NEUREL 2008), pp.
-46.
Tomanek, K., Wermter, J. & Hahn, U. (2007). Sentence and token
splitting based on conditional random fields. Proceedings of the 10th
Conference of the Pacific Association for Computational Linguistics,
pp. 1-9.
Wang, H. & Huang, Y. (2003). BondecA Sentence Boundary
Detector.CS224N Project, Stanford, CA, USA.
Wenchao, M., Lianchen, L. & Anyan, C. (2010). A comparative study
on Chinese word segmentation using statistical models. Proceedings of
IEEE International Conference on Software Engineering and Service
Sciences (ICSESS), pp. 482 486.
Xue, N. (2003). Chinese word segmentation as character tagging.
Computational Linguistics and Chinese Language Processing, 8(1):
-48.
Yao, L., Sun, C., Li, S., Wang, X. & Wang, X. (2009). CRF-based
active learning for Chinese named entity recognition. Proceedings
of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1557-1561.
Youzhi, Z. (2009). Research and implementation of part-of-speech
tagging based on hidden Markov model. Proceedings of Asia-Pacific
Conference on Computational Intelligence and Industrial Applications
(PACIIA), pp. 26-29.
Zhang, L., Pan, Y. & Zhang, T. (2004). Focused named entity
recognition using machine learning. Proceedings of the 27th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, pp. 1-8.
Zhang, Y., Xu, Z. & Zhang, T. (2008). Fusion of multiple features
for chinese named entity recognition based on CRF model Information
Retrieval Technology: Springer, pp. 95-106.
Zheng, G. & Tian, Y. (2010). Chinese web text classification system
model based on Naive Bayes. Proceedings of the International
Conference on E-Product E-Service and E-Entertainment (ICEEE), pp.
-4.
Zheng, X., Chen, H. & Xu, T. (2013). Deep learning for Chinese word
segmentation and pos tagging. Proceedings of the 2013 Conference on
Empirical Methods in Natural Language (EMNLP-ACL-2013), pp.
Zhou, G. & Su, J. (2002). Named entity recognition using an HMMbased
chunk tagger. Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, pp. 473-480.