Trade-off between the number of index-terms and the information retrieval system’s performance
Keywords:
High-frequency, index term, information retrieval, low-frequency, term frequency.Abstract
Performance of modern day information retrieval (IR) systems depends on the index terms and their occurrence frequency. Hence, a small variation in the frequency of index terms alters the performance of IR systems. This article analyzes the variation in performance of IR systems due to changes in the frequency of index terms. Based on the occurrence frequency, we classified the index terms as `Low’ and `High’ frequency terms; their performances were also recorded. Low-frequency terms tend to decrease the performance of IR systems. In contrast, the performance of highfrequency terms is better than its counterpart. High-frequency terms do 10% performance improvement in comparison with the low-frequency terms. By deleting the low-frequency index terms, we can save up to 65% of index terms with a maximum of 26% degradation in performance of IR systems.
References
Aizawa, A. (2003). An information-theoretic perspective of
tf–idf measures, Information Processing & Management,
(1):45–65.
Baayen, R. H. (2001). Word frequency distributions, Vol.
, Springer Science & Business Media.
Baeza-Yates, R., Ribeiro-Neto, B. et al. (1999). Modern
information retrieval, Vol. 463, ACMpress New York.
Berger, A. & Lafferty, J. (1999). Information retrieval as
statistical translation, in Proceedingsof the 22nd annual
international ACM SIGIR conference on Research and
development in information retrieval, ACM, pp. 222–229.
Berka, T. & Vajterˇsic, M. (2013). Parallel rare term
vector replacement: Fast and effective dimensionality
reduction for text, Journal of Parallel and Distributed
Computing,73(3):341–351.
Ciarelli, P. M. & Oliveira, E. (2009). Agglomeration
and elimination of terms for dimensionality reduction,
in ‘Intelligent Systems Design and Applications, 2009.
ISDA’09. Ninth International Conference on’, IEEE, pp.
–552.
Harman, D. W. (1986). An experimental study of factors
important in document ranking in Proceedings of the 9th
annual international ACM SIGIR conference on Research
and development in information retrieval, ACM, pp. 186–
Hiemstra, D. (2000). A probabilistic justification for using
tfidf term weighting in information retrieval, International
Journal on Digital Libraries, 3(2):131–139.
Karypis, G. & Han, E.-H.S. (2000). Fast supervised
dimensionality reduction algorithm with applications to
document categorization & retrieval, in Proceedings of
the ninth international conference on Information and
knowledge management, ACM, pp. 12–19.
Lewis, D. D., Yang, Y., Rose, T. G. & Li, F. (2004). Smart
stop word list, Journal of Machine Learning Research.
Luhn, H. P. (1957). A statistical approach to mechanized
encoding and searching of literaryinformation, IBM
Journal of research and development, 1(4):309–317.
Manning, C. D. & Schutze, H. (1999). Foundations of
statistical natural processing.
Moravec, P., Kolovrat, M. &Snasel, V. (2004). Lsi vs.
wordnet ontology in dimension reduction for information
retrieval., in ‘Dateso’, pp. 18–26.
Porter, M. F. (1980). An algorithm for suffix stripping,
Program, 14(3):130–137.
Quan, X., Wenyin, L. & Qiu, B. (2011).Term weighting
schemes for question categorization, Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 33
(5):1009–1021.
Robertson, S. (2004). Understanding inverse document
frequency: on theoretical argumentsfor idf, Journal of
documentation, 60(5):503–520.
Saleh, A. A. & Weigang, L. (2015). A new variables
selection and dimensionality reduction technique
coupled with simca method for the classification of text
documents, in Proceedings of the MakeLearn and TIIM
Joint International Conference, Make Learn and TIIM,
pp. 583–591.
Salton, G. & Yang, C.-S. (1973). On the specification
of term values in automatic indexing, Journal of
documentation, 29 (4):351–372.
Salton, G., Fox, E. A. & Wu, H. (1983). Extended
boolean information retrieval, Communicationsof the
ACM, 26(11):1022–1036.
Wu, H. & Salton, G. (1981). A comparison of search
term weighting: term relevance vs inverse document
frequency, in ‘ACM SIGIR Forum’, Vol. 16, ACM, pp.
–39.
Wu, H. C., Luk, R. W. P., Wong, K. F. & Kwok, K.
L. (2008). Interpreting tf-idf term weights as making
relevance decisions, ACM Transactions on Information
Systems (TOIS), 26(3):13.
Yu, C. T., Lam, K. & Salton, G. (1982).Term weighting
in information retrieval using the term precision model,
Journal of the ACM (JACM), 29(1):152–170.