Trade-off between the number of index-terms and the information retrieval system’s performance

Authors

  • Lakshmi Srinivasa Rengan
  • Sathyabhama Balasubramanian
  • Batri Krishnan

Keywords:

High-frequency, index term, information retrieval, low-frequency, term frequency.

Abstract

Performance of modern day information retrieval (IR) systems depends on the index terms and their occurrence frequency. Hence, a small variation in the frequency of index terms alters the performance of IR systems. This article analyzes the variation in performance of IR systems due to changes in the frequency of index terms. Based on the occurrence frequency, we classified the index terms as `Low’ and `High’ frequency terms; their performances were also recorded. Low-frequency terms tend to decrease the performance of IR systems. In contrast, the performance of highfrequency terms is better than its counterpart. High-frequency terms do 10% performance improvement in comparison with the low-frequency terms. By deleting the low-frequency index terms, we can save up to 65% of index terms with a maximum of 26% degradation in performance of IR systems.

References

Aizawa, A. (2003). An information-theoretic perspective of

tf–idf measures, Information Processing & Management,

(1):45–65.

Baayen, R. H. (2001). Word frequency distributions, Vol.

, Springer Science & Business Media.

Baeza-Yates, R., Ribeiro-Neto, B. et al. (1999). Modern

information retrieval, Vol. 463, ACMpress New York.

Berger, A. & Lafferty, J. (1999). Information retrieval as

statistical translation, in Proceedingsof the 22nd annual

international ACM SIGIR conference on Research and

development in information retrieval, ACM, pp. 222–229.

Berka, T. & Vajterˇsic, M. (2013). Parallel rare term

vector replacement: Fast and effective dimensionality

reduction for text, Journal of Parallel and Distributed

Computing,73(3):341–351.

Ciarelli, P. M. & Oliveira, E. (2009). Agglomeration

and elimination of terms for dimensionality reduction,

in ‘Intelligent Systems Design and Applications, 2009.

ISDA’09. Ninth International Conference on’, IEEE, pp.

–552.

Harman, D. W. (1986). An experimental study of factors

important in document ranking in Proceedings of the 9th

annual international ACM SIGIR conference on Research

and development in information retrieval, ACM, pp. 186–

Hiemstra, D. (2000). A probabilistic justification for using

tfidf term weighting in information retrieval, International

Journal on Digital Libraries, 3(2):131–139.

Karypis, G. & Han, E.-H.S. (2000). Fast supervised

dimensionality reduction algorithm with applications to

document categorization & retrieval, in Proceedings of

the ninth international conference on Information and

knowledge management, ACM, pp. 12–19.

Lewis, D. D., Yang, Y., Rose, T. G. & Li, F. (2004). Smart

stop word list, Journal of Machine Learning Research.

Luhn, H. P. (1957). A statistical approach to mechanized

encoding and searching of literaryinformation, IBM

Journal of research and development, 1(4):309–317.

Manning, C. D. & Schutze, H. (1999). Foundations of

statistical natural processing.

Moravec, P., Kolovrat, M. &Snasel, V. (2004). Lsi vs.

wordnet ontology in dimension reduction for information

retrieval., in ‘Dateso’, pp. 18–26.

Porter, M. F. (1980). An algorithm for suffix stripping,

Program, 14(3):130–137.

Quan, X., Wenyin, L. & Qiu, B. (2011).Term weighting

schemes for question categorization, Pattern Analysis

and Machine Intelligence, IEEE Transactions on, 33

(5):1009–1021.

Robertson, S. (2004). Understanding inverse document

frequency: on theoretical argumentsfor idf, Journal of

documentation, 60(5):503–520.

Saleh, A. A. & Weigang, L. (2015). A new variables

selection and dimensionality reduction technique

coupled with simca method for the classification of text

documents, in Proceedings of the MakeLearn and TIIM

Joint International Conference, Make Learn and TIIM,

pp. 583–591.

Salton, G. & Yang, C.-S. (1973). On the specification

of term values in automatic indexing, Journal of

documentation, 29 (4):351–372.

Salton, G., Fox, E. A. & Wu, H. (1983). Extended

boolean information retrieval, Communicationsof the

ACM, 26(11):1022–1036.

Wu, H. & Salton, G. (1981). A comparison of search

term weighting: term relevance vs inverse document

frequency, in ‘ACM SIGIR Forum’, Vol. 16, ACM, pp.

–39.

Wu, H. C., Luk, R. W. P., Wong, K. F. & Kwok, K.

L. (2008). Interpreting tf-idf term weights as making

relevance decisions, ACM Transactions on Information

Systems (TOIS), 26(3):13.

Yu, C. T., Lam, K. & Salton, G. (1982).Term weighting

in information retrieval using the term precision model,

Journal of the ACM (JACM), 29(1):152–170.

Downloads

Published

01-11-2017