Gene expression data classification: some distance-based methods

Olusola Samuel Makinde


Micro-array dataset is a classical example of high throughput data characterized with more features(genes) than sample points(gene expression levels). A number of classification techniques have been proposed in literature. Many of these methods are either computationally expensive or perform sub-optimally. In this paper, some distance functions are considered and classification rules based on the distance functions are formulated. The distance functions include average distance measure, distance to component-wise median, distance to mean. These methods are computationally simple and are expected to perform well for gene expression data. We also define a probabilistic approach to classification rules based on two of the distance measures. Gene selection technique based on shrunken centroids regularized discriminant analysis was employed on small round blue cell tissue, colon cancer, lymphoma, prostate cancer and leukaemia data before applying the classification rules. Three simulation studies were performed to mimic gene expression data. The performance of the classification methods mentioned above was compared with performance of some known classification methods in literature. The distance-based methods were also performed on gene expression data. The performance of the distance-based classification methods is competitive with some existing classification methods. Distance based methods implemented in this study are computationally simple and very cheap in terms of computational cost.

Full Text:



Alizadeh A, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO and Staudt LM. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503 - 511.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D and Levine AJ (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96(12):6745 - 6750.

Chakraborty A and Chaudhuri P (2014). The deepest point for distributions in infinite dimensional spaces. Stat Meth, 20:27 - 39.

Cover TM (1968). Rates of convergence for nearest neighbor procedures. Proc. Hawaii Int'l Conf. Systems Sciences. Western Periodicals, Honolulu. 413 - 415.

Chung D and Keles S (2010). Sparse Partial Least Squares Classification for High Dimensional Data. Statistical Applications in Genetics and Molecular Biology, 9(1), Article 17.

Dabney AR (2005). Classification of microarrays to nearest centroids. Bioinformatics, 21(22):4148 - 4154.

Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support Vector Machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2015;16:906 - 914.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286:531 - 537.

Guo Y, Hastie T and Tibshirani R. Regularized linear discriminant analysis and its application in micro-arrays. Biostatistics, 8:86 - 100.

Hall P, Titterington DM and Xue J (2009). Median Based classifiers for High Dimensional Data. Journal of the American Statistical Association, 104(488):1597 - 1608.

Hall P and Pham T (2010). Optimal properties of centroid-based classifiers for very high-dimensional data. The Annals of Statistics, 38(2):1071 - 1093

Hand DJ (2006). Classifier technology and the illusion of progress, Statistical Science, 21(1):1 - 14

Hastie T, Tibshirani R and Friedman J (2001). The elements of statistical learning: data mining, inference and prediction. Springer, New York.

Hennig C and Viroli C (2016). Quantile-based classifiers. Biometrika, 103(2):435 - 446.

Hubert M and Van Driessen K (2004). Fast and robust discriminant analysis. Computational Statistics and Data Analysis, 45:301 - 320.

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C and Meltzer PS (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7:673 - 679.

Klassen M and Kim N (2009). Nearest shrunken centroid as feature selection in microarray data. Proceeding of computers and their applications, 227 - 232.

Li B and Yu Q (2008). Classification of functional data: A segmentation approach. Computational Statistics and Data Analysis, 52:4790 - 4800.

Liu RY, Parelius JM and Singh K (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference. The Annals of Statistics, 27:783 - 858.

Makinde OS (2017). On Rank distribution classifiers for high dimensional data. Submitted.

Masse JC (2009). Multivariate Trimmed means based on the Tukey depth, Journal of Statistical Planning and Inference, 139(2):366 - 384

Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, DAmico A, Richie J, Lander E, Loda M, Kantoff P, Golub T, and Sellers W (2002). Gene expression correlates of clinical prostate cancer behaviour. Cancer Cell, 1:203 - 209.

Tibshirani R, Hastie T, Narasimhan B, Chu G (2002). Diagnosis of multiple cancer type by shrunken centroid. Proceedings of the National Academy of Sciences, USA, 99(10):6567 - 6572.

Vanitha CDA, Devaraj D and Venkatesulu M (2015). Gene expression data classification using support vector machine and mutual information-based gene selection. Procedia Computer Science, 47:13 - 21.

Wang L, Zhu J and Zou H (2008). Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3):412 - 419.


  • There are currently no refbacks.