A Novel Clustering Method Suitable for Clustering of Biological Signal Datasets Containing Batched Outliers

Selahaddin B. Akben

Abstract


During clustering analyses, instances of batched outliers of one class falling close to another class can be a significant problem. Such outliers might be incorporated into a false class or lead to the false identification of unreal classes,which can lead to false localization of the cluster centers. Here we propose a novel method for accurate classification of outliers in batched clustering analyses, aimed specifically at the type of outliers most often encountered in biological signals. The recommended divisive hierarchical clustering method is based on how much each element in the dataset is unwanted by other elements. In this method, the reluctance vectors applied to each element by the other elements are first determined. According to the common features of the reluctance vectors (horizontal and vertical components), two initial classes are obtained from some elements. All remaining elements are then included into classes according to their proximity to these classes. Then, using the reluctance vectors developed between the two established classes, class that might be re-divided are identified and further classes are constituted using the same splitting method. To validate this approach, which we named the selfish data clustering (SDC) method, areal dataset was analyzed using the SDC method and other commonly applied clustering methods. We found that our clustering method outperformed the conventional approaches by up to 12% (average is 6%) in datasets with low silhouette values.


Keywords


Batched outliers; clustering; data mining; force fields; sparse data.

Full Text:

PDF

References


Agresti, A. (2013). Categorical data analysis, Wiley.

Amorim, R. C., Hennig, C. (2015). Recovering the

number of clusters in data sets with noise features

using feature rescaling factors, Information Sciences,

:126–145.

Berkhin P. (2006). A survey of clustering data mining

techniques, Springer.

Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik.V.

(2002). Support vector clustering. The Journal of Machine

Learning Research, 2:125- 137.

Fielding, A.H. (2007). Cluster and classification

techniques for the biosciences, Cambridge University

Press.

Chrominski, K. & Tkacz, M. (2010). Comparision of

outlier detection methods in biomedical data. Journal of

Medical Informatics & Technologies, 16:89- 94.

Corral, G., Fornells, A., Golobardes, E. & Abella,

J. (2006). Cohesion factors: Improving the Clustering

Capabilities of Consensus, Springer.

Ghosh, S. & Dubey, S.K. (2013). Comparative analysis

of K-Means and Fuzzy C-means algorithms. International

Journal of Advanced Computer Science and Applications,

:35- 39.

Karaboga, D. & Ozturk, C. (2011). A novel clustering

approach: Artificial bee colony (ABC) algorithm. Applied

Soft Computing, 11:652–657.

Kaufman, L. & Rousseeuw, P. J. (2009). Finding groups

in data an introduction to cluster analysis, Wiley.

Kumar, N.S., Rao, K.N., Govardhan, A. & Reddy, K.S.

(2014). An updated literature review on the problem of class

imbalanced learning in clustering.International Journal of

Engineering and Technical Research, 2:123- 128.

Mirkin,. B. (2012). Clustering: a data recovery approach.

Chapman and Hall/CRC.

Mosec, D. & Deisy, C. (2015). A survey of data mining

algorithms used in cardiovasculardisease diagnosis

from multi-lead ECG data. Kuwait Journal of Science,

(2):206- 235.

Nallamhut, R. & Palanichamy, J. (2015).Optimized

construction of various classification models forthe

diagnosis of thyroid problems in human beings. Kuwait

Journal of Science, 42(2):189- 205.

Popat, S.K. & Emmanuel, M. (2014). Review and

comparative study of clustering techniques. International

Journal of Computer Science and Information

Technologies, 5:805 -812.

Piernik, M., Brzezinski, D., Morzy, T.& Lesniewska, A.

(2015). XML clustering: a review of structural approaches.

The Knowledge Engineering Review, 30:297- 323.

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid

to the interpretation and validation of cluster analysis.

Journal of Computational and Applied Mathematics,

:53–65.

Sarumathi, S., Shanthi, N. & Sharmila, M. (2013).

A review: Comparative analysis of different categorical

data clustering ensemble methods. International Journal

of Computer, Control, Quantum and Information

Engineering, 7:974 -984.

Tong, C.H. & Barfoot, T.D. (2011). Batch heterogeneous

outlier rejection for feature-poor SLAM.IEEE

International Conference on Robotics and Automation

(ICRA).Shanghai, China.

Xiong, H. & Steinbach, M. (2006). Enhancing data

analysis with noise removal.IEEE Transactions on

Knowledge and Data Engineering, 18(3):304 -319.

Woolson, R.F. & Clarke, W.R. (2011). Statistical

methods for the analysis of biomedical data. John Wiley

& Sons. Inc., New York.

Zhou, H.B. & Gao, J.T. (2014). Automatic method

for determining cluster number based on silhouette

coefficient.Advanced Research on Intelligent System,

:227- 230.


Refbacks

  • There are currently no refbacks.