A Novel Clustering Method Suitable for Clustering of Biological Signal Datasets Containing Batched Outliers
Keywords:
Batched outliers, clustering, data mining, force fields, sparse data.Abstract
During clustering analyses, instances of batched outliers of one class falling close to another class can be a significant problem. Such outliers might be incorporated into a false class or lead to the false identification of unreal classes,which can lead to false localization of the cluster centers. Here we propose a novel method for accurate classification of outliers in batched clustering analyses, aimed specifically at the type of outliers most often encountered in biological signals. The recommended divisive hierarchical clustering method is based on how much each element in the dataset is unwanted by other elements. In this method, the reluctance vectors applied to each element by the other elements are first determined. According to the common features of the reluctance vectors (horizontal and vertical components), two initial classes are obtained from some elements. All remaining elements are then included into classes according to their proximity to these classes. Then, using the reluctance vectors developed between the two established classes, class that might be re-divided are identified and further classes are constituted using the same splitting method. To validate this approach, which we named the selfish data clustering (SDC) method, areal dataset was analyzed using the SDC method and other commonly applied clustering methods. We found that our clustering method outperformed the conventional approaches by up to 12% (average is 6%) in datasets with low silhouette values.
References
Agresti, A. (2013). Categorical data analysis, Wiley.
Amorim, R. C., Hennig, C. (2015). Recovering the
number of clusters in data sets with noise features
using feature rescaling factors, Information Sciences,
:126–145.
Berkhin P. (2006). A survey of clustering data mining
techniques, Springer.
Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik.V.
(2002). Support vector clustering. The Journal of Machine
Learning Research, 2:125- 137.
Fielding, A.H. (2007). Cluster and classification
techniques for the biosciences, Cambridge University
Press.
Chrominski, K. & Tkacz, M. (2010). Comparision of
outlier detection methods in biomedical data. Journal of
Medical Informatics & Technologies, 16:89- 94.
Corral, G., Fornells, A., Golobardes, E. & Abella,
J. (2006). Cohesion factors: Improving the Clustering
Capabilities of Consensus, Springer.
Ghosh, S. & Dubey, S.K. (2013). Comparative analysis
of K-Means and Fuzzy C-means algorithms. International
Journal of Advanced Computer Science and Applications,
:35- 39.
Karaboga, D. & Ozturk, C. (2011). A novel clustering
approach: Artificial bee colony (ABC) algorithm. Applied
Soft Computing, 11:652–657.
Kaufman, L. & Rousseeuw, P. J. (2009). Finding groups
in data an introduction to cluster analysis, Wiley.
Kumar, N.S., Rao, K.N., Govardhan, A. & Reddy, K.S.
(2014). An updated literature review on the problem of class
imbalanced learning in clustering.International Journal of
Engineering and Technical Research, 2:123- 128.
Mirkin,. B. (2012). Clustering: a data recovery approach.
Chapman and Hall/CRC.
Mosec, D. & Deisy, C. (2015). A survey of data mining
algorithms used in cardiovasculardisease diagnosis
from multi-lead ECG data. Kuwait Journal of Science,
(2):206- 235.
Nallamhut, R. & Palanichamy, J. (2015).Optimized
construction of various classification models forthe
diagnosis of thyroid problems in human beings. Kuwait
Journal of Science, 42(2):189- 205.
Popat, S.K. & Emmanuel, M. (2014). Review and
comparative study of clustering techniques. International
Journal of Computer Science and Information
Technologies, 5:805 -812.
Piernik, M., Brzezinski, D., Morzy, T.& Lesniewska, A.
(2015). XML clustering: a review of structural approaches.
The Knowledge Engineering Review, 30:297- 323.
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid
to the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics,
:53–65.
Sarumathi, S., Shanthi, N. & Sharmila, M. (2013).
A review: Comparative analysis of different categorical
data clustering ensemble methods. International Journal
of Computer, Control, Quantum and Information
Engineering, 7:974 -984.
Tong, C.H. & Barfoot, T.D. (2011). Batch heterogeneous
outlier rejection for feature-poor SLAM.IEEE
International Conference on Robotics and Automation
(ICRA).Shanghai, China.
Xiong, H. & Steinbach, M. (2006). Enhancing data
analysis with noise removal.IEEE Transactions on
Knowledge and Data Engineering, 18(3):304 -319.
Woolson, R.F. & Clarke, W.R. (2011). Statistical
methods for the analysis of biomedical data. John Wiley
& Sons. Inc., New York.
Zhou, H.B. & Gao, J.T. (2014). Automatic method
for determining cluster number based on silhouette
coefficient.Advanced Research on Intelligent System,
:227- 230.