A-means: improving the cluster assignment phase of k-means for Big Data

  • Joaquín Pérez Ortega Tecnológico Nacional de México/CENIDET
  • Nelva Nely Almanza Ortega Tecnológico Nacional de México/CENIDET
  • Jorge A. Ruiz-Vanoye Universidad Autónoma del Estado de Hidalgo
  • Rodolfo A. Pazos R. Tecnológico Nacional de México/ITCM
  • Socorro Sáenz Sánchez Tecnológico Nacional de México/CENIDET
  • José María Rodríguez Lelis Tecnológico Nacional de México/CENIDET
  • Alicia Martínez Rebollar Tecnológico Nacional de México/CENIDET
Keywords: K-means, Big Data, NP-hard

Abstract

This paper proposes a new criterion for reducing the processing time of the assignment of data points to clusters for algorithms of the k-means family, when they are applied to instances where the number n of points is large. Our criterion allows a point to be classified in an early stage, excluding it from distance calculations to cluster centroids in subsequent iterations. The proposed criterion uses knowledge of the distance of a point to its two closest centroids and their shifts in the last two iterations. By computer experimentation using synthetic and real instances, we found that this criterion reduces execution time to approximately 2/100 of the time by k-means and generates solutions whose quality is approximately reduced by less than 3%. These findings suggest the usefulness of our criterion for problems like those found in Big Data. The NP-hardness of k-means motivates the use of this heuristics.

Published
2018-02-23
How to Cite
Pérez Ortega, J., Almanza Ortega, N., Ruiz-Vanoye, J., Pazos R., R., Sáenz Sánchez, S., Rodríguez Lelis, J., & Martínez Rebollar, A. (2018). A-means: improving the cluster assignment phase of k-means for Big Data. International Journal of Combinatorial Optimization Problems and Informatics, 9(2), 3-10. Retrieved from https://ijcopi.org/index.php/ojs/article/view/91
Section
Articles