Clustering is a type of core data processing technology, which has been applied widely in various applications. Traditional clustering algorithms, however, cannot guarantee the balance of clustering results, which is inconsistent with the needs of real-world applications. To address this problem, this paper proposes a K-Means based clustering algorithm with balanced constraints. The proposed clustering algorithm changes the data point assignment process, in each iteration, to achieve balanced data blocks. The maximum cluster size threshold can be customized to meet different division requirements. The algorithm is simple and efficient-only two parameters need to be specified:the number of clusters and the threshold for the maximum cluster size. A series of experiments carried on six real UCI (University of California Irvine) datasets show that the algorithm proposed in this paper has better clustering performance and efficiency compared to other balanced clustering algorithms.
TANG Hai-bo
,
LIN Yu-ming
,
LI You
. A K-Means based clustering algorithm with balanced constraints[J]. Journal of East China Normal University(Natural Science), 2018
, 2018(5)
: 164
-171
.
DOI: 10.3969/j.issn.1000-5641.2018.05.014
[1] AGGARWAL C C, REDDY C K. Data Clustering:Algorithms and Applications[M]. Raton:Chapman & Hall/CRC, 2013.
[2] VISWANATH P, PINKESH R. l-DBSCAN:A fast hybrid density based clustering method[C]//Proceedings of the 18th International Conference on Pattern Recognition. NJ:IEEE, 2006:912-915.
[3] PEIZHUANG W. Pattern recognition with fuzzy objective function algorithms (James C. Bezdek)[J]. SIAM Review, 1983, 25(3):442-442.
[4] MALINEN M I, FRÄNTI P. Balanced k-means for clustering[C]//Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin:Springer, 2014:32-41.
[5] LIU H, HAN J, NIE F, et al. Balanced clustering with least square regression[C]//Proceedings of the 31th AAAI Conference on Artificial Intelligence. CA:AAAI, 2017:2231-2237.
[6] BRADLEY P S, BENNETT K P, DEMIRIZ A. Constrained k-means clustering[J]. Microsoft Research Redmond, 2000, 59(1):1-34.
[7] KUHN H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics, 2005, 52(1):7-21.
[8] WU B, ZHOU Y, YUAN P, et al. Semstore:A semantic-preserving distributed rdf triple store[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. NY:ACM, 2014:509-518.
[9] BANERJEE A, GHOSH J. On scaling up balanced clustering algorithms[C]//Proceedings of the 2002 SIAM International Conference on Data Mining. PA:Society for Industrial and Applied Mathematics, 2002:333-349.
[10] BANERJEE A, GHOSH J. Frequency-sensitive competitive learning for scalable balanced clustering on highdimensional hyperspheres[J]. IEEE Transactions on Neural Networks, 2004, 15(3):702-719.
[11] NG A Y, JORDAN M I, WEISS Y. On spectral clustering:Analysis and an algorithm[C]//Proceedings of the neural information processing systems. Massachusetts:MIT Press, 2002:849-856.
[12] CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1624-1637.
[13] STREHL A, CHOSH J. Knowledge reuse framework for combining multiple partitions[J]. Journal of Machine learning Research, 2002, 33(3):583-617.