Research on Density Peak Clustering Algorithm Based on Artificial Bee Colony Optimization

: This paper proposes a density peak algorithm based on artificial bee colony optimization. The improved DPC algorithm can better realize the automatic identification and reasonable clustering of data points between clusters, and reduce the difficulty of selecting the value of the original density peak clustering (DPC) algorithm and the limitation of the neighboring principle aggregation operation in the low density area. This paper verifies the clustering effectiveness of the proposed algorithm using some known data sets and custom data sets. Clustering comparison experiments with K-Means, AP, and DPC algorithms on multiple classical datasets. The experimental results show that compared with the DPC algorithm, the proposed algorithm automatically recognizes and reasonably clusters data points between clusters; automatically identifies the cluster center points and clusters and automatically handles the advantages of randomly distributed data sets.


Introduction
In the era of Internet big data, data has exploded.Various industries have gradually adopted computer technology to manage data, greatly improving the ability to generate, collect, store and process data.Cluster analysis can analyze data without any prior information, and identify the potential structure of data space [1], which is widely used in pattern recognition, data analysis, image processing, market research and many other fields.As a result, various clustering algorithms have emerged.In 1975 and 1979, Hartigan and Wong [2] proposed a simple, efficient and widely used Kmeans clustering algorithm.In 1998, Alsabti et al. proposed a K-means algorithm based on the cost function, which successfully avoided determining the number of pre-allocated clusters and partially reduced the limitations of K-means [3].In 2007, Frey and Dueck proposed the affinity propagation (AP) algorithm, and successfully applied this algorithm to image recognition [4].
In 2014, Rodriguez and Laio proposed a clustering algorithm based on density peak (DPC algorithm).This method can quickly find the density peak points of any shape data set, and can efficiently perform sample allocation and reject outliers.A large number of experiments have confirmed the excellent performance of the DPC algorithm [5], but the algorithm needs to be further verified for the identification and classification of data points between clusters.

Density peak clustering based on artificial bee colony optimization
The DPC algorithm can automatically find the cluster center point of the data set sample, and the selected cluster center point has a higher density and is relatively far away from other cluster centers.Therefore, there are certain limitations.
The density peak clustering algorithm based on artificial bee colony optimization proposed in this paper makes corresponding adjustments to the aggregation principle on the basis of fully inheriting the advantages of DPC algorithm.It mainly improves the sensitivity of DPC algorithm to data points between clusters, and proposes a more scientific and rational aggregation principle of cluster points.The algorithm proposed in this paper is mainly divided into the following six steps: (1) Step 1: Calculate the density of the data points and generate a decision map; (2) Step 2: Perform initial clustering; (3) Step 3: Identify data points between clusters; (4) Step 4: The cluster label of the data point between the primary clusters; (5) Step 5: Determine the cluster label of the data point between the clusters; (6) Step 6: Complete the clustering.

Calculate the density of data points and generate decision maps
Calculate the calculation method in the DPC algorithm on the data point density.
In the formula, the density value of the first data point represents the distance from the first data point to the first data point, and when The i d represents the minimum of the distance from the i data point to all points higher than its density value.When the i data point density value is the maximum density value, the default i : max ( ) .

Perform initial clustering
The algorithm proposed in this chapter not only considers the high-density point with the smallest distance from the point (relative to the point) before clustering each data point, but also considers the high-density point that is second from the point.When it is of the same class, it can be considered that the data point has a relatively clear clustering result, and is the same as the clustering of the recording point, and the data point is listed as a set of data points that can be clearly classified.When the two categories are different from the sub-genus, it is considered that the clustering result of the data point may have a controversy caused by its special position.It may be of the same class or the same class, so the data is Points are included in the set of data points between clusters.

Identify data points between clusters
After a preliminary classification of 2.2 pairs of data points, all data points can be divided into two categories: the classification of data points and the set of data points between clusters can be clarified.First, according to the proposed clustering principle, points that can be clearly classified in the data point cluster are clustered.Second, the data points in the data point set between the clusters are reclassified: . 1 i nneith

Dist
represents the distance from the i data point to the nearest high-density point nneigh1,

Dist
represents the distance from the i data point to the next high-density point nneigh2, and γ i represents the minimum distance of the i data point The difference between the small distance and the second.For the different values of γ i , the following two operations are performed on the data points in the classification point set: (1) When γ i >d c , the distance between the i data point and nneigh1 and nneigh2 is considered to be significantly different, then the data point is at the edge position of the cluster with a smaller distance from nneigh1 and medium nneigh2, so the class of the data point belongs to The cluster should be consistent with the clustering labels that are closer.
(2) When γ i ≤d c , the distance between the i data point and nneigh1 and nneigh2 is considered to be no significant difference, then the data point is in the middle of the two clusters to which nneigh1and nneigh2 belong.
First, confirm that you have the correct template for your paper size.This template has been tailored for output on the A4 paper size.If you are using US letter-sized paper, please close this file and download the Microsoft Word, Letter file.

Class cluster labeling of data points between primary clusters
Before using the bee colony algorithm to find the optimal solution, it is first necessary to determine the most likely class cluster label of the data points between the clusters.For each intercluster data point, its most likely associated cluster label can be determined based on its known relationship with nneigh1 and nneigh2.

Class cluster labeling for determining data points between clusters
Halkidi and Vazirgiannis proposed a classification evaluation index (CDbw) based on the tightness and resolution between clusters [6].It can effectively measure the classification effect of arbitrarily distributed data points.The algorithm uses the bee colony algorithm with CDbw as the objective function.For each clustering result of data points between each cluster, the optimal solution is obtained according to different CDbw values, and finally the optimal solution is obtained, that is, the most data points between clusters.Good clustering results.
In 2006, the bee colony algorithm was proposed by Karaboga, which is a new optimization method to simulate honey bee collecting behavior [7].In 2014, Karaboga et al. proved that the bee colony algorithm can be widely applied to feature selection [8], real parameter optimization, job scheduling [9], travel salesman problem [10] and so on.
The clustering result of the data points which can be clearly classified in 2.2 and the clustering result of the data points of the clusters in 2.4 are integrated into the final result, and the clustering result is evaluated by the contour coefficient (Sil) and the F value (F-Measure). .

Experimental result
In order to measure the validity and stability of the algorithm, some known data sets and custom 2D data sets are selected as input data on the data set; in the clustering algorithm, with K-Means algorithm, AP algorithm and DPC algorithm Comparison of clustering results.The known data sets used are shown in Table 1

Result analysis
The algorithm proposed in this paper is based on the assumption of DPC algorithm, but the clustering principle is completely different.This algorithm implements reasonable clustering of data points between clusters.Through experimental comparison and analysis, and the results of the three clustering methods of K-Means, AP, DPC were compared.The algorithm experiments presented in this paper show that for arbitrary distribution of data sets, there are obvious advantages in the identification and classification of data points between clusters.

Table 1
: Known data set to be tested In order to evaluate the classification effect of the clustering algorithm proposed in this chapter, the values obtained by the commonly used F-Measure and Sil evaluation indicators are based.In this paper, three different clustering algorithms (K-means, AP, DPC) are used to map the clustering results of three different known data sets Table2and Table3(Flame, Aggregation, R15) and the proposed method.The clustering results are compared to evaluate the clustering effect of the proposed algorithm.

Table 2
Values of clustering results for three different data sets

Table 3
Values of clustering results for three different data sets