A Fast Clustering Algorithm for Power Data

Energy conservation is an urgent issue to solve on a global scale. A more and more widely used method for energy saving and emission reduction is the applications of data mining technology including data clustering in power system. However, power data has characteristics of large volume, high dimensions, discrete and complex datasets which lead to poor clustering results when we choose common classic clustering algorithm. In our paper, we proposed D-CFSFDP algorithm which is suitable for power data clustering. We do experiments compared with DBSCAN algorithm and K-means algorithm. We demonstrate the power of the algorithm on the power data from Shanghai Energy Conservation Supervision Center.


Introduction
Energy is an important foundation for human society to survive and develop.At present, Energy consumption is growing rapidly and demand for energy is greatly increased.Increasing energy efficiency and saving energy on a global scale is imminent.In 2016, the endorsement of G20 Energy Efficiency Leading Programme (EELP) of Hangzhou G20 Summit stated that energy conservation and efficient energy consumption are one of the best ways to rationalize the use of energy resources and the most important measure for climate change in the medium and long term for every countries.Energy saving and emission reduction of power industries are fundamental to the resources and environmental safety, which are directly linked to the overall goal of achieving energy saving and emission reduction [1].In recent days, a new idea for energy saving and emission reduction is the applications of using data mining technology in power system [2].We joined a project of the Shanghai Energy Conservation Supervision Center, aiming at finding the methods to save energy by mining the large amounts of power data.However, power data has characteristics of large volume, high dimension, discrete and complex datasets which leads to poor clustering results of power data when we choose common classic clustering algorithm.In order to solve this problem, in this paper, we will propose a clustering algorithm suitable for power data clustering, D-CFSFDP algorithm which is based on the algorithm proposed in 2014 of Alex Rodriguez et al [3].The rest of this paper is organized as follows.In Section 2,we will discuss related work of CFSFDP algorithm.In Section 3, we are going to describe our proposed algorithm in detail.Experiments and evaluation will be discussed in Section 4.In the end, Section 5 concludes our paper.

Related Work
Alex Rodriguez and Alessandro Laio [3] proposed a clustering algorithm in 2014.Their paper, clustering by fast search and find of density peaks (CFSFDP) is proposed to cluster data by finding of density peaks.CFSFDP is based on two assumptions that: a cluster center is a high dense data point as compared to its surrounding neighbors, and it lies at a large distance from other cluster centers.Based on these assumptions, CFSFDP supports a heuristic approach, known as decision graph to manually select cluster centers.However manual selection of cluster centers is big limitation of CFSFDP in intelligent data analysis.Rongfang Bie, Rashid Mehmood et al [4]proposed a fuzzy-CFSFDP method for adaptively selecting the cluster centers effectively.fuzzy-CFSFDP uses the fuzzy rules based on aforementioned assumption for the selection of cluster centers, compared the resulting clusters with the state of the art methods.Zhang WenKai and Li Jing [5]proposed an extension of CFSFDP, E_CFSFDP inspired by the idea of a hierarchical clustering algorithm CHAMELEON because that CFSFDP performs not well when there are more than one density peak for one cluster, namely "no density peaks".They used the original CFSFDP to generating initial clusters first, then merge the sub clusters in the second phase.They have conducted the algorithm to several data sets, of which, there are "no density peaks".Shihua Liu, Bingzhong Zhou [6] proposed DPC_M algorithm based on CFSFDP.DPC algorithm constructs a Decision Graph by computing a local density and a relative distance to discover the cluster center in a dataset.The remaining data points in the dataset are allocated at once to the cluster to which the nearest cluster center belongs.The key issue for the DPC algorithm proposed in literature is how to define the distance measurement between data points in the mixed dataset.Therefore, the DPC_M algorithm designed for the clustering of the mixed data proposed in this paper is constructed by using a new unified dissimilarity metric between the mixed data points.

Proposed Algorithm
First of all, we will introduce the background of D-CFSFDP algorithm, and then we will describe it detailedly in part of this chapter behind.

Background
The algorithm was based on the assumptions that cluster centers are surrounded by neighbours with lower local density and that they are at a relatively large distance from any points with a higher local density.For each data point, we only need to compute two quantities: its local density and its distance from points of higher density.The local density of data points is defined as: Where c d is a cut off distance.i  is measured by computing the minimum distance between point i x and any other point j x with higher density: ) ( min For point with highest density, we conventionally take

Generally, i
 is equal to the number of points that are closer than c d to point i x .The algorithm is highly correlated with the distance between points, thus the results of the analysis are robust with respect to the choice of c d for large data sets.
Then select data points with large local density ρ and large distance δ as cluster center.In order to determine the number of cluster centers quantitatively the author gives a definition of . Hence data points with a higher value of  are more likely to be cluster centers.Sort  in descending order and choose data points with relatively large value of  .This algorithm has many advantages and it is suitable for power data clustering.However, it may exist some problems at some special cases and we would like to make improvements to get a better power data clustering result.

D-CFSFDP Algorithm
As mentioned above, local density of data point i x is very close to point j x .These two points belonging to the same cluster will be divided into two clusters according to distance formula.This case will be discussed in Step 5 in detail.In our algorithm, for the above case, we improve the distance formula by adding: ,we will take a z score scale for ρ and δ respectively and then calculate γ.It will be described in detail later in step 6.
Step 1: Data Preprocessing Step 2: Calculating distance ij d In field of data mining including clustering analysis, similarity between data points is generally calculated by distance.The popular distance formula, Minkowski-form distance is defined based on the norm: When p is equal to 2, ( , ) d x y is Euclidean distance 2 L .The Euclidean distance between two points is the length of the path connecting them.Euclidean distance is the most common use of distance [8,9].In our algorithm, Euclidean distance is also our best choice.
Step 3: Selection of c d The choice of parameter c d is very important that too big or too small may degrade the performance of algorithm.If c d is too large, local density of each data point will be consequently larger than it should be, which may result in a decrease in the number of clusters.And if it is too small, data points that originally belong to one cluster may be divided into several clusters, resulting in a significant increase in the number of clusters.For the choice, the author gives a suggestion: one can choose c d so that the average number of neighbours is around 1% to 2% of the total number of points in data set.We own large power data sets, and the choice for our project is about 2%. Step Problems do exist with this formula in some special circumstances.For example, if i