Research on fast clustering privacy protection of mixed data based on blockchain

: In the actual production environment, a large number of data sets to be protected are non-single type attribute data sets, that is, mixed data sets.In view of the above problems, this project proposes a fast clustering privacy protection research method based on blockchain mixed data. This method is mainly for complex data types, cloud environment for different data types with different measurement method, the difference of the hybrid data set by calculation of each sample point neighborhood density and relative distance, divided into k density and relative distance far sample points as the initial clustering center, clustering, complete and upload them to block chain. For the generated clustering results, the numerical clustering centers are calculated, and the set of attribute values of non-numerical data is generated, which ensures that each user can correctly obtain the iterative process and the final clustering centers, and reduces the error rate of data.


Introduction
With the continuous development of the era of big data, the amount of data is increasing day by day. Within the scope permitted by law, reasonable use of these data for data analysis to obtain the hidden information in the data has become a method followed by all walks of life. However, in the process of data analysis, all data are exposed to the network indiscriminately, which will cause huge hidden dangers and harms to the privacy of data owners. Therefore, reasonable use of data in the network and consideration of personal privacy data has become an urgent problem to be solved in cloud environment.
At the same time, with the increase of data volume, the data type also changes from single numerical data to multiple types of mixed data. The traditional privacy protection methods mostly focus on the research of numerical data, ignoring the influence of other types of data, resulting in low data privacy level and low security factor. Aiming at the above problems, this paper proposes a research method of fast clustering based on mixed data.

Related Work
Yan et al. [1]mainly focused on the processing of general numerical data sets by using the form of histogram in order to protect privacy. Fletcher et al. [2]proposed a differential privacy decision forest algorithm for single-type data, which can shorten the query time of data and reduce the addition of noise. Liu et al. [3]proposed a clustering based differential privacy data publishing method, but this method is also only applicable to numerical data sets. The above methods can only protect the privacy of a single type of data source to a certain extent. In practical applications, there are a large number of other types of data. Soria et al. [4]proposed a micro-aggregation algorithm for mixed attribute datasets, which could effectively deal with differential privacy protection for mixed data types. Li et al. [5]proposed a method combining k-anonymity and differential privacy to release structural data in micro clusters, which is not applicable to the processing of large mixed data sets. Dan et al. [6]proposed a privacy-preserving k-means clustering algorithm that not only satisfies differential privacy but also has the nature of approximation error, in which the approximation error of the algorithm has a sublinear relationship with the dimension of data. Ni et al. [7]proposed a Differentially Private k-means Clustering algorithm based on Cluster Merging (DP-KCCM).

Differential privacy protection
Definition 1 For random algorithm A, A R is the set composed of all the output results of algorithm A, and A R is any subset of A R . For any two adjacent datasets 0 D and 1 D , the algorithm satisfies the following formula [8]: Then algorithm A satisfies  -differential privacy, where is the privacy protection budget. The privacy protection intensity of algorithm A can be measured by  ,  the smaller it is, the higher the privacy protection intensity is; otherwise, the lower it is.

Laplacian mechanism
Theorem 1 for the existing data set D, let it have a query function ' :  . If the algorithm K satisfies: Then algorithm K provides  -differential privacy protection. Where f  represents the global sensitivity and ) (  f Lap  represents the amount of noise added to the data set.

Measurement Method
For hybrid data sets, this paper divides the data sets into two types: one is numerical data, the other is non-numerical data divided according to attributes. Different measurement methods are designed according to the characteristics of these two types of data.
For numerical data, Minkowski distance calculation method is adopted in this paper, that is, for a given sample  , p=1 is taken in this paper, that is, the formula of the distance between samples is: (3) For non-numerical data, this paper measures the distance between samples by measuring the dissimilarity degree.a certain type of attribute ik x and jk x is simply matched: Then the distance between the two samples is defined as: According to the above method, for a mixed data set } , , , has q attributes, that is q m m r r r r r , , , is randomly selected, then the distance between the sample and the cluster center is:

Initial cluster center
According to the neighborhood density calculation method, the initial cluster center of the mixed data is obtained: 1) Calculate the neighborhood density  N for each sample of the initial dataset X; 2) Arrange the neighborhood density } , , , {

Data disturbance
The clustered data set should be perturbed to achieve the purpose of differential privacy protection. In this paper, Laplacian mechanism is adopted to perturb data for numerical data, which is defined as follows: For non-numerical data, exponential mechanism is used for data perturbation, which is defined as follows: between each sample i x in the original data set X and M ; 2) Re-calculate the clustering center of each cluster according to numerical type and non-numerical type; 3) According to the calculated clustering center, judge whether the data in the original cluster has changed. If there is no change, the clustering ends and the clustered data set is obtained.

Data set processing
In this paper, the abalone dataset is selected for experiments. Firstly, the invalid data and attributes are processed, and 4177 data records are retained. In this paper, four attributes in the data are selected for experimental analysis, including the numerical data age and time and the non-numerical data sex and name.
This paper analyzed the algorithm by evaluating the accuracy of data clustering results, and used variance formula [9] to measure the accuracy of data set results. The specific formula is as follows:

Experimental results and analysis
During the experiment, the algorithm in this paper is compared with the traditional DP-KCCM algorithm and MDAV algorithm [10]. The traditional DP-KCCM algorithm only considers the numerical data in the data set and ignores the non-numerical data. The MDAV algorithm is based on the microclustering method for differential privacy protection. Figure 1 shows the error rate of clustering results of the three algorithms on the dataset. Under the same value, the error rate of the proposed algorithm is lower, and the error rate tends to be stable with the increase of the value.  Figure 2 shows the changes of NIV value of the three algorithms. It can be seen from Figure 2 that the NIV value of the algorithm proposed in this paper is significantly lower than that of the other two algorithms, and much lower than that of DP-KCCM algorithm, indicating that the clustering effect of the algorithm in this paper is more obvious. The experimental results show that the error rate of the data processed by the algorithm in this paper is lower and the accuracy of the clustering result is higher, which is more suitable for the privacy protection of mixed data. However, its time complexity and space complexity need to be optimized.

Conclusion
In view of the problems of privacy right now, in this paper, based on block chain of mixed data privacy protection fast clustering algorithm, based on the difference of privacy protection algorithm, combined with the transparent characteristics of block chain, using the clustering algorithm, the classification and clustering data processing, at the same time adding noise disturbance, implement the data so as to realize data privacy protection.