Privacy Enhancement with Perturbation Method for Multidimensional Grid

: With the development of technology, the use of big data is spreading at an increasing rate. The issues of storing, analysing and securing data have brought along the methods that need to be developed. Ensuring data privacy and data security is the case of partial separation and processing of data with the perturbation method of data with the block chain approach. Within the scope of this study, data analysis performed using normalization, geometric rotation, linear regression and scalar data multiplication and comparative classification in precision data mining.


Introduction
Considering the data security structure with the system model proposed in this study, it allowed the analysis of big data in a reliable environment and the perturbation approach [1,2] to ensure the security status of the data before the analysis and to obtain meaningful results from the data as a result of the analysis. The effectiveness of the proposed method used in the Titanic dataset pulled from kaggle.com. The proposed perturbation method, which provides data confidentiality, applied on this data set. In the process of proving the accuracy of the proposed method, decision tree, logistic regression, Naive Bayes, K Neighbours' Classifier, Random Forest, Neural Network, Support Vector Machine and XGboost methods applied. It demonstrated the superiority of the proposed random perturbation method in terms of both classification accuracy and data privacy, by protecting data privacy, predicting data states, I/O states, and independent component analysis. The structure of predicting attack types and taking countermeasures carried out by independent component analysis. Thanks to data mining, the relationships between attributes in multivariate data structures examined and meaningful results drawn. The effect on data concentration analysed with the proposed method.
The remainder of this article organized in the following format. It shows the literature review in the second section, the proposed methodology in the 3rd section and the analysis results in the 4th section. The accuracy of the classification, attack situations and flexible structures, time dimension and scaling situations of the data discussed during the analysis process. In the 5th section, the obtained analysis gives information about the evaluated results and the work that can be done in the future.

Literature Survey
The widespread use of big data together with the Internet of Things (IoT) brings the concept of data privacy to the fore in all sectors from manufacturing to health and justice to banking. Data mining is the process of discovering interesting and previously unknown information from big data. Therefore, the necessity of ensuring data confidentiality in the process of data processing also arises. In order to monitor the correlation and autocorrelation structure of multivariate flows efficiently and effectively, and to protect data security, measures taken against attacks against noise addition and fundamental components.
The protection of data privacy and the realization of data mining analysis will provide data analysis in a healthier and more effective environment. The perturbation method was preferred in ensuring the data confidentiality process, and the pre-analysis process applied, taking into account the multidimensional structure of the data before the analysis. For the data privacy process, a precautionary pre-process is required [3,5]. The issue of data privacy has become an important issue in recent days, with the increase in its size and usage area. Traditional techniques such as perturbation, generalization and sampling used to ensure the confidentiality of data [4][5][6][7]. It is aimed to clean the database, to protect the information in the data group handled in dynamic data analysis [8], and not to experience any deterioration in data quality during the confidentiality process [9]. Association rules are derived by considering the factors of loss of confidentiality, loss of information [10], cloaking error and database diversity [11,12]. Finally, the key value updated in the proposed study and the suitability, success and failure scenarios taken into account [13]. With the use of software-cantered and virtualized approaches in 5G networks, it also brings security and privacy problems. The development of data flow mining by designing efficient security protocols and researchers have targeted writing algorithms in recent years. In order to protect the data in the perturbation analysis process, it ensured that data corruption prevented by using the correlation coefficient together with the differential privacy issue [14][15][16]. At the same time, data analysis performed by evaluating the neighbourhood matrix [17]. Association rules are the process of obtaining the rules obtained through correlation in the data [18]. Time series used in data flow analysis and the correlation coefficient should be taken into account in the data analysis process. Here, the short-term collection of event types and temporal correlation without loss of information in event streams discussed [19][20][21][22][23]. In the data analysis process, the pre-processing phase, the analysis phase and the noise-mining phase are performed [24][25][26][27][28][29]. In this method, random noise added to the privacy sensitive data using a known distribution before the data analysis process. Then, an approximation to the original data distribution created from the distorted data. It uses the reconstructed distribution for post-analysis. Due to the addition of noise, loss of information and protection of confidentiality is always a trade-off in perturbation-based approaches [30].

Perturbation Analysis for Multi-Dimensional Grid
In the data perturbation process, input and output perturbation, noise addition and rule hiding obtained by adding or multiplying the noise to the data. Multidimensional perturbation features diversified into condensation, random rotation, geometric perturbation, hybrid perturbation, and multidimensional perturbation. One of the most basic purposes in the data privacy model is the protection of private information. In this process, the system may be vulnerable to attacks against minimalist, composition and foreground information situations. In this regard, it tried to prevent data leakage by applying the local differential privacy approach with the differential privacy method and the perturbation method in the local differential approach.
If the definition of data mining is re-examined, it is the process of extracting competent information from previously unidentifiable data. At the same time, among the data mining tasks, the availability of the dataset should be close to the original data, while protecting the privacy of the data of individuals and institutions arises (in Fig.1).

Figure 1: Decision support model considering data privacy analysis
During the analysis of dynamic data in the protection of data confidentiality, a covariance matrix created for each bundle (group) of the data bundles. Geometric rotation applied to these matrices, and the rotated bundles then combined. The bundles then randomly shuffled and released. By applying the perturbation process, the data made ready for analysis, after the process is completed, normalization performed again, and the rules obtained as result of the analysis related to the system, thus contributing to the decision support system. The grouping process especially used to increase the effect of perturbation. The rotation operation applied with a rotation operation equal to the unit matrix. The product of a vector and the identity matrix results in the same vector producing zero perturbations on the initial vector.
Data corruption is the technique of protecting the confidentiality of records, keeping the data values in the database without destroying the meaningfulness and the relationship between the variables. There are various variants of the applied perturbation method, these can counted as: total perturbation, random rotation, geometric perturbation, micro-aggregation and data condensation.
By applying the perturbation method before the data mining analysis process, it aimed to move the data to a more reliable environment. This change made in the original data will result in a more reliable environment and a result value closer to the original data after the analysis. Confidentiality of data provides in a much better way with batch processes. (1) In the simplest sense, it expressed as adding a value to the data at a certain coefficient as a perturbation operation.
Here, the rendered refers to the original, i.e. the noisy data added to the original value. The multidimensional perturbation method applied because the relationship structures of the data and features with each other taken into account. In order to prevent loss of information in the study and to perform the analysis of the data more accurately, the analysis process carried out in large structures and together with groups. This process also expressed as masking the dataset. Here is the process of normalizing the data, rotating the data, with a certain slope and at the same time taking the projection of the data. Data security provided at a higher level by the random rotation of data. The mathematical representation of the rotation process is; In the structure expressed with the rotation matrix, the features also expressed with g(X). Ensuring data security is possible with some transformations that performed in the data set. In this context, the security of the data also ensured with the transformations performed on the original data set. The proposed method can categorized by applying it to the original centralized and distributed scenarios. Along with the method discussed, it is also important to transform the data back to the original after the process. Perturbation method can considered in two categories as onedimensional and multi-dimensional. Since the perturbation method, which considered as onedimensional, only deals with one attribute, the result of the process may be the case of data drift or deterioration of the data properties. However, the random perturbation method includes a four-step process.

Evaluation Criteria
In this study, the evaluation criteria of the data used in data mining can used after the analysis in the last step. In the final assessment process, precision was the F1 score; the area under the ROC curve used as evaluation criteria. It focuses on the degree of confidentiality of data and reducing data usage loss. The different measurement criteria used here are the confidentiality of the data; value difference (DF); perturbation of the data matrix; Even after perturbation, some of the data items may not change their importance relative to other data (RP), in this case it is expressed as the percentage of the data items' status (RK). In short, it tests whether the importance of the element is lost in the perturbation process. We can express with whether the element preserves its degree in the perturbation process. Testing whether the mean value of each feature changes after perturbation expressed as the change in the mean value of the features (CP). The RAV function represents the rank of the mean value. CK, on the other hand, refers to the features that can maintain the corresponding order of the mean value after perturbation.
represents changes in the average value of attributes.

Security
It is the comparison of the degree of closeness of the feature to the original data in the degree of confidentiality measurement.

Value Difference (DF)
Represents changes to the data.

Perturbation (PA)
It is the process of changing the order of the data in perturbation analysis. The variation of the mean order of the features expressed depending on the degree-based values.
denotes the state of the data matrix at rest in , taking into account the order of nm in the original data matrix.
If we consider the data set as Dmn; (7)

Perturbation Percentage Change (PS)
It refers to the process of ordering features after perturbation. It also shows whether an element maintains its degree during the perturbation process. After perturbation, each of the features can maintain their respective order. If the percentage display format of this data is:

Ranking Change in Mean Value After Perturbation (CP)
Shows the rank change in the mean value of the features after perturbation. . refers to the rank of the mean value. On the other hand, CK indicates features that can preserve their respective order after perturbation.

Case Study
The Titanic dataset chosen for analysis tasks is multivariate and has different dimensions. The dataset contains only the numeric attributes instead of the class attribute. Table 1 shows a detailed overview of the dataset used. The Titanic dataset is a dataset containing some of the characteristics of the passengers who survived and died during the voyage of the Titanic ship. Titanic dataset data. The dataset contains 1310 records and the total number of records without missing values is 1043. Records with missing values deleted before the data set used in this study. The number of features in the data set is 14, 7 numerical and 7 categorical. Descriptive attributes in this data set are: name, pclass, sex, age, ticket, sibsp, fare, parch, cabin, embarked, boat, body, home dest, survived. A detailed presentation of the Titanic dataset shown in Table X. Name, ticket, cabin, home.dest have been removed from the data set because they are attributes with descriptive properties (in Table 1). The boat and body variables, which are also included in the data set and contain too much missing data, were also removed from the data set. Thus, the number of attributes in the data set to be protected and classified has been reduced to 8. These variables sex, age, pclass, sibsp, fare, parch, embarked, survived were used in the study.   Age and pclass variables were normalized in the Titanic data set. The results were compared in Figure 3 X1 by applying the cross validation approach to the data that was normalized and the data that was not normalized. Changes in data between before and after perturbation were tested by applying various machine learning algorithms. The algorithms with the highest validity average obtained after this test were the support vector machine with a value of 0.757262, followed by the logistic regression method with a value of 0.756456, the neural network with a value of 0.751667, and then the XGboost method and the 0.744844 method. The method with the lowest value is Naive Bayes and then the Decision tree method with a value of 0.693031. When we look at the validity standard deviation values, the decision tree, then the logistic regression and then the XGboost method with the value of 0.077289 come. Next comes the K Neighbours classifier method and then the support vector machine method (in Fig.3). As a result of the analysis performed with the confusion matrix, the negative-negative value of the decision tree data is 116, the positive-positive value is 69, the negative-negative value of the logistic regression is 131, the positive-positive value is 77, the negative-negative value is 124 for Naive Bayes, and the number of positive-positive is 78. , 133 positives positive for KNeighboursClass, 64 negative negatives, 135 positives positive for the random forest method, 68 number of negative negatives, 140 positives positive for neural network, 70 negatives-negatives, 132 positives negatives for support vector machine, 73 as obtained. In this context, the number of correct values is highest with 210 in neural network and support vector machine methods, then logistic regression with 208, random forest with 203 and naive Bayes with 202. We see that it comes with 197 (in Figure 4).
In Figure 5, Decision tree, logistic regression, Naive Bayes, K Neighbours, random forest and support vector machine methods applied to Classifier Validity analysis result and standard deviation raw data of cross validation for 100 registered data, data rotated 45 degrees, processing results applied to 45 degrees noisy data are included. At the same time as the classification results, the confidentiality degree was given to the data with 45 degrees noisy and noiseless rotation, and also the attack resistance of the data was determined by testing the pre-and post-process similarity coefficients of the data.

Conclusion
There are singular value decomposition, polar decomposition, etc., as long as the decomposition that create a submatrix with the properties with rotation matrix. Other decomposition methods used to construct the rotation matrix of a particular group. Privasing data security and examining test parameters were discussed in this study. Data corruption is the technique of protecting the confidentiality of records, keeping the data values in the database without destroying the meaningfulness and the relationship between the variables. There are various variants of the applied perturbation method, these can be counted as: total perturbation, random rotation, geometric perturbation, micro-aggregation and data condensation.