Perturbation Methods for Protecting Data Privacy: A Review of Techniques and Applications

: Perturbation methods are mathematical techniques used to add controlled noise or randomness to data to protect privacy while allowing data analysis. Various methods, such as randomized response, differential privacy, secure multi-party computation, noise addition, and sampling and aggregation, are used to protect sensitive information from disclosure or exploitation. These methods have been successfully applied in machine learning, statistics, and cryptography to ensure data privacy. However, their implementation must be carefully designed to avoid compromising data accuracy or introducing bias in analysis. Mostly, perturbation methods offer a promising approach to protect data privacy in various fields. This review provides an overview of perturbation methods used to protect data privacy in various fields, including machine learning, statistics, and cryptography. Perturbation methods involve adding controlled noise or randomness to data to preserve privacy while still allowing data analysis.


Introduction
The increasing amount of digital data generated by individuals and organizations has raised concerns about the privacy and security of sensitive information. Unauthorized access to personal data can lead to identity theft, financial fraud, and other malicious activities. Data privacy is a critical issue, particularly for sensitive data that contains personal or confidential information Perturbation methods are a set of mathematical techniques that can be used to protect data privacy by adding controlled noise or randomness to data while still allowing data analysis. These methods can be applied to various fields, including machine learning, statistics, and cryptography, to prevent attackers from identifying individuals or sensitive information. This review provides an overview of perturbation methods used to protect data privacy, including their advantages and limitations, and the importance of careful implementation to ensure accuracy and prevent bias.
Data privacy has become a critical concern in today's information age, as organizations collect and store vast amounts of data about individuals. While data analysis can provide valuable insights and improve decision-making, it also poses a risk to individuals' privacy. Perturbation methods offer a promising approach to protecting data privacy while still allowing data analysis. These methods involve adding controlled noise or randomness to data to preserve privacy. In this review, we will provide an overview of perturbation methods used to protect data privacy, including randomized response, differential privacy, secure multi-party computation (SMC), noise addition, and sampling and aggregation. We will also discuss the benefits and limitations of these methods and their potential applications in various fields. By understanding the various perturbation methods available for data privacy protection, researchers and practitioners can make informed decisions about how to protect sensitive information while still allowing data analysis.
This study organized into five sections. A literature review of the perturbation methods for data privacy in section 2. Section 3 presents the models and Section 4 covers the conclusion.

Literature Survey
A significant amount of research has been conducted on perturbation methods used to protect data privacy. The following literature survey highlights some of the key findings and contributions of various studies in this field: 1) Randomized Response: This technique adds randomness to the responses of individuals in a survey or questionnaire, making it difficult for an attacker to determine the true response [1,2]. Warner in 1965 first introduced to protect individual privacy in surveys [3]. Since then, it has been widely used in various fields, including healthcare, social sciences, and marketing. A study by Ghosh and Roth (2011) proposed a generalized randomized response method that provides better privacy guarantees than the original method [4,5].
2) Differential Privacy: It adds random noise to the data to prevent attackers from identifying individuals. This technique can be applied to a range of data analysis techniques, such as machine learning, statistics, and data mining. Dwork et al. (2006) first introduced protecting privacy [6]. Since then, it has become a popular method for protecting sensitive information. Wang et al. (2019) proposed a differential privacy algorithm for deep learning models that offers stronger privacy guarantees than existing methods [7]. The most common types of noise for differential privacy is the Laplace, exponential and Gaussian mechanism. They work by adding noise to the original data entry and can be applied to both real and categorical features. The Laplace strategy is a symmetric version of the exponential distribution, and it adds noise from a symmetric continuous distribution to the true answer according to equation 1 [8]. (1) The exponential mechanism, on the other hand, selects and outputs an element ∈ with probability proportional to equation 2. (2) where is an input and is a utility function with generalized sensitivity Δ .
3) Secure Multi-Party Computation (SMC): SMC has been widely used in privacy-preserving data analysis, where data from different sources is combined to perform a joint analysis. A study by Chaum et al. (1988) proposed a practical approach for secure computation of statistical functions using SMC [9]. Lindell and Pinkas (2000) proposed a practical SMC protocol that is widely used in various applications [10,11,12]. 4) Noise Addition: This technique involves adding a small amount of random noise to the data before releasing it for analysis. Some of the researchers proposed a noise addition-based approach for privacy-preserving principal component analysis that ensures data privacy while maintaining data utility [13, 14. 15, 16]. 5) Sampling and Aggregation: Sampling involves selecting a subset of data to analyse, while aggregation involves combining data from multiple sources to perform an analysis. These techniques can be used to reduce the risk of sensitive information being disclosed while still allowing for accurate data analysis [17,18]. The effectiveness of this method has been studied in various applications, including data mining and machine learning [2]. 6) Privacy-Preserving Machine Learning: The privacy risks associated with machine learning algorithms and presents various perturbation methods to protect data privacy in machine learning. The authors discuss the advantages and limitations of each method and highlight their applications in machine learning [19,20]. 7) Privacy-Preserving Data Mining: Charu et al. provides an overview of various perturbation methods, including differential privacy, randomization, and secure multi-party computation [21].
In addition to these methods, other perturbation techniques have also been proposed, including data swapping, data masking, and k-anonymity. These techniques have been studied in various applications and have shown promising results for protecting data privacy.

Model and analysis
The perturbation method depends on the specific data analysis task and the desired level of privacy protection. Differential privacy provides a strong privacy guarantee, but may be computationally expensive and result in reduced data accuracy. Randomized response and noise addition offer a tuneable trade-off between privacy and accuracy, but may not provide strong privacy protection against more sophisticated attacks. Sampling and aggregation are computationally efficient and can be easily applied to large data sets, but may not provide strong privacy protection against more sophisticated attacks. It is important to carefully design and implement these methods to ensure that they do not compromise data accuracy or introduce bias into the results (in Figure 1).

Figure 1. Privacy Preserving Data Mining (PPDM) Techniques
In this section, data perturbation-rotation perturbation; principal component analysis(PCA); projection perturbation; geometric data perturbation; data swapping; data randomization; heuristic methods to protect data privacy; k-anonymity; k-anonymity l-diversity; k-anonymity l-closeness; personalized privacy preserving; utility based privacy preserving; cryptographic methods; secure multiparty computations; horizontally partitioning data; explanations of vertically partitioning data methods are given.

Data Perturbation-Rotation Perturbation
Rotation perturbation in Principal Component Analysis (PCA) is a technique used to add noise or perturbation to the principal components while preserving the overall structure of the data. It involves rotating the principal components and perturbing the rotated components. The specific formula for rotation perturbation in PCA depends on the perturbation method used in Figure 2. The specific formulas for rotation perturbation in PCA may involve additional considerations depending on the chosen perturbation method and the desired level of perturbation. The goal is to introduce noise while preserving the overall structure and statistical properties of the data. It's important to select appropriate perturbation parameters and techniques to balance privacy protection and data utility in the perturbed data (in Figure 3).

Rotation Perturbation Principal Component Analysis (PCA)
Projection perturbation is a technique used to add noise or perturbations to numerical data while preserving certain statistical properties. It involves projecting the data onto a lower-dimensional space and perturbing the projected values. The specific formula for projection perturbation depends on the perturbation method used.

Projection Perturbation
Data perturbation and projection perturbation are two techniques commonly used in data science and machine learning to protect data privacy. Data perturbation involves adding random noise to the data in order to protect the privacy of individual data points. The level of noise added can be controlled by a privacy budget, which balances privacy protection and data utility. Projection perturbation, on the other hand, involves projecting the data onto a lower-dimensional space while adding noise to the projection. This technique can help to remove identifying features of the data while preserving the overall structure and relationships between the data points. The choice of method depends on the specific application and privacy requirements. Additionally, the level of noise or dimensionality reduction used should be carefully chosen to balance privacy protection and data utility (in Figure 4).

Geometric Data Perturbation
Geometric data perturbation is a technique used to add noise or perturbations to geometric data in order to protect privacy while preserving the general shape or structure of the data. The specific formula for geometric data perturbation may vary depending on the perturbation method used (in Fig.5).

Data Swapping
Data swapping is a privacy-preserving technique used to protect sensitive information while preserving the statistical properties of the data. It involves swapping or exchanging values between data records in a way that maintains the overall data distribution but obscures the original relationships between individual records. The specific formula for data swapping depends on the swapping method used (in Figure 6). The formula for k-anonymity-based swapping involves selecting suitable records within a cluster and swapping the values of sensitive attributes. The exact implementation may vary depending on the specific algorithm used for k-anonymity.

Data Randomization
Data randomization is a technique used to protect data privacy by introducing random noise or perturbation to the original data values. The specific formula for data randomization depends on the randomization method used (in Figure 7).

Figure 7: Data Randomization Steps
The random noise is typically generated from a specific distribution, and the privacy parameter controls the level of privacy protection provided.

Heuristic Methods
Heuristic methods are commonly used in data science and machine learning to protect data privacy. These methods involve using general problem-solving techniques to develop strategies and rules for protecting sensitive data.
One example of a heuristic method for data privacy is k-anonymity, which is a technique used to ensure that each record in a dataset is indistinguishable from at least k-1 other records in the dataset. This involves grouping similar records together and removing any identifying information that could be used to link a record to a specific individual.
Another example is l-diversity, which is a technique used to ensure that each group of records with a given sensitive attribute value has at least l different values for another attribute. This helps to prevent attackers from linking sensitive attributes to specific individuals in the dataset. Other heuristic methods for data privacy include t-closeness, differential privacy, and machine learningbased techniques such as generative adversarial networks (GANs) and variational autoencoders (VAEs). While heuristic methods can be effective for protecting data privacy, it is important to carefully evaluate their effectiveness and to select appropriate methods and parameters based on the specific needs of the analysis and the privacy risks associated with the data.

k-Anonymity
k-Anonymity is a privacy-preserving technique that aims to protect individual identities in a dataset by ensuring that each record in the dataset is indistinguishable from at least k-1 other records with respect to certain identifying attributes. The k-Anonymity principle helps to prevent the re-identification of individuals by reducing the uniqueness of their identifying information. The basic idea behind k-Anonymity is to generalize or suppress the values of attributes in a way that groups of records become indistinguishable while maintaining the overall statistical properties of the data. The specific formula for achieving k-Anonymity depends on the chosen generalization or suppression method.
It's important to note that achieving k-Anonymity requires careful consideration of the chosen attributes, the level of generalization or suppression, and the desired level of privacy protection. Additionally, the effectiveness of k-Anonymity depends on the quality of the generalization or suppression techniques applied and the size of the anonymized groups. Striking a balance between privacy protection and data utility is crucial in implementing k-Anonymity to ensure both privacy preservation and meaningful analysis of the anonymized data.

k-Anonymity l-Diversity
k-Anonymity aims to protect individual identities by ensuring that each record in a dataset is indistinguishable from at least k-1 other records. However, k-Anonymity alone may not be sufficient to prevent attribute disclosure. This is where l-diversity comes into play as an enhancement to k-Anonymity. l-diversity ensures that each group of indistinguishable records (based on k-Anonymity) contains at least l well-represented values for sensitive attributes.
The specific formula for achieving l-diversity depends on the chosen method and the definition of well-represented values (in Figure 8). The key idea behind l-diversity is to ensure that within each group, the sensitive attribute(s) have a sufficient number of distinct, well-represented values to prevent attribute disclosure even if the group is still indistinguishable based on k-Anonymity.

k-Anonymity l-Closeness
k-Anonymity and l-Closeness are two complementary privacy protection techniques used to safeguard sensitive information in datasets. While k-Anonymity focuses on hiding the identity of individuals, l-Closeness aims to address attribute disclosure by ensuring that sensitive attributes in a group of records are sufficiently diverse (in Fig. 9). The exact formula for achieving l-Closeness within a group depends on the specific distance or similarity measure chosen and the selected data transformation technique. The goal is to ensure that the sensitive attribute(s) values within a group exhibit a diversity that satisfies the l-Closeness requirement. By combining k-Anonymity and l-Closeness, privacy protection can be strengthened as k-Anonymity ensures the indistinguishability of records, while l-Closeness addresses the risk of attribute disclosure by enforcing diversity within each group.

Personalized Privcy Preserving
Personalized Privacy Preserving (P3) is a heuristic method used to protect data privacy that focuses on preserving the privacy of individuals rather than the privacy of the overall dataset. The goal of P3 is to allow data analysts to extract useful information from a dataset while minimizing the risk of disclosing sensitive information about individuals. To implement P3, each individual in the dataset is assigned a personalized privacy parameter that determines the level of privacy protection they receive. This parameter is based on the individual's risk of identity disclosure, which is calculated based on the uniqueness of their attributes in the dataset. Individuals with highly unique attributes receive higher levels of privacy protection, while those with less unique attributes receive lower levels of protection. Personalized Privacy Preserving refers to the concept of tailoring privacy protection mechanisms to the individual preferences and requirements of data subjects. It aims to provide users with control over their personal information while still allowing them to benefit from data analysis and services (in Figure 10). Utility-based privacy preserving is a heuristic method used to protect data privacy by balancing privacy protections with the utility (or usefulness) of the data. This approach recognizes that complete privacy protection may not always be feasible or desirable, particularly in situations where data is needed for research or other purposes. To implement utility-based privacy preserving, data is first assessed to determine the level of privacy protection required based on the sensitivity of the data and the risks associated with disclosure. Then, data is processed to ensure that privacy protections are applied to the appropriate data elements, while minimizing the impact on data utility. In utility-based privacy preserving, the focus is on optimizing the trade-off between data privacy and data utility (in Fig. 11).

Cryptographic Methods
Cryptographic methods are a class of heuristic methods used to protect data privacy that involve the use of encryption and decryption techniques to secure sensitive data. These methods use mathematical algorithms to encode data in such a way that it can only be accessed by authorized individuals or systems with the appropriate decryption keys. They can be broadly divided into two categories: symmetric key cryptography and public key cryptography. One common method used for data privacy protection is symmetric-key encryption, where the same key is used for both encryption and decryption. In this method, the data is encrypted using a secret key that is shared between the data owner and the authorized recipient. The data is then transmitted securely over a network or stored on a device, and can only be accessed by those with the appropriate decryption key. Another method used for data privacy protection is public-key encryption, where two different keys are used for encryption and decryption. In this method, a public key is used for encrypting data, while a private key is used for decryption. (In Figure 12).

Secure Multiparty Computations (MPC)
In an MPC protocol, each party holds their own private data and wants to compute a function over the combined data without sharing their data with other parties. To achieve this, the parties interact with each other to perform the computation in a way that preserves privacy. The key idea behind MPC is that even though each party contributes their private input and partial computation, no party can determine the inputs of other parties or the intermediate results.
The protocol ensures privacy, confidentiality, and integrity of the inputs throughout the computation (in Figure 13).

Horizontally Partitioning Data
Horizontally partitioning data is a technique used in data privacy to distribute and store different attributes or subsets of data across multiple data sources while preserving privacy.

Vertically Partitioning Data
Vertically partitioning data is a technique used in data privacy to split a dataset vertically into multiple subsets based on different attributes or columns. Each subset contains a subset of the original attributes, preserving the privacy of certain sensitive attributes. The specific formulas for vertically partitioning data depend on the privacy-preserving technique being used and the specific attributes and privacy requirements of the dataset. Various algorithms and methodologies can be employed to determine the optimal partitioning strategy and achieve the desired privacy guarantees.

Conclusion
In conclusion, there are various perturbation methods that can be used to protect data privacy, each with its own strengths and weaknesses. Randomized response and noise addition are simple and effective perturbation methods, but they can be vulnerable to certain types of attacks and may require careful tuning of the noise level to balance privacy protection and data utility. Differential privacy provides a strong privacy guarantee, but can be computationally expensive. Secure multiparty computation provides strong privacy protection without requiring data to be perturbed or modified, but can also be computationally expensive.
The choice of perturbation method will depend on the specific application and the trade-off between privacy protection and data utility. Researchers continue to explore and develop new perturbation methods and optimization techniques to improve the privacy and utility of data analysis in various settings.