Review of Challenges and Solutions for Genomic Data Privacy-Preserving

The dramatic decrease in the cost of genome sequencing over the last two decades has led to an abundance of genomic data. This data has been used in research related to the discovery of genetic diseases and the production of medicines. At the same time, the huge space for storing the genome (2–3 GB) has led to it being considered one of the most important sources of big data, which has prompted research centers concerned with genetic research to take advantage of the cloud and its services in storing and managing this data. The cloud is a shared storage environment, which makes data stored in it vulnerable to unwanted tampering or disclosure. This leads to serious concerns about securing such data from tampering and unauthorized searches by those involved. In addition to securing inquiries, making calculations on this data, and generating differential privacy and garbled circuits, cryptography is considered one of the important solutions to this problem. This paper introduces most of the important challenges related to maintaining privacy and security and classifies each problem with appropriate, proposed, or applied solutions that will fuel researchers' future interest in developing more effective privacy-preserving methods for genomic data.


Introduction
The official announcement of the completion of the human genome project in 2003 drew attention to the importance and sensitivity of genomic data [1].The tremendous development of gene sequencing technology has resulted in a massive amount of genomic information, which is considered the clue to many diseases' comprehension [2].By using next-generation sequencing technologies, the growth of genomic data has become exponential in that the data volume is starting to reach petabytes [3].It is appropriate for this huge amount of data to be stored in the cloud, provided that its security and privacy are guaranteed [4].Cloud services are considered a wonderful and distinctive technology that provides dynamic and scalable services via the Internet.They are in increasing demand as technology develops [5].However, it is vulnerable to attacks, which lead to data leaking to unwanted parties [6].
Storing, sharing, managing, and performing analysis on this data should be accomplished by insecure means, which ensure that the data is never lost or exposed to misuse [7].Revealing the genome sequence of individuals leads to the possibility of genomic discrimination (even if it is prohibited), as well as the undesired revelation of sensitive information (biological family, medical history, or sensitive illness status) as a result of these breaches.Because they share most of their genetic DNA, the breadth of such injury might extend to offspring or relatives of the affected individuals.Furthermore, unlike the accounts of users and passwords (commonly hacked by information technology businesses), it is impossible to change the genetic information once it has been exposed [8].However, disclosure of genomic data to an untrusted third party has substantial privacy implications [9].In this paper, we present a detailed study on genomic data privacy preserving challenges and methods to preserve privacy through various techniques.The main contribution of the review can be precisely given as follows: • Present an overview of the prominent challenges facing privacy-preserving for genomic data and discussed the important research in this area, with classification and a brief discussion of these challenges.
• This paper discusses different methods of solving the genomic data security and privacy problems using different perspectives of collaboration among Homomorphic encryption (HE), Garbled Circuit (GC), and Differential Privacy (DP).In the rest of this review, the challenges of preserving the privacy of genomic data are described in section 2. Considering the need for more significant solutions is coupled with understanding the difficulties that should be addressed accurately.This is followed by reviewing different techniques for preserving genomic data privacy in Section 3. Finally, the major conclusions realized from this review are clarified.

Genomic data privacy-preserving: challenges
The processes of digital genomic data make it vulnerable to disclosure.The main operations which may violate the privacy of these data are sequence alignment, searching the database of genomics, and querying private genomic data.If necessary countermeasures are not taken, it is possible to violate the privacy of personal data [1].The distribution of the related works published in various journals from 2014 to 2021 is summarized in Figure 1. Figure 2 shows a general taxonomy for different types of genomic data privacy challenges.The taxonomy aims to understand the challenges in protecting genetic data that represent different levels of difficulty.The following subsections present how these challenges add different levels of difficulties to the genomic data preservation challenge.

Genomic data sharing privacy
Genomic data sharing processes can be categorized into public and private, each with its own set of access controls and rules.Various data identification attacks have increased the security and privacy of genomic data, which is not used for public sharing and has no privacy guarantee.These assurances, however, fall short for a variety of reasons (various adversary assumptions, different threat models/attacks) [6].So, data security and privacy have grown to be a crucial necessity for many enterprises [10].Bos et al. [11] discussed the available scenarios for the available applications in this field.This research paper sheds light on the issue of homomorphic encryption and demonstrates its importance and effectiveness by providing a practical application for the prediction service that works in the cloud on encrypted data.The application includes a cloud service to execute private predictive analysis jobs on health data that is encrypted using Some What Homomorphic Encryption (SWHE).The cloud service handles the encrypted data only and predicts without any knowledge of the secret medical data.

Genomic data privacy
Genomic data sharing privacy

Query privacy Outsourcing
Lauter et al. [12] employed Genome-Wide Association Studies (GWAS) and its basic algorithms with HE to work with encrypted data.They discovered a variety of statistical algorithms that were evaluated through lower-order polynomials, such as the goodness of fit (Pearson) test or the Chi-Square statistical test, which were used to check for divergence from Hardy-Weinberg equilibrium.Cheon et al. [13] devised an approach for performing the edit distance technique on encrypted genomic sequences.The approach generates an encrypted value of their edit distance.They implement their proposed algorithm of edit distance over encrypted genomic data with lengths n and m by the SWHE scheme.The optimization in their algorithm was reducing the depth of computing edit distance for short sequences.On the other hand, Simmons et al. [14] made a discovery for the privacy-preserving technique of aggregate sharing of genomic data with low-cost accuracy from the traditional differentially private method in a way making the trade-off between accuracy and privacy by varying λ parameter in the Laplacian distribution.
To facilitate the privacy-preserving partnership of genomic data for GWAS, a decentralized network using the privacy-preserving sharing protocol (PPS) and the data fragmentation algorithm was proposed by Zhang et al. [15], which was restricted to a limited number of fragments.We noticed that Yang et al. [16] suggested a scheme to share medical data based on attribute cryptosystem and blockchain technology, which involved storing encrypted medical data in the cloud while storing the storage address and medical-related information in the blockchain, ensuring storage efficiency and removing the opportunity of data amendment irreversibly.The suggested technique integrates Attribute-Based Encryption (ABE) with Attribute-Based Signature (ABS), allowing medical data to be shared over many-to-many communications.Table 1 outlines the prominent methods for the privacy of genomic data sharing.

Access and storage privacy
Genomic data must be stored in a secure place, ensuring that there is no exposure, tampering, or disclosure from untrusted parties.Huang et al. [17] proposed a method called Selective Retrieval on Encryption and Compressing Reference-Oriented Alignment Map (SECRAM) that is used for compressing aligned data, storage and retrieval of encrypted data, and efficiency improvement during downstream analysis.Despite the high efficiency in saving space and preserving privacy provided by this method, when the coverage is low, we notice a decline in performance.Meanwhile, Liu et al. [18] suggested including the implementation of VA-Store, an approach that uses K-mers (i.e., a subsequence of length k) with various k values to address the substantial space required for repeated data in common genomic sequence analysis tasks.For one component of the input dataset, the VA-Store maintains a physical store while supporting several stores for other portions of the information.
VA-store has translated a given query on the virtual store into one or more queries on the physical store can be executed by utilizing essential linkage among repetitious data.Although it saves space and maintains privacy as well, it degrades in performance when a sequence of length less than k0 (the first subsequence length) comes along.On the other hand, Chen et al. [19] presented a framework for large-scale calculations on genomic data that was outsourced to a third-party (public cloud server) for greater scalability and security.Furthermore, the tree structure was used by them to exemplify arbitrary genomic data for computational competency and integrated homomorphic cryptography with the Garbled Circuit approach to ensure security.Although it provides a significant improvement in run time for the execution of queries, it requires an additional cost in the event of dishonest researchers and is exposed to security leaks during the search operation.Mehmood et al. [20] proposed an indexed-based method to answer queries of pathways scattered over various distributed datasets.They offered a heuristic-based source selection method for determining which datasets are appropriate for a given route query as well as a strategy for federating queries to select sources and assembling(merging) the paths obtained from those distant datasets.Table 2 illustrates the prominent methods for access and storage privacy.

Query privacy
Researchers querying on genomic data preferred to be secure and not to be disclosed to others (attackers/curious), in that the query and the output of it hold sensitive information about the individuals.It is a challenge to ensure the privacy of the query and the result [21].Alaziz et al. [22] adopted the Paillier cryptosystem and order-preserving encryption to execute the count query and the ranked query securely.Despite the advantage of this method that the time of performing calculations on encrypted data is close to the time taken by the same operations on unencrypted data, it is expensive when decrypting.On the other hand, Sousa et al. [5] employed Private Information Retrieval (PIR) with HE to invent a hash-based solution.Some changes have been made to the standard PIR protocol to access specific variants while its identification parameters such as chromosome, position, and reference allele can be used instead of the usage of its relative position in the Variant Call Format (VCF) file.Moreover, they used symmetric encryption to protect genomic data on the server side.The aforementioned method is characterized by an error rate associated with its hashing scheme and is slow if the database is large and multiple variants or files are queried.while Xu et al. [23] resorted to guaranteeing the integrity of the query result and preserving the confidentiality of the data through the proposed authenticated aggregate queries over a set of valued data.They suggested a privacy-preserving authentication framework for overall queries.Mahboubi et al.
[24] suggested a system called Secure Distributed TOPK (SD-TOP-K) in which the user data is encrypted and stored in a distributed system and can be evaluated by a top-k query processing algorithm which finds a set of encrypted data that is proven to contain top-k data items.This is done without having to decrypt the data in the nodes where they are stored.Moreover, they suggested a robust filter in the algorithm that strips the false positives as much as possible without decrypting the data.Meanwhile, Quan et al. [25] suggested a method to reduce top-k query privacy leakage when compared to order-preserving encryption (OPE).Top Order Preserving Encryption (TOPE), which allows top-k searches on encrypted data using partially ordered heap characteristics for balancing privacy and search capabilities, is the essential method.Table 3 summarizes the prominent methods for query privacy.

Outsourcing
The growing interest in outsourcing to manage data is due to faster implementation, flexible scalability, reduction of costs, and improved latency and connectivity.Although consumers must trust cloud service providers, this raises privacy and security concerns when it is related to research data from patients or volunteers.Several solutions have been proposed to address the security challenges, particularly in the area of data processing in the cloud.[22].Zhang et al. [26] introduced the Fully Outsourced secuRe gEnome Study basEd on homomorphic Encryption (FORESEE) architecture for computing Chi-square statistics on the public cloud in a safe and completely outsourced manner.The so-called semi-honest opponent model assumes that the cloud properly follows the protocol but is interested in information from the received data.
Secure division operations can be provided by the suggested FORESEE framework with homomorphically encrypted data and immediate release of research findings from the cloud.Although the cost is very high and the efficiency is reduced due to the large value of G, it is still efficient in supporting full cloud outsourcing while maintaining final result encryption.Meanwhile, Wang et al. [27] suggested a new HEALER framework for evaluating the P-value of accurate logistic regression parameters applied to homomorphically encrypted data.Secure outsourcing was facilitated and the danger of sensitive data analysis was reduced in untrustworthy cloud environments (e.g., Amazon EC2, or Microsoft Azure).A new rejection sampling technique, secure integer comparison method, and parallelizable mechanism were introduced to speed up the execution of this algorithm, making homomorphic encrypted precise logistic regression computing feasible.Furthermore, a compression strategy was used to lower the cost of storing and communicating homomorphically encrypted data.The cost of computation and storage is still significant with some limitations in the proposal and distribution availability in the encrypted domain, which might lead to a low acceptance rate.There is the challenge of a homomorphic division operation.
Then Ghasaemi et al. [28] suggested a model for outsourcing data by employing a paillier cryptosystem with permutation.The method provides count query and top-k operations with an outperformance technique, but there is vulnerability to Homer attack and de-identification attack.Ziegeldorf et al. [29] employed Fully and Partially Homomorphic Encryption with a bloom filter (FHE and PHE-BLOOM).These approaches are efficient in genetic disease tests, which securely outsource the storage that has been allowed by the data owner and computed to the untrusted cloud.FHE-BLOOM provided full security in the semi-honest model, while PHE-BLOOM provided little qualification in guarantees of security in a trade-off for enhancing performance improvement.It provides flexible and efficient management supporting the outsourced data and may be extended to support further query types, but still suffers from overhead in the setup of the patient's database.
Hassan et al. [30] introduced a new approach for outsourcing genome data that is both safe and efficient from the aggregate genome data.The suggested approach created an index tree, which was subsequently outsourced to a third-party cloud server.The nodes of the tree have been scanned by the cloud server and perform count query operations using a secure interactive interface during the data processing phase as well as the query execution phase.This approach does not expose any crucial genomic data, does not provide privacy against inference attacks, nor data access privacy as it reveals the tree traversal pattern.Raisaro et al. [31] suggested and implemented a safe and efficient privacy-preserving approach in a real-world setting for investigating genomic cohorts by employing HE and DF at Lausanne University Hospital.It enables the exploration of large genomic datasets.Kim et al. [32] demonstrated a safe outsourced method for evaluating logistic regression models for quantitative characteristics and testing their genetic connections.They use a semiparallel training strategy to create a logistic regression model for variables, then run a one-step parallelizable regression on all single nucleotide polymorphisms (SNRs).They increase the performance of the underlying approximation homomorphic encryption algorithm.

Privacy-preserving techniques
There are several solutions to preserve privacy and security challenges for genomic data.Homomorphic encryption, Garbled Circuit, and Differential Privacy (DP) are considered the most significant privacy-preserving techniques [22].For all works presented in this review, the distribution of the three existing privacy-preserving techniques is depicted in Figure 2.

Homomorphic encryption
The performance of computation over encrypted data is allowed by homomorphic encryption with no need to decrypt it.HE can be classified into fully, partially, and somewhat homomorphic encryption.To preserve privacy during the computation of genomic data, different schemes were applied [33].K. Shimizu et al. [34] suggested a method that combined efficient string data structures with cryptographic techniques constructed by additive HE.They produced an implementation of an efficient algorithm to search for sequences of SNPs in a large genome database.The server can not reveal the queried sequence.
On the other hand, Lain et al. [35] attempt to develop an Efficient Private Circular Query Protocol (EPCQP) with excellent accuracy, minimal computing, and transmission costs.The Moore curve was employed to transform two-dimensional spatial data to one-dimensional sequence and use Brakerski-Gentry Vaikuntanathan's (BGV) HE approach to protect the information about a point of interest (POIs).In order to reinforce the storage efficiency of the genome data sets, computation, and communication costs, Singh et al. [36] proposed a secure Meanwhile, Wang et al. [37] proposed a novel scheme for healthcare queries on outsourced data called HeOC.Encrypted data is uploaded by trusted users into the cloud, and a perfect query is done on encrypted data about a particular disease.The operation is done by using a large number of sensors, which makes it expensive despite its efficiency.Then, Zheng et al. [38] employed an efficient k-NN query method to outsource encrypted data from e-healthcare.The encryption is done by the Paillier cryptosystem.This method provides efficient storing of encrypted data in the cloud and privacy-preserving k-NN query over encrypted data.This method is efficient in terms of privacy preservation and computational complexity.
Yan et al. [39] used edge computing based on the blockchain to construct a key solution that ensures the efficiency of the blockchain and reduces the computational overhead to clients by employing the paillier cryptosystem.They presented the advantages of blockchain and edge computing and constructed the key technological solutions of edge computing based on blockchain.They achieve the security protection and integrity check of cloud data and realize more extensive, secure multiparty computation.Blatt et al. [40] proposed a solution for GWAS security using (HE) to keep the encryption of all individual data during the association study.They presented a new Residue-Number-System(RNS) variant of the Cheon-Kim-Kim-Song (CKKS) HE scheme, new methods to switch between data encodings, and more than a dozen crypto-engineering optimizations.The solution can implement the full GWAS computation for 1000 individuals, 131,071 SNPs, and 3 covariates in about 10 minutes on a modern server computing node.
A novel encryption strategy based on HE was presented by Vizitiu et al. [41].MORE (Matrix Operation for Randomization or Encryption) is proposed, which allows calculations within a neural network model to be directly conducted on floating-point data with reasonably little computational cost.At the same time, Blatt et al. [42] proposed statistical toolbox techniques that use HE to implement large-scale GWASs on encrypted genetic/phenotype data in an interactive manner where no decryption is required.The method presented a reformulation of GWAS tests to make use of packing of encrypted data and parallel processing, highly efficient statistical computation integration, and the development of a dozen crypto engineering optimizations.
Kuo et al. [43] invented three tracks for competition, which included genomic dataset access logging based on blockchain, securing HE in GWAS, and securing DNA segment searching.Kim et al. (2021) [44] utilized HE to introduce mathematical results and a warranty for security to protect genotype data at the time of imputation, which was implemented in a semi-trusted environment.Table 5 summarizes privacy preserving techniques for genomic data using homomorphic encryption.

Garbled Circuit
Garbled is a two-party secure computation protocol that can be used for any general purpose.The usage of this protocol allows two parties to calculate the outcome of a function jointly without knowing anything concerning the inputs or intermediate results of the other party [45].Yao's garbled circuit protocol is the most renowned of the Multi Parity Computation MPC techniques.It is commonly seen as the best-performing, and numerous of the protocols we cover build on Yao's GC [46].The security of GC can be guaranteed by the equal participation of both parties communicating through the calculated functions.
Another benefit of GC is the secrecy of both parties' inputs, as the query frequently demands the same level of anonymity as the data.As a result, GC is typically utilized in sequence similarity situations when one party (researcher) has a data set of genomic sequences and the other party (data set) has a sensitive query sequence.The researcher wishes to locate sequences that are comparable to that specific query using any similarity metrics, such as Hamming and Levenshtein distances [33].Al Aziz et al. [47] suggested approximation techniques for editing distance computation securely through genomic sequences and utilizing shingling specific set methods that include the algorithm of banded alignment intersected with garbled circuits to implement these methods.The method is considered to be accurate and time-efficient.On the other hand, the suggestion for a paradigm based on the basis of an indexed prefix tree for identical queries of patients by Mahdi et al. (2018) [48].It ensures the privacy of data query requests and query responses.By employing the AES algorithm for preserving privacy, the encrypted and compressed tree is delivered to the cloud server to carry out query operations.
Researchers use GC to execute queries on accumulated data for semi-trusted models of opponents.Hasan et al. [30] proposed using distinct third parties to ensure secure exchange and execution of counter-question procedures on outsourced genomic data.The recommended method for creating an index tree from genetic data and then outsourcing it.The tree's nodes will be traversed by the cloud server and perform the count query using a secure interactive protocol.The checking will be done using Yao's GC over an interactive interface.Cheng et al. [2] proposed protocols to outsource the Similar Sequence Queries (SSQs) using an approximation of Edit Distance (ED), which depends on homomorphic encryption, and proposed a group of different security protocols to attain security efficiency and scalability depending on secret sharing, garbled circuit, and partial homomorphic encryption.
Mahdi et al. [49] suggested a technique to execute the count queries in a secure manner composed of genotype, phenotype, and numeric data by employing encryption and garbled circuits.Sotiraki et al. [50] developed a novel depth-optimized technique for computing setmaximal coincide between a database of aligned genetic sequences and an individual's DNA while preserving the database owner's individual privacy.Table 6 summarizes the privacypreserving technique for genomic data via a garbled circuit.

Differential Privacy
Differential privacy is a model of privacy preservation that provides summary statistics about the dataset and ensures no one can learn anything about any record in the dataset [51].It is considered widely accepted as a rigorous model for privacy protection.The present privacypreserving algorithms are still problematic, such as k-anonymity.Before the appearance of differential privacy [52], it employed extra strict constraints and definitions by adding interference noise, as it conserves the potential privacy of users' information in the published data.
The attacker cannot conclude any information even if he has mastered specific information.Therefore, this completely excludes the possibility of disclosure of private information from the data source [53].He et al. [54] suggested a differential privacy method that ensured genomic data release during belief propagation execution on a factor graph.This method is capable of factorizing the distribution of genomic data into a group of local distributions.Wei et al. [55] suggested differential privacy based on the genetic matching (DPGM) schema to attain efficient agreement and secure privacy in genetics.Park et al. [56] suggested a secure system for genomic data management by combining blockchain and local differential privacy.The suggested system uses two types of storage: private and semi-private, where genes are irreversibly modified by LDP in semi-private storage.While the data is stored in private storage accessible by internal employees only.Table 7 summarizes the privacy-preserving method using differential privacy.

Conclusions
During the past two decades, the importance of genomic sequencing and vital information has been demonstrated with the increase in genetic testing, analyses, and diagnostics and the spread of treatment based on individual genome sequencing.Because the cost of genomic sequencing has been dramatically reduced, people are being sequenced for a variety of reasons.Because it is important and sensitive information, unauthorized and undesirable access leads to a violation of the privacy of individuals.
Despite the benefits, the lack of protection and privacy-preserving methods creates risks and problems that outweigh the benefits.In this review paper, we present a clarification of the most important challenges facing maintaining the privacy of genomic data and a classification of the most important solutions used to meet these challenges.The use of the HE technique has produced remarkable results in terms of providing protection, privacy, and the ability to conduct operations without the use of data decryption.Furthermore, several forms of hybridization among HE, CG, and DP can be used to preserve genomic data privacy.It is hoped that in subsequent years, with the increase in genomic sequencing operations, the process of protecting this important and sensitive data generated will be essential in a manner appropriate to its rapid growth and will encourage researchers to focus on collaboration among HE, CG, and DP.

Figure 1 :
Figure 1: Distribution of the published papers in different journals between 2014-2021.

Figure 2 :
Figure 2: Genomic data challenges publications(2014-2021) Genomic data sharing privacy Access and storage privacy Query privacy Outsourcing

Figure 2 :
Figure 2: Distribution of preserving privacy for genomic during the period 2016-2021 using HE (blue), GC (brown), and DP (gray) techniques.

Table 1 :
The prominent methods for the privacy of genomic data sharing

Table 2 :
The prominent methods for access and storage privacy

Table 3 :
The prominent methods for query privacy

Table 4 :
The prominent methods in the privacy of outsourcing

Table 5 :
Privacy preserving techniques via homomorphic encryption

Table 6 :
Privacy-preserving techniques for genomic data using a garbled circuit

Table 7 :
Privacy-preserving techniques using differential privacy