Private (Big) Data Publication


Public datasets are used in a variety of applications spanning from genome and web usage analysis to location-based and recommendation systems. Publishing such datasets is important since they can help us analyzing and understanding interesting patterns. For example, mobility trajectories have become widely collected in recent years and have opened the possibility to improve our understanding of large-scale social networks by investigating how people exchange information, interact, and develop social interactions. With billion of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated, they can help understand complex processes, such as the spread of viruses, and build better transportation systems, prevent traffic congestion. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to individual privacy. In fact, mobility trajectories might be used by a malicious attacker to discover potential sensitive information about a user, such as his habits, religion or relationships. Because privacy is so important to people, companies and researchers are reluctant to publish datasets by fear of being held responsible for potential privacy breaches. As a result, only very few of them are actually released and available. This limits our ability to analyze such data to derive information that could benefit the general public.

Recent Results:

1. Privacy-Preserving Sequential Data Publication:

Sequential data is being increasingly used in a variety of applications, spanning from genome and web usage analysis to location-based recommendation systems.

Publishing sequential data is of vital importance to the advancement of these applications since they can enable researchers to analyze and understand interesting sequential patterns.

However, as shown by the re-identification attacks on the AOL and Netflix datasets,

releasing sequential data may pose considerable threats to individual privacy. Recent research has indicated the failure of existing sanitization techniques to provide claimed privacy guarantees. It is therefore urgent to respond to this failure by developing new schemes with provable privacy guarantees. Differential privacy is one of the only models that can be used to provide such guarantees. Due to the inherent sequentiality and high-dimensionality, it is challenging to apply differential privacy to sequential data. In this paper, we address this challenge by employing a variable-length n-gram model, which extracts the essential information of a sequential database in terms of a set of variable-length n-grams. Our approach makes use of a carefully designed exploration tree structure and a set of novel techniques based on the Markov assumption in order to lower the magnitude of added noise. The published n-grams are useful for many purposes. Furthermore, we develop a solution for generating a synthetic database, which enables a wider spectrum of data analysis tasks. Extensive experiments on real-life datasets demonstrate that our approach substantially outperforms the state-of-the-art techniques.

The major contributions of this work are three-fold:

- For the first time, we introduce the n-gram model as an effective means of achieving differential privacy in the context of sequential data. To better suit differential privacy, we propose the use of a novel variable-length n-gram model, which balances the trade-off between information of the underlying database retained and the magnitude of Laplace noise added. The variable-length n-gram model intrinsically fits differential privacy in the sense that it retains the essential information of a sequential dataset in terms of a set of high-quality n-grams whose counts are large enough to resist Laplace noise.

-We develop a series of techniques to guarantee good utility under the variable-length n-gram model, including an adaptive privacy budget allocation scheme, a formal choice of a threshold value, and the enforcement of consistency constraints. These techniques make use of the inherent Markov assumption in a n-gram model. In addition, we develop an efficient method to generate a synthetic dataset from released n-grams, enabling a wider spectrum of data analysis tasks.

-We conduct an extensive experimental study on the variable-length n-gram model over real-life datasets, which provides important insights for future work. In particular, we demonstrate that our solution substantially outperforms the state-of-the-art techniques in terms of count query and frequent sequential pattern mining.

2. Private Histogram Publishing [ICDM2012]:

Differential privacy can be used to release different types of data, and, in particular, histograms, which provide useful summaries of a dataset. Several differentially private histogram releasing schemes have been proposed recently. However, most of them directly add noise to the histogram counts, resulting in undesirable accuracy.
In this work, we propose two sanitization techniques that exploit the inherent redundancy of real-life datasets in order to boost the accuracy of histograms. They lossily compress the data and sanitize the compressed data. Our first scheme is an optimization of the Fourier Perturbation Algorithm (FPA) presented in [13]. It improves the accuracy of the initial FPA by a factor of 10. The other scheme relies on clustering and exploits the redundancy between bins. Our extensive experimental evaluation over various real-life and synthetic datasets demonstrates that our techniques preserve very accurate distributions and considerably improve the accuracy of range queries over attributed histograms.


[CCS12] R. Chen, G. Ács, C. Castelluccia. Differentially Private Sequential Data Publication via Variable-Length N-Grams In 19th ACM Conference on Computer and Communications Security (CCS), 2012. (pdf)

[ICDM12] G. Ács, R. Chen, C. Castelluccia. Differentially Private Histogram Publishing through Lossy Compression, IEEE International Conference on Data Mining (ICDM), 2012. (pdf)

Code Release:

The script generates the set of noisy n-grams up to size n_max (get code).
This code includes the synthetic database reconstruction mechanism detailed in Section 4.3.5 of the paper [CCS2012].

The script implements the histogram sanitization algorithm described in [ICDM12] (get code).


This work was supported in part by a grant from the EIT ICT Labs to INRIA.