This tutorial means to provide an introduction to the Privacy Integrated Queries (PINQ) language for writing differentially-private data analyses. PINQ is modeled after the Language Integrated Queries (LINQ) language extensions to .NET, and the programming experience is intentionally quite similar, but there are many important differences we will cover.
Language Integrated Queries (LINQ) is a language extension to .NET supporting declarative data-parallel computation. The programming model introduces .NET objects which wrap data-parallel sources, with methods you would expect from SQL-style computations, like Select, Where, GroupBy, and Join. These methods return new objects wrapping sources that conceptually reflect the operations, but without necessarily invoking computation until the source is either enumerated or aggregated. This lets the LINQ provider choose the best execution plan given the entire query, including the option to transport the execution to a remote computer (perhaps a database, or a DryadLINQ cluster).
Privacy Integrated Queries (PINQ) is not a LINQ provider, but it looks very much like one. Like LINQ, PINQ provides .NET objects which wrap data sources and supporting declarative transformations and aggregations. Unlike LINQ, this is not done for efficiency or modularity, but rather to support the analysis of sensitive data without needing to provide the data itself to the analyst. The analyst describes an analysis they would like to perform on the data, at which point PINQ can take over and conduct the analysis, and return aggregate results that have had appropriate privacy sanitization applied to them. The language restrictions and careful (but non-standard) implementations of aggregations result in computations that provably provide the guarantees of Differential Privacy, even without privacy expertise on the part of the analyst.
This tutorial, the code examples in the download, and the discussion in the associated technical paper should be enough to get you started writing programs in PINQ. As you do, you will undoubtedly find that there are computations you would like to perform that are not obviously supported. In many cases, these computations have natural (but imperfect) analogs in PINQ, and the hope is that the collective wisdom of PINQ users can support many of the intended uses. This tutorial has many tips and tricks, and is intended to grow with time, but there is certainly more to be discovered.
The PINQ prototype is available from http://research.microsoft.com/PINQ where you can also find supporting documentation and code examples (and this tutorial).
1. Getting Started with PINQ
Getting started with PINQ is barely more complicated than starting a new Visual Studio project. Having created a new project, we want to include the PINQ.dll as a reference, and add a "using PINQ;" statement to the top of our program.
Our new project now has access to all that PINQ has to offer, which we now survey.
Creating a new PINQueryable
The central type in PINQ is the PINQueryable. This is the privacy-wrapped data set that an analyst has access to, and through which they learn about the underlying data. The PINQueryable guards the data, and does not respond to queries unless it can confirm that their privacy properties are within the bounds set up by the provider of the data set, but is otherwise meant to appear as if it is the source IQueryable. A PINQueryable supports many of the LINQ transformations like Where, Select, GroupBy, and Join, as well as analogues of the LINQ aggregrations like Count, Sum, and Average.
To construct a PINQueryable, we will need two parameters for its constructor:
The IQueryable is the source data we want our PINQueryable to protect. Any object supporting the IQueryable interface will work here, including data read from files, tables exposed by databases, or even massive data sets held in DryadLINQ. We will see more examples as we continue, but for now we just use a simple list of the numbers from 1 to 1,000:
The PINQAgent is a very important part of PINQ, controlling how much accuracy the analyst should have access to. The agent controls a resource, the units of differential privacy, and is given the opportunity to reject any depletion of this resource. We'll use a very simple PINQAgent, included with PINQ, that manages a simple differential privacy budget:
This agent will make sure that no more than 1.0 units of differential privacy are ever extracted from the PINQueryable object it is tasked with managing. Fortunately, the data provider only needs to specify this amount, and is not required to be on hand while PINQ manages the access.
We can now construct our first PINQueryable simply by creating a new object with the appropriated parameters passed to the constructor:
We are now ready to do some differentially-private data analysis using PINQ.
Counting something (our first analysis)
To start with a really simple program, let's count our data set a few different ways. To do this, we'll want to invoke the PINQueryable's NoisyCount method. This is just like a Count in LINQ, except that there is an accuracy parameter epsilon that corresponds to the privacy loss. When it is set to zero, there is no accuracy, and no privacy lost either. As the value increases, the count becomes more and more accurate (formally, the result in the true count plus a Laplace random variable with parameter 1.0/epsilon), but more and more privacy is also lost.
If we count the number of records in our data set (we expect 1,000, remember), we see various results reflecting the distribution of the noise PINQ incorporates.
The output of this program starts as we might expect, giving the first two counts with accuracy that appears to reflect the requested amount (accurate roughly to 100, and 10, respectively). Before we can get to the third aggregate, though, PINQ intercedes, as the cumulative privacy cost would exceed the limit encoded in the PINQAgentBudget.
This is about as simple as it gets with PINQ, but there is much more it can do.
2. Developing an Analysis: k-Means Clustering
We'll start by writing a relatively simple data analysis: an implementation of the k-means clustering rule. The data are assumed to be points in some d-dimensional space; we'll treat them as double arrays with a known length d. The algorithm starts from a set of k points in this space, and repeatedly applies the update rule that replaces each center with the average of those data points closer to it than to any other. This example should show us how to do some simple data manipulation and analysis with PINQ, but also a few tricks for creating more advanced programs out of simpler parts.
We will focus just on writing the update rule; that is, a C# method whose signature is
The method expects a PINQueryable containing the relevant data, as well as a list of centers to update, and an accuracy parameter epsilon to use. The result is the new list of centers, updated in accordance with the update rule, with accuracy epsilon.
Updating Mean Vectors
The core logic for k-Means is to update each center to the average of those points closer to it than to any other center. For each of the k centers, we will want to use PINQ's Where method to restrict the data set appropriately. Let's imagine we have written a helpful Closest(vector, centers) method, which returns the element of centers nearest to vector. For each center in centers, such a method lets us restrict the data to those vectors closer to it than to any other center:
To compute the mean of a data set, we will invoke PINQ's NoisyAverage method. Like NoisyCount, this method takes a parameter epsilon indicating the level of accuracy. Additionally, it takes a function that maps each input record type (here, double) to a double. PINQ will apply the function to each of the records, clamp each result to the [-1,+1] interval, and then average out the results with a small amount of noise depending on epsilon. What results is a relatively accurate approximation to the average, assuming the data set contains enough records as compared to 1/epsilon.
Putting these two methods together, along with some control structure that loops through centers, and through each coordinate of each center (an average only happens one dimension at a time), we get:
Our function is now complete, and properly returns updated centers as promised.
To demonstrate our implementation, we will rewrite Main to initialize and iterate on several centers. First, and for completeness, let's look at a simple method for producing random data in the [-1,+1] cube, in any number of dimensions.
We will initialize our data set with 10,000 samples from this function, using two dimensions (so that our intuition about the data is best maintained). Using a privacy budget of 1.0 again, our Main funtion looks like:
When run, we can see that the output means are converging roughly to the four quadrants of the [-1,+1] square, exactly as we would expect for a large data set drawn uniformly from that region:
Clever Tricks: The Partition Operation
If we watch our privacy meter carefully, we see that we pay a price that depends not only on epsilon, but also on the dimensionality of the data (d) as well as the number of centers (k). It turns out that this latter dependence is spurious; as each record contributes to the re-computation of at most one center, it can affect only d of the (d x k) outgoing values, not all of them. Attempting to optimize the privacy cost of our algorithm is important, because a more privacy-efficient algorithm can be run with higher levels of accuracy, for more iterations, or effectively on less data.
We can streamline the privacy performance of our k-means algorithm by making this independence explicit using PINQ's Partition operation. Using Partition is a lot like using multiple Where methods, in which each of the methods are intended to be disjoint (as in our case). To invoke Partition, the analyst must supply an array of keys, and a key selection function:
What results is a set of PINQueryable objects, one for each of the proposed keys, each containing those records that mapped to the key under the key selection function. PINQ's Where method can be viewed as a shorthand for the case where the keys are true and false, the key selection function is the predicate. However, we have a more interesting situation, where we want to split our incoming data set into parts, one for each center, using the nearest center as the key selection function.
This new subroutine will experience a reduced privacy cost, as the cost of a partitioned PINQueryable is only the maximum cost associated with each part, rather than the sum, as we had previously experienced using Where.
3. Transformations: Where, Select, GroupBy, Join
[Section in progress; please excuse the dust]
We'll now take a slightly more careful look at some of the transformations that PINQ supports. Transformations are the building blocks of PINQ programs; without them we would be unable to do anything with a protected data set other than directly aggregate it, which would quickly become boring. With transformations, we can apply a variety of interesting and useful operations to our data, as well as integrate multiple data sets (both protected and non).
There are many transformations in PINQ, and there is nothing that prevents further transformations from being added in the future. The main feature of a transformation, as opposed to an aggregation, is that it returns a new PINQueryable object, still wrapped around sensitive data, rather than an immediately useful aggregate. This means that a transformation must be able to contribute some operation to the data, but also that it is responsible for properly handling the privacy implications of the transformation. Some transformations may weaken differential privacy guarantees for the source data through extreme sensitivity, and it is the job of the transformation to alert PINQ to this sensitivity so that the appropriate bookkeeping can be done when its outcome is subjected to differentially-private analysis.
Concretely, all transformations in PINQ have what is called a "stability constant". This number is the factor by which the symmetric difference between two possible input data sets, A and B, can be increased after the transformation, to T(A) and T(B), respectively. This constant is critical for the quantitative mathematics underling PINQ: if a differentially-private analysis with accuracy epsilon is applied to T(X), the differential privacy implications for X are as if it had been subjected to an analysis with accuracy epsilon *times* the stability constant of T. If the transformation has been written properly, PINQ will take care of all of these calculations, but it is important for the analyst to know what is going on, so that they understand where their privacy budget has run off to.
The Where transformation
Where is a simple transformation that takes a predicate as an input, and returns a PINQueryable reflecting the restriction of data set to those records satisfying the predicate.
The stability constant for Where is one.
The Select transformation
Select is also a simple transformation, taking a function that maps records of the source type T to some output type S. The result is the PINQueryable reflecting the application of the function to each of the records in the source data set.
The stability constant for Select is one.
The GroupBy transformation
GroupBy is a somewhat more challenging transformation. It takes as input only a key selection function, mapping each source record to a key value of some type K. GroupBy operates like its LINQ counterpart, resulting in a list of IGrouping objects, one for each observed key, each containing those records that mapped to the corresponding key under the supplied function.
One natural use of GroupBy is to take a set of records with unique identifier information in them, and to collect the records intro groups based on this identifier:
var users = searches.GroupBy(search => search.userID);
The presence of GroupBy in PINQ is noteworthy in that its functionality has normally been viewed as a privacy attack, and largely incompatible with privacy-preserving data analysis. The vulnerability of the AOL data release was exposed just as above, by grouping the data by [anonymized] user identifiers. Despite the anonymization, the searches themselves were sufficiently informative as to allow the identification of the searcher who issued them. Nonetheless, GroupBy is available in PINQ because we are assured that its result will only be subjected to differentially-private computation.
The stability constant for GroupBy is two.
The Join transformation
Join is a rather difficult transformation, and we will see that we are unable to faithfully reproduce the semantics of the Join transformation from LINQ. In LINQ, the Join transformation takes a second data set, key selection functions (as in GroupBy) for each of the data sets, and then a reduction function from pairs of the two input types to a third, output, type. The intended semantics are that for every pair of records, one from each of the two input data sets, for which the keys match, the reduction function is applied to the pair and yields an output record. Unfortunately, the LINQ Join semantics result in a transformation with no bound on stability. Even one record in difference between inputs A and B could lead to an arbitrarily large number of records in difference in the output.
Instead, we adopt semantics closer to the GroupJoin transformation, in which the input data sets are first subjected to a GroupBy using the key selector functions. This ensures that the inputs to the Join have at most one record per key value. The result is not a list of pairs of records that match, but rather a pair of lists of records that match. The reduction function must then operate on pairs of lists rather than pairs, and will invariably see far fewer such pairs than in the traditional semantics, but in this form the Join does achieve a bound on stability.
While far fewer individual records are emitted from a Join, it is still very useful at linking information between data sets using unique, or approximately unique keys. Imagine the users PINQueryable from our GroupBy example, and let's Join the result with our protected map from userID to proper name, to see who is searching for themselves:
var users = searches.GroupBy(search => search.userID); // from above
var selfsearches = users.Join(userToName, user => user.Key, pair =>pair .userID, (user, pair) => user.Where(query => query == pair.name));
Console.WriteLine(selfsearches.Where(list => list.Count() > 0).NoisyCount(0.1));
Using Join we are able to link these two data sets together, taking information from both and providing differential privacy guarantees for both, without assumption or requirement on the structure of the other.
The stabilitiy constant for Join is two, for each input data set.
4. Writing New PINQAgent Classes
[Section in progress; please excuse the dust]
Just as a data analyst has a large degree of freedom for their interaction with the PINQueryable class, a data provider is also able to customize their side of the interaction with PINQ as well.
A data provider interfaces with PINQ by providing PINQAgent classes to constructed PINQueryable objects. PINQ ensures that these PINQAgents are consulted before any differentially-private aggregation is applied, supplying the agent with the intended privacy differential, epsilon. While a standard PINQAgent might track a pre-determined budget, subtracting issued differentials until exhausted, there are other interesting alternatives, and the possibility for the provider to write their own as they see fit.
5. Writing New Aggregations and Transformations
[Section in progress; please excuse the dust]
PINQ is designed to be an extensible platform for differentially-private computation, and supports the extension of the basic aggregation and transformation primitives through subtyping of the PINQueryable class. PINQ provides the core differential privacy logic, and will ensure that PINQAgents continue to be called with appropriate values, but it does not ensure that newly added aggregations are differentially-private or that newly added transformations have the correct stability constant. Newly methods are new privacy risks, and they should be added with some care and supervision.
Each aggregation in PINQ follows a relatively simple structure. The aggregation first checks with the PINQAgent to confirm that the intended accuracy epsilon is acceptable, and if the agent accepts, the aggregation in conducted. Otherwise, the aggregation currently in PINQ throw exceptions, but this decision is up to the aggregation. The only requirement is that they not report any data-dependent information back to the analyst in this case.
Each transformation in PINQ follows a relatively simple structure. The transformation produces a new IQueryable object containing the results of the trasnformation, and assembles a new PINQAgent containing the transformation-appropriate stability logic. As we discuss these details, we will work with an example "new transformation" SelectMany, actually present in PINQ but undiscussed until now. SelectMany is like Select but can output multiple records. SelectMany requires a second parameter k that bounds the number of records produced, and serves as the stability constant.
The transformation logic itself can be quite simple, as is the case with operations like Where, Select, and GroupBy; the LINQ analogue is invoked and simply used as the source data for the new PINQueryable. More complicated cases arise with more complicated transformations, though. Join, for example, requires the two soures be grouped by their key functions; it is not enough to assume that the analyst has done this for us, the transformation must perform the operation itself to ensure it is performed. SelectMany's logic is non-trivial, and important to get correct:
var newdata = source.Select(selectFunction).Select(list => list.Take(k)).SelectMany(x => x);
Note that we did not trust the analyst to supply us with the correct value of k, or for the selectFunction to restrict itself to only k elements; instead we forcibly correct any bad behavior on the analyst's part, ensuring that our requirements are me.
Assembling a new PINQAgent is often simple, typically involving a PINQAgentUnary, who scales all privacy requests up by a constant factor. Much more interesting and complicated PINQAgents exist; the one used in the Partition transformation is a good example of relatively complicated privacy logic, justified by the usefulness of the operator. The stability for SelectMany is the value of k supplied by the analyst (and enforced by our implementation.
var newagent = new PINQAgentUnary(this.agent, k);
return NewPINQueryable<S>(newdata, newagent);
It is especially important in transformations to ensure that the particular implementation of the transformation not enable new channels for the leakage of specific information. For example, the partition transformation returns a list of PINQueryable objects, and the order of this list is observable. Partition is implemented to order the results in the same order that the keys were received, rather than the order that GroupBy would produce (which would be sorted by the order the keys were first seen). Additionally, it is important that it populate the output with keys first, to ensure that even keys that we do not see in the actual data are properly placed; were they absent or otherwise suspicious, the analyst could derive information about the execution of the transformation.