New 

The Price of Selection in Differential Privacy
Mitali Bafna,
Jonathan Ullman
Manuscript
Links: arXiv
Abstract:
In the differentially private top$k$ selection problem, we are given a dataset $X \in \{\pm 1\}^{n \times d}$, in which each row belongs to an individual and each column corresponds to some binary attribute, and our goal is to find a set of $k \ll d$ columns whose means are approximately as large as possible. Differential privacy requires that our choice of these $k$ columns does not depend too much on any on individual's dataset. This problem can be solved using the well known exponential mechanism and composition properties of differential privacy. In the highaccuracy regime, where we require the error of the selection procedure to be to be smaller than the socalled sampling error $\alpha \approx \sqrt{\ln(d)/n}$, this procedure succeeds given a dataset of size $n \gtrsim k \ln(d)$.
We prove a matching lower bound, showing that a dataset of size $n \gtrsim k \ln(d)$ is necessary for private top$k$ selection in this highaccuracy regime. Our lower bound is the first to show that selecting the $k$ largest columns requires more data than simply estimating the value of those $k$ columns, which can be done using a dataset of size just $n \gtrsim k$.

Subgaussian Concentration via Stability Arguments
Thomas Steinke,
Jonathan Ullman
Manuscript
Links: arXiv
Abstract:
Sums of independent, bounded random variables concentrate around their expectation approximately as well a Gaussian of the same variance. Well known results of this form include the Bernstein, Hoeffding, and Chernoff inequalities and many others. We present an alternative proof of these tail bounds based on what we call a stability argument, which avoids bounding the moment generating function or higherorder moments of the distribution. Our stability argument is inspired by recent work on the generalization properties of differential privacy and their connection to adaptive data analysis (Bassily et al., STOC 2016).

Set Cover in Sublinear Time
Piotr Indyk,
Sepideh Mahabadi,
Ronitt Rubinfeld,
Jonathan Ullman,
Ali Vakilian,
Anak Yodpinyanee
Manuscript
Links: Coming Soon!
Abstract:
We study the classic set cover problem from the perspective of sublinear time algorithms. We consider both the integral and the fractional variants of this problem. Given access to a collection of $m$ sets over $n$ elements, for the integral variant, we show an adaptation of the streaming algorithm of HarPeled et al. to the sublinear query model, that returns an $\alpha$approximate cover using $\tilde{O}(m(n/k)^{1/(\alpha1)} + nk)$ queries to the input, where $k$ denotes the value of a minimum set cover. We complement this result by proving that for low values of $k$, the required number of queries is $\tilde{\Omega}(m(n/k)^{1/(2\alpha)})$. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require $\Omega(nk)$ queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter $k$. On the other hand, we show that this bound is not optimal for larger values of the parameter $k$, as there exists a $(1+\varepsilon)$approximation algorithm with $\tilde{O}(mn/k)$ queries.
For the fractional variant of set cover, we show three sublinear time $(1+\varepsilon)$approximate algorithms. The first two algorithms are most efficient when $k$ is small. Specifically, one algorithm has $\tilde{O}({mk / \varepsilon^{5}}+{nk / \varepsilon^2})$ query complexity and runtime, which (for constant $\epsilon)$ improves over the best prior bound of $\tilde{O}( (m + n) k^2/\varepsilon^2)$ due to Koufogiannakis and Young by (roughly) a factor of $k$. Our second algorithm reduces the number of queries further by a factor of (roughly) $\sqrt{k}$ while increasing the runtime by the same factor. Our third algorithm is most efficient when $k$ is large, and uses $O(mn /k\varepsilon)$ queries and similar running time. We also show that, for constant values of $k$, the query complexity of any algorithm for fractional set cover must depend linearly on both $m$ and $n$.
The first two algorithms for fractional set cover use the Multiplicative Weight Update (MWU) framework, which iteratively identifies unsatisfied constraints and recalibrates their "importance". The first algorithm implements each constraint checking round separately by using random sampling. Our second algorithm reduces the number of queries by reusing the random samples in multiple rounds. Since the MWU algorithm updates the constraints in an adaptive fashion, we employ the adaptive data analysis framework of Dwork et al. and Bassily et al., which allows us to control the probabilistic dependencies. To the best of our knowledge this is the first application of the framework to the design of sublinear algorithms.
Our lowerbound results follow by carefully designing two distributions of instances that are hard to distinguish. In particular, our first lower bound involves a probabilistic construction of a certain set system with minimum set cover of size $\alpha k$, with the key property that a small number of "almost uniformly distributed" modifications can reduce the minimum set cover size down to $k$. Thus, these modifications are not detectable unless a large number of queries are asked. We believe that our probabilistic construction technique might find applications to lower bounds for other combinatorial optimization problems.
We remark that all of our algorithms report an actual set cover (not just its size), while our lower bounds hold even for estimating the cover size.

Multidimensional Dynamic Pricing for Welfare Maximization
Aaron Roth,
Aleksandrs Slivkins,
Jonathan Ullman,
Zhiwei Steven Wu
Manuscript
Links: arXiv
Abstract:
We study the problem of a seller dynamically pricing d distinct types of goods, when faced with the online arrival of buyers drawn independently from an unknown distribution. The seller observes only the bundle of goods purchased at each day, but nothing else about the buyer's valuation function. When buyers have strongly concave, Holder continuous valuation functions, we give a pricing scheme that finds a pricing that optimizes welfare (including the seller's cost of production) in time and number of rounds that are polynomial in d and the accuracy parameter. We are able to do this despite the fact that (i) welfare is a nonconcave function of the prices, and (ii) the welfare is not observable to the seller. We also extend our results to a limitedsupply setting.

2017 

Make Up Your Mind: The Price of Online Queries in Differential Privacy
Mark Bun,
Thomas Steinke,
Jonathan Ullman
SODA 2017
Links: arXiv
Abstract:
We consider the problem of answering queries about a sensitive dataset
subject to differential privacy. The queries may be chosen adversarially
from a larger set Q of allowable queries in one of three ways, which we
list in order from easiest to hardest to answer:
 Offline: The queries are chosen all at once and the differentially
private mechanism answers the queries in a single batch.
 Online: The queries are chosen all at once, but the mechanism only
receives the queries in a streaming fashion and must answer each query
before seeing the next query.
 Adaptive: The queries are chosen one at a time and the mechanism must
answer each query before the next query is chosen. In particular, each query
may depend on the answers given to previous queries.
Many differentially private mechanisms are just as effective in the
adaptive model as they are in the offline model. Meanwhile, most lower
bounds for differential privacy hold in the offline setting. This suggests
that the three models may be equivalent.
We prove that these models are all, in fact, distinct. Specifically, we
show that there is a family of statistical queries such that exponentially
more queries from this family can be answered in the offline model than in
the online model. We also exhibit a family of search queries such that
exponentially more queries from this family can be answered in the online
model than in the adaptive model. We also investigate whether such separations
might hold for simple queries like threshold queries over the real line.

2016 

Computing Marginals Using MapReduce
Foto Afrati,
Shantanu Sharma,
Jeffrey D. Ullman,
Jonathan Ullman
IDEAS 2016
Links: arXiv
Abstract:
We consider the problem of computing the datacube marginals of a
fixed order k (i.e., all marginals that aggregate over k dimensions),
using a single round of MapReduce. The focus is on the relationship
between the reducer size (number of inputs allowed at a single reducer)
and the replication rate (number of reducers to which an input is sent).
We show that the replication rate is minimized when the reducers receive
all the inputs necessary to compute one marginal of higher order. That
observation lets us view the problem as one of covering sets of k
dimensions with sets of a larger size m, a problem that has been
studied under the name "covering numbers." We offer a number of
constructions that, for different values of k and m meet or come
close to yielding the minimum possible replication rate for a given
reducer size.

Exposed! A Survey of Attacks on Private Data
Cynthia Dwork,
Adam Smith,
Thomas Steinke,
Jonathan Ullman
Annual Review of Statistics and its Applications
Links: Coming soon!
Abstract:
Privacypreserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, "reidentification attacks" have shown that it is easy to link supposedly deidentified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof.
We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individualâ€™s data is included in the dataset. We also discuss techniques from the differential privacy literature for releasing approximate aggregate statistics whilst provably thwarting any privacy attack.

Privacy Odometers and Filters: PayasyouGo Composition
Ryan Rogers,
Aaron Roth,
Jonathan Ullman,
Salil Vadhan
NIPS 2016
Links: arXiv
Abstract:
In this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his prespecified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to prespecify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches nonadaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer usecases.

Strong Hardness of Privacy from Weak Traitor Tracing
Lucas Kowalczyk,
Tal Malkin,
Jonathan Ullman,
Mark Zhandry
TCC 2016B
Links: arXiv, ePrint
Abstract:
Despite much study, the computational complexity of differential
privacy remains poorly understood. In this paper we consider the
computational complexity of accurately answering a family $Q$
of statistical queries over a data universe $X$ under
differential privacy. A statistical query on a dataset $D \in X^n$
asks "what fraction of the elements of $D$ satisfy a given predicate
$p$ on $X$?" Dwork et al. (STOC'09) and Boneh and Zhandry
(CRYPTO'14) showed that if both $Q$ and $X$ are of polynomial size,
then there is an efficient differentially private algorithm that
accurately answers all the queries, and if both $Q$ and $X$ are
exponential size, then under a plausible assumption, no efficient
algorithm exists.
We show that, under the same assumption,
if either the number of queries or the data universe is of
exponential size, then there is no differentially private algorithm
that answers all the queries.
Specifically, we prove that if oneway functions and
indistinguishability obfuscation exist, then:
 For every $n$, there is a family $Q$ of $O(n^7)$ queries on a data universe $X$ of size $2^d$ such that no $\mathrm{poly}(n,d)$ time
differentially private algorithm takes a dataset $D \in X^n$ and outputs accurate answers to every query in $Q$.
 For every $n$, there is a family $Q$ of $2^d$ queries on a data universe $X$ of size $O(n^7)$ such that no $\mathrm{poly}(n,d)$ time
differentially private algorithm takes a dataset $D \in X^n$ and outputs accurate answers to every query in $Q$.
In both cases, the result is nearly quantitatively tight, since there
is an efficient differentially private algorithm that answers
$\tilde{\Omega}(n^2)$ queries on an exponential size data universe,
and one that answers exponentially many queries on a data universe of
size $\tilde{\Omega}(n^2)$.
Our proofs build on the connection between hardness results in
differential privacy and traitortracing schemes (Dwork et al.,
STOC'09; Ullman, STOC'13). We prove our hardness result for a
polynomial size query set (resp., data universe) by showing that they
follow from the existence of a special type of traitortracing scheme
with very short ciphertexts (resp., secret keys), but very weak
security guarantees, and then constructing such a scheme.

An AntiFolk Theorem for Large Repeated Games with Imperfect Monitoring
Mallesh Pai,
Aaron Roth,
Jonathan Ullman
Transactions on Economics and Computation
Links: arXiv
Abstract:
We study infinitely repeated games in settings of imperfect monitoring.
We first prove a family of theorems that show that when the signals observed
by the players satisfy a condition known as differential privacy, that the
folk theorem has little bite: for sufficiently small values of the privacy
parameters, for a fixed discount factor, any equilibrium of the repeated game
involve players playing approximate equilibria of the stage game in every
period. Next, we argue that in large games (n player games in which unilateral
deviations by single players have only a small impact on the utility of other
players), many monitoring settings naturally lead to signals that satisfy
differential privacy, for privacy parameters tending to zero as the number
of players n grows large. We conclude that in such settings, the set of
equilibria of the repeated game collapse to the set of equilibria of the
stage game.

Some Pairs Problems
Jeffrey D. Ullman,
Jonathan Ullman
BeyondMR 2016
Links: arXiv
Abstract:
A common form of MapReduce application involves discovering relationships
between certain pairs of inputs. Similarity joins serve as a good example
of this type of problem, which we call a "somepairs" problem. In the
framework of Afrati et al. (VLDB 2013), algorithms are measured by the
tradeoff between reducer size (maximum number of inputs a reducer can handle)
and the replication rate (average number of reducers to which an input must
be sent. There are two obvious approaches to solving somepairs problems
in general. We show that no generalpurpose MapReduce algorithm can beat
both of these two algorithms in the worst case. We then explore a recursive
algorithm for solving somepairs problems and heuristics for beating the
lower bound on common instances of the somepairs class of problems.

Space Lower Bounds for Itemset Frequency Sketches
Edo Liberty,
Michael Mitzenmacher,
Justin Thaler,
Jonathan Ullman
PODS 2016
Links: arXiv
Abstract:
Given a database, computing the fraction of rows that contain a query
itemset or determining whether this fraction is above some threshold are
fundamental operations in data mining. A uniform sample of rows is a good
sketch of the database in the sense that all sufficiently frequent itemsets
and their approximate frequencies are recoverable from the sample, and
the sketch size is independent of the number of rows in the original
database. For many seemingly similar problems there are better sketching
algorithms than uniform sampling. In this paper we show that for itemset
frequency sketching this is not the case. That is, we prove that there
exist classes of databases for which uniform sampling is nearly space
optimal.

Algorithmic Stability for Adaptive Data Analysis
Raef Bassily,
Kobbi Nissim,
Uri Stemmer,
Adam Smith,
Thomas Steinke,
Jonathan Ullman
STOC 2016; Invited to the special issue of SICOMP
This work subsumes the two previous manuscripts [BSSU15] and [NS15].
Links: arXiv
Abstract:
Adaptivity is an important feature of data analysisthe choice of
questions to ask about a dataset often depends on previous interactions
with the same dataset. However, statistical validity is typically
studied in a nonadaptive model, where all questions are specified
before the dataset is drawn. Recent work by Dwork et al. (STOC '15)
and Hardt and Ullman (FOCS '14) initiated the formal study of this
problem, and gave the first upper and lower bounds on the achievable
generalization error for adaptive data analysis.
Specifically, suppose there is an unknown distribution ${\bf P}$ and a
set of $n$ independent samples ${\bf x}$ is drawn from ${\bf P}$. We
seek an algorithm that, given ${\bf x}$ as input, accurately answers a
sequence of adaptively chosen "queries" about the unknown distribution
${\bf P}$. How many samples $n$ must we draw from the distribution, as
a function of the type of queries, the number of queries, and the desired
level of accuracy?
In this work we make two new contributions towards resolving this question:
 We give upper bounds on the number of samples $n$ that are
needed to answer statistical queries. The bounds improve and
simplify the work of Dwork et al. (STOC 2015), and have been applied in
subsequent work by those authors (Science, 2015; NIPS,
2015).
 We prove the first upper bounds on the number of samples required
to answer more general families of queries. These include arbitrary
lowsensitivity queries and an important class of optimization
queries (alternatively, risk minimization queries).
As in Dwork et al., our algorithms are based on a connection with
algorithmic stability in the form of differential privacy.
We extend their work by giving a quantitatively optimal, more general, and
simpler proof of their main theorem that stable algorithms of the kind guaranteed
by differential privacy imply low generalization error. We also show that weaker
stability guarantees such as bounded KL divergence and total variation distance
lead to correspondingly weaker generalization guarantees.

Watch and Learn: Optimizing from Revealed Preferences Feedback
Aaron Roth,
Jonathan Ullman,
Zhiwei Steven Wu
STOC 2016
Links: arXiv
Abstract:
A Stackelberg game is played between a leader and
a follower. The leader first chooses an action, and then the follower
plays his best response, and the goal of the leader is to pick the action that
will maximize his payoff given the follower's best response. Stackelberg games
capture, for example, the following interaction between a producer and a consumer.
The producer chooses the prices of the goods he produces, and then a consumer
chooses to buy a utilitymaximizing bundle of goods. The goal of the seller
here is to set prices to maximize his profithis revenue, minus the production
cost of the purchased bundle. It is quite natural that the seller in this example
should not know the buyer's utility function. However, he does have access to
revealed preference feedbackhe can set prices, and then observe the
purchased bundle and his own profit. We give algorithms for efficiently solving,
in terms of both computation and querycomplexity, a broad class of Stackelberg
games in which the follower's utility function is unknown, using only revealed
preference access to it. This class includes in particular the profit maximization
problem, as well as the optimal tolling problem in nonatomic congestion games,
when the latency functions are unknown. Surprisingly, we are able to solve these
problems even though the optimization problems are nonconvex in the leader's
actions.

2015 

When Can Limited Randomness Be Used in Repeated Games?
Pavel Hubacek,
Moni Naor,
Jonathan Ullman
SAGT 2015; TCS special issue for selected papers from SAGT 2014/15
Links: arXiv
Abstract:
The central result of classical game theory states that every finite normal form
game has a Nash equilibrium, provided that players are allowed to use randomized
(mixed) strategies. However, in practice, humans are known to be bad at
generating randomlike sequences, and true random bits may be unavailable.
Even if the players have access to enough random bits for a single instance
of the game their randomness might be insufficient if the game is played
many times.
In this work, we ask whether randomness is necessary for equilibria to exist in
finitely repeated games. We show that for a large class of games containing
arbitrary twoplayer zerosum games, approximate Nash equilibria of the $n$stage
repeated version of the game exist if and only if both players have $\Omega(n)$
random bits. In contrast, we show that there exists a class of games for which
no equilibrium exists in pure strategies, yet the $n$stage repeated version of
the game has an exact Nash equilibrium in which each player uses only a constant
number of random bits.
When the players are assumed to be computationally bounded, if cryptographic
pseudorandom generators (or, equivalently, oneway functions) exist, then the
players can base their strategies on "randomlike" sequences derived from only
a small number of truly random bits. We show that, in contrast, in repeated
twoplayer zerosum games, if pseudorandom generators do not exist, then
$\Omega(n)$ random bits remain necessary for equilibria to exist.

Robust Traceability from Trace Amounts
Cynthia Dwork,
Adam Smith,
Thomas Steinke,
Jonathan Ullman,
Salil Vadhan
FOCS 2015
Links: Harvard
Abstract:
The privacy risks inherent in the release of a large number of summary
statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who
considered the case of 1way marginals of SNP allele frequencies obtained
in a genomewide association study: Given a large number of minor allele
frequencies from a case group of individuals diagnosed with a particular
disease, together with the genomic data of a single target individual and
statistics from a sizable reference dataset independently drawn from the
same population, an attacker can determine with high confidence whether or
not the target is in the case group.
In this work we describe and analyze a simple attack that succeeds even if
the summary statistics are significantly distorted, whether due to measurement
error or noise intentionally introduced to protect privacy. Our attack only
requires that the vector of distorted summary statistics is close to the vector
of true marginals in $\ell_1$norm. Moreover, the reference pool required by
previous attacks can be replaced by a single sample drawn fromthe underlying population.
The new attack, which is not specific to genomics and which handles Gaussian
as well as Bernouilli data, significantly generalizes recent lower bounds on the
noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014;
Steinke and Ullman, 2015), obviating the need for the attacker to control the
exact distribution of the data.

Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery
Thomas Steinke,
Jonathan Ullman
COLT 2015
Links: arXiv
Abstract:
We show an essentially tight bound on the number of adaptively chosen
statistical queries that a computationally efficient algorithm can answer
accurately given $n$ samples from an unknown distribution. A statistical
query asks for the expectation of a predicate over the underlying distribution,
and an answer to a statistical query is accurate if it is "close" to the
correct expectation over the distribution. This question was recently
studied by Dwork et al. (STOC 2015), who showed how to answer
$\tilde{\Omega}(n^2)$ queries efficiently, and also by Hardt and Ullman
(FOCS 2014), who showed that answering $\tilde{O}(n^3)$ queries is hard.
We close the gap between the two bounds and show that, under a standard
hardness assumption, there is no computationally efficient algorithm that,
given $n$ samples from an unknown distribution, can give valid answers to
$O(n^2)$ adaptively chosen statistical queries. An implication of our
results is that computationally efficient algorithms for answering arbitrary,
adaptively chosen statistical queries may as well be differentially private.
We obtain our results using a new connection between the problem of
answering adaptively chosen statistical queries and a combinatorial
object called an interactive fingerprinting code (Fiat and Tassa).
In order to optimize our hardness result, we give a new Fourieranalytic
approach to analyzing fingerprinting codes that is simpler, more flexible,
and yields better parameters than previous constructions.

Inducing Approximately Optimal Flow Using Truthful Mediators
Ryan Rogers,
Aaron Roth,
Jonathan Ullman,
Zhiwei Steven Wu
EC 2015

Between Pure and Approximate Differential Privacy
Thomas Steinke,
Jonathan Ullman
TPDP 2015; Journal of Privacy and Confidentiality
Links: arXiv, Journal
Abstract:
We show a new lower bound on the sample complexity of $(\varepsilon, \delta)$differentially
private algorithms that accurately answer statistical queries on highdimensional
databases. The novelty of our bound is that it depends optimally on the parameter
$\delta$, which loosely corresponds to the probability that the algorithm fails
to be private, and is the first to smoothly interpolate between approximate
differential privacy ($\delta > 0$) and pure differential privacy ($\delta = 0$).
Specifically, we consider a database $D \in \{\pm 1\}^{n \times d}$ and its
oneway marginals, which are the $d$ queries of the form "What fraction
of individual records have the $i$th bit set to $+1$?" We show that
in order to answer all of these queries to within error $\alpha$ (on average)
while satisfying $(\varepsilon, \delta)$differential privacy, it is necessary
that
$$
n \geq \Omega\left(\frac{\sqrt{d \log(1/\delta)}}{\alpha \varepsilon} \right),
$$
which is optimal up to constant factors. To prove our lower bound, we build on
the connection between fingerprinting codes and lower bounds in differential
privacy (Bun, Ullman, and Vadhan, STOCâ€™14). In addition to our lower bound,
we give new purely and approximately differentially private algorithms for
answering arbitrary statistical queries that improve on the sample complexity
of the standard Laplace and Gaussian mechanisms for achieving worstcase
accuracy guarantees by a logarithmic factor.

Private Multiplicative Weights Beyond Linear Queries
Jonathan Ullman
PODS 2015
Links: arXiv
Abstract:
A wide variety of fundamental data analyses in machine learning, such as linear
and logistic regression, require minimizing a convex function defined by the data.
Since the data may contain sensitive information about individuals, and these
analyses can leak that sensitive information, it is important to be able to
solve convex minimization in a privacy preserving way.
A series of recent results show how to accurately solve a single convex
minimization problem in a differentially private manner. However, the same
data is often analyzed repeatedly, and little is known about solving multiple
convex minimization problems with differential privacy. For simpler data analyses,
such as linear queries, there are remarkable differentially private algorithms
such as the private multiplicative weights mechanism (Hardt and Rothblum, FOCS
2010) that accurately answer exponentially many distinct queries. In this work,
we extend these results to the case of convex minimization and show how to give
accurate and differentially private solutions to exponentially many convex
minimization problems on a sensitive dataset.

2014 

Preventing False Discovery in Interactive Data Analysis is Hard
Moritz Hardt,
Jonathan Ullman
FOCS 2014
Links: arXiv
Abstract:
We show that, under a standard hardness assumption, there is no computationally
efficient algorithm that given $n$ samples from an unknown distribution can give
valid answers to $n^{3+o(1)}$ adaptively chosen statistical queries. A
statistical query asks for the expectation of a predicate over the underlying
distribution, and an answer to a statistical query is valid if it is "close" to
the correct expectation over the distribution.
Our result stands in stark contrast to the well known fact that exponentially
many statistical queries can be answered validly and efficiently if the
queries are chosen nonadaptively (no query may depend on the answers to
previous queries). Moreover, Dwork et al., showed how to
accurately answer exponentially many adaptively chosen statistical queries via
a computationally inefficient algorithm. They also gave efficient algorithm
that can answer $\tilde{\Omega}(n^2)$ adaptively chosen queries, which shows
our result is almost quantitatively tight.
Conceptually, our result demonstrates that achieving statistical validity alone can
be a source of computational intractability in adaptive settings. For example, in the
modern large collaborative research environment, data analysts typically choose a
particular approach based on previous findings. False discovery occurs if a research finding
is supported by the data but not by the underlying distribution. While the study of
preventing false discovery in Statistics is decades old, to the best of our knowledge
our result is the first to demonstrate a computational barrier. In particular, our result
suggests that the perceived difficulty of preventing false discovery in today's collaborative
research environment may be inherent.

Privately Solving Linear Programs
Justin Hsu,
Aaron Roth,
Tim Roughgarden,
Jonathan Ullman
ICALP 2014 Track A
Links: arXiv
Abstract:
In this paper, we initiate the systematic study of solving linear programs under
differential privacy. The first step is simply to define the problem: to this end,
we introduce several natural classes of private linear programs that capture
different ways sensitive data can be incorporated into a linear program. For
each class of linear programs we give an efficient, differentially private solver
based on the multiplicative weights framework, or we give an impossibility result.

Fingerprinting Codes and the Price of Approximate Differential Privacy
Mark Bun,
Jonathan Ullman,
Salil Vadhan
STOC 2014; SICOMP special issue for selected papers from STOC 2014
Links: arXiv
Abstract:
We show new lower bounds on the sample complexity of
$(\varepsilon, \delta)$differentially private algorithms that accurately
answer large sets of counting queries. A counting query on a database
$D \in (\{0,1\}^d)^n$ has the form "What fraction of the individual records
in the database satisfy the property $q$?" We show that in order to answer
an arbitrary set $\mathcal{Q}$ of $\gg nd$ counting queries on $D$ to within
error $\pm \alpha$ it is necessary that
$$
n \geq \tilde{\Omega}\left( \frac{\sqrt{d} \log \mathcal{Q}}{\alpha^2 \varepsilon} \right).
$$
This bound is optimal up to polylogarithmic factors, as demonstrated by the
Private Multiplicative Weights algorithm (Hardt and Rothblum, FOCS'10).
In particular, our lower bound is the first to show that the sample complexity
required for accuracy and $(\varepsilon, \delta)$differential privacy is
asymptotically larger than what is required merely for accuracy, which is
$O(\log \mathcal{Q} / \alpha^2)$. In addition, we show that our lower bound
holds for the specific case of $k$way marginal queries
(where $\mathcal{Q} = 2^k \binom{d}{k}$) when $\alpha$ is a constant.
Our results rely on the existence of short fingerprinting codes
(Boneh and Shaw, CRYPTO'95; Tardos, STOC'03), which we show are closely connected
to the sample complexity of differentially private data release. We also give a
new method for combining certain types of sample complexity lower bounds into
stronger lower bounds.

Faster Private Release of Marginals on Small Databases
Karthekeyan Chandrasekaran,
Justin Thaler,
Jonathan Ullman,
Andrew Wan
ITCS 2014
Links: arXiv
Abstract:
We study the problem of answering $k$way marginal queries on a
database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The
answer to a $k$way marginal query is the fraction of the database's records
$x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$
columns. Marginal queries enable a rich class of statistical analyses on a
dataset, and designing efficient algorithms for privately answering marginal
queries has been identified as an important open problem in private data
analysis. For any $k$, we give a differentially private online algorithm that
runs in time $$\mathrm{poly}\left(n, 2^{o(d)} \right)$$ per query and answers
any sequence of $\mathrm{poly}(n)$ many $k$way marginal queries with error at
most $\pm 0.01$ on every query, provided $n \gtrsim d^{0.51} $. To the best of
our knowledge, this is the first algorithm capable of privately answering
marginal queries with a nontrivial worstcase accuracy guarantee for databases
containing $\mathrm{poly}(d, k)$ records in time $\exp(o(d))$. Our algorithm
runs the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10)
on a new approximate polynomial representation of the database.
We derive our representation for the database by approximating the OR
function restricted to low Hamming weight inputs using lowdegree polynomials
with coefficients of bounded $L_1$norm. In doing so, we show new upper and
lower bounds on the degree of such polynomials, which may be of independent
approximationtheoretic interest.

Robust Mediators in Large Games
Michael Kearns,
Mallesh Pai,
Ryan Rogers,
Aaron Roth,
Jonathan Ullman
ITCS 2014
This work subsumes the two previous papers [KPRU14] and [RR14]
Links: arXiv
Abstract:
A mediator is a mechanism that can only suggest actions to players,
as a function of all agents' reported types, in a given game of
incomplete information. We study what is achievable by two kinds of
mediators, "strong" and "weak." Players can choose to optout of using
a strong mediator but cannot misrepresent their type if they optin.
Such a mediator is "strong" because we can view it as having the ability
to verify player types. Weak mediators lack this ability players are
free to misrepresent their type to a weak mediator. We show a striking
resultin a priorfree setting, assuming only that the game is large
and players have private types, strong mediators can implement approximate
equilibria of the completeinformation game. If the game is a congestion
game, then the same result holds using only weak mediators. Our result
follows from a novel application of differential privacy, in particular,
a variant we propose called joint differential privacy.

2013 

Differential Privacy for the Analyst via Private Equilibrium Computation
Justin Hsu,
Aaron Roth,
Jonathan Ullman
STOC 2013
Links: arXiv
Abstract:
We give new mechanisms for answering exponentially many queries from
multiple analysts on a private database, while protecting differential
privacy both for the individuals in the database and for the analysts.
That is, our mechanism's answer to each query is nearly insensitive to
changes in the queries asked by other analysts. Our mechanism is the
first to offer differential privacy on the joint distribution over analysts'
answers, providing privacy for data analysts even if the other data
analysts collude or register multiple accounts. In some settings, we
are able to achieve nearly optimal error rates (even compared to mechanisms
which do not offer analyst privacy), and we are able to extend our
techniques to handle nonlinear queries. Our analysis is based on a novel
view of the private queryrelease problem as a twoplayer zerosum game,
which may be of independent interest.

Answering n^{2+o(1)} Counting Queries with Differential Privacy is Hard
Jonathan Ullman
STOC 2013; SICOMP special issue for selected papers from STOC 2013
Links: arXiv
Abstract:
A central problem in differentially private data analysis is how to
design efficient algorithms capable of answering large numbers of
counting queries on a sensitive database. Counting queries
are of the form "What fraction of individual records in the database
satisfy the property $q$?" We prove that if oneway functions exist,
then there is no algorithm that takes as input a database
$D \in (\{0,1\}^d)^n$, and $k = \tilde{\Theta}(n^2)$ arbitrary efficiently
computable counting queries, runs in time $\mathrm{poly}(d, n)$, and
returns an approximate answer to each query, while satisfying differential
privacy. We also consider the complexity of answering "simple" counting
queries, and make some progress in this direction by showing that the
above result holds even when we require that the queries are computable
by constantdepth ($AC^0$) circuits.
Our result is almost tight because it is known that $\tilde{\Omega}(n^2)$
counting queries can be answered efficiently while satisfying differential
privacy. Moreover, many more than $n^2$ queries (even exponential in $n$)
can be answered in exponential time.
We prove our results by extending the connection between differentially
private query release and cryptographic traitortracing schemes to the
setting where the queries are given to the algorithm as input, and by
constructing a traitortracing scheme that is secure in this setting.

2012 

Faster Algorithms for Privately Releasing Marginals
Justin Thaler,
Jonathan Ullman,
Salil Vadhan
ICALP 2012 Track A
Links: arXiv
Abstract:
We study the problem of releasing $k$way marginals of a
database $D \in (\{0,1\}^d)^n$, while preserving differential privacy.
The answer to a $k$way marginal query is the fraction of $D$'s records
$x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$
columns. Marginal queries enable a rich class of statistical analyses of
a dataset, and designing efficient algorithms for privately releasing
marginal queries has been identified as an important open problem in
private data analysis (cf.~Barak et.~al., PODS '07).
We give an algorithm that runs in time $d^{O(\sqrt{k})}$ and releases
a private summary capable of answering any $k$way marginal query with at
most $\pm .01$ error on every query as long as $n \geq d^{O(\sqrt{k})}$.
To our knowledge, ours is the first algorithm capable of privately releasing
marginal queries with nontrivial worstcase accuracy guarantees in time
substantially smaller than the number of $k$way marginal queries, which
is $d^{\Theta(k)}$ (for $k \ll d$).

Iterative Constructions and Private Data Release
Anupam Gupta,
Aaron Roth,
Jonathan Ullman
TCC 2012
Links: arXiv
Abstract:
In this paper we study the problem of approximately releasing
the cut function of a graph while preserving differential
privacy, and give new algorithms (and new analyses of existing algorithms)
in both the interactive and noninteractive settings.
Our algorithms in the interactive setting are achieved by revisiting
the problem of releasing differentially private, approximate answers to
a large number of queries on a database. We show that several algorithms
for this problem fall into the same basic framework, and are based on the
existence of objects which we call iterative database construction
algorithms. We give a new generic framework in which new (efficient) IDC
algorithms give rise to new (efficient) interactive private query release
mechanisms. Our modular analysis simplifies and tightens the analysis of
previous algorithms, leading to improved bounds. We then give a new IDC
algorithm (and therefore a new private, interactive query release mechanism)
based on the Frieze/Kannan lowrank matrix decomposition. This new release
mechanism gives an improvement on prior work in a range of parameters where
the size of the database is comparable to the size of the data universe
(such as releasing all cut queries on dense graphs).
We also give a noninteractive algorithm for efficiently releasing
private synthetic data for graph cuts with error $O(V^{1.5})$.
Our algorithm is based on randomized response and a nonprivate
implementation of the SDPbased, constantfactor approximation algorithm
for cutnorm due to Alon and Naor. Finally, we give a reduction based
on the IDC framework showing that an efficient, private algorithm for
computing sufficiently accurate rank1 matrix approximations would lead
to an improved efficient algorithm for releasing private synthetic data
for graph cuts. We leave finding such an algorithm as our main open problem.

2011 

On the ZeroError Capacity Threshold for Deletion Channels
Ian Kash,
Michael Mitzenmacher,
Justin Thaler,
Jonathan Ullman
ITA 2011
Links: arXiv
Abstract:
We consider the zeroerror capacity of deletion channels. Specifically,
we consider the setting where we choose a codebook C consisting of strings
of n bits, and our model of the channel corresponds to an adversary who may
delete up to pn of these bits for a constant p. Our goal is to decode
correctly without error regardless of the actions of the adversary. We
consider what values of p allow nonzero capacity in this setting. We
suggest multiple approaches, one of which makes use of the natural
connection between this problem and the problem of finding the expected
length of the longest common subsequence of two random sequences.

Privately Releasing Conjunctions and the Statistical Query Barrier
Anupam Gupta,
Moritz Hardt,
Aaron Roth,
Jonathan Ullman
STOC 2011; SICOMP
Links: arXiv
Abstract:
Suppose we would like to know all answers to a set of statistical
queries~$C$ on a data set up to small error, but we can only access the data
itself using statistical queries. A trivial solution is to exhaustively ask
all queries in~$C$. In this paper, we investigate how and when we can do
better than this naive approach. We show that the number of statistical
queries necessary and sufficient for this task isup to polynomial
factorsequal to the agnostic learning complexity of $C$ in Kearns'
statisticalquery (SQ) model. This gives a complete answer to the question
when running time is not a concern. We then show that the problem can be
solved efficiently (allowing arbitrary error on a small fraction of queries)
whenever the answers to $C$ can be described by a submodular function. This
includes many natural concept classes, such as graph cuts and Boolean
disjunctions and conjunctions.
These results are interesting not only from a learning theoretic point of view,
but also from the perspective of privacypreserving data analysis.
In this context, our second result leads to an algorithm that efficiently
releases differentially private answers to all Boolean conjunctions with
1$\%$ average additive error. This presents significant progress on a key
open problem in privacypreserving data analysis. Our first result on the
other hand gives unconditional lower bounds on any differentially private
algorithm that admits a (potentially nonprivacypreserving) implementation
using only statistical queries. Not only our algorithms, but also most
known private algorithms can be implemented using only statistical queries,
and hence are constrained by these lower bounds. Our result therefore
isolates the complexity of agnostic learning in the SQ model as a new
barrier in the design of differentially private algorithms.

PCPs and the Hardness of Generating Private Synthetic Data
Jonathan Ullman,
Salil Vadhan
TCC 2011; Invited to the Journal of Cryptology
Links: ECCC
Abstract:
Assuming the existence of oneway functions, we show that there is
no polynomialtime, differentially private algorithm $\mathcal{A}$
that takes a database $D\in (\{0,1\}^d)^n$ and outputs a "synthetic
database" $\hat{D}$ all of whose twoway marginals are approximately
equal to those of $D$. (A twoway marginal is the fraction of database
rows $x \in \{0,1\}^d$ with a given pair of values in a given pair of
columns.) This answers a question of Barak et al. (PODS `07), who gave
an algorithm running in time $\mathrm{poly}(n,2^d)$.
Our proof combines a construction of hardtosanitize databases based
on digital signatures (by Dwork et al., STOC `09) with encodings based
on the PCP Theorem.
We also present both negative and positive results for generating
"relaxed" synthetic data, where the fraction of rows in $D$ satisfying
a predicate $c$ are estimated by applying $c$ to each row of $\hat{D}$
and aggregating the results in some way.

2010 

Course Allocation by Proxy Auction
Scott Kominers,
Mike Ruberry,
Jonathan Ullman
WINE 2010
Abstract:
We propose a new proxy bidding mechanism to allocate courses to
students given students' reported preferences. Our mechanism is
motivated by a specific strategic downgrading manipulation observed
in the course allocation mechanism currently used at Harvard
Business School (HBS). The proxy bidding mechanism simplifes students'
decisions by directly incorporating downgrading into the mechanism.
Our proxy bidding mechanism is Pareto efficient with respect to
lexicographic preferences and can be extended to allow for a more
expressive preference language which allows complementarity, substitutability,
and indifference. Simulations suggest that the proxy bidding mechanism is
robust to the manipulations observed in the HBS mechanism and may yield
welfare improvements.

The Price of Privately Releasing Contingency Tables and the Spectra of Random Matrices with Correlated Rows
Shiva Kasiviswanathan,
Mark Rudelson,
Adam Smith,
Jonathan Ullman
STOC 2010
Abstract:
Marginal (contingency) tables are the method of choice for government
agencies releasing statistical summaries of categorical data. In this
paper, we derive lower bounds on how much distortion (noise) is necessary
in these tables to ensure the privacy of sensitive data. We extend a line
of recent work on impossibility results for private data analysis to a
natural and important class of functionalities. Consider a database
consisting of $n$ rows (one per individual), each row comprising $d$
binary attributes. For any subset of $T$ attributes of size $T = k$,
the marginal table for $T$ has $2^k$ entries; each entry counts how many
times in the database a particular setting of these attributes occurs.
We provide lower bounds for releasing all $\binom{d}{k}$ $k$attribute
marginal tables under several different notions of privacy.
We give efficient polynomial time attacks which allow an adversary to
reconstruct sensitive information given insufficiently perturbed marginal
table releases. In particular, for a constant $k$, we obtain a tight bound
of $\Omega(\min{n^{1/2}, d^{(k1)/2}})$ on the average distortion per entry
for any mechanism that releases all $k$attribute marginals while providing
attribute privacy (a weak notion implied by most privacy definitions).
Our reconstruction attacks require a new lower bound on the least
singular value of a random matrix with correlated rows. Let $M(k)$ be a
matrix with $\binom{d}{k}$ rows formed by taking all possible $k$way
entrywise products of an underlying set of $d$ random vectors from
$\{0,1\}^n$. For constant $k$, we show that the least singular value of
$M(k)$ is $\Omega(d^{k/2}) with high probability (the same asymptotic
bound as for independent rows).
We obtain stronger lower bounds for marginal tables satisfying
differential privacy. We give a lower bound of $\Omega(\min{n^{1/2}, d^{k/2}})$,
which is tight for $n = \Omega(d^k)$. We extend our analysis to obtain stronger
results for mechanisms that add instanceindependent noise and weaker results
when $k$ is superconstant.
