# Jonathan Ullman

Assistant Professor
College of Computer and Information Sciences
Northeastern University

260 West Village H
440 Huntington Avenue
Boston, MA, 02115

e-mail: jullman [at] ccs [dot] neu [dot] edu

I am interested in theoretical computer science broadly. Much of my work addresses with questions like when and how can we analyze a dataset while ensuring privacy for the individuals in the dataset and how can we prevent false discovery in the empirical sciences. I study these questions and others using tools from cryptography, computational learning theory, algorithms, and game theory.

Since Fall 2015, I have been an assistant professor in the College of Computer and Information Sciences at Northeastern University, where I am affiliated with Northeastern's theory and security research groups. I am also a senior researcher on the Harvard University Privacy Tools Project.

Prior to joining Northeastern I completed my PhD at Harvard University under the supervision of Salil Vadhan. I was also a postdoctoral fellow in the Center for Research on Computation and Society and in the inaugural class of junior fellows in the Simons Society of Fellows.

If you are a bright and motivated student interested in theoretical computer science, I highly encourage you to apply to Northeastern! See here for more information.

# Recent/Selected Research Papers   [All Papers]All Research Papers   [Selected Papers]   [Google Scholar]

 New The Price of Selection in Differential Privacy Mitali Bafna, Jonathan Ullman Manuscript Links: arXiv Abstract: In the differentially private top-$k$ selection problem, we are given a dataset $X \in \{\pm 1\}^{n \times d}$, in which each row belongs to an individual and each column corresponds to some binary attribute, and our goal is to find a set of $k \ll d$ columns whose means are approximately as large as possible. Differential privacy requires that our choice of these $k$ columns does not depend too much on any on individual's dataset. This problem can be solved using the well known exponential mechanism and composition properties of differential privacy. In the high-accuracy regime, where we require the error of the selection procedure to be to be smaller than the so-called sampling error $\alpha \approx \sqrt{\ln(d)/n}$, this procedure succeeds given a dataset of size $n \gtrsim k \ln(d)$. We prove a matching lower bound, showing that a dataset of size $n \gtrsim k \ln(d)$ is necessary for private top-$k$ selection in this high-accuracy regime. Our lower bound is the first to show that selecting the $k$ largest columns requires more data than simply estimating the value of those $k$ columns, which can be done using a dataset of size just $n \gtrsim k$. Subgaussian Concentration via Stability Arguments Thomas Steinke, Jonathan Ullman Manuscript Links: arXiv Abstract: Sums of independent, bounded random variables concentrate around their expectation approximately as well a Gaussian of the same variance. Well known results of this form include the Bernstein, Hoeffding, and Chernoff inequalities and many others. We present an alternative proof of these tail bounds based on what we call a stability argument, which avoids bounding the moment generating function or higher-order moments of the distribution. Our stability argument is inspired by recent work on the generalization properties of differential privacy and their connection to adaptive data analysis (Bassily et al., STOC 2016). Set Cover in Sublinear Time Piotr Indyk, Sepideh Mahabadi, Ronitt Rubinfeld, Jonathan Ullman, Ali Vakilian, Anak Yodpinyanee Manuscript Links: Coming Soon! Abstract: We study the classic set cover problem from the perspective of sublinear time algorithms. We consider both the integral and the fractional variants of this problem. Given access to a collection of $m$ sets over $n$ elements, for the integral variant, we show an adaptation of the streaming algorithm of Har-Peled et al. to the sublinear query model, that returns an $\alpha$-approximate cover using $\tilde{O}(m(n/k)^{1/(\alpha-1)} + nk)$ queries to the input, where $k$ denotes the value of a minimum set cover. We complement this result by proving that for low values of $k$, the required number of queries is $\tilde{\Omega}(m(n/k)^{1/(2\alpha)})$. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require $\Omega(nk)$ queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter $k$. On the other hand, we show that this bound is not optimal for larger values of the parameter $k$, as there exists a $(1+\varepsilon)$-approximation algorithm with $\tilde{O}(mn/k)$ queries. For the fractional variant of set cover, we show three sub-linear time $(1+\varepsilon)$-approximate algorithms. The first two algorithms are most efficient when $k$ is small. Specifically, one algorithm has $\tilde{O}({mk / \varepsilon^{5}}+{nk / \varepsilon^2})$ query complexity and runtime, which (for constant $\epsilon)$ improves over the best prior bound of $\tilde{O}( (m + n) k^2/\varepsilon^2)$ due to Koufogiannakis and Young by (roughly) a factor of $k$. Our second algorithm reduces the number of queries further by a factor of (roughly) $\sqrt{k}$ while increasing the runtime by the same factor. Our third algorithm is most efficient when $k$ is large, and uses $O(mn /k\varepsilon)$ queries and similar running time. We also show that, for constant values of $k$, the query complexity of any algorithm for fractional set cover must depend linearly on both $m$ and $n$. The first two algorithms for fractional set cover use the Multiplicative Weight Update (MWU) framework, which iteratively identifies unsatisfied constraints and re-calibrates their "importance". The first algorithm implements each constraint checking round separately by using random sampling. Our second algorithm reduces the number of queries by re-using the random samples in multiple rounds. Since the MWU algorithm updates the constraints in an adaptive fashion, we employ the adaptive data analysis framework of Dwork et al. and Bassily et al., which allows us to control the probabilistic dependencies. To the best of our knowledge this is the first application of the framework to the design of sublinear algorithms. Our lower-bound results follow by carefully designing two distributions of instances that are hard to distinguish. In particular, our first lower bound involves a probabilistic construction of a certain set system with minimum set cover of size $\alpha k$, with the key property that a small number of "almost uniformly distributed" modifications can reduce the minimum set cover size down to $k$. Thus, these modifications are not detectable unless a large number of queries are asked. We believe that our probabilistic construction technique might find applications to lower bounds for other combinatorial optimization problems. We remark that all of our algorithms report an actual set cover (not just its size), while our lower bounds hold even for estimating the cover size. Multidimensional Dynamic Pricing for Welfare Maximization Aaron Roth, Aleksandrs Slivkins, Jonathan Ullman, Zhiwei Steven Wu Manuscript Links: arXiv Abstract: We study the problem of a seller dynamically pricing d distinct types of goods, when faced with the online arrival of buyers drawn independently from an unknown distribution. The seller observes only the bundle of goods purchased at each day, but nothing else about the buyer's valuation function. When buyers have strongly concave, Holder continuous valuation functions, we give a pricing scheme that finds a pricing that optimizes welfare (including the seller's cost of production) in time and number of rounds that are polynomial in d and the accuracy parameter. We are able to do this despite the fact that (i) welfare is a non-concave function of the prices, and (ii) the welfare is not observable to the seller. We also extend our results to a limited-supply setting. 2017 Make Up Your Mind: The Price of Online Queries in Differential Privacy Mark Bun, Thomas Steinke, Jonathan Ullman SODA 2017 Links: arXiv Abstract: We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer: Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch. Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query. Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries. Many differentially private mechanisms are just as effective in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line. 2016 Computing Marginals Using MapReduce Foto Afrati, Shantanu Sharma, Jeffrey D. Ullman, Jonathan Ullman IDEAS 2016 Links: arXiv Abstract: We consider the problem of computing the data-cube marginals of a fixed order k (i.e., all marginals that aggregate over k dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). We show that the replication rate is minimized when the reducers receive all the inputs necessary to compute one marginal of higher order. That observation lets us view the problem as one of covering sets of k dimensions with sets of a larger size m, a problem that has been studied under the name "covering numbers." We offer a number of constructions that, for different values of k and m meet or come close to yielding the minimum possible replication rate for a given reducer size. Exposed! A Survey of Attacks on Private Data Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman Annual Review of Statistics and its Applications Links: Coming soon! Abstract: Privacy-preserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, "re-identification attacks" have shown that it is easy to link supposedly de-identified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof. We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individual’s data is included in the dataset. We also discuss techniques from the differential privacy literature for releasing approximate aggregate statistics whilst provably thwarting any privacy attack. Privacy Odometers and Filters: Pay-as-you-Go Composition Ryan Rogers, Aaron Roth, Jonathan Ullman, Salil Vadhan NIPS 2016 Links: arXiv Abstract: In this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases. Strong Hardness of Privacy from Weak Traitor Tracing Lucas Kowalczyk, Tal Malkin, Jonathan Ullman, Mark Zhandry TCC 2016B Links: arXiv, ePrint Abstract: Despite much study, the computational complexity of differential privacy remains poorly understood. In this paper we consider the computational complexity of accurately answering a family $Q$ of statistical queries over a data universe $X$ under differential privacy. A statistical query on a dataset $D \in X^n$ asks "what fraction of the elements of $D$ satisfy a given predicate $p$ on $X$?" Dwork et al. (STOC'09) and Boneh and Zhandry (CRYPTO'14) showed that if both $Q$ and $X$ are of polynomial size, then there is an efficient differentially private algorithm that accurately answers all the queries, and if both $Q$ and $X$ are exponential size, then under a plausible assumption, no efficient algorithm exists. We show that, under the same assumption, if either the number of queries or the data universe is of exponential size, then there is no differentially private algorithm that answers all the queries. Specifically, we prove that if one-way functions and indistinguishability obfuscation exist, then: For every $n$, there is a family $Q$ of $O(n^7)$ queries on a data universe $X$ of size $2^d$ such that no $\mathrm{poly}(n,d)$ time differentially private algorithm takes a dataset $D \in X^n$ and outputs accurate answers to every query in $Q$. For every $n$, there is a family $Q$ of $2^d$ queries on a data universe $X$ of size $O(n^7)$ such that no $\mathrm{poly}(n,d)$ time differentially private algorithm takes a dataset $D \in X^n$ and outputs accurate answers to every query in $Q$. In both cases, the result is nearly quantitatively tight, since there is an efficient differentially private algorithm that answers $\tilde{\Omega}(n^2)$ queries on an exponential size data universe, and one that answers exponentially many queries on a data universe of size $\tilde{\Omega}(n^2)$. Our proofs build on the connection between hardness results in differential privacy and traitor-tracing schemes (Dwork et al., STOC'09; Ullman, STOC'13). We prove our hardness result for a polynomial size query set (resp., data universe) by showing that they follow from the existence of a special type of traitor-tracing scheme with very short ciphertexts (resp., secret keys), but very weak security guarantees, and then constructing such a scheme. An Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring Mallesh Pai, Aaron Roth, Jonathan Ullman Transactions on Economics and Computation Links: arXiv Abstract: We study infinitely repeated games in settings of imperfect monitoring. We first prove a family of theorems that show that when the signals observed by the players satisfy a condition known as differential privacy, that the folk theorem has little bite: for sufficiently small values of the privacy parameters, for a fixed discount factor, any equilibrium of the repeated game involve players playing approximate equilibria of the stage game in every period. Next, we argue that in large games (n player games in which unilateral deviations by single players have only a small impact on the utility of other players), many monitoring settings naturally lead to signals that satisfy differential privacy, for privacy parameters tending to zero as the number of players n grows large. We conclude that in such settings, the set of equilibria of the repeated game collapse to the set of equilibria of the stage game. Some Pairs Problems Jeffrey D. Ullman, Jonathan Ullman BeyondMR 2016 Links: arXiv Abstract: A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of Afrati et al. (VLDB 2013), algorithms are measured by the tradeoff between reducer size (maximum number of inputs a reducer can handle) and the replication rate (average number of reducers to which an input must be sent. There are two obvious approaches to solving some-pairs problems in general. We show that no general-purpose MapReduce algorithm can beat both of these two algorithms in the worst case. We then explore a recursive algorithm for solving some-pairs problems and heuristics for beating the lower bound on common instances of the some-pairs class of problems. Space Lower Bounds for Itemset Frequency Sketches Edo Liberty, Michael Mitzenmacher, Justin Thaler, Jonathan Ullman PODS 2016 Links: arXiv Abstract: Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database in the sense that all sufficiently frequent itemsets and their approximate frequencies are recoverable from the sample, and the sketch size is independent of the number of rows in the original database. For many seemingly similar problems there are better sketching algorithms than uniform sampling. In this paper we show that for itemset frequency sketching this is not the case. That is, we prove that there exist classes of databases for which uniform sampling is nearly space optimal. Algorithmic Stability for Adaptive Data Analysis Raef Bassily, Kobbi Nissim, Uri Stemmer, Adam Smith, Thomas Steinke, Jonathan Ullman STOC 2016; Invited to the special issue of SICOMP This work subsumes the two previous manuscripts [BSSU15] and [NS15]. Links: arXiv Abstract: Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC '15) and Hardt and Ullman (FOCS '14) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution ${\bf P}$ and a set of $n$ independent samples ${\bf x}$ is drawn from ${\bf P}$. We seek an algorithm that, given ${\bf x}$ as input, accurately answers a sequence of adaptively chosen "queries" about the unknown distribution ${\bf P}$. How many samples $n$ must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: We give upper bounds on the number of samples $n$ that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015). We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries). As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stable algorithms of the kind guaranteed by differential privacy imply low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees. Watch and Learn: Optimizing from Revealed Preferences Feedback Aaron Roth, Jonathan Ullman, Zhiwei Steven Wu STOC 2016 Links: arXiv Abstract: A Stackelberg game is played between a leader and a follower. The leader first chooses an action, and then the follower plays his best response, and the goal of the leader is to pick the action that will maximize his payoff given the follower's best response. Stackelberg games capture, for example, the following interaction between a producer and a consumer. The producer chooses the prices of the goods he produces, and then a consumer chooses to buy a utility-maximizing bundle of goods. The goal of the seller here is to set prices to maximize his profit---his revenue, minus the production cost of the purchased bundle. It is quite natural that the seller in this example should not know the buyer's utility function. However, he does have access to revealed preference feedback---he can set prices, and then observe the purchased bundle and his own profit. We give algorithms for efficiently solving, in terms of both computation and query-complexity, a broad class of Stackelberg games in which the follower's utility function is unknown, using only revealed preference access to it. This class includes in particular the profit maximization problem, as well as the optimal tolling problem in non-atomic congestion games, when the latency functions are unknown. Surprisingly, we are able to solve these problems even though the optimization problems are non-convex in the leader's actions. 2015 When Can Limited Randomness Be Used in Repeated Games? Pavel Hubacek, Moni Naor, Jonathan Ullman SAGT 2015; TCS special issue for selected papers from SAGT 2014/15 Links: arXiv Abstract: The central result of classical game theory states that every finite normal form game has a Nash equilibrium, provided that players are allowed to use randomized (mixed) strategies. However, in practice, humans are known to be bad at generating random-like sequences, and true random bits may be unavailable. Even if the players have access to enough random bits for a single instance of the game their randomness might be insufficient if the game is played many times. In this work, we ask whether randomness is necessary for equilibria to exist in finitely repeated games. We show that for a large class of games containing arbitrary two-player zero-sum games, approximate Nash equilibria of the $n$-stage repeated version of the game exist if and only if both players have $\Omega(n)$ random bits. In contrast, we show that there exists a class of games for which no equilibrium exists in pure strategies, yet the $n$-stage repeated version of the game has an exact Nash equilibrium in which each player uses only a constant number of random bits. When the players are assumed to be computationally bounded, if cryptographic pseudorandom generators (or, equivalently, one-way functions) exist, then the players can base their strategies on "random-like" sequences derived from only a small number of truly random bits. We show that, in contrast, in repeated two-player zero-sum games, if pseudorandom generators do not exist, then $\Omega(n)$ random bits remain necessary for equilibria to exist. Robust Traceability from Trace Amounts Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, Salil Vadhan FOCS 2015 Links: Harvard Abstract: The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in $\ell_1$-norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn fromthe underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data. Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery Thomas Steinke, Jonathan Ullman COLT 2015 Links: arXiv Abstract: We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given $n$ samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently studied by Dwork et al. (STOC 2015), who showed how to answer $\tilde{\Omega}(n^2)$ queries efficiently, and also by Hardt and Ullman (FOCS 2014), who showed that answering $\tilde{O}(n^3)$ queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given $n$ samples from an unknown distribution, can give valid answers to $O(n^2)$ adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code (Fiat and Tassa). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions. Inducing Approximately Optimal Flow Using Truthful Mediators Ryan Rogers, Aaron Roth, Jonathan Ullman, Zhiwei Steven Wu EC 2015 Links: arXiv Abstract: We revisit a classic coordination problem from the perspective of mechanism design: how can we coordinate a social welfare maximizing flow in a network congestion game with selfish players? The classical approach, which computes tolls as a function of known demands, fails when the demands are unknown to the mechanism designer, and naively eliciting them does not necessarily yield a truthful mechanism. Instead, we introduce a weak mediator that can provide suggested routes to players and set tolls as a function of reported demands. However, players can choose to ignore or misreport their type to this mediator. Using techniques from differential privacy, we show how to design a weak mediator such that it is an asymptotic ex-post Nash equilibrium for all players to truthfully report their types to the mediator and faithfully follow its suggestion, and that when they do, end up playing a nearly optimal flow. Notably, our solution works in settings of incomplete information even in the absence of a prior distribution on player types. Along the way, we develop new techniques for privately solving convex programs which may be of independent interest. Between Pure and Approximate Differential Privacy Thomas Steinke, Jonathan Ullman TPDP 2015; Journal of Privacy and Confidentiality Links: arXiv, Journal Abstract: We show a new lower bound on the sample complexity of $(\varepsilon, \delta)$-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter $\delta$, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy ($\delta > 0$) and pure differential privacy ($\delta = 0$). Specifically, we consider a database $D \in \{\pm 1\}^{n \times d}$ and its one-way marginals, which are the $d$ queries of the form "What fraction of individual records have the $i$-th bit set to $+1$?" We show that in order to answer all of these queries to within error $\alpha$ (on average) while satisfying $(\varepsilon, \delta)$-differential privacy, it is necessary that $$n \geq \Omega\left(\frac{\sqrt{d \log(1/\delta)}}{\alpha \varepsilon} \right),$$ which is optimal up to constant factors. To prove our lower bound, we build on the connection between fingerprinting codes and lower bounds in differential privacy (Bun, Ullman, and Vadhan, STOC’14). In addition to our lower bound, we give new purely and approximately differentially private algorithms for answering arbitrary statistical queries that improve on the sample complexity of the standard Laplace and Gaussian mechanisms for achieving worst-case accuracy guarantees by a logarithmic factor. Private Multiplicative Weights Beyond Linear Queries Jonathan Ullman PODS 2015 Links: arXiv Abstract: A wide variety of fundamental data analyses in machine learning, such as linear and logistic regression, require minimizing a convex function defined by the data. Since the data may contain sensitive information about individuals, and these analyses can leak that sensitive information, it is important to be able to solve convex minimization in a privacy preserving way. A series of recent results show how to accurately solve a single convex minimization problem in a differentially private manner. However, the same data is often analyzed repeatedly, and little is known about solving multiple convex minimization problems with differential privacy. For simpler data analyses, such as linear queries, there are remarkable differentially private algorithms such as the private multiplicative weights mechanism (Hardt and Rothblum, FOCS 2010) that accurately answer exponentially many distinct queries. In this work, we extend these results to the case of convex minimization and show how to give accurate and differentially private solutions to exponentially many convex minimization problems on a sensitive dataset. 2014 Preventing False Discovery in Interactive Data Analysis is Hard Moritz Hardt, Jonathan Ullman FOCS 2014 Links: arXiv Abstract: We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given $n$ samples from an unknown distribution can give valid answers to $n^{3+o(1)}$ adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution. Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, Dwork et al., showed how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm. They also gave efficient algorithm that can answer $\tilde{\Omega}(n^2)$ adaptively chosen queries, which shows our result is almost quantitatively tight. Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent. Privately Solving Linear Programs Justin Hsu, Aaron Roth, Tim Roughgarden, Jonathan Ullman ICALP 2014 Track A Links: arXiv Abstract: In this paper, we initiate the systematic study of solving linear programs under differential privacy. The first step is simply to define the problem: to this end, we introduce several natural classes of private linear programs that capture different ways sensitive data can be incorporated into a linear program. For each class of linear programs we give an efficient, differentially private solver based on the multiplicative weights framework, or we give an impossibility result. Fingerprinting Codes and the Price of Approximate Differential Privacy Mark Bun, Jonathan Ullman, Salil Vadhan STOC 2014; SICOMP special issue for selected papers from STOC 2014 Links: arXiv Abstract: We show new lower bounds on the sample complexity of $(\varepsilon, \delta)$-differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $D \in (\{0,1\}^d)^n$ has the form "What fraction of the individual records in the database satisfy the property $q$?" We show that in order to answer an arbitrary set $\mathcal{Q}$ of $\gg nd$ counting queries on $D$ to within error $\pm \alpha$ it is necessary that $$n \geq \tilde{\Omega}\left( \frac{\sqrt{d} \log |\mathcal{Q}|}{\alpha^2 \varepsilon} \right).$$ This bound is optimal up to poly-logarithmic factors, as demonstrated by the Private Multiplicative Weights algorithm (Hardt and Rothblum, FOCS'10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and $(\varepsilon, \delta)$-differential privacy is asymptotically larger than what is required merely for accuracy, which is $O(\log |\mathcal{Q}| / \alpha^2)$. In addition, we show that our lower bound holds for the specific case of $k$-way marginal queries (where $|\mathcal{Q}| = 2^k \binom{d}{k}$) when $\alpha$ is a constant. Our results rely on the existence of short fingerprinting codes (Boneh and Shaw, CRYPTO'95; Tardos, STOC'03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample complexity lower bounds into stronger lower bounds. Faster Private Release of Marginals on Small Databases Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, Andrew Wan ITCS 2014 Links: arXiv Abstract: We study the problem of answering $k$-way marginal queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$\mathrm{poly}\left(n, 2^{o(d)} \right)$$ per query and answers any sequence of $\mathrm{poly}(n)$ many $k$-way marginal queries with error at most $\pm 0.01$ on every query, provided $n \gtrsim d^{0.51}$. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee for databases containing $\mathrm{poly}(d, k)$ records in time $\exp(o(d))$. Our algorithm runs the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10) on a new approximate polynomial representation of the database. We derive our representation for the database by approximating the OR function restricted to low Hamming weight inputs using low-degree polynomials with coefficients of bounded $L_1$-norm. In doing so, we show new upper and lower bounds on the degree of such polynomials, which may be of independent approximation-theoretic interest. Robust Mediators in Large Games Michael Kearns, Mallesh Pai, Ryan Rogers, Aaron Roth, Jonathan Ullman ITCS 2014 This work subsumes the two previous papers [KPRU14] and [RR14] Links: arXiv Abstract: A mediator is a mechanism that can only suggest actions to players, as a function of all agents' reported types, in a given game of incomplete information. We study what is achievable by two kinds of mediators, "strong" and "weak." Players can choose to opt-out of using a strong mediator but cannot misrepresent their type if they opt-in. Such a mediator is "strong" because we can view it as having the ability to verify player types. Weak mediators lack this ability--- players are free to misrepresent their type to a weak mediator. We show a striking result---in a prior-free setting, assuming only that the game is large and players have private types, strong mediators can implement approximate equilibria of the complete-information game. If the game is a congestion game, then the same result holds using only weak mediators. Our result follows from a novel application of differential privacy, in particular, a variant we propose called joint differential privacy. 2013 Differential Privacy for the Analyst via Private Equilibrium Computation Justin Hsu, Aaron Roth, Jonathan Ullman STOC 2013 Links: arXiv Abstract: We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting differential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data analysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer analyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the private query-release problem as a two-player zero-sum game, which may be of independent interest. Answering n2+o(1) Counting Queries with Differential Privacy is Hard Jonathan Ullman STOC 2013; SICOMP special issue for selected papers from STOC 2013 Links: arXiv Abstract: A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property $q$?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database $D \in (\{0,1\}^d)^n$, and $k = \tilde{\Theta}(n^2)$ arbitrary efficiently computable counting queries, runs in time $\mathrm{poly}(d, n)$, and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth ($AC^0$) circuits. Our result is almost tight because it is known that $\tilde{\Omega}(n^2)$ counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than $n^2$ queries (even exponential in $n$) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the algorithm as input, and by constructing a traitor-tracing scheme that is secure in this setting. 2012 Faster Algorithms for Privately Releasing Marginals Justin Thaler, Jonathan Ullman, Salil Vadhan ICALP 2012 Track A Links: arXiv Abstract: We study the problem of releasing $k$-way marginals of a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of $D$'s records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf.~Barak et.~al., PODS '07). We give an algorithm that runs in time $d^{O(\sqrt{k})}$ and releases a private summary capable of answering any $k$-way marginal query with at most $\pm .01$ error on every query as long as $n \geq d^{O(\sqrt{k})}$. To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of $k$-way marginal queries, which is $d^{\Theta(k)}$ (for $k \ll d$). Iterative Constructions and Private Data Release Anupam Gupta, Aaron Roth, Jonathan Ullman TCC 2012 Links: arXiv Abstract: In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error $O(|V|^{1.5})$. Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem. 2011 On the Zero-Error Capacity Threshold for Deletion Channels Ian Kash, Michael Mitzenmacher, Justin Thaler, Jonathan Ullman ITA 2011 Links: arXiv Abstract: We consider the zero-error capacity of deletion channels. Specifically, we consider the setting where we choose a codebook C consisting of strings of n bits, and our model of the channel corresponds to an adversary who may delete up to pn of these bits for a constant p. Our goal is to decode correctly without error regardless of the actions of the adversary. We consider what values of p allow non-zero capacity in this setting. We suggest multiple approaches, one of which makes use of the natural connection between this problem and the problem of finding the expected length of the longest common subsequence of two random sequences. Privately Releasing Conjunctions and the Statistical Query Barrier Anupam Gupta, Moritz Hardt, Aaron Roth, Jonathan Ullman STOC 2011; SICOMP Links: arXiv Abstract: Suppose we would like to know all answers to a set of statistical queries~$C$ on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in~$C$. In this paper, we investigate how and when we can do better than this naive approach. We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of $C$ in Kearns' statistical-query (SQ) model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to $C$ can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. These results are interesting not only from a learning theoretic point of view, but also from the perspective of privacy-preserving data analysis. In this context, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1$\%$ average additive error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ model as a new barrier in the design of differentially private algorithms. PCPs and the Hardness of Generating Private Synthetic Data Jonathan Ullman, Salil Vadhan TCC 2011; Invited to the Journal of Cryptology Links: ECCC Abstract: Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm $\mathcal{A}$ that takes a database $D\in (\{0,1\}^d)^n$ and outputs a "synthetic database" $\hat{D}$ all of whose two-way marginals are approximately equal to those of $D$. (A two-way marginal is the fraction of database rows $x \in \{0,1\}^d$ with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS 07), who gave an algorithm running in time $\mathrm{poly}(n,2^d)$. Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC 09) with encodings based on the PCP Theorem. We also present both negative and positive results for generating "relaxed" synthetic data, where the fraction of rows in $D$ satisfying a predicate $c$ are estimated by applying $c$ to each row of $\hat{D}$ and aggregating the results in some way. 2010 Course Allocation by Proxy Auction Scott Kominers, Mike Ruberry, Jonathan Ullman WINE 2010 Abstract: We propose a new proxy bidding mechanism to allocate courses to students given students' reported preferences. Our mechanism is motivated by a specific strategic downgrading manipulation observed in the course allocation mechanism currently used at Harvard Business School (HBS). The proxy bidding mechanism simplifes students' decisions by directly incorporating downgrading into the mechanism. Our proxy bidding mechanism is Pareto efficient with respect to lexicographic preferences and can be extended to allow for a more expressive preference language which allows complementarity, substitutability, and indifference. Simulations suggest that the proxy bidding mechanism is robust to the manipulations observed in the HBS mechanism and may yield welfare improvements. The Price of Privately Releasing Contingency Tables and the Spectra of Random Matrices with Correlated Rows Shiva Kasiviswanathan, Mark Rudelson, Adam Smith, Jonathan Ullman STOC 2010 Abstract: Marginal (contingency) tables are the method of choice for government agencies releasing statistical summaries of categorical data. In this paper, we derive lower bounds on how much distortion (noise) is necessary in these tables to ensure the privacy of sensitive data. We extend a line of recent work on impossibility results for private data analysis to a natural and important class of functionalities. Consider a database consisting of $n$ rows (one per individual), each row comprising $d$ binary attributes. For any subset of $T$ attributes of size $|T| = k$, the marginal table for $T$ has $2^k$ entries; each entry counts how many times in the database a particular setting of these attributes occurs. We provide lower bounds for releasing all $\binom{d}{k}$ $k$-attribute marginal tables under several different notions of privacy. We give efficient polynomial time attacks which allow an adversary to reconstruct sensitive information given insufficiently perturbed marginal table releases. In particular, for a constant $k$, we obtain a tight bound of $\Omega(\min{n^{1/2}, d^{(k-1)/2}})$ on the average distortion per entry for any mechanism that releases all $k$-attribute marginals while providing attribute privacy (a weak notion implied by most privacy definitions). Our reconstruction attacks require a new lower bound on the least singular value of a random matrix with correlated rows. Let $M(k)$ be a matrix with $\binom{d}{k}$ rows formed by taking all possible $k$-way entry-wise products of an underlying set of $d$ random vectors from $\{0,1\}^n$. For constant $k$, we show that the least singular value of $M(k)$ is $\Omega(d^{k/2}) with high probability (the same asymptotic bound as for independent rows). We obtain stronger lower bounds for marginal tables satisfying differential privacy. We give a lower bound of$\Omega(\min{n^{1/2}, d^{k/2}})$, which is tight for$n = \Omega(d^k)$. We extend our analysis to obtain stronger results for mechanisms that add instance-independent noise and weaker results when$k\$ is super-constant.