## 1 Introduction

The increasing collection of data at large scale—medical records, location information from cell phones, internet browsing history—points to the importance of a deeper understanding of the tradeoffs inherent between privacy and the utility of using the data collected. Classical mechanisms for preserving privacy, such as permutation, small noise addition, releasing only mean information, or basic anonymization are insufficient, and notable privacy compromises with genomic data [30] and movie rating information [39] have caused the NIH to temporarily stop releasing genetic information and Netflix to cancel a proposed competition for predicting movie ratings. Balancing the tension between utility and the risk of disclosure of sensitive information is thus essential.

In response to these challenges, researchers in the statistics, databases,
and computer science communities have studied *differential
privacy* [54, 25, 24, 23, 27, 21] as a formalization of disclosure risk limitation,
providing strong privacy guarantees. This literature discusses two notions
of privacy: *local* privacy, in which data is privatized before it is
even shared with a data collector, and *central* privacy, where a
centralized curator maintains the sample and guarantees that any information
it releases is appropriately private. The local model is stronger, and
consequently it is more challenging to develop statistically efficient
algorithms. Yet the strong privacy protections local privacy provides
encourage its adoption. Whether for ease of compliance with regulatory
strictures, for example with European Union privacy rules [48]

; for reasons of transparency and belief in the importance of privacy; or to avoid the risks proximate to holding sensitive data, like hacking or subpoena risk, because private data never leaves an individual’s device in the clear; major technology companies have adopted local differential privacy protections in their data collection and machine learning tools. Apple provides local differential privacy in many of its iPhone systems

[3], and Google has built systems supplying central and local differential privacy [47, 1]. The broad impact of privacy protections in billions of devices suggest we should carefully understand the fundamental limitations and possibilities of learning with local notions of privacy.To address this challenge, we study the *local minimax complexity* of
estimation and learning under local notions of privacy.
Worst-case notions of complexity may be too stringent for statistical
practice [21], and in real-world use, we wish to understand
how difficult the *actual* problem we have is, and whether we can adapt
to this problem difficulty, so that our procedurs more efficiently solve
easy problems as opposed to being tuned to worst-case notions of
difficulty. Our adoption of local minimax complexity is thus driven by three
desiderata: we seek fundamental limits on estimation and learning that (i)
are instance specific, applying to the particular problem at hand, (ii) are
uniformly attainable, in that there exist adaptive procedures to achieve the
instance-specific difficulty, and (iii) have super-efficiency limitations, so
that if a procedure achieves *better* behavior than the lower bounds
suggest is possible, there should be problem instances in which the
procedure must have substantially worse behavior. In this paper, we provide
a characterization of the difficulty of estimation of one-dimensional
quantities under local privacy that satisfies these desiderata.

The celebrated Le Cam–Hájek local asymptotic minimax theory [26, 32, 34, 50, 35]

cleanly delineates efficient from inefficient estimators in classical statistics and highlights the importance of local notions of optimality (making Fisher information bounds rigorous). As an example encapsulating the differences between global and local minimax complexity, consider the one-dimensional logistic regression problem of predicting

from , with , where—taking our motivation from applications of machine learning [28, 3, 1]—we wish only to accurately estimate . This problem is easier the larger is, and a calculation shows that the maximum likelihood estimator has misclassification error decreasing exponentially in ; a fully minimax analysis provides a lower bound at , or random guessing, with convergence lower bound independent of . For most applications of statistical learning, we hope our model substantially outperforms random guessing, so such (global worst-case) analyses are of limited utility in the design of (near-) optimal procedures. To that end, any practicable theory of optimal private estimation should encapsulate a*local*notion of problem difficulty.

### 1.1 Contributions, outline, and related work

Our development of instance-specific (local) notions of problem complexity
under privacy constraints allows us to more precisely quantify the
statistical price of privacy. Identifying the tension here is of course of
substantial interest, and Duchi, Jordan, and Wainwright [21, 20] develop a
set of statistical and information-theoretic tools for understanding the
*minimax* risk in locally differentially private settings, providing
the point of departure for our work. To understand their and our coming
approach, let us formally define our setting.

We have i.i.d. data
drawn according to a distribution on a space
. Instead of observing the original sample , however, the
statistician or learner sees only *privatized* data ,
where the data is drawn from a Markov kernel conditional on (following information-theoretic parlance, we
often call the privacy *channel* [13]; in the
privacy literature is the
*mechanism* [24]). In full generality, we allow the
channel to be *sequentially interactive* [21], meaning
that at observation , the channel may depend on the previous (private)
observations . That is, we have

(1) |

This notion of interactivity is important for procedures, such as stochastic gradient methods [21] or the one-step-corrected estimators we develop in the sequel, which modify the mechanism after some number of observations to more accurately perform inference.

The statistical problems we consider are, abstractly, as follows. Let be a family of distributions, and let be a parameter belonging to a parameter space we wish to estimate, where denotes the estimand. Let

be a loss function measuring the loss of an estimated value

for the distribution , where we assume that for all distributions . As an example, we may consider the mean and squared error . Let be a collection of appropriately private channels, for example, -differentially private channels (which we define in the sequel). The*private minimax risk*[21] is

(2) |

where denotes the marginal and drawn conditionally (1). Duchi et al. [21] provide upper and lower bounds on this quantity when is the collection of -locally differentially private channels, developing strong data processing inequalities to quantify the costs of privacy.

The worst-case nature of the formulation (2) may
suggest lower bounds that are too pessimistic for practice, and
does not allow a characterization of
problem-specific difficulty, which is important for a deeper understanding
of adaptive and optimal procedures. Accordingly, we adopt a *local
minimax approach*, which builds out of the classical statistical
literature on hardest one-dimensional alternatives that begins with
Stein [44, 4, 16, 17, 18, 9, 11]. In the same setting as the above,
we define the *local minimax risk* at
the distribution for the set of channels as

(3) |

The quantity (3) measures the
difficulty of the loss minimization problem for a *particular
distribution* under the privacy constraints
characterizes, and at this distinguished distribution, we look for the
hardest alternative distribution .

To situate our contributions, let us first consider the non-private variant of the minimax complexity (2) and local minimax complexity (3), when (the identity mapping), and we use the squared error loss . Let us first consider a classical setting, in which we wish to estimate a linear function of a parameter in a parametric family with Fisher information matrix . The Fisher information bound [35] for the parameter is

More generally,
if we wish to estimate a functional of
a distribution , Donoho and Liu [16, 17, 18] show
how the *modulus of continuity* takes the place of the classical
information bound. Again considering the squared error, define the
modulus of continuity of over with respect
to Hellinger distance by

(4) |

where . Then under mild regularity conditions,

which highlights that separation in Hellinger distance precisely governs problem difficulty in non-private classical statistical problems. In the local minimax case, similar characterizations via a local modulus of continuity are available in some problems, including estimation of the value of a convex function [9] and stochastic optimization [11].

In contrast, the work of Duchi et al. [21, 20] suggests that
for -locally differentially private estimation, we should replace
the Hellinger distance by *variation distance*. In the case of
higher-dimensional problems, there are additional dimension-dependent
penalties in estimation that local differential privacy makes unavoidable,
at least in a minimax sense [21]. In work independent of and
contemporaneous to our own, Rohde and Steinberger [43] build off
of [21] to show that (non-local) minimax rates of convergence
under -local differential privacy are frequently governed by a
modulus of continuity (4), except that the
variation distance replaces
the Hellinger distance . Rohde and Steinberger also exhibit a
mechanism that is minimax optimal for “nearly” linear functionals based on
randomized response [54, 43, Sec. 4]. Thus, locally
differentially private procedures give rise to a different geometry than
classical statistical problems.

Now we are in a position for a high-level description of our results. Our results apply in a variety of locally private estimation settings, whose definitions we formalize in Section 2, but all of them consist of weakenings of -differential privacy (including concentrated and Rényi-differential privacy [24, 22, 8, 38]). We provide a precise characterization of the local minimax complexity (3) in these settings. If we define the local modulus of continuity (for the squared error) at by

then a consequence of our Theorem 3.1 is that for the squared loss and family of -locally private channels,

We provide this characterization in more detail and for general losses in
Section 3. Moreover, we show a super-efficiency
result that any procedure that achieves risk better than the local minimax
complexity at a distribution *must* suffer higher risk at another
distribution , so that this characterization does indeed satisfy our
desiderata of an instance-specific complexity measure.

The departure of these risk bounds from the typical Hellinger-based moduli
of continuity (4) has consequences for
locally private estimation and adaptivity of estimators, which we address
via examples in Section 4. For instance,
instead of the classical Fisher information, an alternative we
term the *-information*

characterizes the complexity of locally private estimation: the classical Fisher information bounds are unobtainable. A challenging consequence of these results is that, for some parametric models (including Bernoulli estimation and binomial logistic regression), the local complexity (

3) is*independent*of the underlying parameter: problems that are easy (in the classical Fisher information sense) are never easy under local privacy constraints. Our proofs, building off of those of Duchi et al. [21], rely on novel Markov contraction inequalities for divergence measures, which strengthen classical strong data processing inequalities [12, 14].

Developing adaptive procedures uniformly achieving the instance-specific
local minimax risk (3) is challenging, but we
show that such optimal design is possible in a number of cases in
Section 4, including well- and mis-specified exponential
family models, using an extension of classical one-step corrected
estimators. We compare these locally optimal procedures with the minimax
optimal procedures Duchi et al. [21] propose on a protein
expression-prediction problem in Section 6; the
experimental results suggests that the local minimax perspective indeed
outperforms the global minimax procedures, however, the costs of privacy are
*still* nontrivial.

Lastly, because we consider weaker notions of privacy, one might ask whether
it is possible to improve the minimax bounds that Duchi et al. [21, 20] develop. Unfortunately, this appears impossible (see
Section 5). Duchi et al. show
only that *non-interactive privacy* mechanisms (i.e. the channel
may depend only on and not the past observations) must
suffer poor performance in high dimensions under stringent
differential privacy constraints. Our results show that this is unavoidable,
even with weaker notions of privacy and allowing interactive mechanisms. We
provide some additional discussion and perspective in
the closing of the paper in Section 7.

## 2 Definitions of privacy

Our starting point is a formalization of our notions of local privacy.
With the notion (1) of sequentially
interactive channels, where the th private observation
is drawn conditionally on the past as , we consider several notions of
privacy, going from the strongest to the weakest.
First is *local differential privacy*, which
Warner [54] first proposed (implicitly)
in his 1965 work on survey sampling, then explicitly defined by
Evfimievski et al. [25] and Dwork et al. [24].

The channel is *-locally differentially private*
if for all , , and
, we have

The channel is *non-interactive* if

for all and .

Duchi et al. [21] consider this notion of privacy, developing its consequences for minimax optimal estimation. It is a satisfying definition from a privacy point of view, and an equivalent view is that an adversary knowing the data is either or cannot accurately test, even conditional on the output , whether the generating data was or ’. To mitigate the consequent difficulties for estimation and learning with differentially private procedures, researchers have proposed a number of weakenings of Definition 2, which we also consider.

To that end, a second notion of privacy, which Dwork and Rothblum [22] propose and Bun and Steinke [8] develop reposes on Rényi-divergences. For an , the Rényi-divergence of order is

where for one takes the downward limit as , yielding . We then have the following
definition.
The channel is *-zero-concentrated
locally differentially private* (zCDP) if for all
, , and , we have

An equivalent definition is that the log likelihood ratio has sub-Gaussian tails, that is, for , we have

for all (and ). Mironov [38] proposes
a natural relaxation of Definition 2,
suggesting that we require it hold only for a *single* fixed . This yields
The channel is *-Rényi locally
differentially private* if for all , and , we have

Perhaps the most salient point in Definition 2 is the choice , which will be important in our subsequent analysis. Consider a prior distribution on two points , represented by and , and then consider the posterior and after observing the private quantity . Then -Rényi privacy is equivalent [38, Sec. VII]

to the condition that the prior and posterior odds ratios of

against do not change much in expectation:(5) |

for all two-point priors , where the expectation is taken
over . (For -differential
privacy, the inequality holds for *all* without expectation).
Because Rényi divergences are monotonic in
(cf. [51, Thm. 3]), any
channel that is -Rényi private
is also -Rényi private for .

Our final notion of privacy is based on -divergences, which is related to Definition 2. Recall for a convex function with , the -divergence between distributions and is

which is non-negative and strictly positive when and is strictly convex at the point . We consider -divergences parameterized by of the form

For , the channel is
*--divergence locally private* if for all , , and , we have

When , this is the familiar
-divergence [37, 46], and it is equivalent to
-Rényi differential privacy. We
describe this special situation as *--privacy*.

The definitions provide varying levels of privacy. It is immediate that if a channel is -differentially private, then it is --divergence locally private. For , this implies --divergence local privacy. It is also clear that Definition 2 is stronger than 2, which is stronger than 2. We can quantify this as well: -differential privacy implies -zero-concentrated differential privacy. For , we also find that if the channel is -zCDP, then it is immediate that it satisfies --divergence privacy with , where we take in the definition of the Rényi-divergence. Our results all apply for -private channels, so that -privacy (Definition 2 with or Definition 2 with ) implies strong lower bounds on estimation.

## 3 Local minimax complexity and private estimation

We turn to our main goal of establishing
localized minimax complexities for locally private estimation.
To that end, we begin in
Section 3.1 by defining the
*modulus of continuity* of estimands, showing how it provides a tight
lower bound on localized complexity for private estimation.
Section 3.2 continues the development of
Section 3.1 by establishing a
super-efficiency result, showing that any estimator achieving lower risk
than our localized modulus of continuity for some distribution must be
inefficient on other distributions . In
Section 3.3, we present the main
technical tools that underlie our results, providing new strong
data-processing inequalities showing precisely how locally private channels
degrade the information in statistical problems.

### 3.1 The modulus of continuity and local minimax complexities

Recall our setting, where we wish to estimate a parameter of a distribution , a collection of possible distributions, and we measure performance of an estimand via a loss satisfying . We define the “distance” between distributions and for the loss by

which is always non-negative. As an example, if is 1-dimensional and we use the squared error ,

More generally, for any symmetric convex function , if ,

(6) |

where . A similar result holds for general losses; if is non-decreasing and we measure the parameter error , then

(7) |

as it is no loss of generality to assume that in the definition of the distance.

For a family of distributions ,
the *modulus of continuity*
associated with the loss at the distribution is

(8) |

As we shall see, this modulus of continuity fairly precisely
characterizes the difficulty of locally private estimation of functionals.
The key in this definition is that the modulus of continuity is
defined with respect to *variation distance*.
This is in contrast to
classical results on optimal estimation, where the more familiar modulus of
continuity with respect to Hellinger distance characterizes problem
difficulty. Indeed, Le Cam’s theory of quadratic mean differentiability,
contiguity, local asymptotic normality, and local alternatives for testing
all reposes on closeness in Hellinger distance [34, 35, 42, 50], which justifies the use of Fisher Information
in classical statistical problems. In nonparametric problems, as we
mentioned briefly in the introduction, the modulus of continuity of the
parameter with respect to Hellinger distance also characterizes
minimax rates for estimation of functionals of distributions [4, 16, 17] (at least in a global minimax sense), and in some
instances, it governs local minimax guarantees as well [9].
These results all correspond to replacing the variation
distance in definition (8) with the
Hellinger distance between and . As we illustrate, the difference
between the classical Hellinger-based modulus of continuity and
ours (8) leads to to substantially different behavior
for private and non-private estimation problems.

With this, we come to our first main result, which lower bounds the local minimax risk using the modulus (8). We defer the proof to Section 3.4.3, using our results on strong data-processing inequalities to come to prove it. Let be the collection of --locally private channels. Then for any distribution , we have

Noting that the modulus of continuity is increasing in its first argument and that for all , we have the simplified lower bound

In (nearly) simultaneous independent work, Rohde and Steinberger [43] provide a global minimax lower bound, akin to (2), using a global modulus of continuity with respect to variation distance, extending [16, 17, 18] to the private case. Our focus here is on instance-specific bounds, with the hope that we may calculate practically useful quantities akin to classical information bounds [50, 35].

Achievability in the theorem is a somewhat more delicate argument;
demonstrating procedures that achieve the lower bound uniformly is typically
nontrivial. With that said, under two reasonable conditions on our loss,
distance, and growth of the modulus of continuity, we can show a converse to
Theorem 3.1, showing that the modulus
indeed *describes* the local minimax complexity to within
numerical constants.

[Reverse triangle inequality] There exists such that for ,

In the case that the loss is based on the error for nondecreasing, the inequality (7) shows that Condition 3.1 holds whenever for all . In addition, we sometimes use the following condition on the modulus of continuity. [Polynomial growth] At the distribution , there exist constants such that for all

Condition 3.1 is similar to the typical Hölder-type continuity properties assumed on the modulus of continuity for estimation problems [16, 17]. We give examples satisfying Condition 3.1 presently.

The conditions yield the following converse to Theorem 3.1, which shows that the modulus of continuity characterizes the local minimax complexity. See Appendix A.2 for a proof of the result. Let Conditions 3.1 and 3.1 on and hold. Let and , and let be the collection of non-interactive -differentially private channels (Definition 2). Then

The proposition as written is a bit unwieldy, so we unpack it slightly. For , we have , so that for a constant that may depend on , , and , for each there exists a non-interactive -differentially private channel and estimator such that

This matches the lower bound in Theorem 3.1 up to a numerical constant.

We briefly discuss one example satisfying Condition 3.1 to demonstrate that we typically expect it to hold, so that the modulus of continuity characterizes the local minimax rates. We also show that that (potentially mis-specified) exponential family models also satisfy Condition 3.1 in Section 4.4, Lemma 4.4.

[Modulus of continuity in (nonparametric) mean estimation] Consider loss where is nondecreasing. Inequality (7) implies that using the shorthand

we have

and assuming that the function itself satisfies for , to show satisfies Condition 3.1, it is sufficient to show that satisfies Condition 3.1.

Now consider the problem of estimating , where the unknown distribution belongs to for some compact set . Denote . We claim the following upper and lower bounds on :

(9) |

which of course combine to imply Condition 3.1. To see the lower bound (9), for any , define , where denotes a point mass at . Then for all , so . The upper bound (9) is similarly straightforward: for all , we have

by the triangle inequality, which is our desired result.

### 3.2 Super-efficiency

To demonstrate that the local modulus of continuity is indeed the “correct” lower bound on estimation, we consider the third of the desiderata for a strong lower bound that we idenfity in the introduction: a super-efficiency result. We provide this via a constrained risk inequality [6, 19]. Our result applies in the typical setting in which the loss is for some increasing function , and we use the shorthand for the risk (expected loss) of the estimator under the distribution . The starting point for our development is an inequality extending Brown and Low [6, Thm. 1] showing that if has small risk for a parameter under a distribution , then its risk under a distribution close to may be large (see also [45, Thm. 6]). In the lemma and the remainder of this section, for measures and we define the -affinity

(10) |

which measures the similarity between distributions and . With these definitions, we have the following constrained risk inequality.

[[19], Theorem 1] Let , , and define . If the estimator satisfies for some , then

The lemma shows that an estimator has small risk under distribution , then its risk for a nearby (in -divergence) distribution must be nearly the distance between the associated parameters and .

With Lemma 3.2 in hand, we can prove a super-efficiency result, showing that improvement over our modulus of continuity lower bound at a point implies worse performance elsewhere. Let be sequentially interactive --private channel (Defs. 2 or 2) with associated marginal distributions . Let Condition 3.1 hold with parameter . If for some the estimator satisfies

then for all there exists a distribution such that

See Section 3.4.4 for a proof.

The proposition depends on a number of constants, but roughly, it shows (for small enough , where we simplify by taking ) that if an estimator is super-efficient at , in that , then there exists a constant such that for some we have . In this sense, our local modulus of continuity bounds are sharp: no estimator can achieve much better risk than the local modulus of continuity at a distribution without paying nontrivial cost elsewhere.

### 3.3 Contractions of probability measures

The main technical tool underpinning our lower bounds is that our definitions of privacy imply strong contractions on the space of probability measures. Such contractive properties have been important in the study of information channels

[12, 14], where one studies*strong data processing*

inequalities, and in the mixing properites of Markov chains under so-called

*strong mixing conditions*, such as the Dobrushin condition [15]. For , define the marginal distributions

The goal is then to provide upper bounds on the -divergence in terms of the channel ; the standard data-processing inequality [13, 37] guarantees . Dobrushin’s celebrated ergodic coefficient guarantees that for any -divergence (see [12, 14]),

(11) |

Thus, as long as the Dobrushin coefficient is strictly positive, one obtains a strong data processing inequality. In our case, our privacy guarantees provide a stronger condition than the positivity of the Dobrushin coefficient. Consequently, we are able to provide substantially stronger data processing inequalities: we can even show that it is possible to modify the underlying -divergence.

Thus, we reconsider the notions of privacy based on divergences between the channels and . We have the following proposition, which provides a strong data processing inequality for all channels satisfying the divergence-based notion of privacy (Definition 2).

Let for some , and let and be arbitrary distributions on a common space . Let be a Markov kernel from to satisfying

for all and . Then

Jensen’s inequality implies that , so Proposition 3.3 provides a stronger guarantee than the classical bound (11) for the specific divergence associated with . Because for all , it is possible that the -divergence is infinite, while the marginals are much closer together. It is this transfer from power divergence to variation distance, that is, to , that allows us to prove the strong localized lower bounds depending on variation distance such as Theorem 3.1.

As a corollary of Proposition 3.3, we may parallel the proof of [21, Theorem 1]

to obtain a tensorization result. In this context, the most important divergence for us is the

divergence, which corresponds to the case in Proposition 3.3, that is, , which also corresponds to Rényi differential privacy with (Def. 2) with a guarantee that prior and posterior odds of discovery do not change much (Eq. (5)). Recall our formulation (1), in which the channel may be defined sequentially as as , and letNow, let be product distributions on , where we say that the distribution of either follows or , and define , noting that as is a product distribution. We have the following corollary.

Let be a sequentially interactive channel satisfying --divergence privacy, that is, for all and . Then

See Section 3.4.2 for a proof. An immediate consequence of Corollary 3.3 and the fact [46, Lemma 2.7] that yields

(12) |

The tensorization (12) is the key to our results, as we see in the later sections.

### 3.4 Proofs

We collect the proofs of our main results in this section, as they are reasonably brief and (we hope) elucidating. We begin with the key contraction inequality in Proposition 3.3, as it underlies all subsequent results.

#### 3.4.1 Proof of Proposition 3.3

Let and be the densities of with respect to some base measure dominating . Without loss of generality, we may assume that is finite, as all -divergences are approximable by finite partitions [49]; we let denote the associated p.m.f. For , the function is convex on . Thus, applying Jensen’s inequality, we may bound by

(13) |

It thus suffices to upper bound . To do so, we rewrite as

where we have used that . Now define the function

By Minkowski’s integral inequality, we have the upper bound

Now we compute the inner summation: we have that

Substituting this into our upper bound (3.4.1) on , we obtain that

as . Substitute this upper bound into inequality (13) to obtain the proposition.

#### 3.4.2 Proof of Corollary 3.3

We use an inductive argument. The base case in which follows immediately by Proposition 3.3. Now, suppose that Corollary 3.3 holds at ; we will show that the claim holds for . We use the shorthand for the density of the measure , and , which we may assume exists w.l.o.g. Then, by definition of -divergence, we have,

Comments

There are no comments yet.