Bias Reduction for Sum Estimation

Will Rosenbaum (Amherst College)

Joint with: Talya Eden, Jakob Hauen, Shyam Narayanan, and Jakub Tetek

Overview

Motivation
General Setting
Bias Reducing Sum Estimation
Lower Bounds

Motivation

Motivating Example

Population of $N$ voters deciding a proposition (yes/no)

Goal. Estimate the number of “yes” votes.

expected margin of victory is $\sim 3\%$

Typical Assumption (unfounded?). We can sample voters uniformly at random.

Easy Estimation

Procedure:

sample $m$ voters
$X_i \in \{0, 1\}$ is vote of $i$th sample
return
\[\bar\mu = \frac{1}{m} \sum_{i = 1}^{m} N X_i.\]

Guarantee:

$\varepsilon N$ additive error with probability $2/3$ for $m = O(\varepsilon^{-2})$.

Estimating Yes Votes

$N = 108$, $m = 20$, $\mu = 58$

$\bar\mu = \frac 1 m \sum_i N X_i = 59.4$
“Yes” predicted to win!

More Realistic Assumption?

Samples are not exactly uniform

there may be some underlying bias
- bias may systematically favor “yes” or “no” votes
true distribution is not known…

…but true distribution is close to uniform
- e.g., each voter sampled with probability $(1 \pm 0.1) / N$
- bias is $0.1 = 10\%$

“Yes” Bias

$N = 108$, $m = 20$, $\mu = 58$

“Yes” voters 10% more likely to be sampled

$\bar\mu = \frac 1 m \sum_i N X_i = 64.8$
“Yes” predicted to win!

“No” Bias

$N = 108$, $m = 20$, $\mu = 58$

“No” voters 10% more likely to be sampled

$\bar\mu = \frac 1 m \sum_i N X_i = 48.6$
“Yes” predicted to lose!

Can We Do Better?

Question. Can we estimate $\mu = \sum_i x_i$ to within additive error, say, $0.02N$ using a sub-linear number of samples?

naive estimates would have error proportional to the sample bias

Yes We Can!

Our Results (special case):

If bias is $0.1$, $0.02 N$ additive error is achievable using $O(\sqrt{N})$ samples
More generally, estimating $\sum x_i$ with $x_i \in \{0, 1\}$:
- $\gamma$ is upper bound on sample bias
- $\varepsilon$ is desired approximation factor
- $k = \lceil \log(\varepsilon) / \log(\gamma) \rceil$
- $O(N^{1 - 1/k})$ samples are sufficeint
This sample complexity is tight

Bigger Picture

In many contexts, it is assumed that samples can be generated exactly from some distribution:

statistics
sub-linear time algorithms
distribution testing

General Questions

Under what circumstances are “noisy” samples sufficient?
What is the algorithmic cost of coping with noise?

General Setting

Setup

universe of $N$ elements, $[N] = \{1, 2, \ldots, N\}$, item $i$ has associated value $x_i$
$\mathcal{P}$ a known probability distribution over $[N]$
$\mathcal{Q}$ is true sample distribution
$\mathcal{Q}$ is pointwise $\gamma$-close to $\mathcal{P}$:
\[(1 - \gamma) \mathcal{P}(i) \leq \mathcal{Q}(i) \leq (1 + \gamma) \mathcal{P}(i) \qquad \forall i\]

Goal. Estimate $\mu = \sum_i x_i$ to within additive error $\varepsilon_1 \mu_+ + \varepsilon_2$, where

$\mu_+ = \sum_i \vert x_i \vert$
$\varepsilon_1 \ll \gamma$

Classical Setting ($\mathcal{P} = \mathcal{Q}$)

Hansen-Hurwitz Estimator

Sample $m$ elements with replacement

\[\mu_{HH} = \frac{1}{m} \sum_{j = 1}^m \mathcal{P}(i_j)^{-1} X_j\]
unbiased estimator when $\mathcal{P} = \mathcal{Q}$
bias is at most $\gamma \mu_+$ when $\mathcal{Q}$ is $\gamma$-close to $\mathcal{P}$

Bias-Reducing Sum Estimation

Our Idea

Apply Hansen-Hurwitz to $h$-wise collision statistics

$h$-wise collision is a set of $h$ equal sampled indices $i_1 = i_2 = \cdots = i_h$

Get $h$-wise collision estimators:

\[\xi_h = \frac{1}{\binom{m}{h}} \sum_{i = 1}^N \binom{Y_i}{h} (\mathcal{P}(i))^{-h} x_i\]
$Y_i = $ # of times index $i$ was sampled

Note. $\xi_1 = \mu_{HH}$.

Letdown

$h$-wise collision estimators alone are no better than Hansen-Hurwitz…

bias is $O(\gamma h \mu_+)$
variance is also worse

…but an appropriate linear combination of these estimators result in lower-order errors (in $\gamma$) to cancel out…

Bias Reducing Estimator

We define the bias reducing estimator $\zeta_h$:

\[\zeta_k = \sum_{h = 1}^k (-1)^{h+1} \binom{k}{h} \xi_h\]

Main Result. $\zeta_k$ has bias at most $\gamma^k \mu_+$.

we also bound the variance of $\zeta_k$.

A Consequence

Special Case: $\mathcal{P}$ is uniform, all values $x_i \in \{0, 1\}$

take
\[m = O\left(\sqrt[k]{n^{k-1} \varepsilon_2^{-2} \mathrm{Var}_{HH}}\right)\]
then $m$ samples are sufficient to estimate $\mu$ to additive error $\gamma^k \mu + \varepsilon_2$ with probability $2/3$

In original example

$\gamma = 0.1$,
$\varepsilon_1 = 0.01 = \gamma^2$
$\varepsilon_2 = 0.01 N$

$\implies m = O(\sqrt{N})$ samples are sufficient

Lower Bounds

Setting of Original Example

$\mathcal{P} = $ uniform
$x_i \in \{0, 1\}$

Main Result (lower bound). Suppose algorithm $A$ guarantees:

for any desired $\varepsilon > 0$,
for every distribution $\mathcal{Q}$ that is point-wise $\gamma$-close to uniform
$A$ returns an $\varepsilon N$ additive approximation to $\mu$ with constant probability

Then: for every positive integer $k$, there exists a constant $c_k < 1$ such that for $\varepsilon \leq c_k \gamma^k$, $A$ requires $\Omega(N^{1 - 1 / (k + 1)})$ samples.

Bounds in Pictures

bounds

Questions

Tighten values of $c_k$
Sample correction: can we reduce bias of individual samples? Can we do this in an amortized fashion?
Can other algorithms/estimation tasks be made robust to sample bias?

Thank You!!!

Any questions?

Appendix 1: Estimator for $\mathcal{P} = $ uniform, $k = 2$

Bias Reducing Estimator, $k = 2$

General case:

$\zeta_k = \sum_{h = 1}^k (-1)^{h+1} \binom{k}{h} \xi_h$

Case $k = 2$:

$\zeta_2 = 2 \xi_1 - \xi_2$

Suppose $\mathcal{Q}(i) = (1 + \alpha_i) / N$, $\vert\alpha_i\vert \leq \gamma$

Then:

$E(\xi_1) = \sum_i (1 + \alpha_i) x_i$
$E(\xi_2) = \sum_i (1 + \alpha_i)^2 x_i$

So:

$E(2 \xi_1 - \xi_2) = \sum_i (2(1 + \alpha_i) - (1 + \alpha_i)^2) x_i = \mu + \sum_i \alpha_i^2 x_i$

Appendix 2: Lower Bound Techniques

Step 1: Distinguishing Distributions

Theorem (Raskhodnikova et al. 2009). If two distribution $\mathcal{D}_1$ and $\mathcal{D}_2$ have same $p$th frequency moments for $p = 1, 2,\ldots,k$, then $\Omega(N^{1 - 1/(k+1)})$ samples are required to distinguish $\mathcal{D}_1$ from $\mathcal{D}_2$.

We Construct. Distributions $\mathcal{D}_1$ and $\mathcal{D}_2$ with:

support size $n_1 = (1 + O(\gamma)^k) n$ and $n_2 = n$ (respectively)
both pointwise $\gamma$-close to uniform
identical frequency moments $p = 1, 2, \ldots, k$

Step 2: Sum Estimation

Given $\mathcal{D}_1$ and $\mathcal{D}_2$, set $N = n_1 + n_2$, $\mathcal{Q} = \frac 1 2 \mathcal{D}_1 + \frac 1 2 \mathcal{D}_2$

Scenario 1:
\[x_i = \begin{cases} 1 & i \in \mathrm{supp}(\mathcal{D}_1)\\ 0 & i \in \mathrm{supp}(\mathcal{D}_2) \end{cases}\]
Scenario 2:
\[x_i = \begin{cases} 0 & i \in \mathrm{supp}(\mathcal{D}_1)\\ 1 & i \in \mathrm{supp}(\mathcal{D}_2) \end{cases}\]

Then estimating $\sum_i x_i$ with error $O(\gamma^k)$

$\implies$ distinguishing scenarios 1 and 2

$\implies$ distinguishing $\mathcal{D}_1$ and $\mathcal{D}_2$