# Bias Reduction for Sum Estimation

Will Rosenbaum (Amherst College)

Joint with: Talya Eden, Jakob Hauen, Shyam Narayanan, and Jakub Tetek

## Overview

1. Motivation
2. General Setting
3. Bias Reducing Sum Estimation
4. Lower Bounds

# Motivation

## Motivating Example

Population of $N$ voters deciding a proposition (yes/no)

Goal. Estimate the number of “yes” votes.

• expected margin of victory is $\sim 3\%$

Typical Assumption (unfounded?). We can sample voters uniformly at random.

## Easy Estimation

Procedure:

• sample $m$ voters
• $X_i \in \{0, 1\}$ is vote of $i$th sample
• return

$\bar\mu = \frac{1}{m} \sum_{i = 1}^{m} N X_i.$

Guarantee:

• $\varepsilon N$ additive error with probability $2/3$ for $m = O(\varepsilon^{-2})$.

$N = 108$, $m = 20$, $\mu = 58$

• $\bar\mu = \frac 1 m \sum_i N X_i = 59.4$

• “Yes” predicted to win!

## More Realistic Assumption?

Samples are not exactly uniform

• there may be some underlying bias
• bias may systematically favor “yes” or “no” votes
• true distribution is not known
• …but true distribution is close to uniform
• e.g., each voter sampled with probability $(1 \pm 0.1) / N$
• bias is $0.1 = 10\%$

## “Yes” Bias

$N = 108$, $m = 20$, $\mu = 58$

“Yes” voters 10% more likely to be sampled

• $\bar\mu = \frac 1 m \sum_i N X_i = 64.8$

• “Yes” predicted to win!

## “No” Bias

$N = 108$, $m = 20$, $\mu = 58$

“No” voters 10% more likely to be sampled

• $\bar\mu = \frac 1 m \sum_i N X_i = 48.6$

• “Yes” predicted to lose!

## Can We Do Better?

Question. Can we estimate $\mu = \sum_i x_i$ to within additive error, say, $0.02N$ using a sub-linear number of samples?

• naive estimates would have error proportional to the sample bias

## Yes We Can!

Our Results (special case):

1. If bias is $0.1$, $0.02 N$ additive error is achievable using $O(\sqrt{N})$ samples
2. More generally, estimating $\sum x_i$ with $x_i \in \{0, 1\}$:
• $\gamma$ is upper bound on sample bias
• $\varepsilon$ is desired approximation factor
• $k = \lceil \log(\varepsilon) / \log(\gamma) \rceil$
• $O(N^{1 - 1/k})$ samples are sufficeint
3. This sample complexity is tight

## Bigger Picture

In many contexts, it is assumed that samples can be generated exactly from some distribution:

• statistics
• sub-linear time algorithms
• distribution testing

General Questions

1. Under what circumstances are “noisy” samples sufficient?
2. What is the algorithmic cost of coping with noise?

# General Setting

## Setup

• universe of $N$ elements, $[N] = \{1, 2, \ldots, N\}$, item $i$ has associated value $x_i$
• $\mathcal{P}$ a known probability distribution over $[N]$
• $\mathcal{Q}$ is true sample distribution
• $\mathcal{Q}$ is pointwise $\gamma$-close to $\mathcal{P}$:

$(1 - \gamma) \mathcal{P}(i) \leq \mathcal{Q}(i) \leq (1 + \gamma) \mathcal{P}(i) \qquad \forall i$

Goal. Estimate $\mu = \sum_i x_i$ to within additive error $\varepsilon_1 \mu_+ + \varepsilon_2$, where

• $\mu_+ = \sum_i \vert x_i \vert$
• $\varepsilon_1 \ll \gamma$

## Classical Setting ($\mathcal{P} = \mathcal{Q}$)

Hansen-Hurwitz Estimator

Sample $m$ elements with replacement

• $\mu_{HH} = \frac{1}{m} \sum_{j = 1}^m \mathcal{P}(i_j)^{-1} X_j$
• unbiased estimator when $\mathcal{P} = \mathcal{Q}$
• bias is at most $\gamma \mu_+$ when $\mathcal{Q}$ is $\gamma$-close to $\mathcal{P}$

# Bias-Reducing Sum Estimation

## Our Idea

Apply Hansen-Hurwitz to $h$-wise collision statistics

• $h$-wise collision is a set of $h$ equal sampled indices $i_1 = i_2 = \cdots = i_h$

Get $h$-wise collision estimators:

• $\xi_h = \frac{1}{\binom{m}{h}} \sum_{i = 1}^N \binom{Y_i}{h} (\mathcal{P}(i))^{-h} x_i$
• $Y_i =$ # of times index $i$ was sampled

Note. $\xi_1 = \mu_{HH}$.

## Letdown

$h$-wise collision estimators alone are no better than Hansen-Hurwitz…

• bias is $O(\gamma h \mu_+)$
• variance is also worse

…but an appropriate linear combination of these estimators result in lower-order errors (in $\gamma$) to cancel out…

## Bias Reducing Estimator

We define the bias reducing estimator $\zeta_h$:

• $\zeta_k = \sum_{h = 1}^k (-1)^{h+1} \binom{k}{h} \xi_h$

Main Result. $\zeta_k$ has bias at most $\gamma^k \mu_+$.

• we also bound the variance of $\zeta_k$.

## A Consequence

Special Case: $\mathcal{P}$ is uniform, all values $x_i \in \{0, 1\}$

• take

$m = O\left(\sqrt[k]{n^{k-1} \varepsilon_2^{-2} \mathrm{Var}_{HH}}\right)$
• then $m$ samples are sufficient to estimate $\mu$ to additive error $\gamma^k \mu + \varepsilon_2$ with probability $2/3$

In original example

• $\gamma = 0.1$,
• $\varepsilon_1 = 0.01 = \gamma^2$
• $\varepsilon_2 = 0.01 N$

$\implies m = O(\sqrt{N})$ samples are sufficient

# Lower Bounds

## Setting of Original Example

• $\mathcal{P} =$ uniform
• $x_i \in \{0, 1\}$

Main Result (lower bound). Suppose algorithm $A$ guarantees:

• for any desired $\varepsilon > 0$,
• for every distribution $\mathcal{Q}$ that is point-wise $\gamma$-close to uniform
• $A$ returns an $\varepsilon N$ additive approximation to $\mu$ with constant probability

Then: for every positive integer $k$, there exists a constant $c_k < 1$ such that for $\varepsilon \leq c_k \gamma^k$, $A$ requires $\Omega(N^{1 - 1 / (k + 1)})$ samples.

## Questions

1. Tighten values of $c_k$

2. Sample correction: can we reduce bias of individual samples? Can we do this in an amortized fashion?

# Appendix 1: Estimator for $\mathcal{P} =$ uniform, $k = 2$

## Bias Reducing Estimator, $k = 2$

General case:

• $\zeta_k = \sum_{h = 1}^k (-1)^{h+1} \binom{k}{h} \xi_h$

Case $k = 2$:

• $\zeta_2 = 2 \xi_1 - \xi_2$

Suppose $\mathcal{Q}(i) = (1 + \alpha_i) / N$, $\vert\alpha_i\vert \leq \gamma$

Then:

• $E(\xi_1) = \sum_i (1 + \alpha_i) x_i$
• $E(\xi_2) = \sum_i (1 + \alpha_i)^2 x_i$

So:

• $E(2 \xi_1 - \xi_2) = \sum_i (2(1 + \alpha_i) - (1 + \alpha_i)^2) x_i = \mu + \sum_i \alpha_i^2 x_i$

# Appendix 2: Lower Bound Techniques

## Step 1: Distinguishing Distributions

Theorem (Raskhodnikova et al. 2009). If two distribution $\mathcal{D}_1$ and $\mathcal{D}_2$ have same $p$th frequency moments for $p = 1, 2,\ldots,k$, then $\Omega(N^{1 - 1/(k+1)})$ samples are required to distinguish $\mathcal{D}_1$ from $\mathcal{D}_2$.

We Construct. Distributions $\mathcal{D}_1$ and $\mathcal{D}_2$ with:

• support size $n_1 = (1 + O(\gamma)^k) n$ and $n_2 = n$ (respectively)
• both pointwise $\gamma$-close to uniform
• identical frequency moments $p = 1, 2, \ldots, k$

## Step 2: Sum Estimation

Given $\mathcal{D}_1$ and $\mathcal{D}_2$, set $N = n_1 + n_2$, $\mathcal{Q} = \frac 1 2 \mathcal{D}_1 + \frac 1 2 \mathcal{D}_2$

• Scenario 1:

$x_i = \begin{cases} 1 & i \in \mathrm{supp}(\mathcal{D}_1)\\ 0 & i \in \mathrm{supp}(\mathcal{D}_2) \end{cases}$
• Scenario 2:

$x_i = \begin{cases} 0 & i \in \mathrm{supp}(\mathcal{D}_1)\\ 1 & i \in \mathrm{supp}(\mathcal{D}_2) \end{cases}$

Then estimating $\sum_i x_i$ with error $O(\gamma^k)$

$\implies$ distinguishing scenarios 1 and 2

$\implies$ distinguishing $\mathcal{D}_1$ and $\mathcal{D}_2$