# On Sampling Edges Almost Uniformly

## Introduction

Suppose $G = (V, E)$ is a very large graph

Access to $G$ is provided via queries:

• sample a (uniformly random) vertex $v \in V$
• degree query: return the degree of vertex, $d(v)$
• neighbor query: return the $i$-th neighbor of $v$
• pair query: return whether or not $(u, v) \in E$

Main Question: How many such queries are necessary and sufficient to sample an edge $e \in E$ from a(n almost) uniform distribution?

## Motivation

Query Access Model(s)

• introduced in the context of property testing
• standard for sublinear (time) graph algorithms

Random Edge Samples

• used in many sublinear graph algorithms (testing bipartiteness, counting triangles & cliques, estimating moments of the degree distribution)
• used as a basic query by Aliakbarpour et al. (2016) to estimate number of star subgraphs

## Main Results

Theorem 1. There exists an algorithm $\mathcal{A}$ that given query access to any graph $G = (V, E)$ with $n = |V|$ and $m = |E|$ returns a random edge $e \in E$ such that:

1. each edge is returned with probability $(1 \pm \varepsilon) / m$
2. $\mathcal{A}$ uses $O{\big(}n\, {\big /} \sqrt{\varepsilon m}\big)$ degree and neighbor queries in expectation.

Thoerem 2. Any algorithm that samples edges from an almost-uniform distribution requires $\Omega\big(n\, \big / \sqrt{m}\big)$ degree, neighbor, or pair queries.

Note. In this talk, we'll focus on proving Theorem 1.

## A simple (but inefficient) protocol

Suppose the max degree of $G$ is bounded by $\Delta$

1. pick vertex $v \in V$ uniformly at random
2. pick $i \in \{1, 2, \ldots, \Delta\}$ at random
3. if $u$ is $i$th neighbor of $v$, return $(v, u)$
4. otherwise, repeat 1–3

Notes:

• each $e$ is returned with probability $2 \big{/} \Delta\, n$
• success probability is $2 m \big{/} \Delta\,n$
• worst-case (expected) number of queries is $\Theta(n^2 / m)$

## Algorithm Idea

• Partition vertices (and edges) into sets of heavy and light elements according to degree.
• Sample edges from light vertices by using simple (but inefficient) protocol, $\Delta = \theta$
• To sample edges from heavy vertices:
1. pick a random edge $(u, v)$ where $u$ is light,
2. if $v$ is heavy, pick a random neighbor $w$ of $v$,
3. return $(v, w)$.
• Invoke subroutines for sampling light and heavy edges with equal probability. Repeat until an edge is returned.

## An Example Graph

replace edges with directed edges

## Formalities

Replace each edge $\{u, v\} \in E$ with a pair of directed edges $(u, v), (v, u)$

Set threshold $\theta = \sqrt{m / \varepsilon}$:

• $u$ is light if $d(u) \leq \theta$ and heavy otherwise
• $(u, v)$ is light (resp. heavy) if $u$ is light (resp. heavy)
• $E_L$ (resp. $E_H$) is the set of light (resp. heavy) edges

Threshold Lemma. If $u$ is heavy, then at most an $\varepsilon$—fraction of $u$'s neighbors are heavy.

Proof. There can be at most $\sqrt{\varepsilon m} = \varepsilon \theta$ nodes with degree at least $\sqrt{m / \varepsilon}$. $\Box$

$\implies$ at least $(1 - \varepsilon) d(u)$ of $u$'s neighbors are light.

## Sample Light Edge (SLE)

SLE algorithm

1. pick $u \in V$ uniformly at random
2. if $u$ is light, pick $i \in [\theta]$ uniformly (else, fail)
3. if $v$ is $u$'s $i$-th neighbor, return $(u, v)$ (else, fail)

SLE Lemma. SLE succeeds (i.e., returns an edge) with probability $|E_L| \big/ n \theta$. Further, for every light edge $e$, we have $$\Pr(\text{SLE returns } e) = \frac{1}{n \theta}$$

## Sample Heavy Edge (SHE)

SHE algorithm

1. use SLE to sample edge $(u, v)$ (or fail)
2. if $v$ is heavy, pick $i \in [d(v)]$ uniformly
3. return $(v, w)$ where $w$ is $v$'s $i$th neighbor

SHE Lemma. SHE succeeds with probability $p$ satsifying $$(1 - \varepsilon) \frac{|E_H|}{n \theta} \leq p \leq \frac{|E_H|}{n \theta}.$$ Further, for every heavy edge $e$ we have $$(1 - \varepsilon) \frac{1}{n \theta} \leq \Pr(\text{SHE returns } e) \leq \frac{1}{n \theta}.$$

## Proof of SHE Lemma

Suppose $v$ is a heavy vertex, $d_L(v)$ the number of its light neighbors.

By Threshold Lemma, $d_L(v) \geq (1 - \varepsilon) d(v)$

By SLE Lemma the probability that some $(u, v)$ is returned in Step 1 of SHE is $$\frac{d_L(v)}{n \theta} \geq (1 - \varepsilon) \frac{d(v)}{n \theta},$$ in which case SHE succeeds.

The probability that SHE returns $(v, w)$ is then $$\frac{d_L(v)}{n \theta} \cdot \frac{1}{d(v)} \geq (1 - \varepsilon) \frac{1}{n \theta}. \qquad\qquad \Box$$

## Main Algorithm

Sample Edge (SE) Repeat until an edge $e$ is returned

• with probability $1/2$ invoke SLE
• otherwise invoke SHE

Analysis follows from SLE/SHE lemmas:

• probability that SE succeeds is at least $$\frac{1}{2} \cdot \frac{|E_L|}{n\theta} + \frac{1}{2} \cdot (1 - \varepsilon) \frac{|E_H|}{n \theta} = \Theta\left(\frac{m}{n \theta}\right)$$
• the expected number of queries until success is $O(n \theta / m) = O(n \big/ \sqrt{\varepsilon m})$
• each edge is returned with roughly equal probability

## Final Remarks

1. If $m$ is not known in advance, it can be estimated using $O(n \big / \sqrt{m})$ queries [Feige 2005, Goldreich & Ron 2008]
2. Optimality of SE algorithm (Theorem 2)
• start with any graph on $n$ vertices, $m$ edges
• add a clique with $m$ edges ($\Theta(\sqrt{m})$ nodes)
• clique edge must be returned with probability $1/2 - \varepsilon$, but the probability of "hitting" the clique is $\Theta(\sqrt{m} / n)$
3. Lower bound proof can be formalized simply using communication complexity [Eden & Rosenbaum 2018]