Sampling Edges and More

Will Rosenbaum

Amherst College

Based on joint work with

Talya Eden
MIT/BU

Dana Ron
Tel Aviv University

Outline

Introductory Remarks
Sampling Edges: A Simple Algorithm
Optimality?
Parameterization by Arboricity
Applications and Open Questions

1. Introductory Remarks

About Me

Assistant Professor in CS at Amherst College

Previously:

Postdocs at MPI for Informatics in Saarbrücken, Germany and Tel Aviv University
Grad at UCLA (Math)

Generally interested in CS Theory: algorithms and complexity

role of communication and locality in algorithms
- local and distributed algorithms
- sub-linear algorithms
algorithms with uncertainty about input

A Large(?) Graph

Goal: Sample a Random Edge

Problem: Don’t Initially Know Edges

Computational Model

$G = (V, E)$ a very large graph
access to $G$ provided via queries:
- sample a (uniformly random) vertex $v \in V$
- degree query: return the degree of vertex, $d(v)$
- neighbor query: return the $i$-th neighbor of $v$
- pair query: return whether or not $(u, v) \in E$

A.k.a. general graph model in property testing

Problem Statement

Goal. Sample an edge $e \in E$ from an almost uniform distribution.

use as few queries as possible

Point-wise Uniformity. Want every edge $e$ to be sampled with probability

$\Pr(e \text{ is sampled}) = \frac{(1 \pm \varepsilon)}{m}$

where $m$ is number of edges.

2. Sampling Edges: A Simple Algorithm

Warm-Up

What if an upper bound $\Delta$ on maximum degree is known?

pick vertex $v$ uniformly at random
pick a number $i$ from ${1, 2, \ldots, \Delta}$ uniformly at random
query $i$th neighbor of $v$
- if a vertex $w$ is found, return $e = (v, w)$
- otherwise, go to step 1

A Graph with $\Delta = 4$

Pick a Random Vertex

Pick a Random Number $\leq \Delta$: 3

Failure. Try Again!

Pick a Random Vertex

Pick a Random Number $\leq \Delta$: 2

Return Selected Edge

Simple Algorithm

What if an upper bound $\Delta$ on maximum degree is known?

pick vertex $v$ uniformly at random
pick a number $i$ from ${1, 2, \ldots, \Delta}$ uniformly at random
query $i$th neighbor of $v$
- if a vertex $w$ is found, return $e = (v, w)$
- otherwise, go to step 1

Claim 1. Each edge sampled with probability $\frac{2}{n \cdot \Delta}$.

Claim 2. Expected number of queries until success is $O\left(\frac{n \cdot \Delta}{m}\right)$.

Question 1

Claim 2. Expected number of queries until success is $O\left(\frac{n \cdot \Delta}{m}\right)$.

For what graphs is this algorithm efficient?

Question 2

Claim 2. Expected number of queries until success is $O\left(\frac{n \cdot \Delta}{m}\right)$.

What if there is no (known) upper bound on maximum degree?

An Illustrative Example

Observe

$\Delta = \Theta(n)$
$m = \Theta(n)$
$\implies n \Delta / m = \Theta(n)$
can always sample with $\Theta(n)$ queries
- get degrees of all vertices

Leaf Edges are Easy to Sample

See Middle Edge w/ Prob $\Omega(1/n)$?

Access Middle Vertices via Neighbors

Generalizing the Procedure

partition vertices according to degree
- use threshold $\theta$
- if $d(v) \leq \theta$, $v$ is light
- if $d(v) > \theta$, $v$ is heavy

sample edges from light vertices as in warm-up

to sample edges from heavy vertices:
- sample an edge $(u, v)$ with $u$ light
- if $v$ is heavy, pick a random neighbor $w$
- return $(v, w)$

Technicalities

Edges $\to$ Oriented Pair of Edges

Partition Vertices into Heavy/Light

Partition Edges into Heavy/Light

Another View

To Sample Light Edge

1. Sample Vertex; Check Light

2. Pick Random $i \leq \theta$

3. Return $i$th Edge (or Fail)

To Sample Heavy Edge

1. Sample Light Edge $(u, v)$

2. If $v$ is Light, Fail

3. Otherwise, Pick Random Neighbor

Choosing Threshold $\theta$

Want:

$\theta$ as small as possible
- want light edge sampling to succeed
every heavy vertex has mostly light neighbors
- can reach every heavy vertex in one step from light
- hit each heavy vertex with prob. proportional to its degree

A Good $\theta$

Choose:

$\theta = \sqrt{m / \varepsilon}$
- $m$ is number of oriented edges

Lemma. Heavy vertices have at most $\varepsilon$ fraction of heavy neighbors.

Proof. Suppose $> \varepsilon$ fraction:

$> \varepsilon \cdot (\sqrt{m / \varepsilon}) = \sqrt{\varepsilon m}$ heavy neighbors
each neighbor has degree $> \sqrt{m / \varepsilon}$
$> m$ edges $\Rightarrow\!\Leftarrow$

An Issue?

We chose $\theta = \sqrt{m / \varepsilon}$

What is the problem with this?

How do we know $m$!?

We can get a good enough estimate of $m$ with $O(n / \sqrt{m})$ queries:

Feige (2006)
Goldriech & Ron (2008)

The General Algorithm

Until an edge is sampled, flip a coin

with probability 1/2, try to sample light edge
- pick $u$ uniformly at random
- if $u$ is light, pick random $i \leq \theta$ u.a.r.
- if $i \leq d(u)$, return $(u, v)$ ($v$ is $i$th neighbor)
with probability 1/2, try to sample heavy edge
- try to sample light edge $(u, v)$
- if $v$ is heavy, return random incident edge $(v, w)$

Analysis

Claim 1. Light edge procedure returns every light edge with probability $\frac{1}{n \cdot \theta}$. $\Box$

Claim 2. Heavy edge procedure returns every heavy edge $e$ with probability $p(e)$ satisfying

$\frac{1 - \varepsilon}{n \cdot \theta} \leq p(e) \leq \frac{1}{n\cdot\theta}$.

Proof of Claim 2.

Suppose $v$ is a heavy vertex, $d_L(v)$ the number of its light neighbors.

By Lemma (and choice of $\theta$), $d_L(v) \geq (1 - \varepsilon) d(v)$

By Claim 1, the probability that some $(u, v)$ is returned in light edge sample is

$\frac{d_L(v)}{n \theta} \geq (1 - \varepsilon) \frac{d(v)}{n \theta}$ in which case run succeeds.

The probability that heavy sample returns $(v, w)$ is then

\[\frac{d_L(v)}{n \theta} \cdot \frac{1}{d(v)} \geq (1 - \varepsilon) \frac{1}{n \theta}.\]

Running Time

probability of success is at least
- $\frac{(1 - \varepsilon) m}{n \cdot \theta} \geq \frac{\sqrt{\varepsilon m}}{2 n}$

number of trials until success is $O\left(\frac{n}{\sqrt{\varepsilon m}}\right)$

Main Result

Theorem. The general algorithm samples edges almost uniformly using $O\left(\frac{n}{\sqrt{\varepsilon m}}\right)$ queries in expectation.

3. Optimality?

Can We Do Better?

Question. Can we sample edges using $o\left(\frac{n}{\sqrt{m}}\right)$ queries?

No!

A Lower Bound

Theorem. Any algorithm that samples edges almost uniformly requires $\Omega\left(\frac{n}{\sqrt{m}}\right)$ queries.

Is Lower Bound Construction Reasonable?

Eh, maybe not?

Easy Instance ($n = 24, m = 44$)

Hard Instance ($n = 24, m = 44$)

What Makes an Instance Hard?

Having large proportion of edges hidden among a small proportion of vertices!

if no such dense subgraphs, maybe we can sample more efficiently…

4. Parameterization by Arboricity

What is Arboricity?

The arboricity of a graph $G$, denoted $\alpha(G)$ is:

the maximum average degree of any subgraph of $G$
the minimum number of spanning trees (forests) whose union is $G$

Theorem (Nash-Williams, 1964). The two definitions above are equivalent.

High $\Delta$, Small $\alpha$

High $\Delta$, Small $\alpha$, 1

High $\Delta$, Small $\alpha$, 2

High $\Delta$, Small $\alpha$, 3

Does Knowing Arboricity Help?

Intuition. If $\alpha \approx \frac{m}{n}$, then no subgraphs much denser than average

no place to hide edges

But how to use this fact?

before, our tricky instance was a tree—$\alpha = 1$…

Structural Lemma

Lemma. If $G = (V, E)$ has arboricity at most $\alpha$, then $V$ can be partitioned into layers $L_0, L_1, \ldots, L_\ell$ with $\ell \leq \log n$ such that:

$v \in L_0 \iff v$ has degree at most $O(\alpha \log n)$,
every $v \in L_i$ with $i \geq 1$ has $1 - O(1/\log n)$ fraction of neighbors in $L_0, L_1, \ldots, L_{i-1}$.

Structural Lemma, Illustrated

Observation

Layered decomposition $\approx$ light/heavy decomposition from before

can sample “light” edges (from $L_0$) easily
to access heavy edges, start at $L_0$ and walk upwards
all vertices reachable after $\leq \log n$ steps

An Algorithm

Repeat until successful:

use rejection sampling to sampling to sample a vertex $u \in L_0$ with probability proportional to its degree
- note $u \in L_0 \iff d(u) \leq C \alpha \log n$
Pick $k \in {0,1,\ldots,\ell}$ uniformly at random.
Perform a simple random walk of length $k$ starting from $u$.
Abort if another vertex in $L_0$ is encountered.
If random walk ends at $v$, return $(v,w)$, where $w$ is a uniformly random neighbor of $v$.

Illustration

Pick Random $u$ in $L_0$ Prop. to $d(u)$

Pick Random $k$ in $\{0, 1, 2, 3\}$: 2

Random Walk Step $1$

Random Walk Step $2$

Didn’t Return to $L_0$

Return Random Incident Edge

Main Result

Theorem. A single iteration returns each edge $e \in E$ with probability $p(e)$ satisfying

$(1 - \varepsilon) \frac{\varepsilon}{\alpha n \log^2 n} \leq p(e) \leq \frac{\varepsilon}{\alpha n \log^2 n}$.

Therefore, the procedure samples edges almost uniformly using $O^*\left(\frac{\alpha n}{m}\right)$ queries in expectation.

Note. The quantity $\frac{\alpha n}{m}$ is precisely the ratio of maximum average degree of a subgraph ($\alpha$) to overall average degree ($m / n$).

this measures “how hidden” a large collection of edges can be

Interesting Feature

Observation. The layered decomposition doesn’t appear in the sampling algorithm!

used to justify the algorithm
- provides intuition about the random walk process
decomposition appears in the analysis of the algorithm

5. Applications and Open Questions

Application of Technique: Sampling Cliques

What if we want to sample bigger subgraphs…not just single edges?

E.g., sampling random triangles, 4-cliques, $k$-cliques

Application of Technique: Sampling Cliques

Define auxuiliary graph $H_{k}$:
- vertices in $H_{k}$ are $k-1$ cliques
- when vertices in $H_k$ share an edge, edge corresponds to $k$-clique
- sampling edges in $H_k$ corresponds to sampling $k$-cliques in $G$

If $G$ has arboricity at most $\alpha$, then so does $H_k$

Using arboricity algorithm to sample edges $H_k$
- to get vertices in $H_k$, must sample edges in $H_{k-1}$
- recursively call procedure in $H_{k-1}$

Application of Technique: Sampling Cliques

Theorem. Almost uniformly random $k$-cliques can be sampled using $O^*\left(\frac{n \alpha^{k-1}}{n_k}\right)$ queries in expectation, where $n_k$ is the number of $k$-cliques in the graph.

Upper bound is tight for wide range of parameters

Note. For $k > 2$, lower bound separates complexity of sampling and approximate counting!

In bounded arboricity graphs, triangles can be approximately counted in $O^*(1)$ queries

Black Box Applications

Previous works employ edge samples for other tasks:

Counting arbitrary graphs
Sampling arbitrary graphs

Open Questions and Future Directions

Can we sample graphs other than cliques efficiently in low-arboricity graphs?
When do approximate queries suffice?
- e.g., almost uniformly random edges instead of exactly uniform
What general relationships can we find between different query models?
- recent papers introduce more exotic queries: IS, BIS, Inner Product,…
- how can we understand the relative computational power of these primitives?

Sampling Edges and More

Will Rosenbaum

Amherst College

Outline

1. Introductory Remarks

About Me

A Large(?) Graph

Goal: Sample a Random Edge

Problem: Don’t Initially Know Edges

Computational Model

Problem Statement

2. Sampling Edges: A Simple Algorithm

Warm-Up

A Graph with $\Delta = 4$

Pick a Random Vertex

Pick a Random Number \(\leq \Delta\): 3

Failure. Try Again!

Pick a Random Vertex

Pick a Random Number $\leq \Delta$: 2

Return Selected Edge

Simple Algorithm

Question 1

Question 2

More Questions

An Illustrative Example

Observe

Leaf Edges are Easy to Sample

See Middle Edge w/ Prob \(\Omega(1/n)\)?

Access Middle Vertices via Neighbors

Access Middle Vertices via Neighbors

Generalizing the Procedure

Technicalities

Edges \(\to\) Oriented Pair of Edges

Partition Vertices into Heavy/Light

Partition Edges into Heavy/Light

Another View

To Sample Light Edge

1. Sample Vertex; Check Light

2. Pick Random \(i \leq \theta\)

3. Return \(i\)th Edge (or Fail)

To Sample Heavy Edge

1. Sample Light Edge \((u, v)\)

2. If \(v\) is Light, Fail

3. Otherwise, Pick Random Neighbor

Choosing Threshold \(\theta\)

A Good $\theta$

An Issue?

The General Algorithm

Analysis

Proof of Claim 2.

Running Time

Main Result

3. Optimality?

Can We Do Better?

No!

A Lower Bound

Is Lower Bound Construction Reasonable?

Easy Instance ($n = 24, m = 44$)

Hard Instance ($n = 24, m = 44$)

What Makes an Instance Hard?

4. Parameterization by Arboricity

What is Arboricity?

High $\Delta$, Small $\alpha$

High $\Delta$, Small $\alpha$, 1

High $\Delta$, Small $\alpha$, 2

High $\Delta$, Small $\alpha$, 3

Does Knowing Arboricity Help?

Structural Lemma

Structural Lemma, Illustrated

Observation

An Algorithm

Illustration

Pick Random $u$ in $L_0$ Prop. to $d(u)$

Pick Random $k$ in $\{0, 1, 2, 3\}$: 2

Random Walk Step $1$

Random Walk Step $2$

Didn’t Return to $L_0$

Return Random Incident Edge

Main Result

Interesting Feature