Lecture 19: $k$-Means Clustering

COSC 225: Algorithms and Visualization

Spring, 2023

Today

the unreasonable effectiveness of simple procedures

  1. Generic problem: cluster analysis
  2. A simple procedure: k-means
  3. Implementation
  4. Demo and explanation

Motivating Problems

Problem 1

Given locations of houses in a sparsely populated county,

Find locations to put mail drop mail drop boxes to serve the community

  • fixed budget for number of drop boxes

Problem 2

Given a large body of texts (e.g., books),

Find collections of similar books (e.g., similar style, topic, etc)

Problem 3

Given an image (e.g., digital photo),

Find a palette of representative colors similar to those in the image.

Commonalities?

Question. What to these problems have in common?

Generic Problems

Given:

  • a large collection of items

Find:

  • a partition of of the collection into clusters containing similar items
  • a small collection of representative elements for the collection

Assume:

  • similarity measure between items
  • features of new items can be synthesized from existing items

Formalizing the Input

Representing items:

  • items can be represented as geometric points
  • different features correspond to different dimensions

Examples: houses, texts, colors

Representing distances:

  • $a$ has coordinates $(a_1, a_2, \ldots, a_d)$
  • $b$ has coordinates $(b_1, b_2, \ldots, b_d)$
  • $\mathrm{dist}(a, b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + \cdots + (a_d - b_d)^2}$

Output parameter: $k$ = # of desired parts/clusters/elements

Desired Outputs

  1. Clusters $C_1, C_2, \ldots, C_k$ that partition the elements in the collection

  2. Points $m_1, m_2, \ldots, m_k$ that a representative of the clusters

From representative points to clusters:

  • each point $m_i$ represents a cluster $C_i$
  • $C_i$ is the set of all data points $a$ such that $m_i$ is the closest representative point
    • $a$ in $C_i$ if $i = \arg\min_j \{\mathrm{dist}(a, m_j)\}$

Question. How to choose good representative points?

From Clusters to Representative Points

Question. Given clusters, how could we choose appropriate representative points?

Idea. Choose the centroid a.k.a. mean of each cluster

  • centroid’s coordinates are averages of cluster’s points

Optimization Problem

Input.

  • data points \(a_1, a_2, \ldots, a_n\),
  • parameter $k$

Output.

  • means (points) $m_1, m_2, \ldots, m_k$
  • associated clusters $C_1, C_2, \ldots, C_k$

Optimality.

  • cost of $a$ in cluster $C_j$ is $\mathrm{dist}^2(a, m_j)$
  • minimize sum of all costs

A Problem with Optimization

Unfortunately. Finding $m_1, m_2, \ldots, m_k$ is computationally hard

  • no known efficient algorithm
  • widely believed that no efficient procedure exists
    • problem is NP-hard

A Heuristic Solution

  • start by selecting $m_1, m_2, \ldots, m_k$ randomly

  • compute associated clusters $C_1, C_2, \ldots, C_k$

    • then what?
  • then update means according to the clusters

    • then what?
  • then update the clusters

“Naive” $k$-Means Algorithm

  1. Choose $m_1, m_2, \ldots, m_k$ randomly
  2. Compute $C_1, C_2, \ldots, C_k$
    • $a \in C_i$ if $i = \arg\min_i \{\mathrm{dist}(a, m_i)\}$
  3. Update $m_1, m_2, \ldots, m_k$ to be the centroids of clusters
  4. Repeat 2 and 3 until no one updates

Demo

  • k-means-clustering.zip

Implementation Details: Structure

  • KMeans represents an instance of $k$-means clustering problem
    • stores x- and y-coordinates of data points and means
    • stores associated clusters
    • methods to initialize and update means, update clusters
  • KMeansVisualizer
    • stores SVG element and KMeans instance
    • SVG elements for each point, mean
    • methods to add point, update groups, draw means
    • handles animation
    • responds to clicks

SVG Canvas That Fills Screen

#k-means {
    position: fixed;
    z-index: 0;
    top: 0px;
    left: 0px;
    width: 100svw;
    height: 100svh;
    background-color: rgb(220, 220, 255);
}

Box Sticks to Bottom

#root {
    width: 700px;
    height: 100svh;
    margin: 0px auto;
    display: flex;
    flex-direction: column;
    justify-content: space-between;
    overflow: clip;
}

Rounded Corners on Boxes

#root {
    overflow: clip;
}

.head, .foot {
    border-radius: 10px;
    position: relative;
}
.head {
    top: -10px;
    padding: 30px 20px 20px 20px;
}
.foot {
    top: 10px;
    padding: 20px 20px 30px 20px;
}

Generating Cluster Colors

for (let i = 0; i < this.k; i++) {
  let group = document.createElementNS(SVG_NS, "g");
  group.setAttributeNS(null, "fill", 
                       `hsl(${60 +  360 * i / this.k}, 
                        90%, ${50 - 20 * i / this.k}%)`);
  svg.appendChild(group);
  this.clusterGroups.push(group);
}

Coloring Clusters

Group element for each cluster, set fill attribute

this.clusterGroups = [];   // SVG groups for each cluster
for (let i = 0; i < this.k; i++) {
  let group = document.createElementNS(SVG_NS, "g");
  group.setAttributeNS(null, "fill", ...);
  svg.appendChild(group);
  this.clusterGroups.push(group);}

Adding elements

this.updateGroups = function () {
  for (let i = 0; i < this.pointElts.length; i++) {
    let pt = this.pointElts[i];
    let group = this.kmeans.clusters[i];
    this.clusterGroups[group].prepend(pt);}}

Show/Hide Boxes

CSS

.hidden {
    visibility: hidden;
}

JS

const btnHideBoxes = document.querySelector("#btn-hide-boxes");
btnHideBoxes.addEventListener("click", () => {
    for (let box of document.querySelectorAll(".bounding-box")) {
	box.classList.toggle("hidden");
    }
});

Animation was a Pain

let elt = document.createElementNS(SVG_NS, "polygon");
elt.setAttributeNS(null, "points", "10,0 0,10 -10,0 0,-10");
elt.classList.add('mean-point');
let animate = document.createElementNS(SVG_NS, "animateTransform");
animate.setAttributeNS(null, "attributeName", "transform");
animate.setAttributeNS(null, "attributeType", "XML");
animate.setAttributeNS(null, "type", "translate");
animate.setAttributeNS(null, "from", "0 0");
animate.setAttributeNS(null, "to", "0 0");
animate.setAttributeNS(null, "dur", "1s");
animate.setAttributeNS(null, "repeatCount", "1");
animate.setAttributeNS(null, "fill", "freeze");
animate.setAttributeNS(null, "begin", "indefinite");
elt.appendChild(animate);

Starting Mean Animation

this.updateMeans = function () {
  for (let i = 0; i < this.k; i++) {
    let elt = this.meanElts[i];
    let x = this.kmeans.xMeans[i];
    let y = this.kmeans.yMeans[i];
    let animate = elt.firstChild;
    let from = animate.getAttribute("to");
    from = (from === "0 0") ? `${x} ${y}` : from;
    animate.setAttributeNS(null, "from", from);
    animate.setAttributeNS(null, "to", `${x} ${y}`);
    animate.beginElement();}}

Would Have Been Easier

CSS:

.mean-point {
  transition-property: transform;
  transition-duration: 1s;
}

Unfortunately this applies the transition to style = "transform: ...;" and not the transform svg attribute.

  • maybe there is a simpler way?

Improvements?

Next Week

Something a bit different

  • build a visualization collaboratively in class!
  • details to follow