# Lecture 19: $k$-Means Clustering

## Today

#### the unreasonable effectiveness of simple procedures

1. Generic problem: cluster analysis
2. A simple procedure: k-means
3. Implementation
4. Demo and explanation

# Motivating Problems

## Problem 1

Given locations of houses in a sparsely populated county,

Find locations to put mail drop mail drop boxes to serve the community

• fixed budget for number of drop boxes

## Problem 2

Given a large body of texts (e.g., books),

Find collections of similar books (e.g., similar style, topic, etc)

## Problem 3

Given an image (e.g., digital photo),

Find a palette of representative colors similar to those in the image.

## Commonalities?

Question. What to these problems have in common?

## Generic Problems

Given:

• a large collection of items

Find:

• a partition of of the collection into clusters containing similar items
• a small collection of representative elements for the collection

Assume:

• similarity measure between items
• features of new items can be synthesized from existing items

## Formalizing the Input

Representing items:

• items can be represented as geometric points
• different features correspond to different dimensions

Examples: houses, texts, colors

Representing distances:

• $a$ has coordinates $(a_1, a_2, \ldots, a_d)$
• $b$ has coordinates $(b_1, b_2, \ldots, b_d)$
• $\mathrm{dist}(a, b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + \cdots + (a_d - b_d)^2}$

Output parameter: $k$ = # of desired parts/clusters/elements

## Desired Outputs

1. Clusters $C_1, C_2, \ldots, C_k$ that partition the elements in the collection

2. Points $m_1, m_2, \ldots, m_k$ that a representative of the clusters

From representative points to clusters:

• each point $m_i$ represents a cluster $C_i$
• $C_i$ is the set of all data points $a$ such that $m_i$ is the closest representative point
• $a$ in $C_i$ if $i = \arg\min_j \{\mathrm{dist}(a, m_j)\}$

Question. How to choose good representative points?

## From Clusters to Representative Points

Question. Given clusters, how could we choose appropriate representative points?

Idea. Choose the centroid a.k.a. mean of each cluster

• centroid’s coordinates are averages of cluster’s points

## Optimization Problem

Input.

• data points $$a_1, a_2, \ldots, a_n$$,
• parameter $k$

Output.

• means (points) $m_1, m_2, \ldots, m_k$
• associated clusters $C_1, C_2, \ldots, C_k$

Optimality.

• cost of $a$ in cluster $C_j$ is $\mathrm{dist}^2(a, m_j)$
• minimize sum of all costs

## A Problem with Optimization

Unfortunately. Finding $m_1, m_2, \ldots, m_k$ is computationally hard

• no known efficient algorithm
• widely believed that no efficient procedure exists
• problem is NP-hard

## A Heuristic Solution

• start by selecting $m_1, m_2, \ldots, m_k$ randomly

• compute associated clusters $C_1, C_2, \ldots, C_k$

• then what?
• then update means according to the clusters

• then what?
• then update the clusters

## “Naive” $k$-Means Algorithm

1. Choose $m_1, m_2, \ldots, m_k$ randomly
2. Compute $C_1, C_2, \ldots, C_k$
• $a \in C_i$ if $i = \arg\min_i \{\mathrm{dist}(a, m_i)\}$
3. Update $m_1, m_2, \ldots, m_k$ to be the centroids of clusters
4. Repeat 2 and 3 until no one updates

## Demo

• k-means-clustering.zip

## Implementation Details: Structure

• KMeans represents an instance of $k$-means clustering problem
• stores x- and y-coordinates of data points and means
• stores associated clusters
• methods to initialize and update means, update clusters
• KMeansVisualizer
• stores SVG element and KMeans instance
• SVG elements for each point, mean
• methods to add point, update groups, draw means
• handles animation
• responds to clicks

## SVG Canvas That Fills Screen

#k-means {
position: fixed;
z-index: 0;
top: 0px;
left: 0px;
width: 100svw;
height: 100svh;
background-color: rgb(220, 220, 255);
}

## Box Sticks to Bottom

#root {
width: 700px;
height: 100svh;
margin: 0px auto;
display: flex;
flex-direction: column;
justify-content: space-between;
overflow: clip;
}

## Rounded Corners on Boxes

#root {
overflow: clip;
}

position: relative;
}
top: -10px;
}
.foot {
top: 10px;
}

## Generating Cluster Colors

for (let i = 0; i < this.k; i++) {
let group = document.createElementNS(SVG_NS, "g");
group.setAttributeNS(null, "fill",
hsl(${60 + 360 * i / this.k}, 90%,${50 - 20 * i / this.k}%));
svg.appendChild(group);
this.clusterGroups.push(group);
}

## Coloring Clusters

Group element for each cluster, set fill attribute

this.clusterGroups = [];   // SVG groups for each cluster
for (let i = 0; i < this.k; i++) {
let group = document.createElementNS(SVG_NS, "g");
group.setAttributeNS(null, "fill", ...);
svg.appendChild(group);
this.clusterGroups.push(group);}

this.updateGroups = function () {
for (let i = 0; i < this.pointElts.length; i++) {
let pt = this.pointElts[i];
let group = this.kmeans.clusters[i];
this.clusterGroups[group].prepend(pt);}}

## Show/Hide Boxes

CSS

.hidden {
visibility: hidden;
}

JS

const btnHideBoxes = document.querySelector("#btn-hide-boxes");
for (let box of document.querySelectorAll(".bounding-box")) {
box.classList.toggle("hidden");
}
});

## Animation was a Pain

let elt = document.createElementNS(SVG_NS, "polygon");
elt.setAttributeNS(null, "points", "10,0 0,10 -10,0 0,-10");
let animate = document.createElementNS(SVG_NS, "animateTransform");
animate.setAttributeNS(null, "attributeName", "transform");
animate.setAttributeNS(null, "attributeType", "XML");
animate.setAttributeNS(null, "type", "translate");
animate.setAttributeNS(null, "from", "0 0");
animate.setAttributeNS(null, "to", "0 0");
animate.setAttributeNS(null, "dur", "1s");
animate.setAttributeNS(null, "repeatCount", "1");
animate.setAttributeNS(null, "fill", "freeze");
animate.setAttributeNS(null, "begin", "indefinite");
elt.appendChild(animate);

## Starting Mean Animation

this.updateMeans = function () {
for (let i = 0; i < this.k; i++) {
let elt = this.meanElts[i];
let x = this.kmeans.xMeans[i];
let y = this.kmeans.yMeans[i];
let animate = elt.firstChild;
let from = animate.getAttribute("to");
from = (from === "0 0") ? ${x}${y} : from;
animate.setAttributeNS(null, "from", from);
animate.setAttributeNS(null, "to", ${x}${y});
animate.beginElement();}}

## Would Have Been Easier

CSS:

.mean-point {
transition-property: transform;
transition-duration: 1s;
}

Unfortunately this applies the transition to style = "transform: ...;" and not the transform svg attribute.

• maybe there is a simpler way?

## Next Week

Something a bit different

• build a visualization collaboratively in class!
• details to follow