# K-Means
Given a set of $N$ data points $\left\{x_{1}, x_{2}, \ldots, x_{N}\right\}$ partition the data points into $K$ clusters. We define $\boldsymbol{\mu}_k$ as a prototype (here also the mean) of the cluster $k$, and minimize the sum of squares of the distances of each data point to its closest vector $\boldsymbol{\mu}_k$ :
$
J=\sum_{n=1}^N \sum_{k=1}^K z_{n k}\left\|\boldsymbol{x}_n-\boldsymbol{\mu}_k\right\|^2
$
where $\boldsymbol{z}_n$ is a one-hot vector with $z_{n k}=1$ if $k$ is closest cluster of $\boldsymbol{x}_n$
Optimization algorithm (expectation-maximization (EM) algorithm):
First, means $\boldsymbol{\mu}_k \in \mathbb{R}^D$ are initialized randomly
Then repeat until convergence ( $\mu_k$ and $z_{n k}$ do not change for any $n$ and $k$ ):
**1. Expectation step**: Find the assignment of the closest cluster for every data point:
$
\frac{\partial J}{\partial z_{n k}}=0 \Rightarrow z_{n k}= \begin{cases}1 & \text { if } k=\arg \min _j\left\|\boldsymbol{x}_n-\boldsymbol{\mu}_j\right\|^2 \\ 0 & \text { otherwise }\end{cases}
$
**2. Maximization step**: Find the means of each cluster:
$
\frac{\partial J}{\partial \boldsymbol{\mu}_k}=0 \Rightarrow \boldsymbol{\mu}_k=\frac{\sum_n z_{n k} \boldsymbol{x}_n}{\sum_n z_{n k}}
$
Every update step in K-means algorithm decreases the loss function or leaves it unchanged. The algorithm converges as each phase reduces the value of the objective function $J$, but they might converge to a local rather than global minimum (perform multiple random restarts and choose best minimum found)
**Pros:**
- Finds cluster centers that minimize variance (good representation of data).
- Simple to implement, widespread applications.
**Cons:**
- All clusters have spherical distribution (same to all directions or isotropic). Generalization to solve this issue is [[Gaussian Mixture Model]].
- Hard membership/assignment (i.e. 1 or 0 membership). Soft variants are available.
- Prone to local minima.
- Need to choose $\mathrm{K}$. Heuristics can be used: [[Clustering#Determining optimal number of clusters]]
- Can be very slow: each iteration is $\mathrm{O}(\mathrm{KN})$ for N-dimensional points
---
## References