# Learning Embedding Space for Clustering From Deep Representations
Authors: Paras Dahal
Date: 2018
Paper: https://ieeexplore.ieee.org/abstract/document/8622629
Code: https://github.com/parasdahal/deepclustering
Blog: https://www.parasdahal.com/deep-clustering
## The problem
- K-Means clustering is less effective for high dimensional data.
- In high dimensional data, euclidean distance metrics are not a good similarity measure as the distance between points become relatively uniform.
- Why not simply cluster on AE representations?
- Reconstruction loss leads to the formation of compactly positioned representations with severe overlapping between their natural clusters.
- We need to separate out the natural clusters somehow, but that affects the reconstruction quality.
- DEC first trains an AE, then takes encoder and fine-tunes with clustering loss.
- There is a tradeoff involved in preserving reconstruction capability and separation of clusters when manipulating the encoded representations.
- We want to preserve both the AE latent space i.e. have good reconstruction, while separating out the clusters well.
## The solution
- Use a separate neural network that takes in the learned AE representations (Z) that maps to a separate clustering friendly space (E)
- Note that there is no dimensionality reduction, just non-linear transformation.
- Objective: Minimize the CE between pairwise similarity in AE representations and clustering space.
- Convert pairwise distances in Z to probs using Student's t-distribution with degree of freedom parameter alpha=2d, and distances in E using alpha=1.
- This difference of degree of freedoms is key, as with low degree of freedom similar points are pushed together and dissimilar pushed farther. Leads to dense but well separated clusters.
## The details
![[architecture.png]]
First the pairwise probability 'p' denotes the probability of two points lying together in the encoded space Z is calculated using Student's t-distribution kernel:
$p_{ij} = \frac{(1+||f(x_i)-f_(x_j)||^2/\alpha)^{-\frac{\alpha+1}{2}}}{\sum_{k\neq l }(1+||f(x_k)-f(x_l)||^2/\alpha)^{-\frac{\alpha+1}{2}}}$
Student's t distribution is used because it approximates Gaussian distribution for higher degree of freedoms, and doesn't have kernel width parameter. It also assigns stricter probabilities. The degree of freedom is taken as 2 times the dimension of Z which allows more space to model the local structure of the representation space.
Similarly, pairwise probability 'q' denotes probability of two points lying together in embedding space E.
$q_{ij} = \frac{(1+||h(z_i)-h(z_j)||^2)^{-1}}{\sum_{k\neq l }(1+||h(z_k)-h(z_l)||^2)^{-1}}$
Here degree of freedom is chosen as 1 which limits the freedom to model the local structure, and thus distribution approaches p by minimization of KL divergence by creating strong repulsive force between clusters.
The Representation Network is trained by cross entropy of p and q, which has the effect of minimizing the entropy of distribution p as well.
## The results
- Simpler model, trainable end-to-end, outperforms other AE based models while still being fairly competitive with other tricky models.