# Clustering
Divide the dataset into subsets which maximize intra-subset similarity and inter-subset dissimilarity. The similarity measure is defined beforehand.
## Determining optimal number of clusters
### Elbow Method
### Gap Statistic
### Average Shilloutte Score
http://www.sthda.com/english/wiki/print.php?id=239
## Metrics for comparing clustering performance
### Paired Evaluation
For sample pairs (a,b), ask humans if (a,b) should be in the same group. It is cognitively easy task. Then let the system produce clusters for the dataset.
- **FN**: matching pairs (a,b) that are in different clusters (c1,c2)
- **FP**: non-matching pairs (a,b) that are in the same cluster (c1,c2)
Calculate AUC and F1 score.
### Rand Index
Rand index is a measure of similarity between two clustering. Given a set S of n elements, and two partitions of S to compare, X and Y, define the following:
- a, the number of pairs of elements in S that are in the same subset in X and in the same subset in Y
- b, the number of pairs of elements in S that are in different subsets in X and in different subsets in Y
Then Rand Index R is $ R = \frac{a+b}{\binom{n}{2}}$
Rand index represents the *frequency of occurrence* of agreements over the total pairs, or the probability that X and Y will agree on a randomly chosen pair.
### Normalized Mutual Information
Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by `sqrt(H(labels_true) * H(labels_pred))`
### Silhouette Score (Ground truth free method)
The Silhouette Coefficient is calculated using the mean intra-cluster distance (`a`) and the mean nearest-cluster distance (`b`) for each sample. The Silhouette Coefficient for a sample is `(b - a) / max(a, b)`. To clarify, `b` is the distance between a sample and the nearest cluster that the sample is not a part of.
## Partition Clustering
Data is divided into non-overlapping subsets such that each data instance is assigned to exactly one subset. First step is to choose the means of clusters as centroids and iteratively assign every data point to its nearest centroid.
Advantages
- Simplicity and computational efficiency.
- Can be used to sample different types of data points in addition to using it for clustering.
Disadvantage
- Need to specify number of clusters.
- It is vulnerable to initial centroid seeding. If not chosen correctly, it can affect the quality of clusters adversely. A new seeding approach is introduced in an improved version of k-means called k-means++. Given a set of seeds chosen, the seeding technique favors the data points which are far from the seeds already chosen. Thus the seeds are chosen probabilistically as dispersed as possible.
- Another disadvantage is that since it uses euclidean distance, it doesn't work well in high dimensional space.
- It also assumes that data clusters are organized in spherical form which might not be true with most of the real world data.
## Hierarchical Clustering
Clusters are formed by top-down or bottom-up approach. A hierarchical tree is formed to connect all data points at the end. A tree depth level is chosen to cut the tree to form the clusters.
## Density-based Clustering
Data is clustered based on some connectivity and density functions. DBSCAN uses density-reachable and density-connected functions to determine each data point as either a core point or a border point. If a point is a core point, it tries to expand and form a cluster around itself.
OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the epsilon parameter with a maximum search radius.
Advantages
- No need to specify number of clusters.
- DBSCAN has been shown to be robust toward discovering arbitrarily shaped clusters. It is known for being able to correctly separate convex-shaped data clusters in situations where centroid-based clustering perform very poorly.
Disadvantages
- DBSCAN does not need the prior knowledge on a given data set, two parameters in the algorithm need to be tuned for the good performance.
- DBSCAN also uses Euclidean distance, so it cannot perform well on a high dimensional data.
## Spectral Clustering
Spectral Clustering is a modern clustering approach based on the leading eigenvectors of the matrix derrived from a distance matrix. The idea is to perform dimensionality reduction on a Laplacian matrix of the data points for performing K-means clustering in fewer dimensions.
This is a very promising approach. But the problem of computational scalability might prevent from using this on real world high dimensional data.
## Other Paradigms
Correlational clustering, Gravitational Clustering, Herd Clustering, Grid clustering.
![[cluster_comparison.png]]
## Deep Clustering
In deep-learning based modern approaches, DNN are used to learn a representation from features to a reduced dimension and use these representations as input for a specific clustering method.
My work - [[Learning Embedding Space for Clustering From Deep Representations]]
### Architecture/Taxonomy
- **Neural Network Training Procedure**
- **Architecture**
- RNN
- CNN
- FNN
- **Set of deep features**
- Output of encoder (DEC)
- Output of several layers concatenated (Neural Clustering)
- **Losses**
- No non-clustering loss
- Autoencoder Reconstruction Loss (MSE)
- Possibly include churn indication to encourage meaningful feature extraction.
- **Clustering Loss **(For learning cluster-friendly representations )
- K-Means Loss
- Clustering Assignment Hardening Loss
- Locality Preserving Loss
- **Combination of Clustering and Non-Clustering Losses**
- $L(\theta) = \alpha L_c(\theta) + (1 - \alpha) L_n(\theta)$
- Alpha defines weighting between both loss functions. Strategies:
- Pre-training: Initially alpha=0, then later alpha=1
- Joint training: Use both losses, alpha~=0.5
- Variable/Scheduled: Slowly increase alpha
- **Cluster Updates**
- Jointly updated as network produces cluster probabilities
- Alternatively updated
- Update cluster assignment after certain number of iterations
- Update after each network parameter updates
- **Post training clustering**
- Do cluster assignment again.
### Existing Deep Learning Approaches
![[deepclustering.jpg]]
---
## References
1. https://en.wikipedia.org/wiki/Rand_index
2. http://ieeexplore.ieee.org/document/7414675/
3. http://www.dbs.ifi.lmu.de/~zimek/publications/PAKDD2013/TutorialPAKDD2013OutlierDetectionHighDim.pdf
4. https://web.stanford.edu/~hastie/Papers/gap.pdf
5. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
6. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html