t-SNE - Notes on AI

## t-SNE t-SNE is a tool for data visualization. It reduces the dimensionality of data to 2 or 3 dimensions so that it can be plotted easily. Local similarities are preserved by this embedding. t-SNE converts distances between data in the original space to probabilities. First, we compute conditional probabilites $pj|i=\exp(−d(xi,xj)/(2σ2i))∑i≠kexp(−d(xi,xk)/(2σ2i)),pi|i=0,$ which will be used to generate joint probabilities pij=pj|i+pi|j2N. The σi will be determined automatically. This procedure can be influenced by setting the `perplexity` of the algorithm. A heavy-tailed distribution will be used to measure the similarities in the embedded space qij=(1+||yi−yj)||2)−1∑k≠l(1+||yk−yl)||2)−1, and then we minimize the Kullback-Leibler divergence $KL(P|Q)=∑_{i≠j}p_{ij}\log \frac{p{ij}}{q{ij}}$ between both distributions with gradient descent (and some tricks). Note that the cost function is not convex and multiple runs might yield different results. http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf 1. Tackles crowding problem by using t-Student's distribution instead of Gaussian in lower dimension. 2. Has simpler gradient of cost function, so easier to optimize. **How t-SNE works** 1. Choose 2 similarity measure between pairs of points. One is for high dimensional data and next is for d-dimensional data. 2. For high dimensional data, use Gaussian Kernel and for lower dimensional data, use t-Student's kernel with one degree of freedom. 3. Build d-dimensional embedding that minimizes the KL divergence between vector of similarity between pair of point in embedding. 4. Start with randomly initialized embedding and make iterative gradient descent updates to it. **Perplexity parameter** 1. **perplexity** is a measurement of how well a [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution) or [probability model](https://en.wikipedia.org/wiki/Probability_model) predicts a sample. $2^{{H(p)}}=2^{{-\sum _{x}p(x)\log _{2}p(x)}}$. It is used to find the variance of the Gaussian using a bianry search. 2. Balances attention between global and local aspects of data. 3. The parameter is, in a sense, a guess about the number of close neighbors each point has. 4. Has complex effects on resulting pictures. (Generally 5-50). 5. Getting the most from t-SNE may mean analyzing multiple plots with different perplexities. **Interpreting t-SNE plots** https://distill.pub/2016/misread-tsne/ 1. t-SNE adapts to its notion of 'distance' to regional density variation in the dataset, so expands dense clusters and contracts sparse ones. 2. Cluster's geometric size doesn't mean anything and you cannot compare relative sizes of clusters. 3. Distance between clusters might not mean anything. 4. Random noise input doesn't always look random in the resulting plot. **Weaknesses of t-SNE:** 1. Has quadratic memory and runtime, so suffers from curse of dimensionality. Has a variation that uses Barnes-Hut algorithm to reduce run time to N logN to tackle scalability issue. https://arxiv.org/abs/1301.3342 Also parametric t-SNE http://jmlr.csail.mit.edu/proceedings/papers/v5/maaten09a/maaten09a.pdf 2. Not very useful for dimensionality reduction for latent dimensions greater than 3. > The behavior of t-SNE when reducing data to two or three dimensions cannot readily be extrapolated to d > 3 dimensions because of the heavy tails of the Student-t distribution. **In high-dimensional spaces, the heavy tails comprise a relatively large portion of the probability mass under the Student-t distribution, which might lead to d-dimensional data representations that do not preserve the local structure of the data as well.** Thus for tasks in which the dimensionality of the data needs to be reduced to a dimensionality higher than three, Student t-distributions with more than one degree of freedom are likely to be more appropriate. **Performing t-SNE on a data representation produced by, for example, an autoencoder is likely to improve the quality of the constructed visualizations, because autoencoders can identify highly-varying manifolds better than a local method such as t-SNE.** However, the reader should note that it is by definition impossible to fully represent the structure of intrinsically high-dimensional data in two or three dimensions. 3. Unsure if the latent dimension is fit to be clustered using distance and density based methods. > **The problem with t-SNE is that it does not preserve distances nor density. It only to some extend preserves nearest-neighbors. The difference is subtle, but affects any density- or distance based algorithm**. Use t-SNE for visualization (and try different parameters to get something visually pleasing!), but rather do not run [[Clustering]] afterwards, in particular do not use distance- or density based algorithms, as this information was intentionally lost. https://stats.stackexchange.com/questions/263539/k-means-[[Clustering]]-on-the-output-of-t-sne ## References 1. http://kvfrans.com/variational-autoencoders-explained/ 2. https://github.com/maestrojeong/t-SNE/blob/master/t-SNE.ipynb (Tensorflow Implementation) 3. https://www.youtube.com/watch?v=NEaUSP4YerM (t-SNE intuition) 4. https://github.com/oreillymedia/t-SNE-tutorial (Interactive tutorial) 5. http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/AlexanderFabisch/1a0c648de22eff4a2a3e/raw/59d5bc5ed8f8bfd9ff1f7faa749d1b095aa97d5a/t-SNE.ipynb (t-SNE for different datasets)