Based on the “Statistical Consulting Cheatsheet” by Prof. Kris Sankaran
[UNDER CONSTRUCTION]
Essentially, the goal of unsupervised methods is data compression, either to facilitate human interpretation or to improve the quality of downstream analysis. The compressed representation should still contain the signal in present the data, but the noise should be filtered away. Unlike regression and classification (which are supervised), no single variable is of central interest. From another perspective, unsupervised methods can be thought of as inferring latent discrete (clustering) or continuous (factor analysis) “labels” — if they were available, the problem would reduce to a supervised one.
Unsupervised learning encompasses a variety of methods. We go briefly over the following essential directions:
Clustering methods group similar samples with one another. The usual products of a clustering analysis are:
Cluster assignments: Which samples belong to which clusters?
Cluster characterization: How can we interpret the resulting clusters?This can be achieved by looking at cluster centroids (the within-cluster averages) or medoids (a sample from within the cluster that is representative of that cluster in some way).
Clustering techniques can roughly be divided up into those that are distance- based and those that are probabilistic.
Distance-based clustering has two main ingredients: (1) a clustering algorithm, (2) a distance/ choice of similarities. Depending on the application, and the problem at hand, these are thus the main “customization” options. In this paragraph, we provide a few classical options. Note however, that you can be creative, adapt and tailor your choices (in particular the choice of the distance) to the problem at hand: the trick usually consists in defining “the right metric”, “the right measure of similarity”, etc.
Step 1: Selecting an Algorithm. Common distance-based methods include:
K-means: This builds K-clusters in an iterative way, attempting to minimize the sum of within-cluster distances (so that the clusters are compact). Since all directions are treated symmetrically, the resulting clusters tend
to be spherical. Similarly, since clusters are treated symmetrically, they tend to all be approximately the same size.
Hierarchical clustering: This builds clusters hierarchically. Specifically, it
begins with each sample as its own singleton cluster. The pair that is
found to be most similar is then merged together, and the procedure is
continued until all clusters are joined into one large cluster. The nearness
of a pair of cluster has to be defined somehow: common choices include
“complete-link” merging (say that the distance between two clusters is the
distance between the furthest away pair across clusters) or “average-link”
(say the distance is the average pairwise distance between clusters).
An advantage of this approach is that it provides a full tree relating samples. The more branch length a pair of samples shares, the more similar they
are. In particular, this approach avoids the problem of choosing K, though
clusters can be chosen by cutting the tree at some horizontal level. The
main drawback of this approach is that it does not it does not scale as
well as fixed-K methods.
Spectral clustering: This method is “spectral”, because it relies on the eigenvector decomposition of the Laplacian of the data: \(L= D - A\) (where \(A\) is a similarity matrix between datapoints, and \(D\) the associated degree) to do a dimensionality reduction of the data before clustering.
The main advantage of this method is that it is non-linear.
Kernel k-means: This is a kernelized version of k-means, ie., at a high level, k-means is applied to “high-dimensional embeddings” of the original dataset. This allows to use a richer class of distances, as well as to have non-lnear embeddings.
In these methods, decisions need to be made about (1) preprocessing and (2)
the distance to use. Preprocessing will vary on a case-by-case basis. Generally,
the goal of preprocessing is to ensure that the features included in clustering are
meaningful, which means we might drop, transform (e.g., standardization), or
derive features from (e.g., from counts to tf-idf scores) the original raw features.
Step 2: Selecting a distance : As far as distances are concerned, some useful ones to know are:
In contrast, probabilistic clustering techniques assume latent cluster indicator \(z_i\) ( e.g \(z_i =k\) if sample \(i\) belongs to cluster \(k\)( for each sample and define a likelihood model (which must itself be fit) assuming these indicators are known. Inference of these unknown \(z_i\)’s provides the sample assignments, while the parameters fitted in the likelihood model can be used to characterize the clusters. Some of the most common probabilistic clustering models are: