Unsupervised Learning

Based on the “Statistical Consulting Cheatsheet” by Prof. Kris Sankaran

[UNDER CONSTRUCTION]

Essentially, the goal of unsupervised methods is data compression, either to facilitate human interpretation or to improve the quality of downstream analysis. The compressed representation should still contain the signal in present the data, but the noise should be filtered away. Unlike regression and classification (which are supervised), no single variable is of central interest. From another perspective, unsupervised methods can be thought of as inferring latent discrete (clustering) or continuous (factor analysis) “labels” — if they were available, the problem would reduce to a supervised one.

Unsupervised learning encompasses a variety of methods. We go briefly over the following essential directions:

1. Clustering

Clustering methods group similar samples with one another. The usual products of a clustering analysis are:

Clustering techniques can roughly be divided up into those that are distance- based and those that are probabilistic.

1.1 Distance-based clustering

Distance-based clustering has two main ingredients: (1) a clustering algorithm, (2) a distance/ choice of similarities. Depending on the application, and the problem at hand, these are thus the main “customization” options. In this paragraph, we provide a few classical options. Note however, that you can be creative, adapt and tailor your choices (in particular the choice of the distance) to the problem at hand: the trick usually consists in defining “the right metric”, “the right measure of similarity”, etc.

Step 1: Selecting an Algorithm. Common distance-based methods include:

drawing

drawing

Step 2: Selecting a distance : As far as distances are concerned, some useful ones to know are:

1.2. Probabilistic clustering techniques

In contrast, probabilistic clustering techniques assume latent cluster indicator \(z_i\) ( e.g \(z_i =k\) if sample \(i\) belongs to cluster \(k\)( for each sample and define a likelihood model (which must itself be fit) assuming these indicators are known. Inference of these unknown \(z_i\)’s provides the sample assignments, while the parameters fitted in the likelihood model can be used to characterize the clusters. Some of the most common probabilistic clustering models are:

2. Low-dimensional representations

3. Networks

4. Mixtures