Unsupervised learning is contrasted from supervised learning because it uses an unlabeled training set rather than a labeled one.
In other words, we don't have the vector y of expected results, we only have a dataset of features where we can find structure.
Clustering is good for:
The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.
Our main variables are:
Note that we will not use the x0=1 convention.
The algorithm:
123456Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)Repeat:for i = 1 to m:c(i):= index (from 1 to K) of cluster centroid closest to x(i)for k = 1 to K:mu(k):= average (mean) of points assigned to cluster k
The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents the centroid assigned to example x(i).
We can write the operation of the Cluster Assignment step more mathematically as follows:
That is, each
By convention, we square the right-hand-side, which makes the function we are trying to minimize more sharply increasing. It is mostly just a convention. But a convention that helps reduce the computation load because the Euclidean distance requires a square root but it is canceled.
Without the square:
With the square:
...so the square convention serves two purposes, minimize more sharply and less computation.
The second for-loop is the 'Move Centroid' step where we move each centroid to the average of its group.
More formally, the equation for this loop is as follows:
Where each of
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that centroid to a new point. You can also simply eliminate that cluster group.
After a number of iterations the algorithm will converge, where new iterations do not affect the clusters.
Note on non-separated clusters: some datasets have no real inner separation or natural structure. K-means can still evenly segment your data into K subsets, so can still be useful in this case.
Recall some of the parameters we used in our algorithm:
Using these variables we can define our cost function:
Our optimization objective is to minimize all our parameters using the above cost function:
That is, we are finding all the values in sets c, representing all our clusters, and μ, representing all our centroids, that will minimize the average of the distances of every training example to its corresponding cluster centroid.
The above cost function is often called the distortion of the training examples.
In the cluster assignment step, our goal is to:
Minimize J(…) with
In the move centroid step, our goal is to:
Minimize J(…) with
With k-means, it is not possible for the cost function to sometimes increase. It should always descend.
There's one particular recommended method for randomly initializing your cluster centroids.
K-means can get stuck in local optima. To decrease the chance of this happening, you can run the algorithm on many different random initializations. In cases where K<10 it is strongly recommended to run a loop of random initializations.
123456for i = 1 to 100:randomly initialize k-meansrun k-means to get 'c' and 'm'compute the cost function (distortion) J(c,m)pick the clustering that gave us the lowest cost
Choosing K can be quite arbitrary and ambiguous.
The elbow method: plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out.
However, fairly often, the curve is very gradual, so there's no clear elbow.
Note: J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum.
Another way to choose K is to observe how well k-means performs on a downstream purpose. In other words, you choose K that proves to be most useful for some goal you're trying to achieve from using these clusters.
This links to a discussion that shows various situations in which K-means gives totally correct but unexpected results: http://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means
Motivation I: Data Compression
Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.
Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable m will stay the same size; n, the number of features each example from
It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.
We need to find new features,
Example: hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity."
The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA)
Problem formulation
Given two features,
The same can be done with three features, where we map them to a plane.
The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error.
Reduce from 2d to 1d: find a direction (a vector
The more general case is as follows:
Reduce from n-dimension to k-dimension: Find k vectors
If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k will be 2.
PCA is not linear regression
More generally, in linear regression we are taking all our examples in x and applying the parameters in Θ to predict y.
In PCA, we are taking a number of features
Before we can apply PCA, there is a data pre-processing step we must perform:
Data preprocessing
Above, we first subtract the mean of each feature from the original feature. Then we scale all the features
We can define specifically what it means to reduce from 2d to 1d data as follows:
The z values are all real numbers and are the projections of our features onto
So, PCA has two tasks: figure out
The mathematical proof for the following procedure is complicated and beyond the scope of this course.
1. Compute "covariance matrix"
This can be vectorized in Octave as:
12Sigma = (1/m) * X' * X;
We denote the covariance matrix with a capital sigma (which happens to be the same symbol for summation, confusingly---they represent entirely different things).
Note that
2. Compute "eigenvectors" of covariance matrix Σ
12[U,S,V] = svd(Sigma);
svd() is the 'singular value decomposition', a built-in Octave function.
What we actually want out of svd() is the 'U' matrix of the Sigma covariance matrix:
3. Take the first k columns of the U matrix and compute z
We'll assign the first k columns of U to a variable called 'Ureduce'. This will be an n×k matrix. We compute z with:
To summarize, the whole algorithm in octave is roughly:
12345Sigma = (1/m) * X' * X; % compute the covariance matrix[U,S,V] = svd(Sigma); % compute our projected directionsUreduce = U(:,1:k); % take the first k directionsZ = X * Ureduce; % compute the projected data points
If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?
To go from 1-dimension back to 2d we do:
We can do this with the equation:
Note that we can only get approximations of our original data.
Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is:
Since we are dealing with real numbers here, this is equivalent to:
How do we choose k, also called the number of principal components? Recall that k is the dimension we are reducing to.
One way to choose k is by using the following formula:
In other words, the squared projection error divided by the total variation should be less than one percent, so that 99% of the variance is retained.
Algorithm for choosing k
This procedure would actually be horribly inefficient. In Octave, we will call svd:
12[U,S,V] = svd(Sigma)
Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix as follows:
The most common use of PCA is to speed up supervised learning.
Given a training set with a large number of features (e.g.
Note that we should define the PCA reduction from
Applications
Reduce space of data
Speed up algorithm
Choose k = 2 or k = 3
Bad use of PCA: trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.
Don't assume you need to do PCA. Try your full machine learning algorithm without PCA first. Then use PCA if you find that you need it.