The Support Vector Machine (SVM) is yet another type of supervised machine learning algorithm. It is sometimes cleaner and more powerful.
Recall that in logistic regression, we use the following rules:
if y=1, then
if y=0, then
Recall the cost function for (unregularized) logistic regression:
To make a support vector machine, we will modify the first term of the cost function
Similarly, we modify the second term of the cost function
We shall denote these as
Recall the full cost function from (regularized) logistic regression:
Note that the negative sign has been distributed into the sum in the above equation.
We may transform this into the cost function for support vector machines by substituting
We can optimize this a bit by multiplying this by m (thus removing the m factor in the denominators). Note that this does not affect our optimization, since we're simply multiplying our cost function by a positive constant (for example, minimizing
Furthermore, convention dictates that we regularize using a factor C, instead of λ, like so:
This is equivalent to multiplying the equation by
Finally, note that the hypothesis of the Support Vector Machine is not interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0. (In technical terms, it is a discriminant function.)
A useful way to think about Support Vector Machines is to think of them as Large Margin Classifiers.
If y=1, we want
If y=0, we want
Now when we set our constant C to a very large value (e.g. 100,000), our optimizing function will constrain Θ such that the equation A (the summation of the cost of each example) equals 0. We impose the following constraints on Θ:
If C is very large, we must choose Θ parameters such that:
This reduces our cost function to:
Recall the decision boundary from logistic regression (the line separating the positive and negative examples). In SVMs, the decision boundary has the special property that it is as far away as possible from both the positive and the negative examples.
The distance of the decision boundary to the nearest example is called the margin. Since SVMs maximize this margin, it is often called a Large Margin Classifier.
The SVM will separate the negative and positive examples by a large margin.
This large margin is only achieved when C is very large.
Data is linearly separable when a straight line can separate the positive and negative examples.
If we have outlier examples that we don't want to affect the decision boundary, then we can reduce C.
Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary.
Say we have two vectors, u and v:
The length of vector v is denoted
The length of vector v can be calculated with
The projection of vector v onto vector u is found by taking a right angle from u to the end of v, creating a right triangle.
Note that
So the product
In our example, since u and v are vectors of the same length,
If the angle between the lines for v and u is greater than 90 degrees, then the projection p will be negative.
We can use the same rules to rewrite
So we now have a new optimization objective by substituting
If y=1, we want
If y=0, we want
The reason this causes a "large margin" is because: the vector for Θ is perpendicular to the decision boundary. In order for our optimization objective (above) to hold true, we need the absolute value of our projections
If
Kernels allow us to make complex, non-linear classifiers using Support Vector Machines.
Given x, compute new feature depending on proximity to landmarks
To do this, we find the "similarity" of x and some landmark
This "similarity" function is called a Gaussian Kernel. It is a specific example of a kernel.
The similarity function can also be written as follows:
There are a couple properties of the similarity function:
If
If x is far from
In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0.
Each landmark gives us the features in our hypothesis:
One way to get the landmarks is to put them in the exact same locations as all the training examples. This gives us m landmarks, with one landmark per training example.
Given example x:
This gives us a "feature vector,"
Now to get the parameters Θ we can use the SVM minimization algorithm but with
Using kernels to generate f(i) is not exclusive to SVMs and may also be applied to logistic regression. However, because of computational optimizations on SVMs, kernels combined with SVMs is much faster than with other algorithms, so kernels are almost always found combined only with SVMs.
Choosing C (recall that
The other parameter we must choose is
With a large
With a small
Using An SVM
There are lots of good SVM libraries already written. A. Ng often uses 'liblinear' and 'libsvm'. In practical application, you should use one of these libraries rather than rewrite the functions.
In practical application, the choices you do need to make are:
The library may ask you to provide the kernel function.
Note: do perform feature scaling before using the Gaussian Kernel.
Note: not all similarity functions are valid kernels. They must satisfy "Mercer's Theorem" which guarantees that the SVM package's optimizations run correctly and do not diverge.
You want to train C and the parameters for the kernel function using the training and cross-validation datasets.
Many SVM libraries have multi-class classification built-in.
You can use the one-vs-all method just like we did for logistic regression, where
If n is large (relative to m), then use logistic regression, or SVM without a kernel (the "linear kernel")
If n is small and m is intermediate, then use SVM with a Gaussian Kernel
If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel.
In the first case, we don't have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.
Note: a neural network is likely to work well for any of these situations, but may be slower to train.