Errors in your predictions can be troubleshooted by:
Don't just pick one of these avenues at random. We'll explore diagnostic techniques for choosing one of the above solutions in the following sections.
A hypothesis may have low error for the training examples but still be inaccurate (because of overfitting).
With a given dataset of training examples, we can split up the data into two sets: a training set and a test set.
The new procedure using these two sets is then:
This gives us a binary 0 or 1 error result based on a misclassification.
The average test error for the test set is
This gives us the proportion of the test data that was misclassified.
In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.
Without the Validation Set (note: this is a bad method - do not use it)
In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data.
Use of the CV set
To solve this, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error.
One example way to break down our dataset into the three sets is:
We can now calculate three separate error values for the three different sets.
With the Validation Set (note: this method presumes we do not also use the CV set for regularization)
This way, the degree of the polynomial d has not been trained using the test set.
(Mentor note: be aware that using the CV set to select 'd' means that we cannot also use it for the validation curve process of setting the lambda value).
In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.
The training error will tend to decrease as we increase the degree d of the polynomial.
At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.
High bias (underfitting): both
High variance (overfitting):
The is represented in the figure below:
Instead of looking at the degree d contributing to bias/variance, now we will look at the regularization parameter λ.
A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.
The relationship of λ to the training set and the variance set is as follows:
Low λ:
Intermediate λ:
Large λ: both
The figure below illustrates the relationship between lambda and the hypothesis:
In order to choose the model and the regularization λ, we need:
Training 3 examples will easily have 0 errors because we can always find a quadratic curve that exactly touches 3 points.
With high bias
Low training set size: causes
Large training set size: causes both
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
For high variance, we have the following relationships in terms of the training set size:
With high variance
Low training set size:
Large training set size:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
Our decision process can be broken down as follows:
Fixes high variance
Fixes high variance
Fixes high bias
Fixes high bias
Fixes high bias
Fixes high variance
Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set.
Choosing M the order of polynomials.
How can we tell which parameters Θ to leave in the model (known as "model selection")?
There are several ways to solve this problem:
Bias: approximation error (Difference between expected value and optimal value)
Variance: estimation error due to finite data
Intuition for the bias-variance trade-off:
One of the most important goals in learning: finding a model that is just right in the bias-variance trade-off.
Regularization Effects:
Model Complexity Effects:
A typical rule of thumb when running diagnostics is:
Different ways we can approach a machine learning problem:
It is difficult to tell which of the options will be helpful.
The recommended approach to solving machine learning problems is:
It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance.
You may need to process your input before it is useful. For example, if your input is a set of words, you may want to treat the same word with different forms (fail/failing/failed) as one word, so must use "stemming software" to recognize them all as one.
It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.
This usually happens with skewed classes; that is, when our class is very rare in the entire data set.
Or to say it another way, when we have lot more examples from one class than from the other class.
For this we can use Precision/Recall.
Precision: of all patients we predicted where y=1, what fraction actually has cancer?
Recall: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?
These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high.
In the example at the beginning of the section, if we classify all patients as 0, then our recall will be
Note 1: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. F1 score will not be defined too.
Note 2: a manual calculation of precision and other functions is a error prone process. it is very easy though to create an Excel file for this. Put into it a table 2*2 for all necessary input values, label them like "TruePositives", "FalsePositives", and on other cells of Excel add formulas like =SUM(TruePositive, FalsePositive, TrueNegative, FalseNegative), label this one AllExamples. Then on another cell label Accuracy and a formula: =SUM(TruePositive,TrueNegative)/AllExamples. The same with others. After 10 minutes you will have a spreadsheet for all examples and questions. [Snap shot https://share.coursera.org/wiki/index.php/File:Spreadsheetquiz6.GIF ]
We might want a confident prediction of two classes using logistic regression. One way is to increase our threshold:
This way, we only predict cancer if the patient has a 70% chance.
Doing this, we will have higher precision but lower recall (refer to the definitions in the previous section).
In the opposite example, we can lower our threshold:
That way, we get a very safe prediction. This will cause higher recall but lower precision.
The greater the threshold, the greater the precision and the lower the recall.
The lower the threshold, the greater the recall and the lower the precision.
In order to turn these two metrics into one single number, we can take the F value.
One way is to take the average:
This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision.
A better way is to compute the F Score (or F1 score):
In order for the F Score to be large, both precision and recall must be large.
We want to train precision and recall on the cross validation set so as not to bias our test set.
How much data should we train on?
In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data.
We must choose our features to have enough information. A useful test is: Given input x, would a human expert be able to confidently predict y?
Rationale for large data: if we have a low bias algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).
When the quiz instructions tell you to enter a value to "two decimal digits", what it really means is "two significant digits". So, just for example, the value 0.0123 should be entered as "0.012", not "0.01".
References: