Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
Now define the multivariable form of the hypothesis function as follows, accommodating these multiple features:
In order to develop intuition about this function, we can think about
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.
Remark: Note that for convenience reasons in this course Mr. Ng assumes
[Note: So that we can do matrix operations with theta and x, we will set
The training examples are stored in X row-wise, like such:
You can calculate the hypothesis as a column vector of size (m x 1) with:
For the rest of these notes, and other lecture notes, X will represent a matrix of training examples
For the parameter vector θ (of type
The vectorized version is:
Where
The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:
In other words:
The Gradient Descent rule can be expressed as:
Where
The j-th component of the gradient is the summation of the product of two terms:
Sometimes, the summation of the product of two terms can be expressed as the product of two vectors.
Here,
The other term
Finally, the matrix notation (vectorized) of the Gradient Descent rule is:
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
−1 ≤
or
−0.5 ≤
These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:
Where
Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.
Example:
Your answer should be rounded to exactly two decimal places. Use a '.' for the decimal point, not a ','. The tricky part of this question is figuring out which feature of which training example you are asked to normalize. Note that the mobile app doesn't allow entering a negative number (Jan 2016), so you will need to use a browser to submit this quiz if your solution requires a negative number.
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it's difficult to choose this threshold value.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration. Andrew Ng recommends decreasing α by multiples of 3.
We can improve our features and the form of our hypothesis function in a couple different ways.
We can combine multiple features into one. For example, we can combine
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
For example, if our hypothesis function is
In the cubic version, we have created new features
To make it a square root function, we could do:
Note that at 2:52 and through 6:22 in the "Features and Polynomial Regression" video, the curve that Prof Ng discusses about "doesn't ever come back down" is in reference to the hypothesis function that uses the sqrt() function (shown by the solid purple line), not the one that uses
One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
eg. if
The "Normal Equation" is a method of finding the optimum theta without iteration.
There is no need to do feature scaling with the normal equation.
Mathematical proof of the Normal equation requires knowledge of linear algebra and is fairly involved, so you do not need to worry about the details.
Proofs are available at these links for those who are interested:
https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression
The following is a comparison of gradient descent and the normal equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
O ( | O ( |
Works well when n is large | Slow if n is very large |
With the normal equation, computing the inversion has complexity
When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.'
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960%% Change Octave promptPS1('>> ');%% Change working directory in windows example:cd 'c:/path/to/desired/directory name'%% Note that it uses normal slashes and does not use escape characters for theempty spaces.%% elementary operations5+63-25*81/22^61 == 2 % false1 ~= 2 % true. note, not "!="1 && 01 || 0xor(1,0)%% variable assignmenta = 3; % semicolon suppresses outputb = 'hi';c = 3>=1;% Displaying them:a = pidisp(a)disp(sprintf('2 decimals: %0.2f', a))disp(sprintf('6 decimals: %0.6f', a))format longaformat shorta%% vectors and matricesA = [1 2; 3 4; 5 6]v = [1 2 3]v = [1; 2; 3]v = 1:0.1:2 % from 1 to 2, with stepsize of 0.1. Useful for plot axesv = 1:6 % from 1 to 6, assumes stepsize of 1 (row vector)C = 2*ones(2,3) % same as C = [2 2 2; 2 2 2]w = ones(1,3) % 1x3 vector of onesw = zeros(1,3)w = rand(1,3) % drawn from a uniform distributionw = randn(1,3)% drawn from a normal distribution (mean=0, var=1)w = -6 + sqrt(10)*(randn(1,10000)); % (mean = -6, var = 10) - note: add thesemicolonhist(w) % plot histogram using 10 bins (default)hist(w,50) % plot histogram using 50 bins% note: if hist() crashes, try "graphics_toolkit('gnu_plot')"I = eye(4) % 4x4 identity matrix% help functionhelp eyehelp randhelp help
Data files used in this section: featuresX.dat, priceY.dat
123456789101112131415161718192021222324252627282930313233343536373839%% dimensionssz = size(A) % 1x2 matrix: [(number of rows) (number of columns)]size(A,1) % number of rowssize(A,2) % number of colslength(v) % size of longest dimension%% loading datapwd % show current directory (current path)cd 'C:\Users\ang\Octave files' % change directoryls % list files in current directoryload q1y.dat % alternatively, load('q1y.dat')load q1x.datwho % list variables in workspacewhos % list variables in workspace (detailed view)clear q1y % clear command without any args clears all varsv = q1x(1:10); % first 10 elements of q1x (counts down the columns)save hello.mat v; % save variable v into file hello.matsave hello.txt v -ascii; % save as ascii% fopen, fread, fprintf, fscanf also work [[not needed in class]]%% indexingA(3,2) % indexing is (row,col)A(2,:) % get the 2nd row.% ":" means every element along that dimensionA(:,2) % get the 2nd colA([1 3],:) % print all the elements of rows 1 and 3A(:,2) = [10; 11; 12] % change second columnA = [A, [100; 101; 102]]; % append column vecA(:) % Select all elements as a column vector.% Putting data togetherA = [1 2; 3 4; 5 6]B = [11 12; 13 14; 15 16] % same dims as AC = [A B] % concatenating A and B matrices side by sideC = [A, B] % concatenating A and B matrices side by sideC = [A; B] % Concatenating A and B top and bottom
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354%% initialize variablesA = [1 2;3 4;5 6]B = [11 12;13 14;15 16]C = [1 1;2 2]v = [1;2;3]%% matrix operationsA * C % matrix multiplicationA .* B % element-wise multiplication% A .* C or A * B gives error - wrong dimensionsA .^ 2 % element-wise square of each element in A1./v % element-wise reciprocallog(v) % functions like this operate element-wise on vecs or matricesexp(v)abs(v)-v % -1*vv + ones(length(v), 1)% v + 1 % sameA' % matrix transpose%% misc useful functions% max (or min)a = [1 15 2 0.5]val = max(a)[val,ind] = max(a) % val - maximum element of the vector a and index - indexvalue where maximum occurval = max(A) % if A is matrix, returns max from each column% compare values in a matrix & finda < 3 % checks which values in a are less than 3find(a < 3) % gives location of elements less than 3A = magic(3) % generates a magic matrix - not much used in ML algorithms[r,c] = find(A>=7) % row, column indices for values matching comparison% sum, prodsum(a)prod(a)floor(a) % or ceil(a)max(rand(3),rand(3))max(A,[],1) - maximum along columns(defaults to columns - max(A,[]))max(A,[],2) - maximum along rowsA = magic(9)sum(A,1)sum(A,2)sum(sum( A .* eye(9) ))sum(sum( A .* flipud(eye(9)) ))% Matrix inverse (pseudo-inverse)pinv(A) % inv(A'*A)*A'
1234567891011121314151617181920212223242526272829%% plottingt = [0:0.01:0.98];y1 = sin(2*pi*4*t);plot(t,y1);y2 = cos(2*pi*4*t);hold on; % "hold off" to turn offplot(t,y2,'r');xlabel('time');ylabel('value');legend('sin','cos');title('my plot');print -dpng 'myPlot.png'close; % or, "close all" to close all figsfigure(1); plot(t, y1);figure(2); plot(t, y2);figure(2), clf; % can specify the figure numbersubplot(1,2,1); % Divide plot into 1x2 grid, access 1st elementplot(t,y1);subplot(1,2,2); % Divide plot into 1x2 grid, access 2nd elementplot(t,y2);axis([0.5 1 -1 1]); % change axis scale%% display a matrix (or image)figure;imagesc(magic(15)), colorbar, colormap gray;% comma-chaining function calls.a=1,b=2,c=3a=1;b=2;c=3;
1234567891011121314151617181920212223242526272829v = zeros(10,1);for i=1:10,v(i) = 2^i;end;% Can also use "break" and "continue" inside for and while loops to controlexecution.i = 1;while i <= 5,v(i) = 100;i = i+1;endi = 1;while true,v(i) = 999;i = i+1;if i == 6,break;end;endif v(1)==1,disp('The value is one!');elseif v(1)==2,disp('The value is two!');elsedisp('The value is not one or two!');end
To create a function, type the function code in a text editor (e.g. gedit or notepad), and save the file as "functionName.m"
Example function:
1234function y = squareThisNumber(x)y = x^2;
To call the function in Octave, do either:
1) Navigate to the directory of the functionName.m file and call the function:
123456% Navigate to directory:cd /path/to/function% Call the function:functionName(args)
2) Add the directory of the function to the load path and save it:You should not use addpath/savepath for any of the assignments in this course. Instead use 'cd' to change the current working directory. Watch the video on submitting assignments in week 2 for instructions.
123456% To add the path for the current session of Octave:addpath('/path/to/function/')% To remember the path for future sessions of Octave, after executingaddpath above, also do:savepath
Octave's functions can return more than one value:
1234function [y1, y2] = squareandCubeThisNo(x)y1 = x^2y2 = x^3
Call the above function this way:
12[a,b] = squareandCubeThisNo(x)
Vectorization is the process of taking code that relies on loops and converting it into matrix operations. It is more efficient, more elegant, and more concise.
As an example, let's compute our prediction from a hypothesis. Theta is the vector of fields for the hypothesis and x is a vector of variables.
With loops:
12345prediction = 0.0;for j = 1:n+1,prediction += theta(j) * x(j);end;
With vectorization:
12prediction = theta' * x;
If you recall the definition multiplying vectors, you'll see that this one operation does the element-wise multiplication and overall sum in a very concise notation.
123456780:00 Introduction3:15 Elementary and Logical operations5:12 Variables7:38 Matrices8:30 Vectors11:53 Histograms12:44 Identity matrices13:14 Help command
123456789100:24 The size command1:39 The length command2:18 File system commands2:25 File handling4:50 Who, whos, and clear6:50 Saving data8:35 Manipulating data12:10 Unrolling a matrix12:35 Examples14:50 Summary
1234560:00 Matrix operations0:57 Element-wise operations4:28 Min and max5:10 Element-wise comparisons5:43 The find command6:00 Various commands and operations
Plotting data
1234567890:00 Introduction0:54 Basic plotting2:04 Superimposing plots and colors3:15 Saving a plot to an image4:19 Clearing a plot and multiple figures4:59 Subplots6:15 The axis command6:39 Color square plots8:35 Wrapping up
1234567890:10 For loops1:33 While loops3:35 If statements4:54 Functions6:15 Search paths7:40 Multiple return values8:59 Cost function example (machine learning)12:24 Summary
Vectorization
123450:00 Why vectorize?1:30 Example4:22 C++ example5:40 Vectorization applied to gradient descent9:45 Python