CS170A -- HW#4 -- assignment and solution form -- Matlab¶

Your name: Jerry Liu

Your UID: 404474229

Please upload only this notebook to CCLE by the deadline.

Policy for late submission of solutions: We will use Paul Eggert's Late Policy: $N$ days late $\Leftrightarrow$ $2^N$ points deducted} The number of days late is $N=0$ for the first 24 hrs, $N=1$ for the next 24 hrs, etc., and if you submit an assignment $H$ hours late, $2^{\lfloor \,H\,/\,24\, \rfloor}$ points are deducted.

Problem 1: Income Tax and Benford's Law¶

For every presidential candidate (except Donald Trump), income tax returns have been filed prior to an election: http://www.taxhistory.org/www/website.nsf/Web/PresidentialTaxReturns?OpenDocument.

Some of these returns are impressive: Mitt Romney's 2011 tax return was 379 pages long!
Hillary Clinton's income was the highest of the 2016 candidates (except perhaps Trump); see her 2015 income tax return.
Bernie Sanders' reported income is barely enough for his family to survive in Los Angeles...

1.1 Hillary Clinton's 2014 and 2015 Income Tax returns¶

In this assignment, take the files HR_Clinton_2014_tax_return_numbers.txt and HR_Clinton_2015_tax_return_numbers.txt listing numbers in Hillary's tax returns for the last 2 years.

For each of these two files, determine (using the method in the Course Reader) whether the unique numbers in this file (please omit duplicates) violate Benford's Law.

% cat HR_Clinton_2014_tax_return_numbers.txt | sed 's/,//g' > HR_Clinton_2014_tax_return_numbers_nocomma.txt
H14 = unique(dlmread('HR_Clinton_2014_tax_return_numbers_nocomma.txt','\n'));
H15 = unique(dlmread('HR_Clinton_2015_tax_return_numbers.txt','\n'));

firstdigit = @(x) floor(x ./ (10 .^ floor(log10(x))));

number_of_bins = 9;
nu = number_of_bins - 1;

Hillary14Histogram = hist(firstdigit(H14), 1 : 9)
Hillary15Histogram = hist(firstdigit(H15), 1 : 9)

BenfordProbabilities = diff(log10(1:10));

% 2014

figure
hist(firstdigit(H14), 1 : 9)

N14 = length(H14);
Benford14Histogram = N14 * BenfordProbabilities;
hold on;
plot( 1 : 9, Benford14Histogram, 'r');
title('Histogram of firstst digits in Hillary 2014 tax return, with Benford expected values', 'FontSize', 6);
hold off;

ChiSquareStatistic14 = sum((Hillary14Histogram - Benford14Histogram) .^ 2 ./ Benford14Histogram)
ChiSquareProbability14 = cdf('Chisquare', ChiSquareStatistic14, nu)

% 2015

figure
hist(firstdigit(H15), 1 : 9)

N15 = length(H15);
Benford15Histogram = N15 * BenfordProbabilities;
hold on;
plot( 1 : 9, Benford15Histogram, 'r');
title('Histogram of firstst digits in Hillary 2015 tax return, with Benford expected values', 'FontSize', 6);
hold off;

ChiSquareStatistic15 = sum((Hillary15Histogram - Benford15Histogram) .^ 2 ./ Benford15Histogram)
ChiSquareProbability15 = cdf('Chisquare', ChiSquareStatistic15, nu)
fprintf(['2014 tax return from Hillary Clinton could violate the Benford Law\n', ...
         'as the test statistics is a relatively high value: %f.\n', ...
         'The 2015 tax return is more normal\n', ...
         'since the test statistics is %f, a relatively low value.'], ...
         ChiSquareProbability14, ChiSquareProbability15);

Hillary14Histogram =

    38    38    17    11     8     8     4    10     6


Hillary15Histogram =

    34    20    13    15     5     4     2     8     5


ChiSquareStatistic14 =

   12.3526


ChiSquareProbability14 =

    0.8639


ChiSquareStatistic15 =

    9.0871


ChiSquareProbability15 =

    0.6650

2014 tax return from Hillary Clinton could violate the Benford Law
as the test statistics is a relatively high value: 0.863850.
The 2015 tax return is more normal
since the test statistics is 0.665001, a relatively low value.

1.2 Donald Trump's 1995 Income Tax return¶

3 pages of Donald Trump's 1995 Income Tax form have been published by the New York Times.

The file DJ_Trump_1995_tax_return_numbers.txt lists numbers in this (partial) income tax return.

Determine whether the unique numbers in this file violate Benford's Law.

TrumpTaxReturn = unique(dlmread('DJ_Trump_1995_tax_return_numbers.txt','\n'));

firstdigit = @(x) floor(x ./ (10 .^ floor(log10(x))));

number_of_bins = 9;
nu = number_of_bins - 1;

TrumpHistogram = hist(firstdigit(TrumpTaxReturn), 1 : 9)

BenfordProbabilities = diff(log10(1:10));

hist(firstdigit(TrumpTaxReturn), 1 : 9)

N = length(TrumpTaxReturn);
BenfordHistogram = N * BenfordProbabilities;
hold on;
plot( 1 : 9, BenfordHistogram, 'r');
title('Histogram of firstst digits in Trump 1995 tax return, with Benford expected values', 'FontSize', 6);
hold off;

ChiSquareStatistic = sum((TrumpHistogram - BenfordHistogram) .^ 2 ./ BenfordHistogram)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)

fprintf(['1995 tax return from Donald Trump violates the Benford Law\n', ...
         'as the test statistics is\n', ...
         'so high that it is almost impossible to exceed it: %f.\n'], ..., ...
         ChiSquareProbability);

TrumpHistogram =

     8     2     3     5     0     2     1     2     7


ChiSquareStatistic =

   28.7864


ChiSquareProbability =

    0.9997

1995 tax return from Donald Trump violates the Benford Law
as the test statistics is
so high that it is almost impossible to exceed it: 0.999654.

Problem 2: Earthquakes¶

For this assignment we want you to test whether earthquakes in a given region occur with uniform frequency throughout the year. Some geologists claim earthquakes are completely unpredictable, occurring randomly.

Specifically, we want you to compare the histograms of months in which the earthquakes compare with an expected uniform histogram.

To do this, given a dataset of 20,000 earthquakes, we want you to create a histogram with 12 bins giving the number of earthquakes that occurred within each of the 12 months of the year. This is the observed histogram $O$. Since there are 20,000 earthquakes, the total of all bins in $O$ will be 20,000.

Next, create an expected histogram $E$ of 12 bins whose bins have values

E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365) * 20000

giving a uniform expected number of earthquakes in each month. (Simply ignore leap years for this assignment.)
We want you to use the $\chi^2$ test to determine whether the earthquakes occur uniformly or not. In Matlab you can use the cdf() function to compute probabilities of $\chi^2$ values.

help cdf

ChiSquareProbability = cdf( 'chisquare', ChiSquareStatistic, nu )

Each dataset is a .csv file obtained from a server at http://www.iris.edu/seismon. Earthquake times are reported in UTC (Coordinated Universal Time), which is essentially the same thing as GMT (Greenwich Mean Time). However, all we will need is the month in which each earthquake occurred.

2.1 Do large earthquakes in the area around New Zealand happen uniformly around the year?¶

On Sunday November 13 there was a 7.8 earthquake in New Zealand.

The file NewZealandQuakes.csv is a list of the 20,000 largest earthquakes since 1970 in the Tonga/Kermadec Islands region near New Zealand. This region, where the Australian continent and Pacific Ocean meet, is very highly active earthquake-wise. It is an amazing geological formation, a wall that is over 400 miles deep.

Does this distribution of month values of earthquakes differ significantly from a uniform distribution?

E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365.0) * 20000;

NZQuakes = csvread('NewZealandQuakes.csv', 1, 0);

MonthData = NZQuakes(:, 2);

number_of_bins = 12;
nu = number_of_bins - 1;

O = hist(MonthData, 1 : number_of_bins);  % ... histogram of column 'Month' in the NewZealandQuakes.csv file ...
hist(MonthData, 1 : number_of_bins)

hold on;
plot(1 : number_of_bins, E, 'r');
title('Histogram of Earthquake months in New Zealand, with Expected value', 'FontSize', 7);
hold off;

ChiSquareStatistic = sum((O - E) .^ 2 ./ E)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)

fprintf('The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.');

ChiSquareStatistic =

   68.3314


ChiSquareProbability =

    1.0000

The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.

2.2 Do large earthquakes in the area around Japan happen uniformly around the year?¶

The file JapanQuakes.csv is a list of the 20,000 largest earthquakes since 1970 in the region around Japan. A magnitude 9.0 earthquake hit this area in 2011.

Does the distribution of month values of earthquakes in this region differ significantly from a uniform distribution?

E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365.0) * 20000;

JPQuakes = csvread('JapanQuakes.csv', 1, 0);

MonthData = JPQuakes(:, 2);

number_of_bins = 12;
nu = number_of_bins - 1;

O = hist(MonthData, 1 : number_of_bins); % ... histogram of column 'Month' in the JapanQuakes.csv file ...
hist(MonthData, 1 : number_of_bins)

hold on;
plot(1 : number_of_bins, E, 'r');
title('Histogram of Earthquake months in Japan, with Expected value', 'FontSize', 7);
hold off;

ChiSquareStatistic = sum((O - E) .^ 2 ./ E)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)

fprintf('The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.');

ChiSquareStatistic =

  746.8763


ChiSquareProbability =

     1

The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.

Problem 3: Using Newton's Method on Matrices¶

The Course Reader explains that Newton's method for computing inverses $b^{-1}$ works by defining $f(x) = (1/x - b)$, giving a Newton iteration $x_{n+1} \,=\, g(x_n)$, where:

$$ g(x) ~ ~=~ ~ x ~-~ f(x)/f'(x) ~ ~=~ ~ x ~-~ \frac{1/x - b}{-1/x^2} ~ ~=~ ~ x ~ \left( 2 ~-~ bx \right) . $$

A similar iteration obtained from $f(X) \,=\, (X^{-1} \,-\, B)$ can be used on matrices:

$$ g(X) ~=~ X ~-~ (\nabla f(X))^{-1} ~ f(X) ~=~ X ~-~ ({-X^{2}}) \; ({X^{-1} - B}) ~=~ X ~ \left( 2\,I ~-~ B\,X \right) . $$

Use the iteration to obtain $B^{-1}$ for each of the Hilbert matrices $B$:

B = hilb(4)
B = hilb(8)
B = invhilb(4)
B = invhilb(8)

For B = hilb(...), you can start with initial value $X=I$, the identity matrix.

For B = invhilb(...), you will need to find a good starting value yourself.

For each such matrix $B$, determine how many iterations are needed for convergence, and measure the relative error $||X-B^{-1}|| ~/~ ||B^{-1}||$.

The function invhilb() can be used to obtain the exact value of the inverse of hilb().

B = hilb(4)
Binv = invhilb(4)
X = eye(4)  % starting value for the iteration
I = eye(4);

g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-42;  %  Matrices are different, this seems to be the best I can do, credit to Xin Min

i = 0;

while (relative_error > eps)
    i = i + 1;
    oldX = X;
    X = g(oldX);
    relative_error = norm(X - Binv) / norm(Binv);
end

fprintf('Resulting X:')

X

fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);

B =

    1.0000    0.5000    0.3333    0.2500
    0.5000    0.3333    0.2500    0.2000
    0.3333    0.2500    0.2000    0.1667
    0.2500    0.2000    0.1667    0.1429


Binv =

          16        -120         240        -140
        -120        1200       -2700        1680
         240       -2700        6480       -4200
        -140        1680       -4200        2800


X =

     1     0     0     0
     0     1     0     0
     0     0     1     0
     0     0     0     1

Resulting X:
X =

   1.0e+03 *

    0.0160   -0.1200    0.2400   -0.1400
   -0.1200    1.2000   -2.7000    1.6800
    0.2400   -2.7000    6.4800   -4.2000
   -0.1400    1.6800   -4.2000    2.8000

We need 19 iterations for convergence with relative error 0.000000000000212

B = hilb(8)
Binv = invhilb(8)
X = eye(8)  % starting value for the iteration
I = eye(8);

g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-26;  %  Matrices are different, this seems to be the best I can do, credit to Xin Min

i = 0;

while (relative_error > eps)
    i = i + 1;
    oldX = X;
    X = g(oldX);
    relative_error = norm(X - Binv) / norm(Binv);
end

fprintf('Resulting X:')

X

fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);

B =

  Columns 1 through 7

    1.0000    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429
    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250
    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111
    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000
    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909
    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833
    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769
    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769    0.0714

  Column 8

    0.1250
    0.1111
    0.1000
    0.0909
    0.0833
    0.0769
    0.0714
    0.0667


Binv =

   1.0e+09 *

  Columns 1 through 7

    0.0000   -0.0000    0.0000   -0.0001    0.0002   -0.0003    0.0002
   -0.0000    0.0001   -0.0010    0.0047   -0.0116    0.0156   -0.0106
    0.0000   -0.0010    0.0114   -0.0582    0.1497   -0.2043    0.1413
   -0.0001    0.0047   -0.0582    0.3049   -0.8004    1.1099   -0.7769
    0.0002   -0.0116    0.1497   -0.8004    2.1344   -2.9968    2.1189
   -0.0003    0.0156   -0.2043    1.1099   -2.9968    4.2499   -3.0301
    0.0002   -0.0106    0.1413   -0.7769    2.1189   -3.0301    2.1754
   -0.0001    0.0029   -0.0389    0.2162   -0.5946    0.8562   -0.6184

  Column 8

   -0.0001
    0.0029
   -0.0389
    0.2162
   -0.5946
    0.8562
   -0.6184
    0.1767


X =

     1     0     0     0     0     0     0     0
     0     1     0     0     0     0     0     0
     0     0     1     0     0     0     0     0
     0     0     0     1     0     0     0     0
     0     0     0     0     1     0     0     0
     0     0     0     0     0     1     0     0
     0     0     0     0     0     0     1     0
     0     0     0     0     0     0     0     1

Resulting X:
X =

   1.0e+09 *

  Columns 1 through 7

    0.0000   -0.0000    0.0000   -0.0001    0.0002   -0.0003    0.0002
   -0.0000    0.0001   -0.0010    0.0047   -0.0116    0.0156   -0.0106
    0.0000   -0.0010    0.0114   -0.0582    0.1497   -0.2043    0.1413
   -0.0001    0.0047   -0.0582    0.3049   -0.8004    1.1099   -0.7769
    0.0002   -0.0116    0.1497   -0.8004    2.1344   -2.9968    2.1189
   -0.0003    0.0156   -0.2043    1.1099   -2.9968    4.2499   -3.0301
    0.0002   -0.0106    0.1413   -0.7769    2.1189   -3.0301    2.1754
   -0.0001    0.0029   -0.0389    0.2162   -0.5946    0.8562   -0.6184

  Column 8

   -0.0001
    0.0029
   -0.0389
    0.2162
   -0.5946
    0.8562
   -0.6184
    0.1767

We need 39 iterations for convergence with relative error 0.000000012505087

B = invhilb(4)
Binv = hilb(4)

% https://piazza.com/class/itsan0ubeb91eh?cid=198
X = B' / (norm(B, 1) * norm(B, Inf));
I = eye(4);

g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-42;  %  Matrices are different, this seems to be the best I can do, credit to Xin Min

i = 0;

while (relative_error > eps)
    i = i + 1;
    oldX = X;
    X = g(oldX);
    relative_error = norm(X - Binv) / norm(Binv);
end

fprintf('Resulting X:')

X

fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);

B =

          16        -120         240        -140
        -120        1200       -2700        1680
         240       -2700        6480       -4200
        -140        1680       -4200        2800


Binv =

    1.0000    0.5000    0.3333    0.2500
    0.5000    0.3333    0.2500    0.2000
    0.3333    0.2500    0.2000    0.1667
    0.2500    0.2000    0.1667    0.1429

Resulting X:
X =

    1.0000    0.5000    0.3333    0.2500
    0.5000    0.3333    0.2500    0.2000
    0.3333    0.2500    0.2000    0.1667
    0.2500    0.2000    0.1667    0.1429

We need 34 iterations for convergence with relative error 0.000000000000084

B = invhilb(8)
Binv = hilb(8)

% https://piazza.com/class/itsan0ubeb91eh?cid=198
X = B' / (norm(B, 1) * norm(B, Inf));
I = eye(8);

g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-26;  %  Matrices are different, this seems to be the best I can do, credit to Xin Min

i = 0;

while (relative_error > eps)
    i = i + 1;
    oldX = X;
    X = g(oldX);
    relative_error = norm(X - Binv) / norm(Binv);
end

fprintf('Resulting X:')

X

fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);

B =

   1.0e+09 *

  Columns 1 through 7

    0.0000   -0.0000    0.0000   -0.0001    0.0002   -0.0003    0.0002
   -0.0000    0.0001   -0.0010    0.0047   -0.0116    0.0156   -0.0106
    0.0000   -0.0010    0.0114   -0.0582    0.1497   -0.2043    0.1413
   -0.0001    0.0047   -0.0582    0.3049   -0.8004    1.1099   -0.7769
    0.0002   -0.0116    0.1497   -0.8004    2.1344   -2.9968    2.1189
   -0.0003    0.0156   -0.2043    1.1099   -2.9968    4.2499   -3.0301
    0.0002   -0.0106    0.1413   -0.7769    2.1189   -3.0301    2.1754
   -0.0001    0.0029   -0.0389    0.2162   -0.5946    0.8562   -0.6184

  Column 8

   -0.0001
    0.0029
   -0.0389
    0.2162
   -0.5946
    0.8562
   -0.6184
    0.1767


Binv =

  Columns 1 through 7

    1.0000    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429
    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250
    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111
    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000
    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909
    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833
    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769
    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769    0.0714

  Column 8

    0.1250
    0.1111
    0.1000
    0.0909
    0.0833
    0.0769
    0.0714
    0.0667

Resulting X:
X =

  Columns 1 through 7

    1.0000    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429
    0.5000    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250
    0.3333    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111
    0.2500    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000
    0.2000    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909
    0.1667    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833
    0.1429    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769
    0.1250    0.1111    0.1000    0.0909    0.0833    0.0769    0.0714

  Column 8

    0.1250
    0.1111
    0.1000
    0.0909
    0.0833
    0.0769
    0.0714
    0.0667

We need 73 iterations for convergence with relative error 0.000000010568690

Problem 4: Regularized Optimization in Ridge Regression¶

In the traditional least squares model ${\bf y} \,=\, X {\bf c}$, the vector of coefficients ${\bf c}$ is often chosen to minimize the least squared error

$$ \epsilon({\bf c}) ~~=~~ ||{{\bf y} ~-~ X {\bf c}}||^2 . $$

We can add a constraint that requires $||{{\bf c}}||^2$ to be small. This is often implemented by adding a regularization term $R({\bf c})$ to the error that is very large for values of ${\bf c}$ that violate the constraints. In this case the iteration seeks to minimize the regularized problem

$$ \epsilon({\bf c}) ~+~ R({\bf c}) ~~=~~ ||{{\bf y} ~-~ X {\bf c}}||^2~+~ R({\bf c}) . $$

4.0 Get the data¶

The dataset for this problem is called prostate.csv. Read in this file. The columns are lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa.

The regression problem is to predict lpsa from the other 8 variables.

ProstateData = csvread('prostate.csv', 1, 0);

%y = lpsa;
y = ProstateData(:, 9);

%X = [ lcavol lweight age lbph svi lcp gleason pgg45 ];
X = ProstateData(:, 1 : 8);

4.1 Tikhonov regularization¶

Tikhonov regularization $R({\bf c}) = ||{T\,{\bf c}}||^2$ uses a covariance matrix $T$ that is chosen to scale the $\bf{c}$ values properly. In this case we want to solve the Tikhonov regularized least squares problem, minimizing

$$ L({\bf c},T) ~=~ \epsilon({\bf c}) ~+~ R({\bf c}) ~ ~=~ ~ ||{{\bf y} ~-~ X {\bf c}}||^2 ~+~ ||{T\,{\bf c}}||^2 ~ ~=~ ~ \left|\left|{ \left({\!\! \begin{array}{c} {\bf y} \\ {\bf 0} \end{array} \!\!}\right) \;-\; \left({\!\! \begin{array}{c} X \\ T \end{array} \!\!}\right) \, {\bf c} }\right|\right|^2 . $$

To minimize the expression on the right, we need to find the least squares solution of:

$$ \left({ \begin{array}{c} X \\ T \end{array} }\right) ~ {\bf c} ~ ~=~ ~ \left({ \begin{array}{c} {\bf y} \\ {\bf 0} \end{array} }\right) . $$

What are the normal equations for this equation that we can use to obtain the least squares solution?

My answer (to be filled in with Markdown and Equations):

$$ (X' X + T' T) {\bf c} ~=~ X' \mathbf{y} $$

From this equation give a formula for the least squares solution ${\bf c}$:

My answer (to be filled in with Markdown and Equations):

$$ {\bf c} ~=~ (X' X + T' T)^{-1}X' \mathbf{y} $$

4.2 Ridge Regression¶

Ridge Regression is a popular form of this least squares problem when $T'\,T \,=\, \lambda\,I$ for some $\lambda \geq 0$ (so that $T'\,T$ has a 'ridge' down the diagonal). As $\lambda$ decreases to 0, the problem reduces to the ordinary least squares, while as $\lambda$ grows large, $||{{\bf c}}||$ becomes increasingly emphasized and minimization reduces it to zero.

Assume that the singular values of $X$ are $\sigma_1, \ldots, \sigma_p$. What are the singular values of $(X'\,X ~+~ T'\,T)$?

My answer (to be filled in with Markdown and Equations):

We are assuming that $X$ has SVD $U S V'$, where $S$ is a diagonal matrix whose $i$-th singular value is $\sigma_i$, so $(X' X)$ has SVD $V S'S V'$, with $i$-th singular value $\sigma_i^2$.

Therefore, since $T'T = \lambda I$, and $V D V' = D$ for any constant diagonal matrix $D = \lambda I$, the $i$-th singular value of $(X' X + T' T)$ is:

Since $$ X = USV',\ (X'X + T'T) = (V S'S V' + \lambda I) = S'S + \lambda I $$ Thus, the $i$-th singular value of $(X' X + T' T)$ is: $$ \sigma_i^2 + \lambda $$

Also give the singular values of the hat matrix $$ H_{\lambda} \,=\, X\, (X'\,X \,+\, \lambda \,I)^{-1}\,X' . $$

My answer (to be filled in with Markdown and Equations):

Again, we are assuming that $X$ has SVD $U S V'$, where $S$ has $i$-th singular value $\sigma_i$, and $(X' X) = V S' S V'$ has $i$-th singular value $\sigma_i^2$.

Therefore, the $i$-th singular value of the hat matrix $H_{\lambda} = X (X' X + \lambda I)^{-1} X'$ is

Since $(X' X + \lambda I)$ has $i$-th singular value $\sigma_i^2 + \lambda$, the $i$-th entry of $(X' X + \lambda I)^{-1}$ has singualr value $1 / (\sigma_i^2 + \lambda)$. Thus we have $$ H_{\lambda} = X (X' X + \lambda I)^{-1} X' = X \mathbf{\sigma^{-1}} I X' = \mathbf{\sigma^{-1}}X X' = \mathbf{\sigma^{-1}}USV'(VS'U') = \mathbf{\sigma^{-1}}USS'U', \mathbf{\sigma^{-1}} = (\frac{1} {\sigma_1^2 + \lambda}, \dots, \frac{1}{\sigma_p^2 + \lambda})' $$ Then the $i$-th entry of singular values is: $$ \frac{\sigma_i^2 }{ \sigma_i^2 + \lambda} $$

4.3 Degrees of Freedom¶

Define degrees of freedom $\mbox{df}(\lambda) \,=\, \mbox{trace}(\,H_{\lambda}\,)$, where $H_{\lambda}$ is the hat matrix.

Give a formula for $\mbox{df}(\lambda)$ in terms of $\lambda$ and the singular values of $X$:

My answer (to be filled in with Markdown and Equations):

Assuming the $i$-th singular value of the hat matrix $H_{\lambda} = X (X' X + \lambda I)^{-1} X'$ is as in the previous answer, $$ df(\lambda) = \sum_{i = 1}^p \frac{\sigma_i^2 }{ \sigma_i^2 + \lambda} $$

Also give a formula for $\mbox{df}(0)$:

My answer (to be filled in with Markdown and Equations):

$$ df(0) = \sum_{i = 1}^p \frac{\sigma_i^2 }{ \sigma_i^2 + 0} = p $$

4.4 Actually computing regression coefficients for various values of $\lambda$¶

Using the prostate dataset, let ${\bf y}$ be the vector lcavol, and let $X$ be the matrix with the other variables.

Also normalize $X$ and ${\bf y}$ (i.e., replace them by their z-scores), so that all columns have the same scale.

Plot the value of the ridge regression coefficients ${\bf c}$ for all values of $\lambda$ from 0 to $2 \, ||{X' \, X}||$.

% COMPLETE THIS CODE SO THAT IT PLOTS THE COEFFICIENTS c FOR VARIOUS VALUES OF lambda
% AND INCLUDE THE PLOT IN THE OUTPUT YOU SUBMIT.

NX = zscore(X);
NY = zscore(y);

max_lambda = round( 2 * norm(NX' * NX) );
N = 100;
p = min(size(NX));
lambda_values = linspace( 0, max_lambda, N );
I = eye(p);
c_values = zeros(p,N);
colors = { 'r' 'g' 'b' 'c' 'm' 'y' 'k' 'b' };
varnames = { 'lcavol' 'lweight' 'age' 'lbph' 'svi' 'lcp' 'gleason' 'pgg45' };

% c = (X'X + lambda I) X' y
XAdjointX = NX' * NX;
XAdjointTimesY = NX' * NY;

for j=1:N
   c_values(:,j) = pinv(XAdjointX + lambda_values(j) * I) * XAdjointTimesY;
end

plot( lambda_values, c_values(1,:), colors{1})
hold on
for i = 2 : p
   plot( lambda_values,  c_values(i, :), colors{i} )
end
xlabel('lambda')
ylabel('coefficient value')
title('Ridge regression coefficients for various values of lambda', 'FontSize', 9)
legend(varnames)   %  add a legend showing variable names
hold off

4.5 Plotting the Generalized Cross-Validation function¶

An 'optimal' value of $\lambda$ will yield coefficients ${\bf c}$ that balances the residual sum of squares:

$$ RSS = ||{{\bf y} \,-\, \widehat{{\bf y}}}||^2 \,=\, ||{{\bf y} \,-\, H_{\lambda}\,{\bf y}}||^2 $$

Note $\widehat{{\bf y}} \,=\, H_{\lambda}\,{\bf y}$ depends on $\lambda$ and the regularization error $||{{\bf c}}||^2$.

Define the Generalized Cross-Validation (GCV) measure

$$ \mbox{GCV}(\lambda) ~=~ \frac{n}{(n - \mbox{df}(\lambda))^2} ~ ||{\,{\bf y} \,-\, \widehat{{\bf y}} \,}||^2 . $$

Ridge Regression is often implemented by finding a value of $\lambda$ that minimizes $\mbox{GCV}(\lambda)$.

For the values of $\lambda$ that you used above, compute $\mbox{GCV}(\lambda)$.

% COMPLETE THIS CODE SO THAT IT PLOTS GCV(lambda) FOR VARIOUS VALUES OF lambda
% AND INCLUDE THE PLOT IN THE OUTPUT YOU SUBMIT.

max_lambda = round( 2 * norm(NX' * NX) );
N = 100;
n = max(size(NX));
p = min(size(NX));
I = eye(p);
lambda_values = linspace( 0, max_lambda, N );
GCV = zeros(N,1);

[U, S, V] = svd(NX);
sigma = diag(S);

DF = zeros(N);
for i = 1 : N
    DF(i) = sum(sigma .^ 2 ./ (sigma .^ 2 + lambda_values(i)));
end

XAdjointX = NX' * NX;
HHAT = @(i) NX * pinv(XAdjointX + lambda_values(i) * I) * NX';

for j = 1 : N
    yHat = HHAT(j) * NY;
    GCV(j) = n / (n - DF(j))^2 * norm(NY - yHat)^2;
end

minimal_lambda_position = find( GCV == min(GCV) )
minimal_lambda = lambda_values( minimal_lambda_position )

plot( lambda_values, GCV, 'b' )
xlabel('lambda')
ylabel('GCV(lambda)')
title( sprintf('GCV(lambda) is minimal at lambda = %7.3f', minimal_lambda) )
hold on
plot( [minimal_lambda], [min(GCV)], 'r.' )
hold off

minimal_lambda_position =

     2


minimal_lambda =

    6.5152

4.6 Actually Optimizing Generalized Cross-Validation¶

First, develop a function that computes $\mbox{GCV}(\lambda)$.

Then, use an optimization routine to find a value $\lambda$ in the interval $[\, 0,\,2 \, ||{X' \, X}|| \,]$ for which $\mbox{GCV}(\lambda)$ is minimal.

%  optimization functions in Octave / Matlab

help fzero    % find a root of a univariate function

help fminbnd  % minimization of a univariate function

help fminsearch % minimization of a multivariate function

help fminunc % minimization of a multivariate function (with gradients)

help lsqlin   % linear least squares

help quadprog % quadratic programming

help lsqnonlin % nonlinear least squares

help fsolve   % solve set of nonlinear equations (vector-valued function)

% COMPLETE THIS CODE SO THAT IT FINDS A VALUE OF lambda THAT MINIMIZES GCV(lambda).

max_lambda = round( 2 * norm(NX' * NX) );
n = max(size(NX));

DF = @(lambda) sum(sigma .^ 2 ./ (sigma .^ 2 + lambda));

HHAT = @(lambda) NX * pinv(XAdjointX + lambda * I) * NX';

GCV_function = @(lambda) n / (n - DF(lambda))^2 * norm(NY - HHAT(lambda) * NY)^2;

minimal_lambda =  fminbnd(GCV_function, 0, max_lambda)

minimal_lambda =

    6.6671