Your name: Jerry Liu
Your UID: 404474229
Please upload only this notebook to CCLE by the deadline.
Policy for late submission of solutions: We will use Paul Eggert's Late Policy: $N$ days late $\Leftrightarrow$ $2^N$ points deducted} The number of days late is $N=0$ for the first 24 hrs, $N=1$ for the next 24 hrs, etc., and if you submit an assignment $H$ hours late, $2^{\lfloor \,H\,/\,24\, \rfloor}$ points are deducted.
For every presidential candidate (except Donald Trump), income tax returns have been filed prior to an election: http://www.taxhistory.org/www/website.nsf/Web/PresidentialTaxReturns?OpenDocument.
Some of these returns are impressive: Mitt Romney's 2011 tax return was 379 pages long!
Hillary Clinton's income was the highest of the 2016 candidates (except perhaps Trump); see
her 2015 income tax return.
Bernie Sanders' reported income is barely enough for his family to survive in Los Angeles...
In this assignment, take the files HR_Clinton_2014_tax_return_numbers.txt and HR_Clinton_2015_tax_return_numbers.txt listing numbers in Hillary's tax returns for the last 2 years.
For each of these two files, determine (using the method in the Course Reader) whether the unique numbers in this file (please omit duplicates) violate Benford's Law.
% cat HR_Clinton_2014_tax_return_numbers.txt | sed 's/,//g' > HR_Clinton_2014_tax_return_numbers_nocomma.txt
H14 = unique(dlmread('HR_Clinton_2014_tax_return_numbers_nocomma.txt','\n'));
H15 = unique(dlmread('HR_Clinton_2015_tax_return_numbers.txt','\n'));
firstdigit = @(x) floor(x ./ (10 .^ floor(log10(x))));
number_of_bins = 9;
nu = number_of_bins - 1;
Hillary14Histogram = hist(firstdigit(H14), 1 : 9)
Hillary15Histogram = hist(firstdigit(H15), 1 : 9)
BenfordProbabilities = diff(log10(1:10));
% 2014
figure
hist(firstdigit(H14), 1 : 9)
N14 = length(H14);
Benford14Histogram = N14 * BenfordProbabilities;
hold on;
plot( 1 : 9, Benford14Histogram, 'r');
title('Histogram of firstst digits in Hillary 2014 tax return, with Benford expected values', 'FontSize', 6);
hold off;
ChiSquareStatistic14 = sum((Hillary14Histogram - Benford14Histogram) .^ 2 ./ Benford14Histogram)
ChiSquareProbability14 = cdf('Chisquare', ChiSquareStatistic14, nu)
% 2015
figure
hist(firstdigit(H15), 1 : 9)
N15 = length(H15);
Benford15Histogram = N15 * BenfordProbabilities;
hold on;
plot( 1 : 9, Benford15Histogram, 'r');
title('Histogram of firstst digits in Hillary 2015 tax return, with Benford expected values', 'FontSize', 6);
hold off;
ChiSquareStatistic15 = sum((Hillary15Histogram - Benford15Histogram) .^ 2 ./ Benford15Histogram)
ChiSquareProbability15 = cdf('Chisquare', ChiSquareStatistic15, nu)
fprintf(['2014 tax return from Hillary Clinton could violate the Benford Law\n', ...
'as the test statistics is a relatively high value: %f.\n', ...
'The 2015 tax return is more normal\n', ...
'since the test statistics is %f, a relatively low value.'], ...
ChiSquareProbability14, ChiSquareProbability15);
3 pages of Donald Trump's 1995 Income Tax form have been published by the New York Times.
The file DJ_Trump_1995_tax_return_numbers.txt lists numbers in this (partial) income tax return.
Determine whether the unique numbers in this file violate Benford's Law.
TrumpTaxReturn = unique(dlmread('DJ_Trump_1995_tax_return_numbers.txt','\n'));
firstdigit = @(x) floor(x ./ (10 .^ floor(log10(x))));
number_of_bins = 9;
nu = number_of_bins - 1;
TrumpHistogram = hist(firstdigit(TrumpTaxReturn), 1 : 9)
BenfordProbabilities = diff(log10(1:10));
hist(firstdigit(TrumpTaxReturn), 1 : 9)
N = length(TrumpTaxReturn);
BenfordHistogram = N * BenfordProbabilities;
hold on;
plot( 1 : 9, BenfordHistogram, 'r');
title('Histogram of firstst digits in Trump 1995 tax return, with Benford expected values', 'FontSize', 6);
hold off;
ChiSquareStatistic = sum((TrumpHistogram - BenfordHistogram) .^ 2 ./ BenfordHistogram)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)
fprintf(['1995 tax return from Donald Trump violates the Benford Law\n', ...
'as the test statistics is\n', ...
'so high that it is almost impossible to exceed it: %f.\n'], ..., ...
ChiSquareProbability);
For this assignment we want you to test whether earthquakes in a given region occur with uniform frequency throughout the year. Some geologists claim earthquakes are completely unpredictable, occurring randomly.
Specifically, we want you to compare the histograms of months in which the earthquakes compare with an expected uniform histogram.
To do this, given a dataset of 20,000 earthquakes, we want you to create a histogram with 12 bins giving the number of earthquakes that occurred within each of the 12 months of the year. This is the observed histogram $O$. Since there are 20,000 earthquakes, the total of all bins in $O$ will be 20,000.
Next, create an expected histogram $E$ of 12 bins whose bins have values
E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365) * 20000giving a uniform expected number of earthquakes in each month. (Simply ignore leap years for this assignment.)
help cdf ChiSquareProbability = cdf( 'chisquare', ChiSquareStatistic, nu )
Each dataset is a .csv file obtained from a server at http://www.iris.edu/seismon. Earthquake times are reported in UTC (Coordinated Universal Time), which is essentially the same thing as GMT (Greenwich Mean Time). However, all we will need is the month in which each earthquake occurred.
On Sunday November 13 there was a 7.8 earthquake in New Zealand.
The file NewZealandQuakes.csv is a list of the 20,000 largest earthquakes since 1970 in the Tonga/Kermadec Islands region near New Zealand. This region, where the Australian continent and Pacific Ocean meet, is very highly active earthquake-wise. It is an amazing geological formation, a wall that is over 400 miles deep.
Does this distribution of month values of earthquakes differ significantly from a uniform distribution?
E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365.0) * 20000;
NZQuakes = csvread('NewZealandQuakes.csv', 1, 0);
MonthData = NZQuakes(:, 2);
number_of_bins = 12;
nu = number_of_bins - 1;
O = hist(MonthData, 1 : number_of_bins); % ... histogram of column 'Month' in the NewZealandQuakes.csv file ...
hist(MonthData, 1 : number_of_bins)
hold on;
plot(1 : number_of_bins, E, 'r');
title('Histogram of Earthquake months in New Zealand, with Expected value', 'FontSize', 7);
hold off;
ChiSquareStatistic = sum((O - E) .^ 2 ./ E)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)
fprintf('The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.');
The file JapanQuakes.csv is a list of the 20,000 largest earthquakes since 1970 in the region around Japan. A magnitude 9.0 earthquake hit this area in 2011.
Does the distribution of month values of earthquakes in this region differ significantly from a uniform distribution?
E = ([ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] / 365.0) * 20000;
JPQuakes = csvread('JapanQuakes.csv', 1, 0);
MonthData = JPQuakes(:, 2);
number_of_bins = 12;
nu = number_of_bins - 1;
O = hist(MonthData, 1 : number_of_bins); % ... histogram of column 'Month' in the JapanQuakes.csv file ...
hist(MonthData, 1 : number_of_bins)
hold on;
plot(1 : number_of_bins, E, 'r');
title('Histogram of Earthquake months in Japan, with Expected value', 'FontSize', 7);
hold off;
ChiSquareStatistic = sum((O - E) .^ 2 ./ E)
ChiSquareProbability = cdf('Chisquare', ChiSquareStatistic, nu)
fprintf('The ChiSquareStatistic is so large that it is unlikely the distribution is uniform.');
The Course Reader explains that Newton's method for computing inverses $b^{-1}$ works by defining $f(x) = (1/x - b)$, giving a Newton iteration $x_{n+1} \,=\, g(x_n)$, where:
$$ g(x) ~ ~=~ ~ x ~-~ f(x)/f'(x) ~ ~=~ ~ x ~-~ \frac{1/x - b}{-1/x^2} ~ ~=~ ~ x ~ \left( 2 ~-~ bx \right) . $$A similar iteration obtained from $f(X) \,=\, (X^{-1} \,-\, B)$ can be used on matrices:
$$ g(X) ~=~ X ~-~ (\nabla f(X))^{-1} ~ f(X) ~=~ X ~-~ ({-X^{2}}) \; ({X^{-1} - B}) ~=~ X ~ \left( 2\,I ~-~ B\,X \right) . $$Use the iteration to obtain $B^{-1}$ for each of the Hilbert matrices $B$:
For B = hilb(...), you can start with initial value $X=I$, the identity matrix.
For B = invhilb(...), you will need to find a good starting value yourself.
For each such matrix $B$, determine how many iterations are needed for convergence, and measure the relative error $||X-B^{-1}|| ~/~ ||B^{-1}||$.
The function invhilb() can be used to obtain the exact value of the inverse of hilb().
B = hilb(4)
Binv = invhilb(4)
X = eye(4) % starting value for the iteration
I = eye(4);
g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-42; % Matrices are different, this seems to be the best I can do, credit to Xin Min
i = 0;
while (relative_error > eps)
i = i + 1;
oldX = X;
X = g(oldX);
relative_error = norm(X - Binv) / norm(Binv);
end
fprintf('Resulting X:')
X
fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);
B = hilb(8)
Binv = invhilb(8)
X = eye(8) % starting value for the iteration
I = eye(8);
g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-26; % Matrices are different, this seems to be the best I can do, credit to Xin Min
i = 0;
while (relative_error > eps)
i = i + 1;
oldX = X;
X = g(oldX);
relative_error = norm(X - Binv) / norm(Binv);
end
fprintf('Resulting X:')
X
fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);
B = invhilb(4)
Binv = hilb(4)
% https://piazza.com/class/itsan0ubeb91eh?cid=198
X = B' / (norm(B, 1) * norm(B, Inf));
I = eye(4);
g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-42; % Matrices are different, this seems to be the best I can do, credit to Xin Min
i = 0;
while (relative_error > eps)
i = i + 1;
oldX = X;
X = g(oldX);
relative_error = norm(X - Binv) / norm(Binv);
end
fprintf('Resulting X:')
X
fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);
B = invhilb(8)
Binv = hilb(8)
% https://piazza.com/class/itsan0ubeb91eh?cid=198
X = B' / (norm(B, 1) * norm(B, Inf));
I = eye(8);
g = @(x) x * (2 * I - B * x);
relative_error = norm(X - Binv) / norm(Binv);
eps = 2^-26; % Matrices are different, this seems to be the best I can do, credit to Xin Min
i = 0;
while (relative_error > eps)
i = i + 1;
oldX = X;
X = g(oldX);
relative_error = norm(X - Binv) / norm(Binv);
end
fprintf('Resulting X:')
X
fprintf('We need %d iterations for convergence with relative error %16.15f\n', i, relative_error);
In the traditional least squares model ${\bf y} \,=\, X {\bf c}$, the vector of coefficients ${\bf c}$ is often chosen to minimize the least squared error
$$ \epsilon({\bf c}) ~~=~~ ||{{\bf y} ~-~ X {\bf c}}||^2 . $$We can add a constraint that requires $||{{\bf c}}||^2$ to be small. This is often implemented by adding a regularization term $R({\bf c})$ to the error that is very large for values of ${\bf c}$ that violate the constraints. In this case the iteration seeks to minimize the regularized problem
$$ \epsilon({\bf c}) ~+~ R({\bf c}) ~~=~~ ||{{\bf y} ~-~ X {\bf c}}||^2~+~ R({\bf c}) . $$The dataset for this problem is called prostate.csv. Read in this file. The columns are lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa.
The regression problem is to predict lpsa from the other 8 variables.
ProstateData = csvread('prostate.csv', 1, 0);
%y = lpsa;
y = ProstateData(:, 9);
%X = [ lcavol lweight age lbph svi lcp gleason pgg45 ];
X = ProstateData(:, 1 : 8);
Tikhonov regularization $R({\bf c}) = ||{T\,{\bf c}}||^2$ uses a covariance matrix $T$ that is chosen to scale the $\bf{c}$ values properly. In this case we want to solve the Tikhonov regularized least squares problem, minimizing
$$ L({\bf c},T) ~=~ \epsilon({\bf c}) ~+~ R({\bf c}) ~ ~=~ ~ ||{{\bf y} ~-~ X {\bf c}}||^2 ~+~ ||{T\,{\bf c}}||^2 ~ ~=~ ~ \left|\left|{ \left({\!\! \begin{array}{c} {\bf y} \\ {\bf 0} \end{array} \!\!}\right) \;-\; \left({\!\! \begin{array}{c} X \\ T \end{array} \!\!}\right) \, {\bf c} }\right|\right|^2 . $$To minimize the expression on the right, we need to find the least squares solution of:
$$ \left({ \begin{array}{c} X \\ T \end{array} }\right) ~ {\bf c} ~ ~=~ ~ \left({ \begin{array}{c} {\bf y} \\ {\bf 0} \end{array} }\right) . $$What are the normal equations for this equation that we can use to obtain the least squares solution?
My answer (to be filled in with Markdown and Equations):
$$ (X' X + T' T) {\bf c} ~=~ X' \mathbf{y} $$From this equation give a formula for the least squares solution ${\bf c}$:
My answer (to be filled in with Markdown and Equations):
$$ {\bf c} ~=~ (X' X + T' T)^{-1}X' \mathbf{y} $$Ridge Regression is a popular form of this least squares problem when $T'\,T \,=\, \lambda\,I$ for some $\lambda \geq 0$ (so that $T'\,T$ has a 'ridge' down the diagonal). As $\lambda$ decreases to 0, the problem reduces to the ordinary least squares, while as $\lambda$ grows large, $||{{\bf c}}||$ becomes increasingly emphasized and minimization reduces it to zero.
Assume that the singular values of $X$ are $\sigma_1, \ldots, \sigma_p$. What are the singular values of $(X'\,X ~+~ T'\,T)$?
My answer (to be filled in with Markdown and Equations):
We are assuming that $X$ has SVD $U S V'$, where $S$ is a diagonal matrix whose $i$-th singular value is $\sigma_i$, so $(X' X)$ has SVD $V S'S V'$, with $i$-th singular value $\sigma_i^2$.
Therefore, since $T'T = \lambda I$, and $V D V' = D$ for any constant diagonal matrix $D = \lambda I$, the $i$-th singular value of $(X' X + T' T)$ is:
Since $$ X = USV',\ (X'X + T'T) = (V S'S V' + \lambda I) = S'S + \lambda I $$ Thus, the $i$-th singular value of $(X' X + T' T)$ is: $$ \sigma_i^2 + \lambda $$
Also give the singular values of the hat matrix $$ H_{\lambda} \,=\, X\, (X'\,X \,+\, \lambda \,I)^{-1}\,X' . $$
My answer (to be filled in with Markdown and Equations):
Again, we are assuming that $X$ has SVD $U S V'$, where $S$ has $i$-th singular value $\sigma_i$, and $(X' X) = V S' S V'$ has $i$-th singular value $\sigma_i^2$.
Therefore, the $i$-th singular value of the hat matrix $H_{\lambda} = X (X' X + \lambda I)^{-1} X'$ is
Since $(X' X + \lambda I)$ has $i$-th singular value $\sigma_i^2 + \lambda$, the $i$-th entry of $(X' X + \lambda I)^{-1}$ has singualr value $1 / (\sigma_i^2 + \lambda)$. Thus we have $$ H_{\lambda} = X (X' X + \lambda I)^{-1} X' = X \mathbf{\sigma^{-1}} I X' = \mathbf{\sigma^{-1}}X X' = \mathbf{\sigma^{-1}}USV'(VS'U') = \mathbf{\sigma^{-1}}USS'U', \mathbf{\sigma^{-1}} = (\frac{1} {\sigma_1^2 + \lambda}, \dots, \frac{1}{\sigma_p^2 + \lambda})' $$ Then the $i$-th entry of singular values is: $$ \frac{\sigma_i^2 }{ \sigma_i^2 + \lambda} $$
Define degrees of freedom $\mbox{df}(\lambda) \,=\, \mbox{trace}(\,H_{\lambda}\,)$, where $H_{\lambda}$ is the hat matrix.
Give a formula for $\mbox{df}(\lambda)$ in terms of $\lambda$ and the singular values of $X$:
My answer (to be filled in with Markdown and Equations):
Assuming the $i$-th singular value of the hat matrix $H_{\lambda} = X (X' X + \lambda I)^{-1} X'$ is as in the previous answer, $$ df(\lambda) = \sum_{i = 1}^p \frac{\sigma_i^2 }{ \sigma_i^2 + \lambda} $$
Also give a formula for $\mbox{df}(0)$:
My answer (to be filled in with Markdown and Equations):
$$ df(0) = \sum_{i = 1}^p \frac{\sigma_i^2 }{ \sigma_i^2 + 0} = p $$Using the prostate dataset, let ${\bf y}$ be the vector lcavol, and let $X$ be the matrix with the other variables.
Also normalize $X$ and ${\bf y}$ (i.e., replace them by their z-scores), so that all columns have the same scale.
Plot the value of the ridge regression coefficients ${\bf c}$ for all values of $\lambda$ from 0 to $2 \, ||{X' \, X}||$.
% COMPLETE THIS CODE SO THAT IT PLOTS THE COEFFICIENTS c FOR VARIOUS VALUES OF lambda
% AND INCLUDE THE PLOT IN THE OUTPUT YOU SUBMIT.
NX = zscore(X);
NY = zscore(y);
max_lambda = round( 2 * norm(NX' * NX) );
N = 100;
p = min(size(NX));
lambda_values = linspace( 0, max_lambda, N );
I = eye(p);
c_values = zeros(p,N);
colors = { 'r' 'g' 'b' 'c' 'm' 'y' 'k' 'b' };
varnames = { 'lcavol' 'lweight' 'age' 'lbph' 'svi' 'lcp' 'gleason' 'pgg45' };
% c = (X'X + lambda I) X' y
XAdjointX = NX' * NX;
XAdjointTimesY = NX' * NY;
for j=1:N
c_values(:,j) = pinv(XAdjointX + lambda_values(j) * I) * XAdjointTimesY;
end
plot( lambda_values, c_values(1,:), colors{1})
hold on
for i = 2 : p
plot( lambda_values, c_values(i, :), colors{i} )
end
xlabel('lambda')
ylabel('coefficient value')
title('Ridge regression coefficients for various values of lambda', 'FontSize', 9)
legend(varnames) % add a legend showing variable names
hold off
An 'optimal' value of $\lambda$ will yield coefficients ${\bf c}$ that balances the residual sum of squares:
$$ RSS = ||{{\bf y} \,-\, \widehat{{\bf y}}}||^2 \,=\, ||{{\bf y} \,-\, H_{\lambda}\,{\bf y}}||^2 $$Note $\widehat{{\bf y}} \,=\, H_{\lambda}\,{\bf y}$ depends on $\lambda$ and the regularization error $||{{\bf c}}||^2$.
Define the Generalized Cross-Validation (GCV) measure
$$ \mbox{GCV}(\lambda) ~=~ \frac{n}{(n - \mbox{df}(\lambda))^2} ~ ||{\,{\bf y} \,-\, \widehat{{\bf y}} \,}||^2 . $$Ridge Regression is often implemented by finding a value of $\lambda$ that minimizes $\mbox{GCV}(\lambda)$.
For the values of $\lambda$ that you used above, compute $\mbox{GCV}(\lambda)$.
% COMPLETE THIS CODE SO THAT IT PLOTS GCV(lambda) FOR VARIOUS VALUES OF lambda
% AND INCLUDE THE PLOT IN THE OUTPUT YOU SUBMIT.
max_lambda = round( 2 * norm(NX' * NX) );
N = 100;
n = max(size(NX));
p = min(size(NX));
I = eye(p);
lambda_values = linspace( 0, max_lambda, N );
GCV = zeros(N,1);
[U, S, V] = svd(NX);
sigma = diag(S);
DF = zeros(N);
for i = 1 : N
DF(i) = sum(sigma .^ 2 ./ (sigma .^ 2 + lambda_values(i)));
end
XAdjointX = NX' * NX;
HHAT = @(i) NX * pinv(XAdjointX + lambda_values(i) * I) * NX';
for j = 1 : N
yHat = HHAT(j) * NY;
GCV(j) = n / (n - DF(j))^2 * norm(NY - yHat)^2;
end
minimal_lambda_position = find( GCV == min(GCV) )
minimal_lambda = lambda_values( minimal_lambda_position )
plot( lambda_values, GCV, 'b' )
xlabel('lambda')
ylabel('GCV(lambda)')
title( sprintf('GCV(lambda) is minimal at lambda = %7.3f', minimal_lambda) )
hold on
plot( [minimal_lambda], [min(GCV)], 'r.' )
hold off
First, develop a function that computes $\mbox{GCV}(\lambda)$.
Then, use an optimization routine to find a value $\lambda$ in the interval $[\, 0,\,2 \, ||{X' \, X}|| \,]$ for which $\mbox{GCV}(\lambda)$ is minimal.
% optimization functions in Octave / Matlab
help fzero % find a root of a univariate function
help fminbnd % minimization of a univariate function
help fminsearch % minimization of a multivariate function
help fminunc % minimization of a multivariate function (with gradients)
help lsqlin % linear least squares
help quadprog % quadratic programming
help lsqnonlin % nonlinear least squares
help fsolve % solve set of nonlinear equations (vector-valued function)
% COMPLETE THIS CODE SO THAT IT FINDS A VALUE OF lambda THAT MINIMIZES GCV(lambda).
max_lambda = round( 2 * norm(NX' * NX) );
n = max(size(NX));
DF = @(lambda) sum(sigma .^ 2 ./ (sigma .^ 2 + lambda));
HHAT = @(lambda) NX * pinv(XAdjointX + lambda * I) * NX';
GCV_function = @(lambda) n / (n - DF(lambda))^2 * norm(NY - HHAT(lambda) * NY)^2;
minimal_lambda = fminbnd(GCV_function, 0, max_lambda)