Main Page

CS170A -- HW#1 -- assignment and solution form -- Matlab

Your name: Jerry Liu

Your UID: 404474229

Please upload only this notebook to CCLE by the deadline.

Policy for late submission of solutions: We will use Paul Eggert's Late Policy: $N$ days late $\Leftrightarrow$ $2^N$ points deducted} The number of days late is $N=0$ for the first 24 hrs, $N=1$ for the next 24 hrs, etc., and if you submit an assignment $H$ hours late, $2^{\lfloor H/24\rfloor}$ points are deducted.

NOTE: In this assignment we provide pseudocode to get you started.

In later assignments we will not do this.

Problem 1: SVD k-th order approximations (30 points)

If $A$ is a matrix that has SVD $A = U\,S\,V'$, the rank-k approximation of $A$ keeping only the first $k$ columns of the SVD.

Specifically, given a $n \times p$ matrix $A$ with SVD $A = U\,S\,V'$, then if $k \leq n$ and $k \leq p$, the rank-$k$ approximation of $A$ is $$ A^{(k)} ~~=~~ U ~ S^{(k)} ~ V' $$ where $S^{(k)}$ is the result of setting all diagonal elements to zero after the first $k$ entries $(1 \leq k \leq p)$. If $U^{(k)}$ and $V^{(k)}$ are like $U$ and $V$ but with all columns zero after the first $k$, then $$ A^{(k)} ~~=~~ U ~ S^{(k)} ~ V' ~~=~~ U^{(k)} ~ S^{(k)} ~ V^{(k)'} . $$

In class, we saw a demo of the attached Matlab script imagesvdgui.m --- and the effectiveness of this approximation in retaining information about an image.

The goal of this problem is to implement this approximation for black-and-white (grayscale) images.

In [6]:
load mandrill
Mandrill = ind2rgb(X, map);

A = mean( Mandrill, 3 );  % grayscale image -- size 480 x 500.
size(A)

imwrite(A, 'GrayMandrill.bmp')  % Write the Mandrill to a bitmap image file

% The matrix A now contains the Mandrill image (in grayscale)
ans =

   480   500

Display the bitmap image file using an HTML img tag:

<img src="resources/GrayMandrill.bmp">

1.(a): Plot Singular Values of the Rank-$k$ Approximation of an Image

As in HW0, construct a grayscale version of the Mandrill image, and take one of the 3 color planes as a 500x480 matrix. This is our `black and white' image $A$. You are to analyze the rank-k approximation of the image.

Compute the SVD of $A$, and plot the singular values $\sigma_1$, $\sigma_2$, ...

In [8]:
[U S V] = svd(A);       % U, S, V are now the SVD of A

norm( A - U * S * V' )  %  A should match the product of U, S, V'

X = diag(S);
plot(X);
ans =

   1.7837e-12

1.(b): Optimal Rank-$k$ Approximation of an Image

Find the value of $k$ that minimizes $\mid\mid{A \; - \; A^{(k)}}\mid\mid^2_{F} ~+~ k$.

In [24]:
[n p] = size(A);
[U S V] = svd(A);
maximum_possible_k = min(n,p);
min_err = max(n, p);
singular_vals = diag(S).^2;
minK = 0;

for k = 1 : maximum_possible_k
    err_sum = (sum(singular_vals((k + 1) : maximum_possible_k)) + k);
    if err_sum < min_err
        min_err = err_sum;
        minK = k;
    end
end
minK
minK =

   265

1.(c): The Rank-$k$ Approximation is a Good Approximation

In the chapter on the SVD, the course reader presents a derivation for $A \, - \, A^{(k)}$: $$\begin{eqnarray*} A \; - \; A^{(k)} & = & U \; S \; V' ~ - ~ U^{(k)} ~ S^{(k)} ~ V^{(k)'} \\ & = & U \; S \; V' ~ - ~ U \; S^{(k)} \; V' \\ & = & U \; (S ~ - ~ S^{(k)}) \; V ' \\ \end{eqnarray*}$$

Prove the following: $$ \mid\mid{A \; - \; A^{(k)}}\mid\mid^2_{F} ~=~ \sum_{i>k} \sigma_i^2 . $$

Proof (Enter your Proof here)

Because $A \; - \; A^{(k)} ~ = ~ U \; (S ~ - ~ S^{(k)}) \; V '$,

As a result, $\mid\mid{A \; - \; A^{(k)}}\mid\mid^2_{F} ~ = ~ \text{tr}((U(S - S^{(k)})V')(U(S - S^{(k)})V')') = \text{tr}((U(S - S^{(k)})V')(U^{-1}(S - S^{(k)})'(V')^{-1})) $

$= \text{tr}(UU^{-1}(S - S^{(k)})(S - S^{(k)})'V'(V')^{-1}) = \text{tr}((S - S^{(k)})(S - S^{(k)})') = \sum_{i > k} \sigma_i^2$

Therefore $$ \mid\mid{A \; - \; A^{(k)}}\mid\mid^2_{F} ~=~ \sum_{i>k} \sigma_i^2 . $$

Problem 2: Baseball Visualization (40 points)

For this dataset you are given a matrix of statistics for Baseball players. You are to perform two kinds of analysis on this matrix.

Read in the Baseball Statistics

Statistics of top players after the last regular season game, obtained from MLB.com, October 2016.

In [1]:
%%% Stats = csvread('Baseball_Players_Stats_2016.csv', 1, 0);  # skip the header (= row 0)
%%% Names = csvread('Baseball_Players_Names_2016.csv', 1, 0);

Baseball_Players_2016   %% execute Baseball_Players_2016.m  to load in the data needed here
 added to session magics.

In [5]:
StatNames{1:3}

size(StatNames)

StatNames{:}
ans = Rank
ans = G
ans = AB
ans =

    1   17

ans = Rank
ans = G
ans = AB
ans = R
ans = H
ans = 2B
ans = 3B
ans = HR
ans = RBI
ans = BB
ans = SO
ans = SB
ans = CS
ans = AVG
ans = OBP
ans = SLG
ans = OPS
In [6]:
size(Stats)
Stats(1:3, :)
ans =

   146    17

ans =

 Columns 1 through 6:

     1.00000   146.00000   552.00000   104.00000   192.00000    32.00000
     2.00000   142.00000   531.00000    88.00000   184.00000    47.00000
     3.00000   161.00000   640.00000   108.00000   216.00000    42.00000

 Columns 7 through 12:

     8.00000    11.00000    66.00000    66.00000    80.00000    11.00000
     5.00000    25.00000   104.00000    35.00000    57.00000     5.00000
     5.00000    24.00000    96.00000    60.00000    70.00000    30.00000

 Columns 13 through 17:

     7.00000     0.34800     0.41600     0.49500     0.91100
     3.00000     0.34700     0.39000     0.59500     0.98500
    10.00000     0.33800     0.39600     0.53100     0.92800

In [7]:
size(PlayerNames)
PlayerNames{1:3}
ans =

   146     1

ans =   LeMahieu D
ans =   Murphy D
ans =   Altuve J

Compute a "scaled" version of the Stats matrix

We scale each column of values ${\bf x}$ in Stats to be ${\bf z} = ({\bf x}-\mu)/\sigma$ in ScaledStats, where $\mu$ is the mean of the ${\bf x}$ values, and $\sigma$ is their standard deviation.

In Octave/Matlab, the function mean() computes column means, and std() computes standard deviations. The function zscore() computes both, and uses them to "scale" each column in this way.

This scaling is also called normalization and standardization. The z-scores ${\bf z} = ({\bf x}-\mu)/\sigma$ are also called the standardized or normalized values for ${\bf x}$.

In [2]:
ScaledStats  =  zscore(Stats);   %  z  =  (x-mu)/sigma

mean(ScaledStats)  %  the means of each column after normalization should be 0
std(ScaledStats)   %  the standard deviations of each column after normalization should be 1
ans =

   1.0e-14 *

  Columns 1 through 7

   -0.0032    0.0508   -0.1114   -0.0168    0.0262    0.0225   -0.0046

  Columns 8 through 14

   -0.0047    0.0236   -0.0156   -0.0131   -0.0036   -0.0025    0.4593

  Columns 15 through 17

    0.6185   -0.1468    0.7886


ans =

  Columns 1 through 7

    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000

  Columns 8 through 14

    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000

  Columns 15 through 17

    1.0000    1.0000    1.0000

2 (a): Random Projections

A fundamental problem in data science is that it is impossible to visualize a dataset that has many features. Given an $n \times p$ dataset (matrix) $A$ in which the number of features $p$ is large, there is no obvious way to plot the data.

Dimensionality reduction algorithms have been developed that attempt to find datasets that have lower values of $p$ but approximate $A$ in some way. Although there are sophisticated algorithms, a competitive approach is to compute a random projection of $A$ into a few dimensions. When the projection is into 2 or 3 dimensions, the result can be visualized.

A random $k$-D projection of a $n \times p$ dataset (matrix) $A$ is the result $(A \, P)$ of multiplying $A$ on the right by a $p \times k$ matrix $P$ of random values.

The result is a $n \times k$ matrix, assigning each row in $A$ a new pair of values $(x,y)$, and these can be interpreted as positions in a 2D plot.

In [ ]:
% plotting 2D values
Iris = csvread('iris.csv', 1,0);  % skip over the header line
A = Iris(:, 1:4);  % just the measurement columns
[n p] = size(A);
P = rand(p,2);
disp('random projection weights:')
disp(P)
XY = A * P;
plot(XY(:,1), XY(:,2), 'b.')
title('random projection of the iris data')
species = {'  s', '  v', '  V'}
text(XY(:,1), XY(:,2), species(Iris(:,5)), 'FontSize', 10 )

Problem: write a function random_projection(A,k) that, given an input matrix $A$ of size $n \times p$ and an integer $k>0$, produces a random $k$-D projection.

Please use uniform random values in the matrix $P$.

Then: plot the result of 3 random 2D projections of the data.

In each plot, identify the greatest outlier -- the player with $(x,y)$ values that have the largest total $x+y$. Print the row in the dataset whose projection is this outlier.

In [6]:
random_projection = @(A,k) A * rand(size(A, 2), k); % fill in using e.g.:   ... rand(size(A,1),k) ...

for i = 1 : 3
    XY = random_projection( ScaledStats, 2 );
    subplot(3, 1, i);
    if i == 1
        xcolor = 'rx';
    elseif i == 2
        xcolor = 'bx';
    else
        xcolor = 'gx';
    end
    plot(XY(:,1), XY(:,2), xcolor)
    if i == 1
        title( 'Random 2D Projection of Scaled States Matrix' )
        legend('1st run', 'Location', 'northwest', 'Orientation', 'Horizontal');
    elseif i == 2
        legend('2nd run', 'Location', 'northwest', 'Orientation', 'Horizontal');
    else
        legend('3rd run', 'Location', 'northwest', 'Orientation', 'Horizontal');
    end

    XplusY = sum(XY, 2);
    [maxXplusY, greatest_outlier] = max(XplusY);
    fprintf('The outlier row for %d th run is row %d: \n', i, greatest_outlier);
    ScaledStats(greatest_outlier, :)
    names = strcat(PlayerNames(greatest_outlier), ', ');
    ranks = PlayerRanks(greatest_outlier);
    textStr = strcat(names, ranks);
    text( XY(greatest_outlier, 1), XY(greatest_outlier, 2), textStr, 'FontSize', 5 );
end
The outlier row for 1 th run is row 23: 

ans =

  Columns 1 through 7

   -1.1941    0.9517    0.7176    1.2764    1.1280    1.8599    1.2291

  Columns 8 through 14

    1.1771    0.6463    1.6874    1.6587   -0.2862   -0.7504    1.0713

  Columns 15 through 17

    1.8459    1.9062    2.1042

The outlier row for 2 th run is row 10: 

ans =

  Columns 1 through 7

   -1.5015    1.0485   -0.0394    2.4622    0.9197    0.3406    0.8175

  Columns 8 through 14

    0.6824    1.0661    2.9758    0.6352    2.1244    1.1605    1.5517

  Columns 15 through 17

    3.1168    1.5840    2.3872

The outlier row for 3 th run is row 50: 

ans =

  Columns 1 through 7

   -0.5557    0.7579    0.7176    0.7117    0.7114    1.1693   -0.0056

  Columns 8 through 14

   -0.3070   -0.6597    1.2102    1.7490    5.3386    4.6639    0.4430

  Columns 15 through 17

    0.8849    0.0067    0.3569

2 (b): Latent Semantic Analysis

The course reader describes Latent Semantic Indexing for a matrix of values measuring association between X terms vs. Y terms.

The classic example is a "term/document matrix" for Keywords vs. Books, shown below. The code shown produces an LSI plot for the data.

Your job is to produce an analogous LSI plot for the table of Baseball players.

Components are computed as in: Berry, M. W., Dumais, S. T., and O'Brien, G. W. (1995). "Using linear algebra for intelligent information retrieval." SIAM Review, 37(4), 1995, 573-595.

Some LSI references: lsi.research.telcordia.com/lsi/LSIpapers.html

In [59]:
nX = size(StatNames, 2)

% PlayerNames:  Stats Attr

nY = size(PlayerNames, 1)

% ScaledStats:
% Stats Attr: 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17   % Players:
%--------------------------------------------------------------------------
% ScaledStats':
% Players: 1 2 3 4 ... 146 % Stats Attr:
%                             Rank
%                             G
%                             .
%                             .
%                             .

[U,S,V] = svd(ScaledStats');

%  We can look at a plot of the singular values (diagonal of S):

plot( diag(S), 'b' )

Xfactor = U(:,1:2);
Yfactor = V(:,1:2);

% plot the 2D projection of the data

text_offset = 0.01;
plot( Xfactor(:,1), Xfactor(:,2), 'r*' )
hold on
plot( Yfactor(:,1), Yfactor(:,2), 'b*' )

for i = (1:nX)
     text( Xfactor(i,1)+text_offset, Xfactor(i,2), StatNames(i), 'FontSize', 4)
end

for i = (1:nY)
     text( Yfactor(i,1)+text_offset, Yfactor(i,2), PlayerNames(i), 'FontSize', 4)
end

title( 'Latent Semantic Indexing:  Players vs. Stats Names' )
xlabel( '1st component' )
ylabel( '2nd component' )
axis([-0.3, 0.4, -0.45,0.38]);

zoom on
hold off
nX =

    17


nY =

   146
In [ ]:

Problem 3: Global Warming again (30 points)

In HW0, you plotted the average (non-missing-value) temperature anomaly over the entire grid, for every year from 1916 to 2015.

In this problem we want you to fit linear models through the data.

In [2]:
% set up everything as in HW#0:

GHCN = csvread('ghcn.csv');

%   The data was artificially shifted to [0, 4500];
%     its range should be [-2500, +2000]/100 = [-25,+20], in degrees Centigrade.
%     Since our focus here is on warming, we ignore temperatures below -5.
%   We omit the year and month in columns 1:2 before scaling:

GHCN_in_centigrade  = (GHCN(:,3:74) - 2500) / 100;

temperature_anomaly = reshape( GHCN_in_centigrade, [36, 12, 137, 72] );   % convert to a 4D matrix, so we can use slices

missing_values = (temperature_anomaly == -25);
number_of_missing_values = sum(sum(sum(sum( missing_values ))));

maximum_anomaly_value = max(max(max(max( temperature_anomaly ))));
minimum_anomaly_value = min(min(min(min( temperature_anomaly .*  (~ missing_values) ))));  %  '~' is 'not' in MATLAB

US_latitude  = 9:12
US_longitude = 15:20
my_years = 1916:2015;
my_slice = temperature_anomaly( US_latitude, :, my_years - 1880 + 1, US_longitude );

total_number_of_grid_squares = length(US_latitude) * length(US_longitude) * 12;
N = total_number_of_grid_squares;

average_US_anomaly_by_year = reshape( sum(sum(sum( my_slice, 4),2),1), [length(my_years) 1] ) / N;

plot( my_years, average_US_anomaly_by_year )
xlabel('year')
ylabel('avg anomaly (Centigrade) in the US')
title('average US temperature anomaly by year')
US_latitude =

     9    10    11    12


US_longitude =

    15    16    17    18    19    20

(a) Global Average Temperature Anomaly: Linear Model (Least Squares)

Problem: fit a line through the data, using Least Squares.

In [3]:
my_years = 1916:2015;
temparature_nonmissing_anomaly = temperature_anomaly .*  (~ missing_values);
my_slice = temparature_nonmissing_anomaly( :, :, my_years - 1880 + 1, : );

total_number_of_grid_squares = 36 * 72 * 12;
N = total_number_of_grid_squares;

average_global_anomaly_by_year = reshape( sum(sum(sum( my_slice, 4),2),1), [length(my_years) 1] ) / N; % fill in something % ..........
nRows = size(my_years, 2);
my_years_A = [my_years', ones(nRows, 1)];

linear_model = my_years_A \ average_global_anomaly_by_year %  fit_curve( my_years, average_global_anomaly_by_year )  % ..........

plot( my_years, average_global_anomaly_by_year );
hold on;
X = [linspace(1916, 2015, 10000)' ones(10000, 1)];
Y = X * linear_model;
plot(X, Y);
axis([1900,2020,-0.3,0.2]);
xlabel('year')
ylabel('avg global anomaly (Centigrade)')
title('average global temperature anomaly by year')
linear_model =

    0.0017
   -3.5005

(b) Global Average Temperature Anomaly: Piecewise Linear Model (Least Squares)

Problem: fit a 2-segment piecewise linear model through the data, using Least Squares.

Specifically, find a pair of least squares models, one from 1916 up to year Y, and one from year Y+1 to 2015, such that the SSE (sum of squared errors) is minimized.

In [4]:
minimum_SSE = max(average_global_anomaly_by_year) - min(average_global_anomaly_by_year);
linear_model_1 = ones(2, 1);
linear_model_2 = ones(2, 1);
YSelected = 1917;

for Y = 1917:2013
    year_up_to_Y = 1916 : Y;
    year_up_to_Y = [year_up_to_Y', ones(size(year_up_to_Y, 2), 1)];
    average_global_anomaly_by_year_up_to_Y = average_global_anomaly_by_year(1 : (Y - 1915));
    average_global_anomaly_by_year_after_Y = average_global_anomaly_by_year((Y - 1914) : 100);
    year_after_Y = (Y + 1) : 2015;
    year_after_Y = [year_after_Y', ones(size(year_after_Y, 2), 1)];
    linear_model_up_to_Y = year_up_to_Y \ average_global_anomaly_by_year_up_to_Y;
    linear_model_after_Y = year_after_Y \ average_global_anomaly_by_year_after_Y;
    Y_year_up_to_Y = year_up_to_Y * linear_model_up_to_Y;
    SSE_up_to_Y = (average_global_anomaly_by_year_up_to_Y - Y_year_up_to_Y)' * (average_global_anomaly_by_year_up_to_Y - Y_year_up_to_Y);
    Y_year_after_Y = year_after_Y * linear_model_after_Y;
    SSE_after_Y = (average_global_anomaly_by_year_after_Y - Y_year_after_Y)' * (average_global_anomaly_by_year_after_Y - Y_year_after_Y);
    total_SSE = SSE_up_to_Y + SSE_after_Y;
    if (total_SSE < minimum_SSE)
        linear_model_1 = linear_model_up_to_Y;
        linear_model_2 = linear_model_after_Y;
        YSelected = Y;
        minimum_SSE = total_SSE;
    end
end
plot( my_years, average_global_anomaly_by_year );
hold on;
X_1 = [linspace(1916, YSelected, 10000)' ones(10000, 1)];
Y_1 = X_1 * linear_model_1;
plot(X_1, Y_1);
X_2 = [linspace(YSelected + 1, 2015, 10000)' ones(10000, 1)];
Y_2 = X_2 * linear_model_2;
plot(X_2, Y_2);
axis([1900,2020,-0.3,0.2]);
xlabel('year')
ylabel('avg global anomaly (Centigrade)')
title('average global temperature anomaly by year')
minimum_SSE
YSelected
linear_model_1
linear_model_2
minimum_SSE =

    0.0870


YSelected =

        1963


linear_model_1 =

   -0.0004
    0.6252


linear_model_2 =

    0.0054
  -10.8261
In [ ]: