CS170A -- HW#3 -- assignment and solution form -- Matlab¶

Your name: Jerry Liu

Your UID: 404474229

Please upload only this notebook to CCLE by the deadline.

Policy for late submission of solutions: We will use Paul Eggert's Late Policy: $N$ days late $\Leftrightarrow$ $2^N$ points deducted} The number of days late is $N=0$ for the first 24 hrs, $N=1$ for the next 24 hrs, etc., and if you submit an assignment $H$ hours late, $2^{\lfloor \,H\,/\,24\, \rfloor}$ points are deducted.

Problem 1: Fourier Music¶

Remember the $k$-th note in the the equal-tempered scale has frequency $c^k$ times the frequency of C, where $c = 2^{1/12} = 1.059463\ldots$:

note	frequency (Hz)	frequency/C	interval
Middle C	261.63	$c^0 = 1.000$	~
C#	277.18	$c^1 = 1.059$	minor 2nd
D	293.66	$c^2 = 1.122$	major 2nd
D#	311.13	$c^3 = 1.189$	minor 3rd
E	329.63	$c^4 = 1.260$	major 3rd
F	349.23	$c^5 = 1.334$	4th
F#	369.99	$c^6 = 1.414$	diminished 5th
G	392.00	$c^7 = 1.498$	5th
G#	415.30	$c^8 = 1.587$	minor 6th
A	440	$c^9 = 1.682$	major 6th
A#	466.16	$c^{10} = 1.782$	minor 7th
B	493.88	$c^{11} = 1.888$	major 7th
High C	523.25	$c^{12} = 2.000$	octave/8th

The equal-tempered scale can be extended to higher or lower octaves by multiplying the frequencies in it by a power of 2. (For example, the note 'A' occurs at frequencies 220, 440, 880, etc., because 440 = 2 $\times$ 220, 880 = 4 $\times$ 220, etc.

However, it is not right to call the frequency 660 = 3 $\times$ 220 an 'A'; the frequency 660 is an 'E', because 600 = 2 $\times$ 330, and 330 is the frequency for 'E'.) Only the frequencies $2^k \times 440$ (for integer $k$) are called 'A'.

Remember: if a signal is sampled at frequency Fs, then the Nyquist frequency at the end of this first half is Fs/2.

1.0 Read in the Mystery Tune¶

Read in the file mystery.wav, creating a vector y that contains a sound track (audio sequence) for the first few seconds of a mystery tune.

As discussed in class, if Fs is an integer sampling frequency. In Octave and Matlab, executing sound(y, Fs) will play y at the frequency Fs.

Change y to contain only the first of the 2 stereo audio tracks.

[y, Fs] = audioread('mystery.wav');

sound(y, Fs)   %  play the mystery tune

y = y(:,1);   %  only keep the first track for this assignment

n = length(y)

n =

      860000

1.1 The Fast Fourier Transform¶

The length of y is $n = 86000$. Find the integer factors of $n$ by using the Matlab function factor().

Then find out how much cpu time it takes to compute fft(y);

(To do this, run it 100 times and compute the average time required, using cputime.)

factor(n)

time = zeros(100, 1);

for i = 1 : 100
    start = cputime;
    fft(y);
    time(i) = cputime - start;
end

fprintf('Mean CPU time for fft(y): ')
mean(time)

ans =

     2     2     2     2     2     5     5     5     5    43

Mean CPU time for fft(y): 
ans =

    0.0235

1.2 The Fast Fourier Transform?¶

First, define a new vector z that contains y(1:(n-13)). Find the integer factors of $(n-13)$.

Find out how much cpu time it takes to compute fft(z); using the method above.

Explain the difference in timing; this is discussed on pp.235-236 of the Course Reader.

z = y(1:(n-13));
factor(n - 13)

time = zeros(100, 1);

for i = 1 : 100
    start = cputime;
    fft(z);
    time(i) = cputime - start;
end

fprintf('Mean CPU time for fft(z): ')
mean(time)

ans =

      859987

Mean CPU time for fft(z): 
ans =

    0.1556

%% Explanation:
% Since (n - 13) is a prime number, we cannot use fast fourier transform on this number.
% On n, however, we have many small integer factors 
% which permit divide-and-conquer algorithm to faster execution.
% This is why it takes much longer to compute.

1.3 Plot the First Half of the Power Spectrum¶

Plot the frequency spectrum for y by plotting the power of the 'first half' of the squared power of its Fourier transform, using Fs as the sampling frequency.

Assume for this assignment that the power of a complex value $z$ is its complex absolute value: i.e., $\mbox{power}(z) = |z| = \sqrt{z \, \bar{z}}$.

Here by "first half" of a sequence $s =[s_0, \ldots, s_{n-1}]$ we actually mean $[s_1, \ldots, s_m]$ where $m = \lfloor n/2 \rfloor$. For example, the "first half" of $[0,1,2,3,4,5,6,7]$ is $[1,2,3,4]$.

To increase the information displayed by your plot, omit the very first element in the power vector (which is just the sum of the data).

firstHalf = 2 : (n / 2 + 1);
Y = fft(y);
power = abs(Y(firstHalf)) .^ 2; % the squared power of its Fourier transform

NyquistFrequency = Fs / 2;
frequencies = linspace(1, NyquistFrequency, n / 2);

plot(frequencies, power);

1.4 Write a script to find the top 10 Notes (corresponding to Spikes)¶

There are altogether about $(5 \times 12)+1 = 61$ notes in the equal-tempered scale with frequencies between low C (with frequency near 131) and the very high C with frequency near $131 \times 2^5 = 4192$.

Suppose we also define dividing lines between notes. For example, C has frequency 261.63 and C# has frequency 271.18. The dividing line between these two notes is $\Delta = (261.63+271.18)/2 = 266.40$.

Just as every note in the equal-tempered scale has frequency $c^k$ times the frequency of C, we can define equal tempered dividing lines between all adjacent notes as having frequency $c^k$ times the frequency $\Delta$.

Write a Matlab function note_power(power,Fs) that distills a power spectrum vector power into an $61 \times 1$ vector that gives, for each equal-tempered note, the maximum value among all entries in the vector within the dividing lines above and below this note. Find the top 10 notes (the 10 notes with highest power).

For the mystery clip, use your function to print the power value for each of these 10 notes.

function notes = note_power(power, Fs)
    cVal = 2 ^ (1 / 12);
    % I assume 131 is the lowest note, and compute the divide line between low C and low C# as the base delta
    delta = (131 + 131 * cVal) / 2;
    numNodes = 61;
    C = ones(numNodes, 1) * cVal;
    for i = 2 : numNodes
        C(i) = C(i - 1) * cVal;
    end
    Delta = [1; delta * C];

    len = length(power);
    notes = zeros(numNodes, 1);
    p = zeros(numNodes, 1);
    
    for i = 1 : numNodes
       start = ceil(Delta(i));
       divLine = floor(Delta(i + 1));
       searchInterval = start : divLine;
       maxLocation = find(power(searchInterval) == max(power(searchInterval)));
       notes(i) = power(maxLocation + start);
    end
end

notes = note_power(power, Fs);
[sp, I] = sort(notes, 'descend');
top10notes = I(1:10) % ith note, 1st (low C), 2nd, ...
highestPower = sp(1:10)

top10notes =

    36
    49
    43
    48
    41
    47
    55
    37
    51
    53


highestPower =

   1.0e+06 *

    2.6379
    2.3466
    1.6629
    1.3973
    1.1926
    1.1285
    0.9805
    0.9789
    0.7155
    0.4994

1.5 Identify the Key, and find the relationship of the top 10 Notes to the Key¶

Find the note whose spike in the power spectrum is largest; this is called the key.

For each of the top 10 notes, compute the ratio of their frequency to that of the key. Each ratio will be a power $c^k$; for each note, give the value of $k$.

A major fifth has a ratio $c^7$, which is approximately 3/2 -- the tonic/dominant ratio.

A minor fifth (diminished fifth) has a ratio $c^6$, which is approximately 1.4.

A major third has a ratio $c^4$, which is approximately 5/4 -- the ratio of harmony.

c = 2^(1/12);

% major_third = c^4
% minor_fifth = c^6
% major_fifth = c^7
lowCFreq = 131;
Notes = ['C' 'C#' 'D' 'D#' 'E' 'F' 'F#' 'G' 'G#' 'A' 'A#' 'B' 'B#'];

[val, key] = max(power);
k = round(mod(log(key / lowCFreq) / log(c), 12)) + 1;
fprintf('The key is: %s', Notes(k));
ratio = zeros(10, 1);
for i = 1 : 10
    ratio(i) = find(power == highestPower(i)) / lowCFreq;
end
ratio
k = log(ratio) / log(c)

c =

    1.0595

The key is: C
ratio =

    8.1679
   16.5344
   12.3130
   16.3817
   11.0000
   15.5267
   24.2290
    8.3359
   19.5115
   21.9466


k =

   36.3597
   48.5687
   43.4653
   48.4081
   41.5132
   47.4802
   55.1840
   36.7120
   51.4350
   53.4711

Problem 2: Sunspots¶

Sunspots have effects on power grids and communications on earth, and have had high intensity recently. They often appear in the news, and there is a sunspot observatory at Big Bear near Los Angeles.

In class we studied a classic example of Fourier analysis: determining the periodicity of sunspot data. The plot of the data shows it is clearly periodic.

% sunspot activity from 1700 to 1987.

years = 1700 : 1987;
sunspots = [ ...
     5    11    16    23    36    58    29    20    10     8     3     0     0     2    11    27 ...
    47    63    60    39    28    26    22    11    21    40    78   122   103    73    47    35 ...
    11     5    16    34    70    81   111   101    73    40    20    16     5    11    22    40 ...
    60    81    83    48    48    31    12    10    10    32    48    54    63    86    61    45 ...
    36    21    11    38    70   106   101    82    67    35    31     7    20    93   154   126 ...
    85    68    39    23    10    24    83   132   131   118    90    67    60    47    41    21 ...
    16     6     4     7    15    34    45    43    48    42    28    10     8     3     0     1 ...
     5    12    14    35    46    41    30    24    16     7     4     2     9    17    36    50 ...
    64    67    71    48    28     9    13    57   122   138   103    86    65    37    24    11 ...
    15    40    62    99   125    96    67    65    54    39    21     7     4    23    55    94 ...
    96    77    59    44    47    31    16     7    38    74   139   111   102    66    45    17 ...
    11    12     3     6    32    54    60    64    64    52    25    13     7     6     7    36 ...
    73    85    78    64    42    26    27    12    10     3     5    24    42    64    54    62 ...
    49    44    19     6     4     1    10    47    57   104    81    64    38    26    14     6 ...
    17    44    64    69    78    65    36    21    11     6     9    36    80   114   110    89 ...
    68    48    31    16    10    33    93   152   136   135    84    69    32    14     4    38 ...
   142   190   185   159   112    54    38    28    10    15    47    94   106   106   105    67 ...
    69    38    35    16    13    28    93   155   155   140   116    67    46    18    13    29 ...
];

plot( years, sunspots, 'b' )
title('Sunspot intensity from 1700 to 1987')
xlabel('year')

We can use Fourier analysis to find the frequency of the cycles shown in the plot above. Based on the script below, sunspots go through a cycle that has frequency (0.091/year) which is (1/(11 years)). Thus the period of sunspot activity is about 11 years long.

figure
plot( years, sunspots, 'b' )
title('Sunspot intensity from 1700 to 1987')
xlabel('year')

power_spectrum = abs(fft(sunspots));
n = length(sunspots);
sampling_frequency = 1;  %  one sample / year
frequencies = linspace( 0, sampling_frequency, n );

figure
plot( frequencies, power_spectrum, 'b' )
text( sampling_frequency/5, max(power_spectrum)*0.80, 'Note the symmetry (always obtained for real data)', 'fontsize', 6 )
text( sampling_frequency/5, max(power_spectrum)*0.70, 'Sampling frequency = once per year = (1 / year)', 'fontsize', 6 )
title( 'Fourier power spectrum of the sunspot data' )
xlabel('frequency (1/year)')

n_over_2 = floor(n/2);
figure
plot( frequencies(2:n_over_2), power_spectrum(2:n_over_2) )
title( 'Spectrum up to the Nyquist frequency' )
xlabel('frequency (1/year)')
% text( sampling_frequency/8, max(power_spectrum)*0.90, 'Nyquist frequency = sampling frequency / 2', 'fontsize', 6 )
% text( sampling_frequency/8, max(power_spectrum)*0.70, 'We ignore the 1st Fourier coefficient here;', 'fontsize', 6 )
% text( sampling_frequency/8, max(power_spectrum)*0.60, 'it is always just the sum of the input data.', 'fontsize', 6 )

search_interval = 2:floor(n/4);  %  skip the first coefficient, which is always the sum of the input data
spike_location = 1+find( power_spectrum(search_interval) == max(power_spectrum(search_interval)) );
spike_frequency = frequencies(spike_location);
period_of_sunspot_cycle = 1/spike_frequency;

figure
plot( frequencies(search_interval), power_spectrum(search_interval) )
text( spike_frequency*1.05, power_spectrum(spike_location)*1.00, ...
      sprintf('Frequency f of largest spike (units: 1/year) = %6.3f',spike_frequency), 'fontsize', 5 )
text( spike_frequency*1.05, power_spectrum(spike_location)*0.90, ...
      sprintf('1/f = period of sunspot cycle (units:  year) = %5.2f',period_of_sunspot_cycle), 'fontsize', 5 )
title( 'Spike represents sunspot cycle frequency: once per 11 years' )
xlabel('frequency (1/year)')
hold on
plot( [spike_frequency spike_frequency], [0 power_spectrum(spike_location)], 'r--' )
hold off

Sunspot data since 1700 is available from the Belgian Royal Observatory at http://www.sidc.be/silso/datafiles. Daily, monthly, and yearly sunspot measures are available.

yearly data is in: http://www.sidc.be/silso/DATA/SN_y_tot_V2.0.csv
monthly data is in: http://www.sidc.be/silso/DATA/SN_m_tot_V2.0.csv

2.0 Get the yearly and monthly data¶

Define a function to read in a .csv file across the network using urlread(), writefile(), and csvread().

Use your function to read in the yearly and monthly data.

% There is no writefile() in MATLAB 2016b
function [yearly, monthly] = urlcsv(yearurl, monthurl)
    year = urlread(yearurl);
    tmp = textscan(year, repmat('%f',1,5),'delimiter',';');
    year = horzcat(tmp{1:end});
    dlmwrite('y.csv', year);
    yearly = csvread('y.csv');
    
    month = urlread(monthurl);
    tmp = textscan(month, repmat('%f',1,5),'delimiter',';');
    month = horzcat(tmp{1:end});
    dlmwrite('m.csv', month);
    monthly = csvread('m.csv');
end

year = 'http://www.sidc.be/silso/DATA/SN_y_tot_V2.0.csv';
month = 'http://www.sidc.be/silso/DATA/SN_m_tot_V2.0.csv';
[yearly, monthly] = urlcsv(year, month);

2.1 Compare the Belgian yearly sunspot data with the classical sunspot data¶

Is the yearly data close to the sunspots vector analyzed above? (up to 1987) Find the L2-distance between the two vectors.

filtered_data = yearly(:, 2)';
sunspots_belgia = filtered_data(yearly(:, 1) < 1988); % up to 1987
norm(sunspots_belgia - sunspots)

plot( years, sunspots_belgia, 'b' )
title('Belgian Sunspot intensity from 1700 to 1987')
xlabel('year')

ans =

  645.7781

2.2 Analyze the yearly data¶

Find the period of yearly sunspot intensity, using analysis like that shown above. Do the two values of the period agree?

power_spectrum = abs(fft(sunspots_belgia));
n = length(sunspots_belgia);
sampling_frequency = 1;  %  one sample / year
frequencies = linspace( 0, sampling_frequency, n );

search_interval = 2:floor(n/4);  %  skip the first coefficient, which is always the sum of the input data
spike_location = 1+find( power_spectrum(search_interval) == max(power_spectrum(search_interval)) );
spike_frequency = frequencies(spike_location);
period_of_sunspot_cycle = 1/spike_frequency;

figure
plot( frequencies(search_interval), power_spectrum(search_interval) )
text( spike_frequency*1.05, power_spectrum(spike_location)*1.00, ...
      sprintf('Frequency f of largest spike (units: 1/year) = %6.3f',spike_frequency), 'fontsize', 5 )
text( spike_frequency*1.05, power_spectrum(spike_location)*0.90, ...
      sprintf('1/f = period of sunspot cycle (units:  year) = %5.2f',period_of_sunspot_cycle), 'fontsize', 5 )
title( 'Spike represents sunspot cycle freq: once / 11 years' )
xlabel('frequency (1/year)')
hold on
plot( [spike_frequency spike_frequency], [0 power_spectrum(spike_location)], 'r--' )
hold off
fprintf('The two values of the period agree.')

The two values of the period agree.

2.3 Analyze the monthly data¶

Find the period of sunspot intensity using the monthly data. At which frequencies are there spikes?

Remember, the sampling frequency here is (1/month) = (12/year).

Is there a period at a frequency close to the period for the yearly data?

filtered_data_m = monthly(:, 4);
sunspots_belgia_m = filtered_data_m(monthly(:, 1) < 1988)'; % up to 1987

power_spectrum = abs(fft(sunspots_belgia_m));
n = length(sunspots_belgia_m);
sampling_frequency = 12;  %  twelve sample / year
frequencies = linspace( 0, sampling_frequency, n );

search_interval = 2:floor(n/4);  %  skip the first coefficient, which is always the sum of the input data
spike_location = 1+find( power_spectrum(search_interval) == max(power_spectrum(search_interval)) );
spike_frequency = frequencies(spike_location);
period_of_sunspot_cycle = 1/spike_frequency;

figure
plot( frequencies(search_interval), power_spectrum(search_interval) )
text( spike_frequency*1.05, power_spectrum(spike_location)*1.00, ...
      sprintf('Frequency f of largest spike (units: 1/year) = %6.3f',spike_frequency), 'fontsize', 5 )
text( spike_frequency*1.05, power_spectrum(spike_location)*0.90, ...
      sprintf('1/f = period of sunspot cycle (units:  year) = %5.2f',period_of_sunspot_cycle), 'fontsize', 5 )
title( 'Spike represents sunspot cycle freq: once / 11 years' )
xlabel('frequency (1/year)')
hold on
plot( [spike_frequency spike_frequency], [0 power_spectrum(spike_location)], 'r--' )
hold off
fprintf('The period is pretty close to that of the yearly data.')

The period is pretty close to that of the yearly data.

Problem 3: A Photoshop Detector¶

A simple way to check if a $m \times n$ RGB image has been faked (with Photoshop, say) is to use the reshape function to convert it into a $(m\,n) \times 3$ matrix, compute the 3 principal components of its $3 \times 3$ covariance matrix, project the reshaped image on the second principal component, and then reshape the result back to a $m \times n$ grayscale image. If this grayscale image has bright or dark spots, it is possible that the color distribution in that area (and perhaps also that part of the image) has been altered.

For example, the image LIFE_projected_on_2nd_PC.jpg shows the result of projecting the image LIFE.jpg on the 2nd PC, and treating the result as a grayscale image. Notice the flag on Aldrin's shoulder and some buttons on his suit are bright, suggesting that they have been retouched. The buttons are bright red in LIFE.jpg.

LIFE magazine cover

LIFE cover projected on 2nd PC

3.1 Write a Photoshop detector¶

Write a function that takes the filename of an input image, and returns its 2nd PC image.

As a test, show the output of your program on LIFE.jpg.

function res = secondPCImage(filename)
    image = double(imread(filename)) / 255;
    [m, n, h] = size(image);

    im = reshape(image, m * n, 3);
    [coeff, score] = pca(im);

    res = reshape(score(:, 2), m, n);
    
%   Same as follows, but shorter    
%    imCov = cov(im);
%    [U, S, V] = svd(imCov);
%    PC2 = U(:, 2);

%    res = reshape(im * PC2, m, n);
end

figure
colormap('gray');
imagesc(secondPCImage('LIFE.jpg'));

(With Octave, you can include the output image in this notebook with a <img src="..."> HTML tag.)

3.2 Jennifer in Paradise¶

Show the output of your program on the photo Jennifer_in_Paradise.jpg included in the assignment .zip file. This image is allegedly the first image ever actually photoshopped -- a photo of the wife of the original developer of Photoshop, famous as a photoshopped image.

figure
colormap('gray');
imagesc(secondPCImage('Jennifer_in_Paradise.jpg'));

3.3 Find and comment on an image of Hillary Clinton or Donald Trump that has been photoshopped¶

Using e.g. Google image search, find an image of Hillary or Donald whose 2nd PC has clearly highlighted areas. Comment on the highlighting.

figure
colormap('gray');
imagesc(secondPCImage('hrc.jpg'));

display( 'We see some part of her hair, her left eyebrow, her eye, etc. have clear highlighted areas.\nThis probably mean that this picture has been photoshopped.');

We see some part of her hair, her left eyebrow, her eye, etc. have clear highlighted areas.\nThis probably mean that this picture has been photoshopped.

Problem 4: Graphs as Matrices¶

The goal of this problem is to analyze the graph (actually, a numeric matrix $H$). You can read it from one of the attached files hero_social_network_50.tsv, hero_social_network_50.m, hero_social_network_50.py.

This $H$ matrix represent a core part of Marvel's Hero Social Network (a graph of social connections between superheroes. The entry $h_{ij}$ of $H$ is the strength of edges, so higher values mean that heroes $i$ and $j$ appear together more often.

The entire Marvel set of heroes is amazingly large; this dataset includes 206 heroes in the graph, obtained from the subset who appear in at least 50 comics. Despite their frequent appearance, this matrix is still pretty sparse -- apparently superheroes don't have hundreds of friends.

For example, you can read the file with dlmread() in Matlab; in Python either ad hoc parsing or perhaps the csv module.

Notice that the network is undirected, so $H$ is real symmetric, and all entries of $H$ are nonnegative.

load('hero_social_network_50_names.mat');
load('hero_social_network_50_network.mat');

4.1 Eigenvalues¶

Compute the eigenvalues of $H$ and plot them -- so that the $i$-th eigenvalue $\lambda_i$ is plotted at position $(x,y) = (i, \lambda_i)$.

[V, D] = eig(Hero_network);
eigvals = diag(D);
n = length(eigvals);
x = linspace(1, n, n);
plot(x, eigvals);

4.2 Largest Eigenvalue¶

Find the largest eigenvalue $r$ of $H$. Is it equal to the spectral norm $||H||_2$?

max(eigvals)
norm(Hero_network)

fprintf('Yes, they are equal');

ans =

    0.0058


ans =

    0.0058

Yes, they are equal

4.3 Dominant Eigenvector¶

Find the eigenvector e of $H$ that has this largest eigenvalue. Plot the sequence of entries of the eigenvector, showing that they are nonnegative. Which entry $i$ has the largest value? (Who is the $i$-th hero?)

[val, ind] = max(eigvals);
eigvec = V(:, ind);
plot(eigvec, '.');
[val, i] = max(eigvec);
fprintf('Entry %d has the largest value, corresponds to name %s', i, string(names(i)));

Entry 97 has the largest value, corresponds to name CAPTAIN AMERICA

4.4 Node Degree Sequence¶

The degree sequence is a vector $f d$ of integer values whose $i$-th entry $d_i$ is the degree of the $i$-th node. That is, $d_i$ is the number of other superheroes to which the $i$-th hero has nonzero edges. Plot the sequence of degrees. What is the name of the hero $i$ who has the largest value?

adjacency = Hero_network ~= 0;
deg = sum(adjacency, 2);
[val, ind] = max(deg);
names(ind)

ans =

  cell

    'CAPTAIN AMERICA'

4.5 Graph Laplacian -- and Connected Components¶

Compute the graph laplacian $L = D - A$, where $D = diag({\bf d})$ is the diagonal matrix determined by the degree sequence, and $A$ is the zero-one adjacency matrix for $H$ (The entries of $A$ are zero wherever they are zero in $H$, and are one wherever nonzero in $H$.)

What is the number of zero eigenvalues of the graph laplacian $L$, if we assume that all eigenvalues whose absolute value is below 1e-10 are zero? (This is supposed to be the number of connected components in the graph.)

D = diag(deg);
L = D - adjacency;
eigval = eig(L);
sum((eigval < 10^(-10)))

ans =

    13