# PCA: a non-technical introduction

## Contents

# Copyright (C) 2018-2019 Juan Pablo Carbajal
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

# Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com>

pkg load geometry
pkg load statistics
close all

## Trees in the backyard

We have measured the position of trees in a rectangular parcel. We took as origin of the coordinates the door to the parcel. The axis are parallel to the front and the side of the parcel. The trees are planted at the sides of a small stone path that crosses the parcel.

First we will create fictitious data. You can skip this part, it is not needed for understanding PCA

width = 40;                                    % width of parcel in meters
depth = 75;                                    % depth of parcel in meters
parcel = [-width/2 width/2 0 depth];          % box defining the parcel

ntrees = 25;                                   % number of trees in the parcel
x = linspace (0, 1, ntrees);                   % front coordinate of the trees
y = 0.05 * (2 * randi ([0 1], 1, ntrees) - 1); % side coordinate of the trees
ang = 55;
X = ([cosd(ang) -sind(ang); sind(ang) cosd(ang)] * [x; y]).';
X = parcel([2 4]) .* (X - min (X)) + [-3 3];

The data is stored in the matrix X. The first column is the component along the front of the parcel. The second column is the component along the side of the parcel.

The exercise is to look at the plot and propose another coordinate system (i.e. origin and axes) that might be more natural to describe this data. The plot shows the position of the trees and the boundary of the parcel

Give a little time to think about an answer before moving to the next section.

figure (), clf
hold on
drawBox (parcel);
plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
axis equal
xlabel ('along front [m]')
ylabel ('along side [m]')
hold off ## PCA on the tree data

We now apply PCA computed via the Singular Value Decomposition (SVD). Then we add the plot the the tree data to show what are the axes proposed by PCA.

meanxy     = mean (X, 1);                % mean value along columns
[U S AXES] = svd (X - meanxy, 1);        % principal axes

% This is just aesthetic: make the first axis point in the positive x-direction
% and the second axis positive along the y-direction
Asign  = [sign(AXES(1,1)) sign(AXES(2,2))];
AXES   = Asign .* AXES;
COMP   = U * (S .* Asign); % Principal components
lambda = diag (S).^2 / ntrees;       % Eigenvalues of the covariance matrix


The plot shows the data and axes that would be proposed by a PCA algorithm. The axes are scaled indicating their relevance for describing the data, e.g. how much is lost if we ignore a direction.

figure (), clf
hold on
drawBox (parcel);
plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
hpc = plot_pca2d (AXES, lambda, meanxy);
set (hpc(end), 'visible', 'off');
hold off
axis equal
xlabel ('along front [m]')
ylabel ('along side [m]')
hold off The next plot shows the trees in the principal axes AXES. That is, we plot the principal components.

parcel_pc = (reshape (parcel([1 3 2 3 2 4 1 4]), 2, 4).' - meanxy) * AXES;
figure (), clf
hold on
drawPolygon (parcel_pc);
plot (COMP(:,1), COMP(:,2), 'xg;Tree;', 'linewidth', 3)
hold off
xlabel ('along 1st principal axis [m]')
ylabel ('along 2nd principal axis [m]')
axis equal ## Trees projected on principal axis

If we lost the second column of the original data X, we would have a lot of trouble reconstructing the position of the trees, indeed a lot of information is lost together with the component along the side of the parcel.

If we loose the second column of the principal components, this is not that critical. The principal axes are built such that the firsts components capture most of the variability of the data (actually most of the variance). Hence the first axis is always along the direction in which the data changes the most.

To build the projected tree position we use only the first principal axis

X_ = COMP(:,1) * AXES(:,1).' + meanxy; % Projected trees

figure (), clf
hold on
drawBox (parcel);
plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
hpc = plot_pca2d (AXES, lambda, meanxy);
color = get (hpc(1), 'color');
set (hpc(end), 'visible', 'off');
plot (X_(:,1), X_(:,2), 'o;PCA Tree;', 'color', color);
hold off
axis equal
xlabel ('along front [m]')
ylabel ('along side [m]')
hold off ## Abstract 2D point clouds

Now that you have completed an intuitive example, could you draw the principal axes of the following point clouds? Remember that the first axis is placed along the direction in which the data varies the most, i.e. the direction of maximum variance.

Do not look into the next section before you try to solve this on your own.

% Correlated Guassian variables
n = 100;
K      = [1 0.5; 0.5 0.5];
X_ncor = mvnrnd ([0 0], K, n);
% Calculate PCA
meanxy_ncor     = mean (X_ncor, 1);              % mean along columns
[U S AXES_ncor] = svd (X_ncor - meanxy_ncor, 1); % principal axes
COMP_ncor       = U * S;                         % principal components
lambda_ncor     = diag (S).^2 / n;               % eigenvalues of cov matrix

% Several anti-correlated subsets with global positive correlation, i.e.
% Simpson's paradox.
K      = [1 0; 0 0.8]; % Covariance matrix
m      = [1 1];
X_simp = [];
ns     = 3;
covxy  = linspace (-0.1, -0.8, ns);
for i = 1:ns
K([2 3]) = covxy(i);
X_simp   = [X_simp; mvnrnd((i-1)*3*m, K, n)];   % Samples
endfor
n = size (X_simp, 1);
% Calculate PCA
meanxy_simp     = mean (X_simp, 1);              % mean along columns
[U S AXES_simp] = svd (X_simp - meanxy_simp, 1); % principal axes
COMP_simp       = U * S;                         % principal components
lambda_simp = diag (S).^2 / n;                   % eigenvalues of cov matrix

figure (), clf()
subplot (1, 2, 1)
hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b');
xticks ([]); yticks ([])
axis equal

subplot (1, 2, 2)
hp = plot_pointcloud (X_simp); set(hp, 'color', 'b');
xticks ([]); yticks ([])
axis equal Compare the axes you proposed with the ones provided by PCA

figure (), clf()
subplot (1, 2, 1)
hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b');
hold on
plot_pca2d (AXES_ncor, lambda_ncor, meanxy_ncor);
hold off
xticks ([]); yticks ([])
axis equal

subplot (1, 2, 2)
hp = plot_pointcloud (X_simp); set(hp, 'color', 'b');
hold on
plot_pca2d (AXES_simp, lambda_simp, meanxy_simp);
hold off
xticks ([]); yticks ([])
axis equal 