PCA: a non-technical introduction

Contents

# Copyright (C) 2018 - Juan Pablo Carbajal
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

# Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com>

pkg load geometry
pkg load statistics
close all

Trees in the backyard

We have measured the position of trees in a rectangular parcell. We took as origin of the cordinates the door to the terrain. The axis are parallel to the front and the side of the parcell. The trees are planted at the sides of a small stone path that crosses the parcell.

First we will create ficticious data. You can skip this part, it is not needed for understanding PCA

width = 40;                     % length of the front of the terrain in meters
depth = 75;                     % length of the side  of the terrain in meters
parcell = [-width/2 width/2 0 depth]; % box defining the parcell

ntrees = 25;                          % numer of trees in the terrain
x = linspace (0, 1, ntrees); % front coordinate of the trees
y = 0.05 * (2 * randi ([0 1], 1, ntrees) - 1); % side coordiante of the trees
ang = 55;
X = ([cosd(ang) -sind(ang); sind(ang) cosd(ang)] * [x; y]).';
X = parcell([2 4]) .* (X - min (X)) + [-3 3];

The data is stored in the matrix X. The first column is the component along the front of the parcell. The second column is the component along the side of the parcell.

The exercise is to look at the plot and propose another coordinate system (i.e. origin and axes) that might be more natural to describe this data. The plot shows the poition of the trees and the boundary of the parcell

Give a little time to think about an answer before movign to the next section.

figure (), clf
  hold on
  drawBox (parcell);
  plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
  axis equal
  xlabel ('along front [m]')
  ylabel ('along side [m]')
  hold off
s_intuitive_pca_publish-1.png

PCA on the tree data

We now apply PCA computed via the Singular Value Decomposition (SVD). Then we add the plot the the tree data to show what are the axes proposed by PCA.

meanxy     = mean (X, 1);                % mean value along columns
[U S AXES] = svd (X - meanxy, 1);        % Princial axes
% This is just aesthetic: make the first axis point in the positive x-direction
% and the second axis positive along the y-direction
Asign  = [sign(AXES(1,1)) sign(AXES(2,2))];
AXES   = Asign .* AXES;
COMP   = U * (S .* Asign); % Principal components
lambda = diag (S).^2 / ntrees;       % Eigenvalues of the covariance matrix

The plot shows the data and the axes that will be proposed by a PCA algorithm. The axes are scaled indicating their relevance for describing the data, e.g. how much is lost if we ignore a direction.

figure (), clf
  hold on
  drawBox (parcell);
  plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
  hpc = plot_pca2d (AXES, lambda, meanxy);
  set (hpc(end), 'visible', 'off');
  hold off
  axis equal
  xlabel ('along front [m]')
  ylabel ('along side [m]')
  hold off
s_intuitive_pca_publish-2.png

The next plot shows the trees in the principal axes AXES. That is we plot the principal components.

parcell_pc = (reshape (parcell([1 3 2 3 2 4 1 4]), 2, 4).' - meanxy) * AXES;
figure (), clf
  hold on
  drawPolygon (parcell_pc);
  plot (COMP(:,1), COMP(:,2), 'xg;Tree;', 'linewidth', 3)
  hold off
  xlabel ('along 1st principal axis [m]')
  ylabel ('along 2nd principal axis [m]')
  axis equal
s_intuitive_pca_publish-3.png

Trees projected on principal axis

If we lost the second column of the original data X, we would have a lot of trouble reconstructing the position of the trees, indeed a lot of information is lost together with the component along th eside of the parcell.

If we loose the second column of the principal components, thi sis not that critical. The principal axes are built such that the firsts components capture most of the variablity of the data (actually most of the variance). Hence the first axis si always along the direction where the data changes the most.

To buidl the projected tree positon we use only the first principal axis

X_ = COMP(:,1) * AXES(:,1).' + meanxy; % Projected trees

figure (), clf
  hold on
  drawBox (parcell);
  plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3);
  hpc = plot_pca2d (AXES, lambda, meanxy);
  color = get (hpc(1), 'color');
  set (hpc(end), 'visible', 'off');
  plot (X_(:,1), X_(:,2), 'o;PCA Tree;', 'color', color);
  hold off
  axis equal
  xlabel ('along front [m]')
  ylabel ('along side [m]')
  hold off
s_intuitive_pca_publish-4.png

Abstract 2D point clouds

Now that you have completed an intuitive example, could you draw the principal axes of the following point clouds? Remember that the first axes is placed along the direction in which the data varies the most, i.e. the direction of maxium variance.

Do not look into the next section befroe you try to solve this on your own.

% Correlated Guassina variables
n = 100;
K      = [1 0.5; 0.5 0.5];
X_ncor = mvnrnd ([0 0], K, n);
% Calcualte PCA
meanxy_ncor     = mean (X_ncor, 1);              % mean along columns
[U S AXES_ncor] = svd (X_ncor - meanxy_ncor, 1); % principal axes
COMP_ncor       = U * S;                         % principal components
lambda_ncor = diag (S).^2 / n;                   % Eigenvalues of cov matrix

% Several anti-correlated subsets with global positive correlation, i.e.
% Simpson's paradox.
K      = [1 0; 0 0.8]; % Covariance matrix
m      = [1 1];
X_simp = [];
ns     = 3;
covxy  = linspace (-0.1, -0.8, ns);
for i =1:ns
  K([2 3]) = covxy(i);
  X_simp   = [X_simp; mvnrnd((i-1)*3*m, K, n)];   % Samples
endfor
n = size (X_simp, 1);
% Calcualte PCA
meanxy_simp     = mean (X_simp, 1);              % mean along columns
[U S AXES_simp] = svd (X_simp - meanxy_simp, 1); % principal axes
COMP_simp       = U * S;                         % principal components
lambda_simp = diag (S).^2 / n;                   % Eigenvalues of cov matrix

figure (), clf()
  subplot (1, 2, 1)
  hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b');
  xticks ([]); yticks ([])
  axis equal

  subplot (1, 2, 2)
  hp = plot_pointcloud (X_simp); set(hp, 'color', 'b');
  xticks ([]); yticks ([])
  axis equal
s_intuitive_pca_publish-5.png

Compare the axes you proposed with the ones provided by PCA

figure (), clf()
  subplot (1, 2, 1)
  hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b');
  hold on
  plot_pca2d (AXES_ncor, lambda_ncor, meanxy_ncor);
  hold off
  xticks ([]); yticks ([])
  axis equal

  subplot (1, 2, 2)
  hp = plot_pointcloud (X_simp); set(hp, 'color', 'b');
  hold on
  plot_pca2d (AXES_simp, lambda_simp, meanxy_simp);
  hold off
  xticks ([]); yticks ([])
  axis equal
s_intuitive_pca_publish-6.png