Section1_PCA_complete_publish

Contents

% Copyright 2017-2018 Kris Villez
%
% This file is part of the principal component analysis and regression
% (PCAR) toolbox for Matlab/Octave.
%
% The PCAR Toolbox is free software: you can redistribute it and/or
% modify it under the terms of the GNU General Public License as published
% by the Free Software Foundation, either version 3 of the License, or (at
% your option) any later version.
%
% The PCAR Toolbox is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
% Public License for more details.
%
% You should have received a copy of the GNU General Public License along
% with the PCAR Toolbox. If not, see <http://www.gnu.org/licenses/>.

% Author: Kris Villez <Kris.Villez@eawag.ch>
% Author: Juan Pablo Carbajl <ajuanpi+dev@gmail.com>
% Last version: 25.06.2018

PCA example: section 1

This script illustrates the use of PCA.

%clc
%clear all
close all

Load the data

This exmaples uses a data set that can be found in the folder data one level above the current folder. The name of the original file was EawagData.csv

FileName = fullfile ('..',  'data',  'EawagData.csv');
tic
disp ('Reading file')
data     = dlmread (FileName);
disp ('Done reading file')
toc

% legend format for all plots
lg_fmt = {'Location', 'NorthOutside', 'Orientation', 'Horizontal'};
Reading file
Done reading file
Elapsed time is 0.0119469 seconds.

Organize data

Separate the data into to vectors named X and Y. At the end of this section we plot the data

N    = size (data, 1);      % Number of samples
dimX = size (data, 2) - 1;  % Dimension of X
dimY = 1;                   % Dimension of Y
X = data(:, 1:dimX);
Y = data(:, end) ;

figure (1);
  idx = (1:dimX).';
  plot (idx, X(1:5:end, :).');
  xlabel ('dim X')
  ylabel ('X');
  axis tight
  clear idx

figure (2);
  plot (X(:, 1), Y, 'k+')
  xlabel ('Abs X_1 [%/m]')
  ylabel ('NO_2^{-}-N [mg/L]')
  Ylim    = get (gca, 'Ylim');
  Ylim(1) = 0;
  set(gca, 'Ylim', Ylim);
  Xlim    = get (gca, 'Xlim');
  Xlim(1) = 0;
  set (gca, 'Xlim', Xlim);
Section1_PCA_complete_publish-1.pngSection1_PCA_complete_publish-2.png

Linear regression

We perform standard linear regression between the first dimension of X and the values in Y. The regression is done on the z-scored data.

$$x = \frac{X_1 - \bar{X_1}}{\sigma_{X_1}}$$ $$y = \frac{Y - \bar{Y}}{\sigma_{Y}}$$

[x, mean_X1, std_X1] = zscore (X(:, 1)) ;
[y, mean_Y, std_Y]   = zscore (Y);

Input: x Output: y

beta_xy = calibrateLR (x, y);

figure (3), hold on,
  xs1 = sort (x);
  ys1 = xs1 * beta_xy(2) + beta_xy(1);
  y_  = x * beta_xy(2) + beta_xy(1);

  h    = plot (x, y, 'k+');
  h(2) = plot (xs1, ys1, 'b.:', 'linewidth', 3);
  plot ([x x].', [y y_].', 'b-')
  axis equal
  xlabel ('x');
  ylabel ('y')
  legend (h, {'data', 'y(x)'}, lg_fmt{:})
Section1_PCA_complete_publish-3.png

Input: y Output: x

beta_yx = calibrateLR (y, x);

figure (4), hold on,
  ys2 = sort (y);
  xs2 = ys2 * beta_yx(2) + beta_yx(1);
  x_  = y * beta_yx(2) + beta_yx(1);
  h    = plot (x, y, 'k+');
  h(2) = plot (xs2, ys2, 'r.:', 'linewidth', 3);
  plot ([x x_].', [y y].', 'r-')
  axis equal
  xlabel ('x');
  ylabel ('y')
  legend (h, {'data', 'x(y)'}, lg_fmt{:})
Section1_PCA_complete_publish-4.png

Compare regressions

figure (5), hold on,
  h    = plot (x, y, 'k+');
  h(2) = plot (xs1, ys1, 'b.:', 'linewidth', 3);
  h(3) = plot (xs2, ys2, 'r.:', 'linewidth', 3);
  axis equal
  xlabel ('x');
  ylabel ('y')
  legend (h, {'data', 'y(x)', 'x(y)'}, lg_fmt{:})
Section1_PCA_complete_publish-5.png

Total least squares

Exemplifies the use of PCA regression for Total least-squares.

[T, par_mean, P, S] = calibratePCA ([x y], 1);

x_recon = T * P(1) + par_mean(1);
y_recon = T * P(2) + par_mean(2);

figure (6), hold on,
  h    = plot (x, y, 'k+');
  h(2) = plot (x_recon, y_recon, 'k.-');
  plot([x x_recon].', [y y_recon].', 'k:')
  axis equal
  xlabel ('x')
  ylabel ('y')
  legend (h, {'data', 'PCR'}, lg_fmt{:})

figure (7), hold on,
  h    = plot (x, y, 'k+');
  h(2) = plot (xs1, ys1, 'b-');
  h(3) = plot (xs2, ys2, 'r-');
  h(4) = plot (x_recon, y_recon, 'k-');
  axis equal
  xlabel ('x')
  ylabel ('y')
  legend (h, {'data', 'y(x)', 'x(y)', 'PCR'}, lg_fmt{:})

clear xs* ys* y_ x_
Section1_PCA_complete_publish-6.pngSection1_PCA_complete_publish-7.png

Interpreting PCA

Interpret 2-PC model as a new coordinate system

[T, par_mean, P, S, lambda] = calibratePCA ([x y], 2);

PCstd = std (T, 1) ;

figure (8), hold on,
    h    = plot (x, y, 'k+');
    h(2) = plot (x_recon, y_recon, 'k-');
    htmp = addarrows (par_mean, P, 'b');
    h    = [h(:); htmp(1)];
    axis equal
    xlabel ('x')
    ylabel ('y')
    legend (h, {'data', 'PCR', 'PCR basis'}, lg_fmt{:})

figure (9), hold on,
    h    = plot (x, y, 'k+');
    h(2) = plot (x_recon, y_recon, 'k-');
    htmp = addarrows (par_mean, P, 'b');
    h    = [h(:); htmp(1)];
    htmp = addarrows (par_mean, P * diag (PCstd), 'r');
    h    = [h(:); htmp(1)];
    axis equal
    xlabel ('x')
    ylabel ('y')
    legend (h, {'data', 'PCR', 'PCR basis', 'scaled basis'}, lg_fmt{:})
Section1_PCA_complete_publish-8.pngSection1_PCA_complete_publish-9.png

Interpret 2-PC as a rotation

figure (10), hold on,
    h    = plot(T(:, 1), T(:, 2), 'kx');
    htmp = addarrows(zeros(2, 1), eye(2), 'b');
    h    = [h(:); htmp(1)];
    axis equal
    xlabel('PC 1')
    ylabel('PC 2')
    legend (h, {'loadings', 'PCA basis'}, lg_fmt{:})
Section1_PCA_complete_publish-10.png

Selecting dimension

Explanation here

figure (11)
  subplot(2,1,1)
  h = bar (1:length(lambda), lambda);
  set (gca, 'Xtick', 1:length(lambda))
  ylabel ('eigenvalue (variance)')

  subplot(2,1,2)
  lambda0 = [0 ; lambda];
  idx     = 0:length(lambda);
  h       = bar (idx, cumsum (lambda0) ./ sum (lambda0));
  set (gca, 'Xtick', idx)
  xlabel ('dimension')
  ylabel ('explained variance')

clear idx
Section1_PCA_complete_publish-11.png