# Copyright (C) 2018 - Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <firstname.lastname@example.org> pkg load geometry pkg load statistics close all
We have measured the position of trees in a rectangular parcell. We took as origin of the cordinates the door to the terrain. The axis are parallel to the front and the side of the parcell. The trees are planted at the sides of a small stone path that crosses the parcell.
First we will create ficticious data. You can skip this part, it is not needed for understanding PCA
width = 40; % length of the front of the terrain in meters depth = 75; % length of the side of the terrain in meters parcell = [-width/2 width/2 0 depth]; % box defining the parcell ntrees = 25; % numer of trees in the terrain x = linspace (0, 1, ntrees); % front coordinate of the trees y = 0.05 * (2 * randi ([0 1], 1, ntrees) - 1); % side coordiante of the trees ang = 55; X = ([cosd(ang) -sind(ang); sind(ang) cosd(ang)] * [x; y]).'; X = parcell([2 4]) .* (X - min (X)) + [-3 3];
The data is stored in the matrix X. The first column is the component along the front of the parcell. The second column is the component along the side of the parcell.
The exercise is to look at the plot and propose another coordinate system (i.e. origin and axes) that might be more natural to describe this data. The plot shows the poition of the trees and the boundary of the parcell
Give a little time to think about an answer before movign to the next section.
figure (), clf hold on drawBox (parcell); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off
We now apply PCA computed via the Singular Value Decomposition (SVD). Then we add the plot the the tree data to show what are the axes proposed by PCA.
meanxy = mean (X, 1); % mean value along columns [U S AXES] = svd (X - meanxy, 1); % Princial axes % This is just aesthetic: make the first axis point in the positive x-direction % and the second axis positive along the y-direction Asign = [sign(AXES(1,1)) sign(AXES(2,2))]; AXES = Asign .* AXES; COMP = U * (S .* Asign); % Principal components lambda = diag (S).^2 / ntrees; % Eigenvalues of the covariance matrix
The plot shows the data and the axes that will be proposed by a PCA algorithm. The axes are scaled indicating their relevance for describing the data, e.g. how much is lost if we ignore a direction.
figure (), clf hold on drawBox (parcell); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); hpc = plot_pca2d (AXES, lambda, meanxy); set (hpc(end), 'visible', 'off'); hold off axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off
The next plot shows the trees in the principal axes AXES. That is we plot the principal components.
parcell_pc = (reshape (parcell([1 3 2 3 2 4 1 4]), 2, 4).' - meanxy) * AXES; figure (), clf hold on drawPolygon (parcell_pc); plot (COMP(:,1), COMP(:,2), 'xg;Tree;', 'linewidth', 3) hold off xlabel ('along 1st principal axis [m]') ylabel ('along 2nd principal axis [m]') axis equal
If we lost the second column of the original data X, we would have a lot of trouble reconstructing the position of the trees, indeed a lot of information is lost together with the component along th eside of the parcell.
If we loose the second column of the principal components, thi sis not that critical. The principal axes are built such that the firsts components capture most of the variablity of the data (actually most of the variance). Hence the first axis si always along the direction where the data changes the most.
To buidl the projected tree positon we use only the first principal axis
X_ = COMP(:,1) * AXES(:,1).' + meanxy; % Projected trees figure (), clf hold on drawBox (parcell); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); hpc = plot_pca2d (AXES, lambda, meanxy); color = get (hpc(1), 'color'); set (hpc(end), 'visible', 'off'); plot (X_(:,1), X_(:,2), 'o;PCA Tree;', 'color', color); hold off axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off
Now that you have completed an intuitive example, could you draw the principal axes of the following point clouds? Remember that the first axes is placed along the direction in which the data varies the most, i.e. the direction of maxium variance.
Do not look into the next section befroe you try to solve this on your own.
% Correlated Guassina variables n = 100; K = [1 0.5; 0.5 0.5]; X_ncor = mvnrnd ([0 0], K, n); % Calcualte PCA meanxy_ncor = mean (X_ncor, 1); % mean along columns [U S AXES_ncor] = svd (X_ncor - meanxy_ncor, 1); % principal axes COMP_ncor = U * S; % principal components lambda_ncor = diag (S).^2 / n; % Eigenvalues of cov matrix % Several anti-correlated subsets with global positive correlation, i.e. % Simpson's paradox. K = [1 0; 0 0.8]; % Covariance matrix m = [1 1]; X_simp = ; ns = 3; covxy = linspace (-0.1, -0.8, ns); for i =1:ns K([2 3]) = covxy(i); X_simp = [X_simp; mvnrnd((i-1)*3*m, K, n)]; % Samples endfor n = size (X_simp, 1); % Calcualte PCA meanxy_simp = mean (X_simp, 1); % mean along columns [U S AXES_simp] = svd (X_simp - meanxy_simp, 1); % principal axes COMP_simp = U * S; % principal components lambda_simp = diag (S).^2 / n; % Eigenvalues of cov matrix figure (), clf() subplot (1, 2, 1) hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b'); xticks (); yticks () axis equal subplot (1, 2, 2) hp = plot_pointcloud (X_simp); set(hp, 'color', 'b'); xticks (); yticks () axis equal
Compare the axes you proposed with the ones provided by PCA
figure (), clf() subplot (1, 2, 1) hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b'); hold on plot_pca2d (AXES_ncor, lambda_ncor, meanxy_ncor); hold off xticks (); yticks () axis equal subplot (1, 2, 2) hp = plot_pointcloud (X_simp); set(hp, 'color', 'b'); hold on plot_pca2d (AXES_simp, lambda_simp, meanxy_simp); hold off xticks (); yticks () axis equal