- Trees in the backyard
- PCA on the tree data
- Trees projected on principal axis
- Abstract 2D point clouds

# Copyright (C) 2018-2019 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> pkg load geometry pkg load statistics close all

We have measured the position of trees in a rectangular parcel. We took as origin of the coordinates the door to the parcel. The axis are parallel to the front and the side of the parcel. The trees are planted at the sides of a small stone path that crosses the parcel.

First we will create fictitious data. You can skip this part, it is not needed for understanding PCA

width = 40; % width of parcel in meters depth = 75; % depth of parcel in meters parcel = [-width/2 width/2 0 depth]; % box defining the parcel ntrees = 25; % number of trees in the parcel x = linspace (0, 1, ntrees); % front coordinate of the trees y = 0.05 * (2 * randi ([0 1], 1, ntrees) - 1); % side coordinate of the trees ang = 55; X = ([cosd(ang) -sind(ang); sind(ang) cosd(ang)] * [x; y]).'; X = parcel([2 4]) .* (X - min (X)) + [-3 3];

The data is stored in the matrix `X`

.
The first column is the component along the front of the parcel.
The second column is the component along the side of the parcel.

The exercise is to look at the plot and propose another coordinate system (i.e. origin and axes) that might be more natural to describe this data. The plot shows the position of the trees and the boundary of the parcel

Give a little time to think about an answer before moving to the next section.

figure (), clf hold on drawBox (parcel); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off

We now apply PCA computed via the Singular Value Decomposition (SVD). Then we add the plot the the tree data to show what are the axes proposed by PCA.

meanxy = mean (X, 1); % mean value along columns [U S AXES] = svd (X - meanxy, 1); % principal axes % This is just aesthetic: make the first axis point in the positive x-direction % and the second axis positive along the y-direction Asign = [sign(AXES(1,1)) sign(AXES(2,2))]; AXES = Asign .* AXES; COMP = U * (S .* Asign); % Principal components lambda = diag (S).^2 / ntrees; % Eigenvalues of the covariance matrix

The plot shows the data and axes that would be proposed by a PCA algorithm. The axes are scaled indicating their relevance for describing the data, e.g. how much is lost if we ignore a direction.

figure (), clf hold on drawBox (parcel); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); hpc = plot_pca2d (AXES, lambda, meanxy); set (hpc(end), 'visible', 'off'); hold off axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off

The next plot shows the trees in the principal axes `AXES`

.
That is, we plot the principal components.

parcel_pc = (reshape (parcel([1 3 2 3 2 4 1 4]), 2, 4).' - meanxy) * AXES; figure (), clf hold on drawPolygon (parcel_pc); plot (COMP(:,1), COMP(:,2), 'xg;Tree;', 'linewidth', 3) hold off xlabel ('along 1st principal axis [m]') ylabel ('along 2nd principal axis [m]') axis equal

If we lost the second column of the original data `X`

, we would have a lot of
trouble reconstructing the position of the trees, indeed a lot of information
is lost together with the component along the side of the parcel.

If we loose the second column of the principal components, this is not that critical. The principal axes are built such that the firsts components capture most of the variability of the data (actually most of the variance). Hence the first axis is always along the direction in which the data changes the most.

To build the projected tree position we use only the first principal axis

X_ = COMP(:,1) * AXES(:,1).' + meanxy; % Projected trees figure (), clf hold on drawBox (parcel); plot (X(:,1), X(:,2), 'xg;Tree;', 'linewidth', 3); hpc = plot_pca2d (AXES, lambda, meanxy); color = get (hpc(1), 'color'); set (hpc(end), 'visible', 'off'); plot (X_(:,1), X_(:,2), 'o;PCA Tree;', 'color', color); hold off axis equal xlabel ('along front [m]') ylabel ('along side [m]') hold off

Now that you have completed an intuitive example, could you draw the principal axes of the following point clouds? Remember that the first axis is placed along the direction in which the data varies the most, i.e. the direction of maximum variance.

Do not look into the next section before you try to solve this on your own.

% Correlated Guassian variables n = 100; K = [1 0.5; 0.5 0.5]; X_ncor = mvnrnd ([0 0], K, n); % Calculate PCA meanxy_ncor = mean (X_ncor, 1); % mean along columns [U S AXES_ncor] = svd (X_ncor - meanxy_ncor, 1); % principal axes COMP_ncor = U * S; % principal components lambda_ncor = diag (S).^2 / n; % eigenvalues of cov matrix % Several anti-correlated subsets with global positive correlation, i.e. % Simpson's paradox. K = [1 0; 0 0.8]; % Covariance matrix m = [1 1]; X_simp = []; ns = 3; covxy = linspace (-0.1, -0.8, ns); for i = 1:ns K([2 3]) = covxy(i); X_simp = [X_simp; mvnrnd((i-1)*3*m, K, n)]; % Samples endfor n = size (X_simp, 1); % Calculate PCA meanxy_simp = mean (X_simp, 1); % mean along columns [U S AXES_simp] = svd (X_simp - meanxy_simp, 1); % principal axes COMP_simp = U * S; % principal components lambda_simp = diag (S).^2 / n; % eigenvalues of cov matrix figure (), clf() subplot (1, 2, 1) hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b'); xticks ([]); yticks ([]) axis equal subplot (1, 2, 2) hp = plot_pointcloud (X_simp); set(hp, 'color', 'b'); xticks ([]); yticks ([]) axis equal

Compare the axes you proposed with the ones provided by PCA

figure (), clf() subplot (1, 2, 1) hp = plot_pointcloud (X_ncor); set(hp, 'color', 'b'); hold on plot_pca2d (AXES_ncor, lambda_ncor, meanxy_ncor); hold off xticks ([]); yticks ([]) axis equal subplot (1, 2, 2) hp = plot_pointcloud (X_simp); set(hp, 'color', 'b'); hold on plot_pca2d (AXES_simp, lambda_simp, meanxy_simp); hold off xticks ([]); yticks ([]) axis equal