Rotated Principal Components. Interpretability in the Eye of the Beholder

Rotated Principal Components. Interpretability in the Eye of the Beholder

In the sciences, it is increasingly more common to deal with high-dimensional datasets. In some fields like meteorology this was always the case. Many tools have been released to perform “dimensionality reduction”, which enables us to interpret and handle these huge datasets more easily. One of the most popular methods is Principal Components Analysis (PCA) (we have talked about its many guises), but it has been criticized. Here we make a short and partial review of “Rotation of principal components” by M. B. Richman, a highly cited article from the 80’s popular in earth sciences …

PCA of a multivariate Gaussian distribution
Credit: CCBY4.0 Nicoguaro at Wikimedia commons

In his article, Richman 1 reviewed the uses an applications of matrix factorizations (he called them principal components decompositions). Therein he criticized properties of the method and suggested that “meaningful” decompositions are obtained via linear (affine) transformations of the components. His article was cited more than 1600 times, and I think is was quite influential in earth sciences. Others have also published their opinion, for example 2 discusses several issues in Richman criticism. From nowadays stand point of view the Richman’s criticism should be seen more as a desiderata on the properties of the extracted components rather as a critic on the method. As such, one could discuss if the desiderata is actually relevant or not, but the method remains unscathed.

Richman mentioned several properties of principal components decomposition as undesired, e.g. that the decomposition is only unique after a normalization is provided, that the method depends on local properties of the data (subdomain stability), etc… check the article for the whole thing.

In this post I go thought the aspects that I find most relevant for the EmuMore project.

The basics

Lets say we measured a time dependent magnitude on a mesh in 3D space during a time-span of duration , we then have a dataset associating each point in 3D space with a real signal in time, i.e. , or equivalently . We could represent this data also as a function (if we know there is a functional relationship), .

Here comes the idea: can we represent the function as the combination of other functions? (a core question in Hilbert’s 13th problem (1900), in particular we look for a factorization of the form

which allows us to think of the data as composed by some spatial modes that are combined over time with weights . If we manage to find this factorization with a small number of modes then we will be able to understand our data better, by just understanding the way the modes are combined (check normal models in physics or eigenfunctions in mathematics)

If we have discretized space and time (mesh of measurements every so many seconds), we can then refer to the set by the indexes of elements in the mesh instead of their value. The data can then be stored as a matrix , where are the number of points in each direction of the mesh. With this view, the data becomes a matrix and the factorization above would look like a matrix-vector product

In the article about factorization algorithms we showed that SVD provides

with and , which is what the factorization needs if we define . The columns of the matrix form a basis in which the data is decomposed and are the components of the data in that basis.

If we apply an invertible linear transformation to the basis, the data is projected with new components ,

Meaning that we can always transform the basis (without loosing information) and obtain new components.

The set of all invertible transformation is infinite and this made Richman 1 talk about an infinite level of degeneracy in the decomposition. This is a fact from linear algebra: the set of bases of linear vector spaces is infinite, all mathematically equivalent. However, if we add desired properties to the components or the basis, we can pick one out of the infinite set. For example, PCA maximizes the standard deviation of the components.

In the last equation, if we require that the decomposition produces a basis which is orthonormal with respect to (instead of the usual scalar product ) then we get . If we can decompose the data with this constraint (the so called normalization constraint in Richman’s) we obtain the decomposition in the transformed basis.

Enter POD …

Transformation as scalar products

When the matrix has only (non-negative)positive eigenvalues (i.e. positive (semi-)definite) it can be interpreted as a Gramian matrix or a scalar product. This is how PCA is implemented in many fields that deal with differential equations and their numerical solutions 3, the given name of the method in these fields is usually Proper Orthogonal Decomposition (POD). In this view, the condition of orthonormality can be implemented directly into the SVD decomposition,

which means that applying SVD to the weighted data and then solving will produce the desired basis.

This generalized PCA is implemented in the function pod.m in the repository.

To transform or not to transform

Bases of linear vector spaces are equivalent; citing Richman “…[referring to two bases] have the same properties and are indistinguishable”. So there is no intrinsic “better” , despite of what the discourse in his article seems to indicate.

The goodness of each representation lies in the researcher’s use of the decomposition. In his article he argues for decompositions that are aligned with how explanations of phenomena are constructed in his field, i.e. he likes components and bases that allows him to generate explanations with the usual jargon and concepts. Hence he struggles (and reviews somebody else’s struggle) to represent this in terms of linear transformations and normalizations of the components and bases.

If the objective is just dimensionality reduction than any basis would do, you can stick to the output of SVD. If the objective also includes interpretability of the results with widespread concepts or the if basis should embody some prior knowledge we have about the phenomena generating the data, then transforming the basis and components provided by SVD is highly desired (probably necessary). Transformation are also useful if further algorithms applied to the components or basis have properties that suit certain type of data. In that case it is desired to choose a decomposition that matches the properties of those algorithms, e.g. if we want to parallelize further analyses involving the components we might want to extract sparse representations or statistically independent components.

Subdomain stability

One of the concerns raised in Richman’s article is that extraction of components is not stable if one performs the analysis in a subdomain of the original domain. Attributing subdomain stability to the extracted components reflects prior information that the analyst has about the process generating the data. It is, by no means, a deficiency of the decomposition. Indeed, subdomain stability is expected only on specific types of data. Lets briefly describe types of data with and without subdomain stability:

  • Scale free data When the subdomain is built by zooming in the original domain, stability will occur if the properties of the data are the same (very similar) at different scales. Here is an excellent video demonstrating this properties in critical Ising models

    For scale-free data, the extracted components in the subdomain will be compatible with the ones in the original domain. Many types of noise (and chaotic signals) are also self-similar or scale invariant.

  • Bifurcation data Many dynamical systems show extreme changes in their behavior when parameters change. In an spatially distributed scenario, even if the same dynamical process is generating the data, parameters (like, topography, temperature, humidity, etc…) depend on the position in the domain. If the variation of these parameters is not uniform then components of data from subdomain will reflect these local variability , e.g. the density of bees is concentrated more in agricultural or green areas but not everywhere, like in deserts or cities, hence a process regulated by bee population will produce different data in these subdomains, specially if it undergoes a bifurcation.

    Here is an example of the behavior of a car-trailer system when mass distribution is changed:

    Similarly, systems that undergo drastic change in behavior based on their states, known as Hybrid/Switching systems, will show different components in different regions of the observed data.

  • Data with local variation If the characteristics of the signal are defined by their position in the domain (as in bifurcations) the components are not expected to be subdomain invariant. I have written the script s_pc_subdomain.m (here its html report) to illustrate this based on signals that look like this:

Signals with local characteristics
Signals used in the demonstration script (see text). The black line indicates the two regions where the signals have different characteristic length scales

The results show that the subdomain components are very often considerably different from the ones in the whole domain (80 out of 100 have relative L2 difference higher than 0.5).

References

  1. Richman, M. B. (1986). Rotation of principal components. Journal of Climatology, 6(3), 293–335. http://doi.org/10.1002/joc.3370060305  2

  2. Jolliffe, I. T. (1987). Rotation of principal components: Some comments. Journal of Climatology, 7(5), 507–510. http://doi.org/10.1002/joc.3370070506 

  3. Volkwein, S. (2013). Proper orthogonal decomposition: Theory and reduced-order modelling. Lecture Notes, University of Konstanz. http://www.math.uni-konstanz.de/numerik/personen/volkwein/teaching/POD-Book.pdf