Structural PCA
Principal Component Analysis (PCA) is a powerful data reduction technique developed in 1901 by Statistician Karl Pearson. While the method is now over 100 years old, it is increasingly used as a first line analytic approach in modern data containg ultra high dimensional samples of images or curves. In conjunction with the closely related singular value decomposition (SVD), PCA provides a practical approach to data analysis by focusing on smaller linear spaces that explain almost all the variability in the observed data.
Our group has been faced with studies where some information about the design of the experiment or data structure is known. For example, in a brain imaging replication experiment brain images are taken one day apart on tens or hundreds of subjects to check the reliability of the resulting images. Another example is when hip accelerometry data is measured on the same subject continuously for a week and then repeatedly every month for 1 year. A direct PCA would ignore this structure altogether and would not be able to estimate the different levels of variability, especially if one level dominates the others. Our approaches have focused on explicitly modeling data using high-dimensional latent processes and then studying the implied covariance partitioning due to the known structure. We started by deploying Multilevel Functional Principal Component Analysis (MFPCA) for replication/exchangeable designs, Longitudinal Functional Principal Component Analysis (LFPCA) for longitudinal designs, and Structural Functional Principal Component Analysis (SFPCA) for general purpose designs.
Originally these methods were developed for the case when the basic observational unit was a function or image of moderate size (p<1000). The problem with larger objects is that it becomes hard to smooth their covariance operators, whereas truly high-dimensional covariance operators (p>30,000) are difficult to store and diagonalize. Thus, we deployed a new approach designed for high dimensional (HD-MFPCA, HD-LFPCA, HD-SFPCA). The main idea is simple: represent the data from a high dimensional function or image in a high dimensional column vector, Y. If a linear model is provided for Y then the exact same linear model would hold for a the low dimensional vector AY, where A is a q by p dimensional matrix, where q<<p. Thus, inference can be conducted on the much smaller dimensional space of AY. An important observation is that V'Y is a loss-less high-to-low-dimensional data transformation, where V is the matrix of left eigenvectors obtained from the SVD of the matrix obtained by column binding all subject-specific vectors data matrices. As SVD can be conducted in linear time with respect to the high dimension, p, all calculations remain linear in p. This provides a powerful exploratory tool for imaging and functional data that are spatially and/or temporally registered.