As one often encounters datasets with more than a few variables, multivariate statistical techniques are needed to extract the information contained in these datasets effectively. In the environmental sciences, examples of multivariate datasets are ubiquitous – the air temperatures recorded by all the weather stations around the globe, the satellite infrared images composed of numerous small pixels, the gridded output from a general circulation model, etc. The number of variables or time series from these datasets ranges from thousands to millions.Without a mastery of multivariate techniques, one is overwhelmed by these gigantic datasets. In this chapter, we review the principal component analysis method and its many variants, and the canonical correlation analysis method. These methods, using standard matrix techniques such as singular value decomposition, are relatively easy to use, but suffer from being linear, a limitation which will be lifted with neural network and kernel methods in later chapters.
Principal component analysis (PCA)
Geometric approach to PCA
We have a dataset with variables y1, …, ym. These variables have been sampled n times. In many situations, the m variables are m time series each containing n observations in time. For instance, one may have a dataset containing the monthly air temperature measured at m stations over n months.