Principal component analysis

Principal component analysis is a versatile statistical method for reducing a cases-by-variables data table to its essential features, called principal components. Principal components are a few linear combinations of the original variables that maximally explain the variance of all the variables. In the process, the method provides an approximation of the original data table using only these few major components. This Primer presents a comprehensive review of the method’s definition and geometry, as well as the interpretation of its numerical and graphical results. The main graphical result is often in the form of a biplot, using the major components to map the cases and adding the original variables to support the distance interpretation of the cases’ positions. Variants of the method are also treated, such as the analysis of grouped data, as well as the analysis of categorical data, known as correspondence analysis. Also described and illustrated are the latest innovative applications of principal component analysis: for estimating missing values in huge data matrices, sparse component estimation, and the analysis of images, shapes and functions. Supplementary material includes video animations and computer scripts in the R environment.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

cancel any time

Subscribe to this journal

Receive 1 digital issues and online access to articles

133,45 € per year

only 133,45 € per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

Easy computation of the Bayes factor to fully quantify Occam’s razor in least-squares fitting and to guide actions

Article Open access 19 January 2022

Variable Selection in the Regularized Simultaneous Component Analysis Method for Multi-Source Data Integration

Article Open access 09 December 2019

Simple nested Bayesian hypothesis testing for meta-analysis, Cox, Poisson and logistic regression models

Article Open access 23 March 2023

Code availability

Several datasets and the R scripts that produce certain results in this Primer can be found on GitHub at: https://github.com/michaelgreenacre/PCA.

Change history

References

  1. Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dubl. Phil. Mag. J. Sci.2, 559–572 (2010). ArticleMATHGoogle Scholar
  2. Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol.24, 417–441 (1933). ArticleMATHGoogle Scholar
  3. Wold, S., Esbensen, K. & Geladi, P. Principal component analysis. Chemometr. Intell. Lab. Syst.2, 37–52 (1987). ArticleGoogle Scholar
  4. Jackson, J. E. A User’s Guide To Principal Components (Wiley, 1991).
  5. Jolliffe, I. T. Principal Component Analysis 2nd edn (Springer, 2002). Covering all major aspects of theory of PCA and with a wide range of real applications.
  6. Ringnér, M. What is principal component analysis? Nat. Biotechnol.26, 303–304 (2008). ArticleGoogle Scholar
  7. Abdi, H. & Williams, L. J. Principal component analysis. WIREs Comp. Stat.2, 433–459 (2010). ArticleGoogle Scholar
  8. Bro, R. & Smilde, A. K. Principal component analysis. Anal. Meth.6, 2812–2831 (2014).A tutorial on how to understand, use, and interpret PCA in typical chemometric areas, with a general treatment that is applicable to other fields.ArticleGoogle Scholar
  9. Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A374, 20150202 (2016). ArticleADSMathSciNetMATHGoogle Scholar
  10. Helliwell, J. F., Huang, H., Wang, S. & Norton, M. World happiness, trust and deaths under COVID-19. In World Happiness Report Ch. 2, 13–56 (2021).
  11. Cantril, H. Pattern Of Human Concerns (Rutgers Univ. Press, 1965).
  12. Flury, B. D. Developments in principal component analysis. In Recent Advances In Descriptive Multivariate Analysis (ed. Krzanowski, W. J.) 14–33 (Clarendon Press, 1995).
  13. Gabriel, R. The biplot graphic display of matrices with application to principal component analysis. Biometrika58, 453–467 (1971). ArticleMathSciNetMATHGoogle Scholar
  14. Gower, J. C. & Hand, D. J. Biplots (Chapman & Hall, 1995).
  15. Greenacre, M. Biplots In Practice (BBVA Foundation, 2010). Comprehensive treatment of biplots, including principal component and correspondence analysis biplots, explained in a pedagogical way and aimed at practitioners.
  16. Greenacre, M. Contribution biplots. J. Comput. Graph. Stat.22, 107–122 (2013). ArticleMathSciNetGoogle Scholar
  17. Eckart, C. & Young, G. The approximation of one matrix by another of lower rank. Psychometrika1, 211–218 (1936). ArticleMATHGoogle Scholar
  18. Greenacre, M., Martínez-Álvaro, M. & Blasco, A. Compositional data analysis of microbiome and any-omics datasets: a validation of the additive logratio transformation. Front. Microbiol.12, 727398 (2021). ArticleGoogle Scholar
  19. Greenacre, M. Compositional data analysis. Annu. Rev. Stat. Appl.8, 271–299 (2021). ArticleMathSciNetGoogle Scholar
  20. Aitchison, J. & Greenacre, M. Biplots of compositional data. J. R. Stat. Soc. Ser. C51, 375–392 (2002). ArticleMathSciNetMATHGoogle Scholar
  21. Greenacre, M. Compositional Data Analysis In Practice (Chapman & Hall/CRC Press, 2018).
  22. Cattell, R. B. The scree test for the number of factors. Multivar. Behav. Res.1, 245–276 (1966). ArticleGoogle Scholar
  23. Jackson, D. A. Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology74, 2204–2214 (1993). ArticleGoogle Scholar
  24. Peres-Neto, P. R., Jackson, D. A. & Somers, K. A. How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput. Stat. Data Anal.49, 974–997 (2005). ArticleMathSciNetMATHGoogle Scholar
  25. Auer, P. & Gervini, D. Choosing principal components: a new graphical method based on Bayesian model selection. Commun. Stat. Simul. Comput.37, 962–977 (2008). ArticleMathSciNetMATHGoogle Scholar
  26. Cangelosi, R. & Goriely, A. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct.2, 2 (2007). ArticleGoogle Scholar
  27. Josse, J. & Husson, F. Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Stat. Data Anal.56, 1869–1879 (2012). ArticleMathSciNetMATHGoogle Scholar
  28. Choi, Y., Taylor, J. & Tibshirani, R. Selecting the number of principal components: estimation of the true rank of a noisy matrix. Ann. Stat. 45, 2590–2617 (2017).
  29. Wang, M., Kornblau, S. M. & Coombes, K. R. Decomposing the apoptosis pathway into biologically interpretable principal components. Cancer Inf.17, 1176935118771082 (2018). Google Scholar
  30. Greenacre, M. & Degos, L. Correspondence analysis of HLA gene frequency data from 124 population samples. Am. J. Hum. Genet.29, 60–75 (1977). Google Scholar
  31. Borg, I. & Groenen, P. J. F. Modern Multidimensional Scaling: Theory And Applications (Springer Science & Business Media, 2005).
  32. Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med.7, 673–679 (2001). ArticleGoogle Scholar
  33. Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The Elements of Statistical Learning Data Mining, Inference, And Prediction (Springer, 2009).
  34. James, G., Witten, D., Hastie, T. & Tibshirani, R. Introduction To Statistical Learning 2nd edn (Springer, 2021). General text on methodology for data science, with extensive treatment of PCA in its various forms, including matrix completion.
  35. Greenacre, M. Data reporting and visualization in ecology. Polar Biol.39, 2189–2205 (2016). ArticleGoogle Scholar
  36. Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann. Eugen.7, 179–188 (1936). ArticleGoogle Scholar
  37. Campbell, N. A. & Atchley, W. R. The geometry of canonical variate analysis. Syst. Zool.30, 268–280 (1981). ArticleGoogle Scholar
  38. Jolliffe, I. T. Rotation of principal components: choice of normalization constraints. J. Appl. Stat.22, 29–35 (1995). ArticleMathSciNetGoogle Scholar
  39. Cadima, J. F. C. L. & Jolliffe, I. T. Loadings and correlations in the interpretation of principal components. J. Appl. Stat.22, 203–214 (1995). ArticleMathSciNetGoogle Scholar
  40. Jolliffe, I. T., Trendafilov, N. T. T. & Uddin, M. A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12, 531–547 (2003).
  41. Zou, H., Hastie, T. & Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat.15, 265–286 (2006). ArticleMathSciNetGoogle Scholar
  42. Shen, H. & Huang, J. Z. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal.99, 1015–1034 (2008). ArticleMathSciNetMATHGoogle Scholar
  43. Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics10, 515–534 (2009). ArticleMATHGoogle Scholar
  44. Journée, M., Nesterov, Y., Richtárik, P. & Sepulchre, R. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res.11, 517–553 (2010).
  45. Papailiopoulos, D., Dimakis, A. & Korokythakis, S. Sparse PCA through low-rank approximations. In Proc. 30th Int. Conf. on Machine Learning (PMLR)28, 747–755 (2013).
  46. Erichson, N. B. et al. Sparse principal component analysis via variable projection. SIAM J. Appl. Math.80, 977–1002 (2020). ArticleMathSciNetMATHGoogle Scholar
  47. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B58, 267–288 (1996). MathSciNetMATHGoogle Scholar
  48. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B67, 301–320 (2005). ArticleMathSciNetMATHGoogle Scholar
  49. Guerra-Urzola, R., van Deun, K., Vera, J. C. & Sijtsma, K. A guide for sparse PCA: model comparison and applications. Psychometrika86, 893–919 (2021). ArticleMathSciNetMATHGoogle Scholar
  50. Camacho, J., Smilde, A. K., Saccenti, E. & Westerhuis, J. A. All sparse PCA models are wrong, but some are useful. Part I: Computation of scores, residuals and explained variance. Chemometr. Intell. Lab. Syst.196, 103907 (2020). ArticleGoogle Scholar
  51. Camacho, J., Smilde, A. K., Saccenti, E., Westerhuis, J. A. & Bro, R. All sparse PCA models are wrong, but some are useful. Part II: Limitations and problems of deflation. Chemometr. Intell. Lab. Syst.208, 104212 (2021). ArticleGoogle Scholar
  52. Benzécri, J.-P. Analyse Des Données, Tôme 2: Analyse Des Correspondances (Dunod, 1973).
  53. Greenacre, M. Correspondence Analysis in Practice 3rd edn (Chapman & Hall/CRC Press, 2016). Comprehensive treatment of correspondence analysis (CA) and its variants, multiple correspondence analysis (MCA) and canonical correspondence analysis (CCA).
  54. ter Braak, C. J. F. Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology67, 1167–1179 (1986). ArticleGoogle Scholar
  55. Greenacre, M. & Primicerio, R. Multivariate Analysis of Ecological Data (Fundacion BBVA, 2013).
  56. Good, P. Permutation Tests: A Practical Guide To Resampling Methods For Testing Hypotheses (Springer Science & Business Media, 1994).
  57. Legendre, P. & Anderson, M. J. Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol. Monogr.69, 1–24 (1999). ArticleGoogle Scholar
  58. van den Wollenberg, A. L. Redundancy analysis an alternative for canonical correlation analysis. Psychometrika42, 207–219 (1977). ArticleMATHGoogle Scholar
  59. Capblancq, T. & Forester, B. R. Redundancy analysis: a Swiss army knife for landscape genomics. Meth. Ecol. Evol.12, 2298–2309 (2021). ArticleGoogle Scholar
  60. Palmer, M. W. Putting things in even better order: the advantages of canonical correspondence analysis. Ecology74, 2215–2230 (1993). ArticleADSGoogle Scholar
  61. ter Braak, C. J. F. & Verdonschot, P. F. M. Canonical correspondence analysis and related multivariate methods in aquatic ecology. Aquat. Sci.57, 255–289 (1995). ArticleGoogle Scholar
  62. Abdi, H. & Valentin, D. Multiple correspondence analysis. Encycl. Meas. Stat.2, 651–657 (2007). Google Scholar
  63. Richards, G. & van der Ark, L. A. Dimensions of cultural consumption among tourists: multiple correspondence analysis. Tour. Manag.37, 71–76 (2013). ArticleGoogle Scholar
  64. Glevarec, H. & Cibois, P. Structure and historicity of cultural tastes. Uses of multiple correspondence analysis and sociological theory on age: the case of music and movies. Cult. Sociol.15, 271–291 (2021). ArticleGoogle Scholar
  65. Jones, I. R., Papacosta, O., Whincup, P. H., Goya Wannamethee, S. & Morris, R. W. Class and lifestyle ‘lock-in’ among middle-aged and older men: a multiple correspondence analysis of the British Regional Heart Study. Sociol. Health Illn.33, 399–419 (2011). ArticleGoogle Scholar
  66. Greenacre, M. & Pardo, R. Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey. Sociol. Meth. Res.35, 193–218 (2006). ArticleMathSciNetGoogle Scholar
  67. Greenacre, M. & Pardo, R. Multiple correspondence analysis of subsets of response categories. In Multiple Correspondence Analysis And RelatedMethods (eds Greenacre, M. & Blasius, J.) 197–217 (Chapman & Hall/CRC Press, 2008).
  68. Aşan, Z. & Greenacre, M. Biplots of fuzzy coded data. Fuzzy Sets Syst.183, 57–71 (2011). ArticleMathSciNetGoogle Scholar
  69. Vichi, M., Vicari, D. & Kiers, H. A. L. Clustering and dimension reduction for mixed variables. Behaviormetrika46, 243–269 (2019). ArticleGoogle Scholar
  70. van de Velden, M., Iodice D’Enza, A. & Markos, A. Distance-based clustering of mixed data. Wiley Interdiscip. Rev. Comput. Stat.11, e1456 (2019). MathSciNetGoogle Scholar
  71. Greenacre, M. Use of correspondence analysis in clustering a mixed-scale data set with missing data. Arch. Data Sci. Ser. Bhttps://doi.org/10.5445/KSP/1000085952/04 (2019). ArticleGoogle Scholar
  72. Gifi, A. Nonlinear Multivariate Analysis (Wiley-Blackwell, 1990).
  73. Michailidis, G. & de Leeuw, J. The Gifi system of descriptive multivariate analysis. Stat. Sci. 13, 307–336 (1998).
  74. Linting, M., Meulman, J. J., Groenen, P. J. F. & van der Koojj, A. J. Nonlinear principal components analysis: introduction and application. Psychol. Meth.12, 336–358 (2007). Gentle introduction to nonlinear PCA for data that have categorical or ordinal variables, including an in-depth application to data of early childhood caregiving.ArticleGoogle Scholar
  75. Cazes, P., Chouakria, A., Diday, E. & Schektman, Y. Extension de l’analyse en composantes principales à des données de type intervalle. Rev. Stat. Appl.45, 5–24 (1997). Google Scholar
  76. Bock, H.-H., Chouakria, A., Cazes, P. & Diday, E. Symbolic factor analysis. In Analysis of Symbolic Data (ed. Bock H.-H. & Diday, E.) 200–212 (Springer, 2000).
  77. Lauro, C. N. & Palumbo, F. Principal component analysis of interval data: a symbolic data analysis approach. Comput. Stat.15, 73–87 (2000). ArticleMATHGoogle Scholar
  78. Gioia, F. & Lauro, C. N. Principal component analysis on interval data. Comput. Stat.21, 343–363 (2006). ArticleMathSciNetMATHGoogle Scholar
  79. Giordani, P. & Kiers, H. A comparison of three methods for principal component analysis of fuzzy interval data. Comput. Stat. Data Anal.51, 379–397 (2006). The application of PCA to non-atomic coded data, that is, interval or fuzzy data.ArticleMathSciNetMATHGoogle Scholar
  80. Makosso-Kallyth, S. & Diday, E. Adaptation of interval PCA to symbolic histogram variables. Adv. Data Anal. Classif.6, 147–159 (2012). ArticleMathSciNetMATHGoogle Scholar
  81. Brito, P. Symbolic data analysis: another look at the interaction of data mining and statistics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.4, 281–295 (2014). ArticleGoogle Scholar
  82. Le-Rademacher, J. & Billard, L. Principal component analysis for histogram-valued data. Adv. Data Anal. Classif.11, 327–351 (2017). ArticleMathSciNetMATHGoogle Scholar
  83. Booysen, F. An overview and evaluation of composite indices of development. Soc. Indic. Res.59, 115–151 (2002). ArticleGoogle Scholar
  84. Lai, D. Principal component analysis on human development indicators of China. Soc. Indic. Res.61, 319–330 (2003). ArticleGoogle Scholar
  85. Krishnakumar, J. & Nagar, A. L. On exact statistical properties of multidimensional indices based on principal components, factor analysis, MIMIC and structural equation models. Soc. Indic. Res.86, 481–496 (2008). ArticleGoogle Scholar
  86. Mazziotta, M. & Pareto, A. Use and misuse of PCA for measuring well-being. Soc. Indic. Res.142, 451–476 (2019). ArticleGoogle Scholar
  87. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C. & Strahan, E. J. Evaluating the use of exploratory factor analysis in psychological research. Psychol. Meth.4, 272–299 (1999). ArticleGoogle Scholar
  88. Booysen, F., van der Berg, S., Burger, R., von Maltitz, M. & du Rand, G. Using an asset index to assess trends in poverty in seven Sub-Saharan African countries. World Dev.36, 1113–1130 (2008). ArticleGoogle Scholar
  89. Wabiri, N. & Taffa, N. Socio-economic inequality and HIV in South Africa. BMC Public. Health13, 1037 (2013). ArticleGoogle Scholar
  90. Lazarus, J. Vetal The global NAFLD policy review and preparedness index: are countries ready to address this silent public health challenge? J. Hepatol.76, 771–780 (2022). ArticleGoogle Scholar
  91. Rodarmel, C. & Shan, J. Principal component analysis for hyperspectral image classification. Surv. Land. Inf. Sci.62, 115–122 (2002). Google Scholar
  92. Du, Q. & Fowler, J. E. Hyperspectral image compression using JPEG2000 and principal component analysis. IEEE Geosci. Remote. Sens. Lett.4, 201–205 (2007). ArticleADSGoogle Scholar
  93. Turk, M. & Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci.3, 71–86 (1991). ArticleGoogle Scholar
  94. Paul, L. & Suman, A. Face recognition using principal component analysis method. Int. J. Adv. Res. Comput. Eng. Technol.1, 135–139 (2012). Google Scholar
  95. Zhu, J., Ge, Z., Song, Z. & Gao, F. Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data. Annu. Rev. Control.46, 107–133 (2018). ArticleMathSciNetGoogle Scholar
  96. Ghorbani, M. & Chong, E. K. P. Stock price prediction using principal components. PLoS One15, e0230124 (2020). ArticleGoogle Scholar
  97. Pang, R., Lansdell, B. J. & Fairhall, A. L. Dimensionality reduction in neuroscience. Curr. Biol.26, R656–R660 (2016). ArticleGoogle Scholar
  98. Abraham, G. & Inouye, M. Fast principal component analysis of large-scale genome-wide data. PLoS One9, e93766 (2014). ArticleADSGoogle Scholar
  99. Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci.97, 10101–10106 (2000). Application of PCA to gene expression data, proposing the concepts of eigenarrays and eigengenes as representative linear combinations of original arrays and genes.ArticleADSGoogle Scholar
  100. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet.2, e190 (2006). ArticleGoogle Scholar
  101. Tsuyuzaki, K., Sato, H., Sato, K. & Nikaido, I. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biol.21, 9 (2020). ArticleGoogle Scholar
  102. Golub, G. H. & van Loan, C. F. Matrix Computations (JHU Press, 2013).
  103. Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bureau Standards45, 255–282 (1950). ArticleMathSciNetGoogle Scholar
  104. Baglama, J. & Reichel, L. Augmented GMRES-type methods. Numer. Linear Algebra Appl.14, 337–350 (2007). ArticleMathSciNetMATHGoogle Scholar
  105. Wu, K. & Simon, H. Thick-restart Lanczos method for large symmetric eigenvalue problems. SIAM J. Matrix Anal. Appl.22, 602–616 (2000). ArticleMathSciNetMATHGoogle Scholar
  106. Halko, N., Martinsson, P.-G. & Tropp, J. A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev.53, 217–288 (2011). A comprehensive review of randomized algorithms for low-rank approximation in PCA and SVD.ArticleMathSciNetMATHGoogle Scholar
  107. Weng, J., Zhang, Y. & Hwang, W.-S. Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell.25, 1034–1040 (2003). ArticleGoogle Scholar
  108. Ross, D. A., Lim, J., Lin, R.-S. & Yang, M.-H. Incremental learning for robust visual tracking. Int. J. Comput. Vis.77, 125–141 (2008). Proposal of incremental implementations of PCA for applications to large data sets and data flows.ArticleGoogle Scholar
  109. Cardot, H. & Degras, D. Online principal component analysis in high dimension: which algorithm to choose? Int. Stat. Rev.86, 29–50 (2018). ArticleMathSciNetGoogle Scholar
  110. Iodice D’Enza, A. & Greenacre, M. Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In Advanced Statistical Methods for the Analysis of Large Data-Sets (eds di Ciaccio, A., Coli, M. & Angulo Ibanez, J.-M.) 453–463 (Springer, 2012).
  111. Iodice D’Enza, A., Markos, A. & Palumbo, F. Chunk-wise regularised PCA-based imputation of missing data. Stat. Meth. Appl. 31, 365–386 (2021).
  112. Shiokawa, Y. et al. Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet. Sci. Rep.8, 3426 (2018). ArticleADSGoogle Scholar
  113. Koren, Y., Bell, R. & Volinsky, C. Matrix factorization techniques for recommender systems. Computer42, 30–37 (2009). ArticleGoogle Scholar
  114. Li, Y. On incremental and robust subspace learning. Pattern Recogn.37, 1509–1518 (2004). ArticleADSMATHGoogle Scholar
  115. Bouwmans, T. Subspace learning for background modeling: a survey. Recent Pat. Comput. Sci.2, 223–234 (2009). ArticleGoogle Scholar
  116. Guyon, C., Bouwmans, T. & Zahzah, E.-H. Foreground detection via robust low rank matrix decomposition including spatio-temporal constraint. In Asian Conf. ComputerVision (eds Park, J. Il & Kim, J.) 315–320 (Springer, 2012).
  117. Bouwmans, T. & Zahzah, E. H. Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance. Comput. Vis. Image Underst.122, 22–34 (2014). ArticleGoogle Scholar
  118. Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res.11, 2287–2322 (2010). MathSciNetMATHGoogle Scholar
  119. Josse, J. & Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat.153, 79–99 (2012). MathSciNetMATHGoogle Scholar
  120. Hastie, T., Tibshirani, R. & Wainwright, M. Statistical Learning With Sparsity: The LASSO And Generalizations (CRC Press, 2015). Comprehensive treatment of the concept of sparsity in many different statistical contexts, including PCA and related methods.
  121. Hastie, T., Mazumder, R., Lee, J. D. & Zadeh, R. Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res.16, 3367–3402 (2015). MathSciNetMATHGoogle Scholar
  122. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun.9, 284 (2018). ArticleADSGoogle Scholar
  123. Ioannidis, A. G. et al. Paths and timings of the peopling of Polynesia inferred from genomic networks. Nature597, 522–526 (2021). ArticleADSGoogle Scholar
  124. Rohlf, F. J. & Archie, J. W. A comparison of Fourier methods for the description of wing shape in mosquitoes (Diptera: Culicidae). Syst. Zool.33, 302–317 (1984). ArticleGoogle Scholar
  125. Gower, J. C. Generalized Procrustes analysis. Psychometrika40, 33–51 (1975). ArticleMathSciNetMATHGoogle Scholar
  126. Dryden, I. L. & Mardia, K. V. Statistical Shape Analysis: With Applications In R 2nd edn, Vol. 995 (John Wiley & Sons, 2016).
  127. Ocaña, F. A., Aguilera, A. M. & Valderrama, M. J. Functional principal components analysis by choice of norm. J. Multivar. Anal.71, 262–276 (1999). ArticleMathSciNetMATHGoogle Scholar
  128. Ramsay, J. O. & Silverman, B. W. Principal components analysis for functional data. In Functional Data Analysis 147–172 (Springer, 2005).
  129. James, G. M., Hastie, T. J. & Sugar, C. A. Principal component models for sparse functional data. Biometrika87, 587–602 (2000). ArticleMathSciNetMATHGoogle Scholar
  130. Yao, F., Müller, H.-G. & Wang, J.-L. Functional data analysis for sparse longitudinal data. J. Am. Stat. Assoc.100, 577–590 (2005). ArticleMathSciNetMATHGoogle Scholar
  131. Hörmann, S., Kidziński, Ł. & Hallin, M. Dynamic functional principal components. J. R. Stat. Soc. Ser. B77, 319–348 (2015). ArticleMathSciNetMATHGoogle Scholar
  132. Bongiorno, E. G. & Goia, A. Describing the concentration of income populations by functional principal component analysis on Lorenz curves. J. Multivar. Anal.170, 10–24 (2019). ArticleMathSciNetMATHGoogle Scholar
  133. Li, Y., Huang, C. & Härdle, W. K. Spatial functional principal component analysis with applications to brain image data. J. Multivar. Anal.170, 263–274 (2019). ArticleMathSciNetMATHGoogle Scholar
  134. Song, J. & Li, B. Nonlinear and additive principal component analysis for functional data. J. Multivar. Anal.181, 104675 (2021). ArticleMathSciNetMATHGoogle Scholar
  135. Tuzhilina, E., Hastie, T. J. & Segal, M. R. Principal curve approaches for inferring 3D chromatin architecture. Biostatistics23, 626–642 (2022). ArticleMathSciNetGoogle Scholar
  136. Maeda, H., Koido, T. & Takemura, A. Principal component analysis of song units produced by humpback whales (Megaptera novaeangliae) in the Ryukyu region of Japan. Aquat. Mamm.26, 202–211 (2000). Google Scholar
  137. Allen, J. A. et al. Song complexity is maintained during inter-population cultural transmission of humpback whale songs. Sci. Rep.12, 8999 (2022). ArticleADSGoogle Scholar
  138. Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron88, 1121–1135 (2015). ArticleGoogle Scholar
  139. Liu, L. T., Dobriban, E. & Singer, A. ePCA: high dimensional exponential family PCA. Ann. Appl. Stat.12, 2121–2150 (2018). ArticleMathSciNetMATHGoogle Scholar
  140. Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw.25, 1–18 (2008). ArticleGoogle Scholar
  141. Siberchicot, A., Julien-Laferrière, A., Dufour, A.-B., Thioulouse, J. & Dray, S. adegraphics: an S4 Lattice-based package for the representation of multivariate data. R J.9, 198–212 (2017). ArticleGoogle Scholar
  142. Thioulouse, J. et al. Multivariate Analysis Of Ecological Data With ade4 (Springer, 2018).
  143. Erichson, N. B., Voronin, S., Brunton, S. L. & Kutz, J. N. Randomized matrix decompositions using R. J. Stat. Softw.89, 1–48 (2019). ArticleGoogle Scholar
  144. Iodice D’Enza, A., Markos, A. & Buttarazzi, D. The idm package: incremental decomposition methods in R. J. Stat. Softw.86, 1–24 (2018). Google Scholar
  145. Josse, J. & Husson, F. missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw.70, 1–31 (2016). ArticleGoogle Scholar
  146. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res.12, 2825–2830 (2011). MathSciNetMATHGoogle Scholar
  147. Harris, C. R. et al. Array programming with NumPy. Nature585, 357–362 (2020). ArticleADSGoogle Scholar
  148. Kidziński, Ł. et al. Deep neural networks enable quantitative movement analysis using single-camera videos. Nat. Commun.11, 4054 (2020). ArticleADSGoogle Scholar

Acknowledgements

This review is dedicated to the memory of Professor Cas Troskie, who was the head of the Department of Statistics at the University of Cape Town, both teacher and mentor to M.G. and T.H., and who planted the seeds of principal component analysis in them at an early age. T.H. was partially supported by grants DMS2013736 and IIS1837931 from the National Science Foundation, and grant 5R01 EB001988-21 from the National Institutes of Health. E.T. was supported by the Stanford Data Science Institute.

Author information

Authors and Affiliations

  1. Department of Economics and Business, Universitat Pompeu Fabra and Barcelona School of Management, Barcelona, Spain Michael Greenacre
  2. Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam, Rotterdam, Netherlands Patrick J. F. Groenen
  3. Departments of Statistics and Biomedical Science, Stanford University, Stanford, CA, USA Trevor Hastie
  4. Department of Political Sciences, University of Naples Federico II, Naples, Italy Alfonso Iodice D’Enza
  5. Department of Primary Education, Democritus University of Thrace, Alexandroupolis, Greece Angelos Markos
  6. Department of Statistics, Stanford University, Stanford, CA, USA Elena Tuzhilina
  1. Michael Greenacre