Technical Report 2019-009

Evaluation of Methods Handling Missing Data in PCA on Genotype Data: Applications for Ancient DNA

Kristiina Ausmees

October 2019

Principal Component Analysis (PCA) is a method of projecting data onto a basis that maximizes its variance, possibly revealing previously unseen patterns or features. PCA can be used to reduce the dimensionality of multivariate data, and is widely applied in visualization of genetic information. In the field of ancient DNA, it is common to use PCA to show genetic affinities of ancient samples in the context of modern variation. Due to the low quality and sequence coverage often exhibited by ancient samples, such analysis is not straightforward, particularly when performing joint visualization of multiple individuals with non-overlapping sequence data. The PCA transform is based on variances of allele frequencies among pairs of individuals, and discrepancies in overlap may therefore have large effects on scores. As the relative distances between scores are used to infer genetic similarity, it is important to distinguish between the effects of the particular set of markers used and actual genetic affinities. This work addresses the problem of using an existing PCA model to estimate scores of new observations with missing data. We address the particular application of visualizing genotype data, and evaluate approaches commonly used in population genetic analyses as well as other methods from the literature. The methods considered are that of trimmed scores, projection to the model plane, performing PCA individually on samples and subsequently merging them using Procrustes transformation, as well as the two least-squares based methods trimmed score regression and known data regression. Using empirical ancient data, we demonstrate the use of the different methods, and show that discrepancies in the set of loci considered for different samples can have pronounced effects on estimated scores. We also present an evaluation of the methods based on modern data with varying levels of simulated sparsity, concluding that their relative performance is highly data-dependent.

Available as PDF (339 kB, no cover)

Download BibTeX entry.