Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:


mkdir -p $anc

# download global proportions table
for pop in ASW CLM MXL PUR; do
  wget -O-${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[[,anc]),anc] = predict.lm(tmp,data[[,anc]),])

# output data to disk

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

No comments:

Post a Comment