Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:
anc="%DIRECTORY WITH ANCESTRAL COMPONENTS%" mkdir -p $anc # download global proportions table (echo "IID EUR AFR NAT UNK"; for pop in ASW CLM MXL PUR; do wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt done) > $anc/gap
The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.
#!/usr/bin/Rscript args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables df1 <- read.table(args[1],header=TRUE) df2 <- read.table(args[2],header=TRUE) df3 <- read.table(args[3],header=TRUE) data <- merge(df1,df2,all=TRUE,incomparables=NaN) data <- merge(data,df3,all=TRUE,incomparables=NaN) data$ASN <- NaN # European populations eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI" data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0 # African populations afr <- data$POP=="YRI" data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0 # Asian populations asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT" data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1 # Admixed populations mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR" for (anc in c("EUR","AFR","ASN")) { data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"]) } data[which(mix),"ASN"] <- 0 # African American samples to be excluded exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921") for (anc in c("EUR","AFR","NAT","ASN")) { data[is.element(data$IID,exclude),anc] <- NaN } # predict global ancestral proportions for (anc in c("EUR","AFR","NAT","ASN")) { tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data) data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),]) } # output data to disk write.table(data[,c("FID","IID","EUR","AFR","NAT","ASN")],args[4],row.names=FALSE,quote=FALSE)
If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.
Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.
No comments:
Post a Comment