Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:

anc="%DIRECTORY WITH ANCESTRAL COMPONENTS%"

mkdir -p $anc

# download global proportions table
(echo "IID EUR AFR NAT UNK";
for pop in ASW CLM MXL PUR; do
  wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

#!/usr/bin/Rscript
args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
}
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN
}

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),])
}

# output data to disk
write.table(data[,c("FID","IID","EUR","AFR","NAT","ASN")],args[4],row.names=FALSE,quote=FALSE)

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:
set="%PLINK DATASET PREFIX%"
pth="%PATH TO THE 1000 GENOMES PLINK DATASET%"
gct="%DIRECTORY WITH GCTA%"

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(ggplot2))
args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

Thursday, December 12, 2013

1000 Genomes PCA analysis

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:
gct="%DIRECTORY WITH GCTA%"
vcf="%DIRECTORY WITH KGP VCF FILES%"

And then we can download all necessary files:
# install plink and utilities to convert to plink format
sudo apt-get install plink
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh
# download the 1000 Genomes files
pfx="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr"
sfx=".1kg.ref.phase1_release_v3.20101123.vcf.gz"
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# install GCTA
wget http://www.complextraitgenomics.com/software/gcta/gcta_1.22.zip
mkdir -p $gct
unzip gcta_1.22.zip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:
# convert each VCF file to a plink binary file with sex information included
pfx="$vcf/chr"
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./vcf2tfam.sh /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./vcf2tped.sh /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam
done

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:
# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01 --indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$chr.prune.in --make-bed --out kgp.chr$chr.prune
done

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:
# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.