Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:
set="%PLINK DATASET PREFIX%"
pth="%PATH TO THE 1000 GENOMES PLINK DATASET%"
gct="%DIRECTORY WITH GCTA%"

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(ggplot2))
args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

1 comment:

  1. Great blog! I am in the pilot phase of a sequencing project and am interested in comparing my study population to the 1000 genomes data (via PCA). Do you know of an easy way to merge my data with the 1000 genomes data? Thank you!

    ReplyDelete