As usual, you will need a few variables defined:
set="%PLINK DATASET PREFIX%" pth="%PATH TO THE 1000 GENOMES PLINK DATASET%" gct="%DIRECTORY WITH GCTA%"
It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:
mkdir -p kgp # identify SNPs compatible with the merge awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt # extract SNPs from the 1000 Genomes dataset for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done /bin/cp kgp/kgp.chr1.fam kgp/kgp.fam for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim (echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done # merge two datasets plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp # create population categorical data awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop # use GCTA to compute the genetic relationship matrix gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10At this point we are ready to compute principal components.
# compute principal components using GCTA gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10 # create PCA matrix for use with R data.frame (echo "FID IID "`seq -f PCA%.0f -s" " 1 20`; (echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) | awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)} NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca
The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:
#!/usr/bin/Rscript suppressPackageStartupMessages(library(ggplot2)) args <- commandArgs(trailingOnly = TRUE) df1 <- read.table(args[1], header = TRUE) df2 <- read.table(args[2], header = TRUE) data <- merge(df1,df2) ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))
Notice that you will need R and ggplot2 already installed to run this script.
Great blog! I am in the pilot phase of a sequencing project and am interested in comparing my study population to the 1000 genomes data (via PCA). Do you know of an easy way to merge my data with the 1000 genomes data? Thank you!
ReplyDeleteThanks for a wonderful exercise to do the PCA with 1000 genome. I wonder what is the $pth/20130606_g1k.ped for, the one under "# create population categorical data" category. Thanks very much.
ReplyDeleteAh, I just realized it is from the 1kg, nevermind. ha.
Deleteftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped
Hi All, After running gwas of binary pheno, i got qqplot deflated right start from 2 onward. My population seems homogenious however, someone suggested me for pca. I do not know how to run pca and what software should i use? i am more easy with plink, if plink can do?
ReplyDeleteCheers
Nabi