Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:


mkdir -p $anc

# download global proportions table
for pop in ASW CLM MXL PUR; do
  wget -O-${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[[,anc]),anc] = predict.lm(tmp,data[[,anc]),])

# output data to disk

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

Thursday, December 12, 2013

1000 Genomes PCA analysis

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:

And then we can download all necessary files:
# install plink and utilities to convert to plink format
sudo apt-get install plink
chmod a+x
# download the 1000 Genomes files
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done

# install GCTA
mkdir -p $gct
unzip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:
# convert each VCF file to a plink binary file with sex information included
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./ /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./ /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:
# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01 --indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$ --make-bed --out kgp.chr$chr.prune

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:
# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.

Monday, November 18, 2013

IBD pipeline

There are a lot of interesting analyses that segments inherited identical-by-descent (IBD) can be informative for. However, it takes some steps to setup a fully functional IBD pipeline. Here is a basic one that will work with any dataset in binary plink format. It will first phase the data with hapi-ur, then detect IBD segments using beagle, and finally detect clusters of IBD individuals using dash.

To begin with, you need to setup a few variables:

And have a couple of programs pre-installed on your Linux machine:
# install plink
sudo apt-get install plink

# install hapi-ur
tar xzvf hapi-ur-1.01.tgz
mkdir -p $hpr
/bin/cp hapi-ur-1.01/hapi-ur hapi-ur-1.01/ $hpr/

# install genetic maps
tar xzvf genetic_maps_b37.tgz
mkdir -p $gmp
for i in {1..22} X_PAR1 X_nonPAR X_PAR2; do gzip -c genetic_maps_b37/genetic_map_chr${i}_combined_b37.txt > $gmp/chr$i.gmap.gz; done

# install beagle
mkdir -p $bgl
wget -P $bgl/

# install dash
tar xzvf dash-1-1.tar.gz
mkdir -p $dsh
/bin/cp dash-1-1/bin/dash_cc $dsh/

The following pipeline will run the analysis independently on each autosome:
# perform IBD and DASH analysis separately for each autosome
for i in {1..22}; do
  # create a directory
  mkdir -p chr$i

  # extract autosome of interest
  p-link --noweb --bfile $set --chr $i --make-bed --out chr$i/$set

  # generate marker file with genetic distance
  zcat $gmp/chr$i.gmap.gz | $hpr/ chr$i/$set.bim - | awk -v OFS="\t" \
    'length($5)>1 || $6=="-" {$5="I"; $6="D"} $5=="-" || length($6)>1 {$5="D"; $6="I"} {print $1,$2,$3,$4,$5,$6}' > chr$i/$set.gmap.bim

  # clean up just in case
  /bin/rm chr$i/$set.sample chr$i/$set.haps

  # phase haplotypes using hapi-ur
  $hpr/hapi-ur -g chr$i/$set.bed -s chr$i/$set.gmap.bim -i chr$i/$set.fam -w 73 --impute -o chr$i/$set

  # convert hapi-ur output for use with beagle
  (echo "##fileformat=VCFv4.1";
  echo "##FORMAT=";
  awk 'NR>2 {printf $1"."$2"\t"}' chr$i/$set.sample | sed 's/\t$/\n/g';
  awk -v c=$i '{printf c"\t"$3"\t"$2"\t"$4"\t"$5"\t.\tPASS\t.\tGT";
    for (i=6; i<NF; i+=2) printf "\t"$i"|"$(i+1); printf "\n"}' chr$i/$set.haps) |
    java -Xmx3g -jar $bgl/b4.r1185.jar ibd=true gt=/dev/stdin usephase=true burnin-its=0 phase-its=0 out=chr$i/$set

  # convert beagle output for use with DASH
  awk '{sub("\\\."," ",$1); sub("\\\."," ",$3); print $1"."$2-1,$3"."$4-1,$6,$7}' chr$i/$set.hbd chr$i/$set.ibd |
    $dsh/dash_cc chr$i/$set.fam chr$i/$set

  # create map file with IBD clusters
  cut -f1-5 chr$i/$set.clst | awk -v c=$i '{ print c,c"_"$1,$2,$3,$4,$5 }' > chr$i/$set.cmap

  # create map file with IBD clusters for use with plink
  awk -v c=$i '{ print c,$2,0,int(($3+$4)/2) }' chr$i/$set.cmap > chr$i/$

  # convert output to plink binary format
  p-link --maf 0.001 --file chr$i/$set --make-bed --recode --out chr$i/$set

You can then join the results in a single plink binary file using the code discussed in a previous post:
# join separate autosome outputs into a single plink binary file
for i in {1..22}; do cat chr$i/$set.cmap; done > $set.dash.cmap
cp $set.fam $set.dash.fam
for i in {1..22}; do cat chr$i/$set.bim; done > $set.dash.bim
(echo -en "\x6C\x1B\x01"; for i in {1..22}; do tail -c +4 chr$i/$set.bed; done) > $set.dash.bed

Wednesday, October 30, 2013

Longest homopolymer in coding sequence in the human genome

It is expected that most homopolymer sequences be negatively selected in coding sequence, due to the higher mutation rate. So what is the longest homopolymer in coding sequence in the human genome? Here a short script to answer this question.

First you will need a few programs:
sudo apt-get install wget samtools bedtools

And you will need a copy of the hg19 genome:
tar xfO chromFa.tar.gz > hg19.fa
samtools faidx hg19.fa

Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:
zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed
bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa
for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done

It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.

Friday, October 11, 2013

Merging plink binary files

If you have a bunch of plink binary files for the same set of people split over different chromosomes, how do you merge them? A simple answer would be to use the --merge-list option in plink. However, this will require some setting up to do and it turns out there is a much easier way to perform this.

Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: then the following code will just work:

cat plink.chr1.fam > plink.fam
for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed

This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.

A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.

Tuesday, October 8, 2013

Convert VCF files to PLINK binary format

How many times have you needed to convert a VCF file to PLINK binary format? The 1000 Genomes project has a recommended tool ( but it only works when converting small regions. Another easy to use tool is SnpSift (

Overall, it really isn't that complicated to convert a VCF file. The idea is to first convert it to tped format and then let plink do the job of converting that to binary format.

sudo apt-get install plink tabix
chmod a+x

The tools are fairly simple and they are supposed to be flexible. Now, let's suppose your VCF file is bgzip-comrpessed. A good idea might be to split it in chromosomes and generate one plink file for each chromosome. Here some code that will do that:

for chr in {1..22} X Y MT; do
  tabix -H $vcf | ./ /dev/stdin > gpc.chr$chr.tfam
  tabix $vcf $chr | ./ /dev/stdin > gpc.chr$chr.tped
  plink --tfile gpc.chr$chr --make-bed --out gpc.chr$chr
  /bin/rm gpc.chr$chr.tfam gpc.chr$chr.tped gpc.chr$chr.nosex gpc.chr$chr.log

Wednesday, August 28, 2013

Impute APOE and APOL1 with 23andMe

If you have paid to get your genome genotyped by 23andMe, you probably know that there are a lot of variants that are not directly genotyped by 23andMe. To obviate this problem, large datasets like the 1000 Genomes Project can be used to compare your genotypes with the genotypes of other individuals and guess what your missing genotypes are. This process is called imputation and is routinely practiced by many researchers in my area. Although, little is available online with regard to imputation of your own 23andMe data. The best link I could find was here but the instructions provided to perform the imputation still require a fair amount of knowledge on how to use some specific tools.

I think this is unacceptable and I decided to come up with my own scripts and provide a couple of examples to show how to impute your own rs429358 APOE genotype (the famous Alzheimer variant, which you will have genotyped only if you bought the more recent 23andMe v3 chip) and also how to impute the rs73885319 rs60910145 rs71785313 APOL1 genotypes (those associated to non-diabetic kidney disease in people of African descent, as shown in my previous paper) which unfortunately are still not tested by 23andMe.

I wrote a few simple scripts that will allow you to impute these genotypes in a matter of minutes (or seconds, depending on how fast you can cut and paste the code below). The following instructions should work on any recent Linux Ubuntu box. You should be able to use this code on a Mac as well, provided you have the necessary basic tools already installed.

The first preliminary step is to download your raw 23andMe data. Go to your account and download the "All DNA" data set.

The second preliminary step is to install a few UNIX-friendly programs to manipulate genetic datasets, the imputation software Beagle, a couple of scripts I wrote, and a template VCF file for the variants tested by 23andMe. The following commands will do:

sudo apt-get install bedtools dos2unix openjdk-8-jre tabix unzip wget
chmod a+x

The 23andme.vcf.gz file is a VCF template I pre-built for 23andMe v2 and 23andMe v3 chips. If you want to build it yourself (optional, you don't need to do this!), from a directory where you have stored your 23andMe raw files, run:

./ 00-All.vcf.gz genome_*_Full_*.zip

This step takes time as the whole dbSNP dataset (~1.4GB) will need to be downloaded and processed.

Now the first step is to convert the raw downloaded data to a more standard format, the VCF format:

./ 23andme.vcf.gz genome_*_Full_*.zip

Where genome_*_Full_*.zip should be substituted with the name of the zip file you downloaded from 23andMe. The second and last step is to create a list of variants (from a single locus) to be imputed and pass the genomic locations of these variants to a script which will run the imputation.

The script provided will automatically download a piece of the 1000 Genomes Project reference panel around the variants of interest, subset this genotype panel to the variants to be imputed and the variants available in your VCF file, and run the imputation with this minimal reference panel. This way the imputation process will be immediate and you will have your results right away.

For APOE, we are only interested in the rs429358 variant and the rs7412 variant (which is already genotyped both in the 23andMe v2 and the 23andMe v3 chip), so the code will look like this:

echo -e "19\t45411940\t45411941\n19\t45412078\t45412079" > apoe.bed
./ beagle.16Jun16.7e4.jar apoe.bed genome_*_Full_*.vcf.gz

Where genome_*_Full_*.vcf.gz should be substituted with the name of the compressed VCF file generated in the previous step. For APOL1 we are interested instead in three variants:

echo -e "22\t36661905\t36661906\n22\t36662033\t36662034\n22\t36662040\t36662047" > apol1.bed
./ beagle.16Jun16.7e4.jar apol1.bed genome_*_Full_*.vcf.gz

The results will be displayed on screen in VCF file format with best guess genotypes and individual genotype likelihoods. Of course, the results are only probabilistic, so you will need to interpret them accordingly. For the biological interpretation instead, snpedia has a good summary, both for APOE and for APOL1.

If you have any comments, complaint, or suggestions, please send me an email! No perl code was used for any of the scripts.