Tuesday, May 13, 2014

How do I identify all the homopolymer run indels in a VCF file?

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:
#!/bin/bash
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format 
bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
  awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
  bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
  awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
  grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort
The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.

Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:

anc="%DIRECTORY WITH ANCESTRAL COMPONENTS%"

mkdir -p $anc

# download global proportions table
(echo "IID EUR AFR NAT UNK";
for pop in ASW CLM MXL PUR; do
  wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

#!/usr/bin/Rscript
args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
}
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN
}

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),])
}

# output data to disk
write.table(data[,c("FID","IID","EUR","AFR","NAT","ASN")],args[4],row.names=FALSE,quote=FALSE)

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:
set="%PLINK DATASET PREFIX%"
pth="%PATH TO THE 1000 GENOMES PLINK DATASET%"
gct="%DIRECTORY WITH GCTA%"

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(ggplot2))
args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

Thursday, December 12, 2013

1000 Genomes PCA analysis

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:
gct="%DIRECTORY WITH GCTA%"
vcf="%DIRECTORY WITH KGP VCF FILES%"

And then we can download all necessary files:
# install plink and utilities to convert to plink format
sudo apt-get install plink
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh
# download the 1000 Genomes files
pfx="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr"
sfx=".1kg.ref.phase1_release_v3.20101123.vcf.gz"
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# install GCTA
wget http://www.complextraitgenomics.com/software/gcta/gcta_1.22.zip
mkdir -p $gct
unzip gcta_1.22.zip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:
# convert each VCF file to a plink binary file with sex information included
pfx="$vcf/chr"
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./vcf2tfam.sh /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./vcf2tped.sh /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam
done

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:
# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01--indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$chr.prune.in --make-bed --out kgp.chr$chr.prune
done

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:
# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.

Monday, November 18, 2013

IBD pipeline

There are a lot of interesting analyses that segments inherited identical-by-descent (IBD) can be informative for. However, it takes some steps to setup a fully functional IBD pipeline. Here is a basic one that will work with any dataset in binary plink format. It will first phase the data with hapi-ur, then detect IBD segments using beagle, and finally detect clusters of IBD individuals using dash.

To begin with, you need to setup a few variables:
hpr="%DIRECTORY WITH HAPI-UR BINARY FILES%"
gmp="%DIRECTORY WITH GENETIC MAP FILES%"
bgl="%DIRECTORY WITH BEAGLE BINARY FILE%"
dsh="%DIRECTORY WITH DASH BINARY FILE%"
set="%PLINK DATASET PREFIX%"

And have a couple of programs pre-installed on your Linux machine:
# install plink
sudo apt-get install plink

# install hapi-ur
wget https://hapi-ur.googlecode.com/files/hapi-ur-1.01.tgz
tar xzvf hapi-ur-1.01.tgz
mkdir -p $hpr
/bin/cp hapi-ur-1.01/hapi-ur hapi-ur-1.01/insert-map.pl $hpr/

# install genetic maps
wget http://mathgen.stats.ox.ac.uk/impute/genetic_maps_b37.tgz
tar xzvf genetic_maps_b37.tgz
mkdir -p $gmp
for i in {1..22} X_PAR1 X_nonPAR X_PAR2; do gzip -c genetic_maps_b37/genetic_map_chr${i}_combined_b37.txt > $gmp/chr$i.gmap.gz; done

# install beagle
mkdir -p $bgl
wget http://faculty.washington.edu/browning/beagle/b4.r1196.jar -P $bgl/

# install dash
wget http://www1.cs.columbia.edu/~gusev/dash/dash-1-1.tar.gz
tar xzvf dash-1-1.tar.gz
mkdir -p $dsh
/bin/cp dash-1-1/bin/dash_cc $dsh/

The following pipeline will run the analysis independently on each autosome:
# perform IBD and DASH analysis separately for each autosome
for i in {1..22}; do
  # create a directory
  mkdir -p chr$i

  # extract autosome of interest
  p-link --noweb --bfile $set --chr $i --make-bed --out chr$i/$set

  # generate marker file with genetic distance
  zcat $gmp/chr$i.gmap.gz | $hpr/insert-map.pl chr$i/$set.bim - | awk -v OFS="\t" \
    'length($5)>1 || $6=="-" {$5="I"; $6="D"} $5=="-" || length($6)>1 {$5="D"; $6="I"} {print $1,$2,$3,$4,$5,$6}' > chr$i/$set.gmap.bim

  # clean up just in case
  /bin/rm chr$i/$set.sample chr$i/$set.haps

  # phase haplotypes using hapi-ur
  $hpr/hapi-ur -g chr$i/$set.bed -s chr$i/$set.gmap.bim -i chr$i/$set.fam -w 73 --impute -o chr$i/$set

  # convert hapi-ur output for use with beagle
  (echo "##fileformat=VCFv4.1";
  echo "##FORMAT=";
  echo -en "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t";
  awk 'NR>2 {printf $1"."$2"\t"}' chr$i/$set.sample | sed 's/\t$/\n/g';
  awk -v c=$i '{printf c"\t"$3"\t"$2"\t"$4"\t"$5"\t.\tPASS\t.\tGT";
    for (i=6; i<NF; i+=2) printf "\t"$i"|"$(i+1); printf "\n"}' chr$i/$set.haps) |
    java -Xmx3g -jar $bgl/b4.r1185.jar ibd=true gt=/dev/stdin usephase=true burnin-its=0 phase-its=0 out=chr$i/$set

  # convert beagle output for use with DASH
  awk '{sub("\\\."," ",$1); sub("\\\."," ",$3); print $1"."$2-1,$3"."$4-1,$6,$7}' chr$i/$set.hbd chr$i/$set.ibd |
    $dsh/dash_cc chr$i/$set.fam chr$i/$set

  # create map file with IBD clusters
  cut -f1-5 chr$i/$set.clst | awk -v c=$i '{ print c,c"_"$1,$2,$3,$4,$5 }' > chr$i/$set.cmap

  # create map file with IBD clusters for use with plink
  awk -v c=$i '{ print c,$2,0,int(($3+$4)/2) }' chr$i/$set.cmap > chr$i/$set.map

  # convert output to plink binary format
  p-link --maf 0.001 --file chr$i/$set --make-bed --recode --out chr$i/$set
done

You can then join the results in a single plink binary file using the code discussed in a previous post:
# join separate autosome outputs into a single plink binary file
for i in {1..22}; do cat chr$i/$set.cmap; done > $set.dash.cmap
cp $set.fam $set.dash.fam
for i in {1..22}; do cat chr$i/$set.bim; done > $set.dash.bim
(echo -en "\x6C\x1B\x01"; for i in {1..22}; do tail -c +4 chr$i/$set.bed; done) > $set.dash.bed

Wednesday, October 30, 2013

Longest homopolymer in coding sequence in the human genome

It is expected that most homopolymer sequences be negatively selected in coding sequence, due to the higher mutation rate. So what is the longest homopolymer in coding sequence in the human genome? Here a short script to answer this question.

First you will need a few programs:
sudo apt-get install wget samtools bedtools

And you will need a copy of the hg19 genome:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar xfO chromFa.tar.gz > hg19.fa
samtools faidx hg19.fa

Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed
bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa
for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done

It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.

Friday, October 11, 2013

Merging plink binary files

If you have a bunch of plink binary files for the same set of people split over different chromosomes, how do you merge them? A simple answer would be to use the --merge-list option in plink. However, this will require some setting up to do and it turns out there is a much easier way to perform this.

Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml) then the following code will just work:

cat plink.chr1.fam > plink.fam
for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed

This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.

A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.