Tuesday, November 25, 2014

Best practice for converting VCF files to plink format

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?

20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
cp bcftools/bcftools ~/bin/

# install latest version of plink2
ver=$(wget -O- https://www.cog-genomics.org/plink2/ | grep plink_linux_x86_64.zip | cut -d/ -f4)
wget -qO plink_linux_x86_64.zip https://www.cog-genomics.org/static/bin/$ver/plink_linux_x86_64.zip
unzip -o plink_linux_x86_64.zip
cp plink ~/bin/

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:

bcftools annotate -Ob -x ID input.vcf.gz |
  bcftools norm -Ob -m -any |
  bcftools norm -f human_g1k_v37.fasta |
  awk -F"\t" -v OFS="\t" '$3=="." {$3="chr"$1"_"$2"_b37_"$4"_"$5} {print}' |
  plink --vcf /dev/stdin --double-id --allow-extra-chr --split-x b37 --make-bed --out output

The first command removes any ID currently present in the VCF file. The second command will split multi-allelic alleles so that a variant like the following:

20 31022441 AG AGG,A

Will become two variants as follows:

20 31022441 AG AGG
20 31022441 AG A

The third command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:

20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The fourth command will assign a unique name to each bi-allelic variant:

20 31022441 chr20_31022441_b37_A_AG A AG
20 31022441 chr20_31022441_b37_AG_A AG A

The fifth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:

cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:


plink --bfile output --impute-sex --impute-sex .3 .4 --make-bed --out output.with.sex


However, the two thresholds you should use (.3 .4 in the above examples) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.

Tuesday, May 13, 2014

How do I identify all the homopolymer run indels in a VCF file?

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:
#!/bin/bash
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format 
bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
  awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
  bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
  awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
  grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort
The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.

Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:

anc="%DIRECTORY WITH ANCESTRAL COMPONENTS%"

mkdir -p $anc

# download global proportions table
(echo "IID EUR AFR NAT UNK";
for pop in ASW CLM MXL PUR; do
  wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

#!/usr/bin/Rscript
args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
}
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN
}

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),])
}

# output data to disk
write.table(data[,c("FID","IID","EUR","AFR","NAT","ASN")],args[4],row.names=FALSE,quote=FALSE)

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:
set="%PLINK DATASET PREFIX%"
pth="%PATH TO THE 1000 GENOMES PLINK DATASET%"
gct="%DIRECTORY WITH GCTA%"

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(ggplot2))
args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

Thursday, December 12, 2013

1000 Genomes PCA analysis

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:
gct="%DIRECTORY WITH GCTA%"
vcf="%DIRECTORY WITH KGP VCF FILES%"

And then we can download all necessary files:
# install plink and utilities to convert to plink format
sudo apt-get install plink
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh
# download the 1000 Genomes files
pfx="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr"
sfx=".1kg.ref.phase1_release_v3.20101123.vcf.gz"
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# install GCTA
wget http://www.complextraitgenomics.com/software/gcta/gcta_1.22.zip
mkdir -p $gct
unzip gcta_1.22.zip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:
# convert each VCF file to a plink binary file with sex information included
pfx="$vcf/chr"
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./vcf2tfam.sh /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./vcf2tped.sh /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam
done

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:
# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01 --indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$chr.prune.in --make-bed --out kgp.chr$chr.prune
done

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:
# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.

Monday, November 18, 2013

IBD pipeline

There are a lot of interesting analyses that segments inherited identical-by-descent (IBD) can be informative for. However, it takes some steps to setup a fully functional IBD pipeline. Here is a basic one that will work with any dataset in binary plink format. It will first phase the data with hapi-ur, then detect IBD segments using beagle, and finally detect clusters of IBD individuals using dash.

To begin with, you need to setup a few variables:
hpr="%DIRECTORY WITH HAPI-UR BINARY FILES%"
gmp="%DIRECTORY WITH GENETIC MAP FILES%"
bgl="%DIRECTORY WITH BEAGLE BINARY FILE%"
dsh="%DIRECTORY WITH DASH BINARY FILE%"
set="%PLINK DATASET PREFIX%"

And have a couple of programs pre-installed on your Linux machine:
# install plink
sudo apt-get install plink

# install hapi-ur
wget https://hapi-ur.googlecode.com/files/hapi-ur-1.01.tgz
tar xzvf hapi-ur-1.01.tgz
mkdir -p $hpr
/bin/cp hapi-ur-1.01/hapi-ur hapi-ur-1.01/insert-map.pl $hpr/

# install genetic maps
wget http://mathgen.stats.ox.ac.uk/impute/genetic_maps_b37.tgz
tar xzvf genetic_maps_b37.tgz
mkdir -p $gmp
for i in {1..22} X_PAR1 X_nonPAR X_PAR2; do gzip -c genetic_maps_b37/genetic_map_chr${i}_combined_b37.txt > $gmp/chr$i.gmap.gz; done

# install beagle
mkdir -p $bgl
wget http://faculty.washington.edu/browning/beagle/b4.r1196.jar -P $bgl/

# install dash
wget http://www1.cs.columbia.edu/~gusev/dash/dash-1-1.tar.gz
tar xzvf dash-1-1.tar.gz
mkdir -p $dsh
/bin/cp dash-1-1/bin/dash_cc $dsh/

The following pipeline will run the analysis independently on each autosome:
# perform IBD and DASH analysis separately for each autosome
for i in {1..22}; do
  # create a directory
  mkdir -p chr$i

  # extract autosome of interest
  p-link --noweb --bfile $set --chr $i --make-bed --out chr$i/$set

  # generate marker file with genetic distance
  zcat $gmp/chr$i.gmap.gz | $hpr/insert-map.pl chr$i/$set.bim - | awk -v OFS="\t" \
    'length($5)>1 || $6=="-" {$5="I"; $6="D"} $5=="-" || length($6)>1 {$5="D"; $6="I"} {print $1,$2,$3,$4,$5,$6}' > chr$i/$set.gmap.bim

  # clean up just in case
  /bin/rm chr$i/$set.sample chr$i/$set.haps

  # phase haplotypes using hapi-ur
  $hpr/hapi-ur -g chr$i/$set.bed -s chr$i/$set.gmap.bim -i chr$i/$set.fam -w 73 --impute -o chr$i/$set

  # convert hapi-ur output for use with beagle
  (echo "##fileformat=VCFv4.1";
  echo "##FORMAT=";
  echo -en "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t";
  awk 'NR>2 {printf $1"."$2"\t"}' chr$i/$set.sample | sed 's/\t$/\n/g';
  awk -v c=$i '{printf c"\t"$3"\t"$2"\t"$4"\t"$5"\t.\tPASS\t.\tGT";
    for (i=6; i<NF; i+=2) printf "\t"$i"|"$(i+1); printf "\n"}' chr$i/$set.haps) |
    java -Xmx3g -jar $bgl/b4.r1185.jar ibd=true gt=/dev/stdin usephase=true burnin-its=0 phase-its=0 out=chr$i/$set

  # convert beagle output for use with DASH
  awk '{sub("\\\."," ",$1); sub("\\\."," ",$3); print $1"."$2-1,$3"."$4-1,$6,$7}' chr$i/$set.hbd chr$i/$set.ibd |
    $dsh/dash_cc chr$i/$set.fam chr$i/$set

  # create map file with IBD clusters
  cut -f1-5 chr$i/$set.clst | awk -v c=$i '{ print c,c"_"$1,$2,$3,$4,$5 }' > chr$i/$set.cmap

  # create map file with IBD clusters for use with plink
  awk -v c=$i '{ print c,$2,0,int(($3+$4)/2) }' chr$i/$set.cmap > chr$i/$set.map

  # convert output to plink binary format
  p-link --maf 0.001 --file chr$i/$set --make-bed --recode --out chr$i/$set
done

You can then join the results in a single plink binary file using the code discussed in a previous post:
# join separate autosome outputs into a single plink binary file
for i in {1..22}; do cat chr$i/$set.cmap; done > $set.dash.cmap
cp $set.fam $set.dash.fam
for i in {1..22}; do cat chr$i/$set.bim; done > $set.dash.bim
(echo -en "\x6C\x1B\x01"; for i in {1..22}; do tail -c +4 chr$i/$set.bed; done) > $set.dash.bed

Wednesday, October 30, 2013

Longest homopolymer in coding sequence in the human genome

It is expected that most homopolymer sequences be negatively selected in coding sequence, due to the higher mutation rate. So what is the longest homopolymer in coding sequence in the human genome? Here a short script to answer this question.

First you will need a few programs:
sudo apt-get install wget samtools bedtools

And you will need a copy of the hg19 genome:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar xfO chromFa.tar.gz > hg19.fa
samtools faidx hg19.fa

Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed
bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa
for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done

It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.