Tuesday, November 25, 2014

Best practice for converting VCF files to plink format

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?

20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:

# create your personal binary directory
mkdir -p ~/bin/

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:

bcftools norm -Ou -m -any input.vcf.gz |
  bcftools norm -Ou -f human_g1k_v37.fasta |
  bcftools annotate -Ob -x ID \
    -I +'%CHROM:%POS:%REF:%ALT' |
  plink --bcf /dev/stdin \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --const-fid \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --make-bed \
    --out output

Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).

The first command will split multi-allelic alleles so that a variant like the following:

20 31022441 AG AGG,A

Will become two variants as follows:

20 31022441 AG AGG
20 31022441 AG A

The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:

20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:

20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A

The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:

cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:


plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~


However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).

Tuesday, May 13, 2014

How do I identify all the homopolymer run indels in a VCF file?

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:
#!/bin/bash
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format 
bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
  awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
  bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
  awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
  grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort
The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.