Tuesday, May 13, 2014

How do I identify all the homopolymer run indels in a VCF file?

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:
#!/bin/bash
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format 
bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
  awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
  bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
  awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
  grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort
The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.