We definitely need the reference genome sequence of the organism the VCF file refers to. However there is no tool to my knowledge that will take a VCF file and a fasta sequence and yield as output an annotated VCF file indicating which mutations are CpG mutations (it would be nice if someone wrote a bcftools pluging for this purporse). But there are tools that will annotate a VCF file given the presence of a mutation in another VCF file. Therefore one solution would be to have a VCF file with all possible CpG mutations. Here is how to do that:
# create your personal binary directory
mkdir -p ~/bin/
# download human genome reference (GRCh37)
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta &&
samtools faidx human_g1k_v37.fasta
# install latest version of bedtools
git clone https://github.com/arq5x/bedtools2.git &&
cd bedtools2 && make && cd .. &&
cp bedtools2/bin/bedtools ~/bin/
# create a VCF file with all CpG mutations (it takes several hours)
awk '{for (i=1; i<$2; i++)
print $1"\t"i-1"\t"i+1"\t"$1" "i}' human_g1k_v37.fasta.fai |
bedtools getfasta -name -fi human_g1k_v37.fasta \
-bed /dev/stdin -fo /dev/stdout |
tr '\n' ' ' | tr '>' '\n' | grep "CG $" |
awk -v OFS="\t" 'BEGIN {print "##fileformat=VCFv4.1";
print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}
{print $1"\t"$2"\t.\tC\tT\t.\t.\t.";
print $1"\t"$2+1"\t.\tG\tA\t.\t.\t."}' |
bgzip > cpg.vcf.gz &&
tabix -f cpg.vcf.gz
To be completely thorough these mutations will be CpG mutations assuming that the reference sequence is the ancestral sequence which is not always the case. But for the majority of rare mutations it is a fair assumption. Furthermore, this code does not take into account whether a CpG mutation is located within a CpG island and this is also information that might be important as not all CpG sites are equally mutable.Now that we have our VCF file with all CpG mutations, we can use SnpSift to annotate our input VCF file. I selected SnpSift because it is quite fast and flexible compared to other tools like bcftools (see here). This can be achieved as follows:
# install latest version of snpEff/SnpSift
wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip &&
unzip snpEff_latest_core.zip
# annotate your VCF file with SnpSift
java -jar snpEff/SnpSift.jar annotate -exists CPG cpg.vcf.gz input.vcf.gz | bgzip > output.vcf.gz &&
tabix -f output.vcf.gz
If you want to extract from the VCF file only the variants which are CpG mutations, the following code will work:
# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/
# extract from VCF file all CpG mutations
bcftools view -Oz -i "CPG==1" output.vcf.gz -o output.cpg.vcf.gz &&
tabix -f output.cpg.vcf.gz