Tuesday, November 25, 2014

Best practice for converting VCF files to plink format

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?

20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:

# create your personal binary directory
mkdir -p ~/bin/

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:

bcftools norm -Ou -m -any input.vcf.gz |
  bcftools norm -Ou -f human_g1k_v37.fasta |
  bcftools annotate -Ob -x ID \
    -I +'%CHROM:%POS:%REF:%ALT' |
  plink --bcf /dev/stdin \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --const-fid \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --make-bed \
    --out output

Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).

The first command will split multi-allelic alleles so that a variant like the following:

20 31022441 AG AGG,A

Will become two variants as follows:

20 31022441 AG AGG
20 31022441 AG A

The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:

20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:

20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A

The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:

cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:

plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~

However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).


  1. Thank you!! this is a great post freeseek! Incredibly needed nowadays!!
    I have not been able to solve the error I get in the last lines of your code here:

    bcftools annotate -Ob -x ID chr21.test.vcf

    which says:
    [W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String
    Encountered error, cannot proceed. Please check the error output above.

    If you have any insight on how to go past this please let me know, thanks again!!

    1. It seems like your VCF file is malformed. You must have some "NS=..." in the INFO field of one of the variants but without having NS defined in the header of the VCF file. I believe bcftools will refuse to work with malformed VCF files.

  2. Ok thanks... I downloaded it from the MACH website here: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html
    Maybe this is an older version...?

    1. Their VCF files are malformed. You should report this issue with the maintainers of those files (not here) as other people might have the same problem.

  3. hi buddy i am getting this error
    --bcf: 481k variants complete.Error: Excessively long typed string in .bcf file.

  4. Hi freeseek,

    Thanks so much for the post. I tried the pipelined command, and got an error, seemingly related to indels, and wondered if you could comment:

    bcftools norm -Ou -m -any exomes1.vcf.gz > exomes1.norm.bcf
    Error: wrong number of fields in INFO/AF at 1:24180806, expected 2, found 3


    bcftools view -H exomes1.vcf.gz 1:24180806 | cut -c 1-100
    1 24180806 rs55699941 TA TAA,T 7970.41 PASS AC=21,18;AF=0.005,0.1238,0.1733;AN=136;BaseQRankSum=19.5

    Clearly, the error message is legit -- this is a badly formatted line. There are indeed 2 ALT alleles and 3 AF fields. Any idea why one would see this discrepancy, and how to fix it?



  5. I suppose you could change "AF,Number=A" to "AF,Number=." in the header. A (slow) way to do that could be: bcftools view exomes1.vcf.gz|sed 's/AF,Number=A/AF,Number=./'|bcftools norm -Ob -m -any -o exomes1.norm.bcf. But the correct thing to do here is to figure what software generated the VCF and report the bug upstream so that it gets fixed and other people don't have the same issue.

    1. Thanks for the hack! After further inspection, the simplest explanation is that the ALT field has two values but should have three. This is clear from the fact that there are 4 AD values, 10 PL values, and 3 AF values. Unfortunately the header isn't much help in tracing the origin of these files. At least, this problem only occurs in indel type calls. I'm not very familiar with the VCF format or the various tools so it'll likely take me awhile to figure this out. Anyway, thanks again for the help!

  6. Hi Giulio,

    I'm working with a vcf file generated from the Beagle IBD software, it consists of the phased genotypes across 6 samples and all columns are identical to for example to the 1000Genomes phased genotype vcfs. Though when I run the:

    "~/bin/bcftools norm -Ou -m -any phased_fbp.vcf.gz | ~/bin/bcftools norm -Ou -f /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta | ~/bin/bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' | ~/bin/plink --bcf /dev/stdin --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out fbp" command

    I get a error saying my .bcf file is improperly formatted:
    PLINK v1.90b3.42 64-bit (20 Sep 2016) https://www.cog-genomics.org/plink2
    (C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
    Logging to fbp.log.
    Options in effect:
    --allow-extra-chr 0
    --bcf /dev/stdin
    --out fbp
    --split-x b37 no-fail
    --vcf-idspace-to _

    7872 MB RAM detected; reserving 3936 MB for main workspace.
    Reference allele mismatch at 1:752566 .. REF_SEQ:'G' vs VCF:'A'
    Failed to open -: unknown file type
    Error: Improperly formatted .bcf file.
    Any idea on what I may have wrong? Does this mean my vcf file is somehow improperly formatted? Thanks

    1. This comment has been removed by the author.

    2. ##fileformat=VCFv4.2
      ##source="beagle.27Jul16.86a.jar (version 4.1)"
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT southAmerica_BR91 southAmerica_BR96 southAmerica_BR128 southAmerica_FBP500 southAmerica_FBP428 southAmerica_FBP335
      1 752566 rs3094315 A G . PASS . GT 1|0 0|0 0|0 0|0 1|0 0|0
      1 752721 rs3131972 G A . PASS . GT 1|0 0|0 1|0 0|0 1|0 0|0
      1 762320 exm2268640 G A . PASS . GT 0|0 0|0 1|0 0|0 0|0 0|0
      1 798959 rs11240777 G A . PASS . GT 1|1 0|0 0|1 0|1 1|0 0|0
      1 838555 rs4970383 C A . PASS . GT 0|0 1|0 0|0 0|0 0|1 1|1

  7. Evan, your VCF file is malformed. At position chr1:752,566 you have the reference allele as letter A, but in the /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta file you have the letter G at the same position. The bcftools norm command cannot run with such inconsistencies. Somehow you messed up the reference/alternate allele when you generated the VCF file. Malformed VCF files are not going to work. Use bcftools norm --check-ref to verify whether the VCF file is properly formatted with respect to the reference genome.