genetics for fun: Best practice for converting VCF files to plink format

Tuesday, November 25, 2014

Best practice for converting VCF files to plink format

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?


20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:


# create your personal binary directory
mkdir -p ~/bin/

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:


wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:


bcftools norm -Ou -m -any input.vcf.gz |
  bcftools norm -Ou -f human_g1k_v37.fasta |
  bcftools annotate -Ob -x ID \
    -I +'%CHROM:%POS:%REF:%ALT' |
  plink --bcf /dev/stdin \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --const-fid \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --make-bed \
    --out output

Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).

The first command will split multi-allelic alleles so that a variant like the following:


20 31022441 AG AGG,A

Will become two variants as follows:


20 31022441 AG AGG
20 31022441 AG A

The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:


20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:


20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A

The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:


cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:

plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~

However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).

21 comments:

francyDecember 11, 2014 at 4:14 AM
Thank you!! this is a great post freeseek! Incredibly needed nowadays!!
I have not been able to solve the error I get in the last lines of your code here:

bcftools annotate -Ob -x ID chr21.test.vcf

which says:
[W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.

If you have any insight on how to go past this please let me know, thanks again!!
ReplyDelete
Replies
francyDecember 11, 2014 at 8:23 AM
Ok thanks... I downloaded it from the MACH website here: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html
Maybe this is an older version...?
ReplyDelete
Replies
UnknownFebruary 29, 2016 at 2:12 AM
hi buddy i am getting this error
--bcf: 481k variants complete.Error: Excessively long typed string in .bcf file.
ReplyDelete
Replies
HenryJune 22, 2016 at 11:54 AM
Hi freeseek,

Thanks so much for the post. I tried the pipelined command, and got an error, seemingly related to indels, and wondered if you could comment:

bcftools norm -Ou -m -any exomes1.vcf.gz > exomes1.norm.bcf
Error: wrong number of fields in INFO/AF at 1:24180806, expected 2, found 3

with:

bcftools view -H exomes1.vcf.gz 1:24180806 | cut -c 1-100
1 24180806 rs55699941 TA TAA,T 7970.41 PASS AC=21,18;AF=0.005,0.1238,0.1733;AN=136;BaseQRankSum=19.5

Clearly, the error message is legit -- this is a badly formatted line. There are indeed 2 ALT alleles and 3 AF fields. Any idea why one would see this discrepancy, and how to fix it?

Thanks,

Henry
ReplyDelete
Replies
freeseekJune 22, 2016 at 1:19 PM
I suppose you could change "AF,Number=A" to "AF,Number=." in the header. A (slow) way to do that could be: bcftools view exomes1.vcf.gz|sed 's/AF,Number=A/AF,Number=./'|bcftools norm -Ob -m -any -o exomes1.norm.bcf. But the correct thing to do here is to figure what software generated the VCF and report the bug upstream so that it gets fixed and other people don't have the same issue.
ReplyDelete
Replies
AnonymousOctober 12, 2016 at 5:07 PM
Hi Giulio,

I'm working with a vcf file generated from the Beagle IBD software, it consists of the phased genotypes across 6 samples and all columns are identical to for example to the 1000Genomes phased genotype vcfs. Though when I run the:

"~/bin/bcftools norm -Ou -m -any phased_fbp.vcf.gz | ~/bin/bcftools norm -Ou -f /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta | ~/bin/bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' | ~/bin/plink --bcf /dev/stdin --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out fbp" command

I get a error saying my .bcf file is improperly formatted:
###############
PLINK v1.90b3.42 64-bit (20 Sep 2016) https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to fbp.log.
Options in effect:
--allow-extra-chr 0
--bcf /dev/stdin
--const-fid
--keep-allele-order
--make-bed
--out fbp
--split-x b37 no-fail
--vcf-idspace-to _

7872 MB RAM detected; reserving 3936 MB for main workspace.
Reference allele mismatch at 1:752566 .. REF_SEQ:'G' vs VCF:'A'
Failed to open -: unknown file type
Error: Improperly formatted .bcf file.
##########
Any idea on what I may have wrong? Does this mean my vcf file is somehow improperly formatted? Thanks
ReplyDelete
Replies
freeseekOctober 13, 2016 at 7:36 AM
Evan, your VCF file is malformed. At position chr1:752,566 you have the reference allele as letter A, but in the /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta file you have the letter G at the same position. The bcftools norm command cannot run with such inconsistencies. Somehow you messed up the reference/alternate allele when you generated the VCF file. Malformed VCF files are not going to work. Use bcftools norm --check-ref to verify whether the VCF file is properly formatted with respect to the reference genome.
ReplyDelete
Replies
AadhiraSeptember 5, 2017 at 2:42 AM
Hi freeseek,
I am using the same files as you have used. But , bcftools norm -Ou -m -any input.vcf.gz |
bcftools norm -Ou -f human_g1k_v37.fasta |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output

Doesn't work for me and I get the error : Improperly formatted bcf. I am new to genomics and I find difficult to get the principal components for 1000 genomes
ReplyDelete
Replies
freeseekSeptember 5, 2017 at 8:14 AM
It seems like your BCF file is malformed. I advise you to discuss your problem with the person that gave you the file. Another good way to start is to get familiar with "bcftools norm --check-ref" to make sure your VCF file is good. I have seen many examples of VCF files not following the standard.
ReplyDelete
Replies
sJanuary 8, 2018 at 4:07 AM
"# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/"
Doesn't give the most recent version of bcftools, which led to an error for me.
BCFtools 1.6 is available for download here: http://www.htslib.org/download/
ReplyDelete
Replies
LisaJuly 12, 2020 at 2:59 PM
Hi, I used some of your code for converting to plink however the formatting is strange. Do you know how I could re-format the plink file so each thing is in its own column and space-delimited? Thanks!

22 22:51237071:G:A;rs188271599 0 51237071 A G
22 22:51237136:G:A;rs570845638 0 51237136 A G
22 22:51237137:A:C;rs534739169 0 51237137 C A
22 22:51237486:G:C;rs149820726 0 51237486 C G
22 22:51238275:C:T;rs202031343 0 51238275 T C
22 22:51238307:A:G;rs577028785 0 51238307 G A
ReplyDelete
Replies
LisaJuly 12, 2020 at 3:00 PM
code used
java -jar snpEff/SnpSift.jar annotate All_20170710.vcf.gz chr22.dose.vcf.gz | plink --vcf /dev/stdin --keep-allele-order --double-id --make-bed --out output22
ReplyDelete
Replies

Add comment