20 31022441
A AG
There is no way to tell, as the plink format does not record this information.
Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:
# create your personal binary directory
mkdir -p ~/bin/
# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/
# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink
We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta
The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:
bcftools norm -Ou -m -any input.vcf.gz |
bcftools norm -Ou -f human_g1k_v37.fasta |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).
The first command will split multi-allelic alleles so that a variant like the following:
20 31022441 AG AGG,A
Will become two variants as follows:
20 31022441 AG AGG
20 31022441 AG A
The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:
20 31022441 A AG
20 31022441 AG A
Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:
20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A
The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:
cut -f2 output.bim | sort | uniq -c | awk '$1>1'
If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.
You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.
Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:
plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~
However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).
Thank you!! this is a great post freeseek! Incredibly needed nowadays!!
ReplyDeleteI have not been able to solve the error I get in the last lines of your code here:
bcftools annotate -Ob -x ID chr21.test.vcf
which says:
[W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.
If you have any insight on how to go past this please let me know, thanks again!!
It seems like your VCF file is malformed. You must have some "NS=..." in the INFO field of one of the variants but without having NS defined in the header of the VCF file. I believe bcftools will refuse to work with malformed VCF files.
DeleteOk thanks... I downloaded it from the MACH website here: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html
ReplyDeleteMaybe this is an older version...?
Their VCF files are malformed. You should report this issue with the maintainers of those files (not here) as other people might have the same problem.
Deletehi buddy i am getting this error
ReplyDelete--bcf: 481k variants complete.Error: Excessively long typed string in .bcf file.
Hi freeseek,
ReplyDeleteThanks so much for the post. I tried the pipelined command, and got an error, seemingly related to indels, and wondered if you could comment:
bcftools norm -Ou -m -any exomes1.vcf.gz > exomes1.norm.bcf
Error: wrong number of fields in INFO/AF at 1:24180806, expected 2, found 3
with:
bcftools view -H exomes1.vcf.gz 1:24180806 | cut -c 1-100
1 24180806 rs55699941 TA TAA,T 7970.41 PASS AC=21,18;AF=0.005,0.1238,0.1733;AN=136;BaseQRankSum=19.5
Clearly, the error message is legit -- this is a badly formatted line. There are indeed 2 ALT alleles and 3 AF fields. Any idea why one would see this discrepancy, and how to fix it?
Thanks,
Henry
I suppose you could change "AF,Number=A" to "AF,Number=." in the header. A (slow) way to do that could be: bcftools view exomes1.vcf.gz|sed 's/AF,Number=A/AF,Number=./'|bcftools norm -Ob -m -any -o exomes1.norm.bcf. But the correct thing to do here is to figure what software generated the VCF and report the bug upstream so that it gets fixed and other people don't have the same issue.
ReplyDeleteThanks for the hack! After further inspection, the simplest explanation is that the ALT field has two values but should have three. This is clear from the fact that there are 4 AD values, 10 PL values, and 3 AF values. Unfortunately the header isn't much help in tracing the origin of these files. At least, this problem only occurs in indel type calls. I'm not very familiar with the VCF format or the various tools so it'll likely take me awhile to figure this out. Anyway, thanks again for the help!
DeleteHi Giulio,
ReplyDeleteI'm working with a vcf file generated from the Beagle IBD software, it consists of the phased genotypes across 6 samples and all columns are identical to for example to the 1000Genomes phased genotype vcfs. Though when I run the:
"~/bin/bcftools norm -Ou -m -any phased_fbp.vcf.gz | ~/bin/bcftools norm -Ou -f /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta | ~/bin/bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' | ~/bin/plink --bcf /dev/stdin --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out fbp" command
I get a error saying my .bcf file is improperly formatted:
###############
PLINK v1.90b3.42 64-bit (20 Sep 2016) https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to fbp.log.
Options in effect:
--allow-extra-chr 0
--bcf /dev/stdin
--const-fid
--keep-allele-order
--make-bed
--out fbp
--split-x b37 no-fail
--vcf-idspace-to _
7872 MB RAM detected; reserving 3936 MB for main workspace.
Reference allele mismatch at 1:752566 .. REF_SEQ:'G' vs VCF:'A'
Failed to open -: unknown file type
Error: Improperly formatted .bcf file.
##########
Any idea on what I may have wrong? Does this mean my vcf file is somehow improperly formatted? Thanks
This comment has been removed by the author.
Delete##fileformat=VCFv4.2
Delete##filedate=20160822
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=
##INFO=
##INFO=
##INFO=
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT southAmerica_BR91 southAmerica_BR96 southAmerica_BR128 southAmerica_FBP500 southAmerica_FBP428 southAmerica_FBP335
1 752566 rs3094315 A G . PASS . GT 1|0 0|0 0|0 0|0 1|0 0|0
1 752721 rs3131972 G A . PASS . GT 1|0 0|0 1|0 0|0 1|0 0|0
1 762320 exm2268640 G A . PASS . GT 0|0 0|0 1|0 0|0 0|0 0|0
1 798959 rs11240777 G A . PASS . GT 1|1 0|0 0|1 0|1 1|0 0|0
1 838555 rs4970383 C A . PASS . GT 0|0 1|0 0|0 0|0 0|1 1|1
Evan, your VCF file is malformed. At position chr1:752,566 you have the reference allele as letter A, but in the /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta file you have the letter G at the same position. The bcftools norm command cannot run with such inconsistencies. Somehow you messed up the reference/alternate allele when you generated the VCF file. Malformed VCF files are not going to work. Use bcftools norm --check-ref to verify whether the VCF file is properly formatted with respect to the reference genome.
ReplyDeleteHi freeseek,
ReplyDeleteI am using the same files as you have used. But , bcftools norm -Ou -m -any input.vcf.gz |
bcftools norm -Ou -f human_g1k_v37.fasta |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
Doesn't work for me and I get the error : Improperly formatted bcf. I am new to genomics and I find difficult to get the principal components for 1000 genomes
It seems like your BCF file is malformed. I advise you to discuss your problem with the person that gave you the file. Another good way to start is to get familiar with "bcftools norm --check-ref" to make sure your VCF file is good. I have seen many examples of VCF files not following the standard.
ReplyDelete"# install latest version of bcftools
ReplyDeletegit clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/"
Doesn't give the most recent version of bcftools, which led to an error for me.
BCFtools 1.6 is available for download here: http://www.htslib.org/download/
Hi, I used some of your code for converting to plink however the formatting is strange. Do you know how I could re-format the plink file so each thing is in its own column and space-delimited? Thanks!
ReplyDelete22 22:51237071:G:A;rs188271599 0 51237071 A G
22 22:51237136:G:A;rs570845638 0 51237136 A G
22 22:51237137:A:C;rs534739169 0 51237137 C A
22 22:51237486:G:C;rs149820726 0 51237486 C G
22 22:51238275:C:T;rs202031343 0 51238275 T C
22 22:51238307:A:G;rs577028785 0 51238307 G A
I am not sure what you mean. Variants IDs in plink are not allowed to contain spaces
Deleteso the chromosome:position:ref allelele:alternate allele:rsid are all one column... Do you know it can be separated out so each thing is a different column? Thanks!
DeleteI tried your code with bcftools using the files I downloaded from the Michigan imputation server. I had performed quality control before submitting them to be imputed. These are the .dose.vcf.gz files. Is there a step needed before using it as input in the bcftools as I have followed your code and it says that Failed to open -: not compressed with bgzip. I had bgzipped it before submitting it to the server? I would really appreciate your advice! Thank you.
DeleteThe plink bim files specification requires 6 columns. If you had more columns that would break the specification.a If bcftools is telling you that the file is not compressed with bgzip, then it means it was not. This must be a problem with the Michigan imputation server
Deletecode used
ReplyDeletejava -jar snpEff/SnpSift.jar annotate All_20170710.vcf.gz chr22.dose.vcf.gz | plink --vcf /dev/stdin --keep-allele-order --double-id --make-bed --out output22