tag:blogger.com,1999:blog-2735750853096182491.post368232205795617389..comments2023-04-26T06:22:39.785-07:00Comments on genetics for fun: Best practice for converting VCF files to plink formatUnknownnoreply@blogger.comBlogger21125tag:blogger.com,1999:blog-2735750853096182491.post-65273200115497594002020-07-13T08:26:19.556-07:002020-07-13T08:26:19.556-07:00The plink bim files specification requires 6 colum...The plink bim files specification requires 6 columns. If you had more columns that would break the specification.a If bcftools is telling you that the file is not compressed with bgzip, then it means it was not. This must be a problem with the Michigan imputation serverfreeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-7564748281305285712020-07-13T07:49:57.510-07:002020-07-13T07:49:57.510-07:00I tried your code with bcftools using the files I ...I tried your code with bcftools using the files I downloaded from the Michigan imputation server. I had performed quality control before submitting them to be imputed. These are the .dose.vcf.gz files. Is there a step needed before using it as input in the bcftools as I have followed your code and it says that Failed to open -: not compressed with bgzip. I had bgzipped it before submitting it to the server? I would really appreciate your advice! Thank you.Lisahttps://www.blogger.com/profile/07413883303136309193noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-61370933895951788282020-07-13T03:29:36.853-07:002020-07-13T03:29:36.853-07:00so the chromosome:position:ref allelele:alternate ...so the chromosome:position:ref allelele:alternate allele:rsid are all one column... Do you know it can be separated out so each thing is a different column? Thanks!Lisahttps://www.blogger.com/profile/11009172729996428660noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-16264232818639657062020-07-12T17:06:55.181-07:002020-07-12T17:06:55.181-07:00I am not sure what you mean. Variants IDs in plink...I am not sure what you mean. Variants IDs in plink are not allowed to contain spacesfreeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-90826974490272272042020-07-12T15:00:44.416-07:002020-07-12T15:00:44.416-07:00code used
java -jar snpEff/SnpSift.jar annotate A...code used <br />java -jar snpEff/SnpSift.jar annotate All_20170710.vcf.gz chr22.dose.vcf.gz | plink --vcf /dev/stdin --keep-allele-order --double-id --make-bed --out output22<br />Lisahttps://www.blogger.com/profile/11009172729996428660noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-78372685961295174532020-07-12T14:59:54.363-07:002020-07-12T14:59:54.363-07:00Hi, I used some of your code for converting to pli...Hi, I used some of your code for converting to plink however the formatting is strange. Do you know how I could re-format the plink file so each thing is in its own column and space-delimited? Thanks!<br /><br />22 22:51237071:G:A;rs188271599 0 51237071 A G<br />22 22:51237136:G:A;rs570845638 0 51237136 A G<br />22 22:51237137:A:C;rs534739169 0 51237137 C A<br />22 22:51237486:G:C;rs149820726 0 51237486 C G<br />22 22:51238275:C:T;rs202031343 0 51238275 T C<br />22 22:51238307:A:G;rs577028785 0 51238307 G A<br />Lisahttps://www.blogger.com/profile/11009172729996428660noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-31591719057368299912018-01-08T04:07:47.007-08:002018-01-08T04:07:47.007-08:00"# install latest version of bcftools
git clo..."# install latest version of bcftools<br />git clone --branch=develop git://github.com/samtools/bcftools.git<br />git clone --branch=develop git://github.com/samtools/htslib.git<br />cd htslib && make && cd ..<br />cd bcftools && make && cd ..<br />mv bcftools/bcftools ~/bin/"<br />Doesn't give the most recent version of bcftools, which led to an error for me.<br />BCFtools 1.6 is available for download here: http://www.htslib.org/download/shttps://www.blogger.com/profile/08070337200963851953noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-61630194600260083682017-09-05T08:14:57.882-07:002017-09-05T08:14:57.882-07:00It seems like your BCF file is malformed. I advise...It seems like your BCF file is malformed. I advise you to discuss your problem with the person that gave you the file. Another good way to start is to get familiar with "bcftools norm --check-ref" to make sure your VCF file is good. I have seen many examples of VCF files not following the standard.freeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-47905987818733555832017-09-05T02:42:36.085-07:002017-09-05T02:42:36.085-07:00Hi freeseek,
I am using the same files as you hav...Hi freeseek, <br />I am using the same files as you have used. But , bcftools norm -Ou -m -any input.vcf.gz |<br /> bcftools norm -Ou -f human_g1k_v37.fasta |<br /> bcftools annotate -Ob -x ID \<br /> -I +'%CHROM:%POS:%REF:%ALT' |<br /> plink --bcf /dev/stdin \<br /> --keep-allele-order \<br /> --vcf-idspace-to _ \<br /> --const-fid \<br /> --allow-extra-chr 0 \<br /> --split-x b37 no-fail \<br /> --make-bed \<br /> --out output<br /><br />Doesn't work for me and I get the error : Improperly formatted bcf. I am new to genomics and I find difficult to get the principal components for 1000 genomesAadhirahttps://www.blogger.com/profile/07110977164188577108noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-15917863187529826652016-10-13T07:36:18.136-07:002016-10-13T07:36:18.136-07:00Evan, your VCF file is malformed. At position chr1...Evan, your VCF file is malformed. At position chr1:752,566 you have the reference allele as letter A, but in the /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta file you have the letter G at the same position. The bcftools norm command cannot run with such inconsistencies. Somehow you messed up the reference/alternate allele when you generated the VCF file. Malformed VCF files are not going to work. Use bcftools norm --check-ref to verify whether the VCF file is properly formatted with respect to the reference genome.freeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-27719040849068952342016-10-12T17:21:05.338-07:002016-10-12T17:21:05.338-07:00##fileformat=VCFv4.2
##filedate=20160822
##source=...##fileformat=VCFv4.2<br />##filedate=20160822<br />##source="beagle.27Jul16.86a.jar (version 4.1)"<br />##INFO=<br />##INFO=<br />##INFO=<br />##INFO=<br />##FORMAT=<br />##FORMAT=<br />##FORMAT=<br />#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT southAmerica_BR91 southAmerica_BR96 southAmerica_BR128 southAmerica_FBP500 southAmerica_FBP428 southAmerica_FBP335<br />1 752566 rs3094315 A G . PASS . GT 1|0 0|0 0|0 0|0 1|0 0|0<br />1 752721 rs3131972 G A . PASS . GT 1|0 0|0 1|0 0|0 1|0 0|0<br />1 762320 exm2268640 G A . PASS . GT 0|0 0|0 1|0 0|0 0|0 0|0<br />1 798959 rs11240777 G A . PASS . GT 1|1 0|0 0|1 0|1 1|0 0|0<br />1 838555 rs4970383 C A . PASS . GT 0|0 1|0 0|0 0|0 0|1 1|1Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-36794980072169805712016-10-12T17:19:18.352-07:002016-10-12T17:19:18.352-07:00This comment has been removed by the author.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-88939347566422021532016-10-12T17:07:18.826-07:002016-10-12T17:07:18.826-07:00Hi Giulio,
I'm working with a vcf file genera...Hi Giulio,<br /><br />I'm working with a vcf file generated from the Beagle IBD software, it consists of the phased genotypes across 6 samples and all columns are identical to for example to the 1000Genomes phased genotype vcfs. Though when I run the: <br /><br />"~/bin/bcftools norm -Ou -m -any phased_fbp.vcf.gz | ~/bin/bcftools norm -Ou -f /stornext/snfs5/ruichen/dmhg/fgi/evanj/PCA_Data/1KG/human_g1k_v37.fasta | ~/bin/bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' | ~/bin/plink --bcf /dev/stdin --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out fbp" command <br /><br />I get a error saying my .bcf file is improperly formatted:<br />###############<br />PLINK v1.90b3.42 64-bit (20 Sep 2016) https://www.cog-genomics.org/plink2<br />(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3<br />Logging to fbp.log.<br />Options in effect:<br /> --allow-extra-chr 0<br /> --bcf /dev/stdin<br /> --const-fid<br /> --keep-allele-order<br /> --make-bed<br /> --out fbp<br /> --split-x b37 no-fail<br /> --vcf-idspace-to _<br /><br />7872 MB RAM detected; reserving 3936 MB for main workspace.<br />Reference allele mismatch at 1:752566 .. REF_SEQ:'G' vs VCF:'A'<br />Failed to open -: unknown file type<br />Error: Improperly formatted .bcf file.<br />##########<br />Any idea on what I may have wrong? Does this mean my vcf file is somehow improperly formatted? ThanksAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-3863829155319178232016-06-22T15:36:30.978-07:002016-06-22T15:36:30.978-07:00Thanks for the hack! After further inspection, th...Thanks for the hack! After further inspection, the simplest explanation is that the ALT field has two values but should have three. This is clear from the fact that there are 4 AD values, 10 PL values, and 3 AF values. Unfortunately the header isn't much help in tracing the origin of these files. At least, this problem only occurs in indel type calls. I'm not very familiar with the VCF format or the various tools so it'll likely take me awhile to figure this out. Anyway, thanks again for the help!Henryhttps://www.blogger.com/profile/11492361770873690362noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-47615974276770113202016-06-22T13:19:16.951-07:002016-06-22T13:19:16.951-07:00I suppose you could change "AF,Number=A"...I suppose you could change "AF,Number=A" to "AF,Number=." in the header. A (slow) way to do that could be: bcftools view exomes1.vcf.gz|sed 's/AF,Number=A/AF,Number=./'|bcftools norm -Ob -m -any -o exomes1.norm.bcf. But the correct thing to do here is to figure what software generated the VCF and report the bug upstream so that it gets fixed and other people don't have the same issue.freeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-14824069606239141392016-06-22T11:54:10.597-07:002016-06-22T11:54:10.597-07:00Hi freeseek,
Thanks so much for the post. I trie...Hi freeseek,<br /><br />Thanks so much for the post. I tried the pipelined command, and got an error, seemingly related to indels, and wondered if you could comment:<br /><br />bcftools norm -Ou -m -any exomes1.vcf.gz > exomes1.norm.bcf<br />Error: wrong number of fields in INFO/AF at 1:24180806, expected 2, found 3<br /><br />with:<br /><br />bcftools view -H exomes1.vcf.gz 1:24180806 | cut -c 1-100<br />1 24180806 rs55699941 TA TAA,T 7970.41 PASS AC=21,18;AF=0.005,0.1238,0.1733;AN=136;BaseQRankSum=19.5<br /><br />Clearly, the error message is legit -- this is a badly formatted line. There are indeed 2 ALT alleles and 3 AF fields. Any idea why one would see this discrepancy, and how to fix it?<br /><br />Thanks,<br /><br />Henry<br />Henryhttps://www.blogger.com/profile/11492361770873690362noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-67495747306774753682016-02-29T02:12:33.592-08:002016-02-29T02:12:33.592-08:00hi buddy i am getting this error
--bcf: 481k varia...hi buddy i am getting this error<br />--bcf: 481k variants complete.Error: Excessively long typed string in .bcf file.Anonymoushttps://www.blogger.com/profile/07725315193167985053noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-31058402968730943982014-12-11T11:36:52.742-08:002014-12-11T11:36:52.742-08:00Their VCF files are malformed. You should report t...Their VCF files are malformed. You should report this issue with the maintainers of those files (not here) as other people might have the same problem.freeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-68890912237801102552014-12-11T08:23:26.984-08:002014-12-11T08:23:26.984-08:00Ok thanks... I downloaded it from the MACH website...Ok thanks... I downloaded it from the MACH website here: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html <br />Maybe this is an older version...?francyhttps://www.blogger.com/profile/09952572613044920775noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-5653132529876173712014-12-11T06:42:31.540-08:002014-12-11T06:42:31.540-08:00It seems like your VCF file is malformed. You must...It seems like your VCF file is malformed. You must have some "NS=..." in the INFO field of one of the variants but without having NS defined in the header of the VCF file. I believe bcftools will refuse to work with malformed VCF files.freeseekhttps://www.blogger.com/profile/13892797134951166637noreply@blogger.comtag:blogger.com,1999:blog-2735750853096182491.post-80599301009831920232014-12-11T04:14:35.716-08:002014-12-11T04:14:35.716-08:00Thank you!! this is a great post freeseek! Incredi...Thank you!! this is a great post freeseek! Incredibly needed nowadays!! <br />I have not been able to solve the error I get in the last lines of your code here:<br /><br />bcftools annotate -Ob -x ID chr21.test.vcf <br /><br />which says: <br /> [W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String<br /> Encountered error, cannot proceed. Please check the error output above.<br /><br />If you have any insight on how to go past this please let me know, thanks again!!francyhttps://www.blogger.com/profile/09952572613044920775noreply@blogger.com