Wednesday, October 30, 2013

Longest homopolymer in coding sequence in the human genome

It is expected that most homopolymer sequences be negatively selected in coding sequence, due to the higher mutation rate. So what is the longest homopolymer in coding sequence in the human genome? Here a short script to answer this question.

First you will need a few programs:
sudo apt-get install wget samtools bedtools

And you will need a copy of the hg19 genome:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar xfO chromFa.tar.gz > hg19.fa
samtools faidx hg19.fa

Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed
bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa
for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done

It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.

Friday, October 11, 2013

Merging plink binary files

If you have a bunch of plink binary files for the same set of people split over different chromosomes, how do you merge them? A simple answer would be to use the --merge-list option in plink. However, this will require some setting up to do and it turns out there is a much easier way to perform this.

Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml) then the following code will just work:

cat plink.chr1.fam > plink.fam
for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed

This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.

A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.

Tuesday, October 8, 2013

Convert VCF files to PLINK binary format

How many times have you needed to convert a VCF file to PLINK binary format? The 1000 Genomes project has a recommended tool (http://www.1000genomes.org/vcf-ped-converter) but it only works when converting small regions. Another easy to use tool is SnpSift (http://snpeff.sourceforge.net/SnpSift.html#vcf2tped).

Overall, it really isn't that complicated to convert a VCF file. The idea is to first convert it to tped format and then let plink do the job of converting that to binary format.

sudo apt-get install plink tabix
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh

The tools are fairly simple and they are supposed to be flexible. Now, let's suppose your VCF file is bgzip-comrpessed. A good idea might be to split it in chromosomes and generate one plink file for each chromosome. Here some code that will do that:

for chr in {1..22} X Y MT; do
  tabix -H $vcf | ./vcf2tfam.sh /dev/stdin > gpc.chr$chr.tfam
  tabix $vcf $chr | ./vcf2tped.sh /dev/stdin > gpc.chr$chr.tped
  plink --tfile gpc.chr$chr --make-bed --out gpc.chr$chr
  /bin/rm gpc.chr$chr.tfam gpc.chr$chr.tped gpc.chr$chr.nosex gpc.chr$chr.log
done