First you will need a few programs:
sudo apt-get install wget samtools bedtools
And you will need a copy of the hg19 genome:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz tar xfO chromFa.tar.gz > hg19.fa samtools faidx hg19.fa
Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done
It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.
No comments:
Post a Comment