Wednesday, August 28, 2013

Impute APOE and APOL1 with 23andMe

If you have paid to get your genome genotyped by 23andMe, you probably know that there are a lot of variants that are not directly genotyped by 23andMe. To obviate this problem, large datasets like the 1000 Genomes Project can be used to compare your genotypes with the genotypes of other individuals and guess what your missing genotypes are. This process is called imputation and is routinely practiced by many researchers in my area. Although, little is available online with regard to imputation of your own 23andMe data. The best link I could find was here but the instructions provided to perform the imputation still require a fair amount of knowledge on how to use some specific tools.

I think this is unacceptable and I decided to come up with my own scripts and provide a couple of examples to show how to impute your own rs429358 APOE genotype (the famous Alzheimer variant, which you will have genotyped only if you bought the more recent 23andMe v3 chip) and also how to impute the rs73885319 rs60910145 rs71785313 APOL1 genotypes (those associated to non-diabetic kidney disease in people of African descent, as shown in my previous paper) which unfortunately are still not tested by 23andMe.

I wrote a few simple scripts that will allow you to impute these genotypes in a matter of minutes (or seconds, depending on how fast you can cut and paste the code below). The following instructions should work on any recent Linux Ubuntu box. You should be able to use this code on a Mac as well, provided you have the necessary basic tools already installed.

The first preliminary step is to download your raw 23andMe data. Go to your account and download the "All DNA" data set.

The second preliminary step is to install a few UNIX-friendly programs to manipulate genetic datasets, the imputation software Beagle, a couple of scripts I wrote, and a template VCF file for the variants tested by 23andMe. The following commands will do:

sudo apt-get install bedtools dos2unix tabix unzip wget
wget http://faculty.washington.edu/browning/beagle/b4.r1196.jar
wget http://broadinstitute.org/~giulio/23andme/make_vcf_template.sh
wget http://broadinstitute.org/~giulio/23andme/23andme_to_vcf.sh
wget http://broadinstitute.org/~giulio/23andme/vcf_impute.sh
wget http://broadinstitute.org/~giulio/23andme/23andme.vcf.gz
chmod a+x make_vcf_template.sh 23andme_to_vcf.sh vcf_impute.sh

The 23andme.vcf.gz file is a VCF template I pre-built for 23andMe v2 and 23andMe v3 chips. If you want to build it yourself (optional, you don't need to do this!), from a directory where you have stored your 23andMe raw files, run:

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
./make_vcf_template.sh 00-All.vcf.gz genome_*_Full_*.zip

This step takes time as the whole dbSNP dataset (~1.4GB) will need to be downloaded and processed.

Now the first step is to convert the raw downloaded data to a more standard format, the VCF format:

./23andme_to_vcf.sh 23andme.vcf.gz genome_*_Full_*.zip

Where genome_*_Full_*.zip should be substituted with the name of the zip file you downloaded from 23andMe. The second and last step is to create a list of variants (from a single locus) to be imputed and pass the genomic locations of these variants to a script which will run the imputation.

The script provided will automatically download a piece of the 1000 Genomes Project reference panel around the variants of interest, subset this genotype panel to the variants to be imputed and the variants available in your VCF file, and run the imputation with this minimal reference panel. This way the imputation process will be immediate and you will have your results right away.

For APOE, we are only interested in the rs429358 variant and the rs7412 variant (which is already genotyped both in the 23andMe v2 and the 23andMe v3 chip), so the code will look like this:

echo -e "19\t45411940\t45411941\n19\t45412078\t45412079" > apoe.bed
./vcf_impute.sh apoe.bed genome_*_Full_*.vcf.gz

Where genome_*_Full_*.vcf.gz should be substituted with the name of the compressed VCF file generated in the previous step. For APOL1 we are interested instead in three variants:

echo -e "22\t36661905\t36661906\n22\t36662033\t36662034\n22\t36662040\t36662047" > apol1.bed
./vcf_impute.sh apol1.bed genome_*_Full_*.vcf.gz

The results will be displayed on screen in VCF file format with best guess genotypes and individual genotype likelihoods. Of course, the results are only probabilistic, so you will need to interpret them accordingly. For the biological interpretation instead, snpedia has a good summary, both for APOE and for APOL1.

If you have any comments, complaint, or suggestions, please send me an email! No perl code was used for any of the scripts.

1 comment:

  1. Seems like you haven't made this any less complicated, unless you happen to use a minority OS like Linux or Mac.

    ReplyDelete