Friday, October 14, 2016

1000 Genomes project phase 3 principal component analysis

The 1000 Genomes project phase 3 genotype data has been available since 2014, but I have not seen any detailed instructions for how to generate a principal component analysis plot of the 2,504 individuals for which genotype data is available. Let's fix that. First of all, you will need software to deal with the formats the data is being distributed as:
# create your personal binary directory
mkdir -p ~/bin/

# install software needed for basic commands
sudo apt-get install git wget unzip samtools

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
/bin/mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64_dev.zip
unzip -od ~/bin/ plink_linux_x86_64_dev.zip plink

# install latest version of GCTA
wget http://cnsgenomics.com/software/gcta/gcta_1.25.2.zip
unzip -od ~/bin/ gcta_1.25.2.zip gcta64
Now that we have all necessary software, we need to download a copy of the human genome reference and a copy of the genotype data. These operations might take several hours, so they might be best run on a good internet connection and possibly overnight:
# download a version of human genome reference GRCh37
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

# download 1000 Genomes reference panel version 5a
mkdir -p chrs/
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/chr{1-4,5-9,10-15,16-22X}.1kg.phase3.v5a.vcf.zip
for chrs in 1-4 5-9 10-15 16-22X; do unzip chr$chrs.1kg.phase3.v5a.vcf.zip -d chrs/; done
The next commands are a bit technical. What we are going to is: (i) convert the downloaded VCF files into plink format using previously described best practices; (ii) remove duplicate markers shamefully present in the genotype data and a few long indels that would make plink2 memory hungry; (iii) join all chromosome files into a single plink file.
# download and convert 1000 Genomes project phase 3 reference
for chr in {1..22} X; do
  tabix -f chrs/chr$chr.1kg.phase3.v5a.vcf.gz
  bcftools norm -Ou -m -any chrs/chr$chr.1kg.phase3.v5a.vcf.gz |
    bcftools norm -Ou -f human_g1k_v37.fasta |
    bcftools annotate -Ob -x ID \
      -I +'%CHROM:%POS:%REF:%ALT' |
    plink --bcf /dev/stdin \
      --keep-allele-order \
      --vcf-idspace-to _ \
      --const-fid \
      --allow-extra-chr 0 \
      --split-x b37 no-fail \
      --make-bed \
      --out chrs/kgp.chr$chr
done

# impute sex using chromosome X
plink --bfile chrs/kgp.chrX \
  --keep-allele-order \
  --impute-sex .95 .95 \
  --make-bed \
  --out chrs/kgp.chrX && \
  /bin/rm chrs/kgp.chrX.{bed,bim,fam}~

# check for duplicate markers (there are 11,943 such markers, mostly on the X chromosome, unfortunately)
for chr in {{1..22},X,Y,MT}; do cut -f2 chrs/kgp.chr$chr.bim | sort | uniq -c | awk '$1>=2 {print $2}'; done > kgp.dups

# check for very long indels (there are 46 of these)
cut -f2 chrs/kgp.chr{{1..22},X,Y,MT}.bim | awk 'length($1)>=150' | sort | uniq > kgp.longindels

# generate version of each chromosome without duplicate variants
for chr in {1..22} X; do
  cat kgp.{dups,longindels} |
    plink --bfile chrs/kgp.chr$chr \
      --keep-allele-order \
      --exclude /dev/stdin \
      --make-bed \
      --out chrs/kgp.clean.chr$chr
done

# join all chromosomes into one
cat kgp.clean.chrX.fam > kgp.fam
cat kgp.clean.chr{{1..22},X}.bim > kgp.bim
(echo -en "\x6C\x1B\x01"; tail -qc +4 kgp.clean.chr{{1..22},X}.bed) > kgp.bed
If you have managed to follow up to this point, the difficult part is over. We now are going to download population information and compute the principal components using plink2 and GCTA:
# download population information
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
awk 'BEGIN {print "FID\tIID\tPOP"} NR>1 {print "0\t"$1"\t"$2}' integrated_call_samples_v3.20130502.ALL.panel > kgp.pop

# compute principal component analysis weights
plink --bfile kgp --maf .01 --indep 50 5 2 --out kgp
plink --bfile kgp --extract kgp.prune.in --make-grm-bin --out kgp
gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10
(echo FID IID PC{1..20}; cat kgp.eigenvec) > kgp.pca
The last command will generate a file which contains the first 20 principal components for the 2,504 individuals in the 1000 Genomes project phase 3. Now, we are only left to plot the result. There are many programs that can do this (like Excel) but I will show some R code (requiring ggplot2). If you are familiar with R, you can use the following script:
# generate principal component plot
suppressPackageStartupMessages(library(ggplot2))
pca <- read.table('kgp.pca', header = TRUE)
pop <- read.table('kgp.pop', header = TRUE)
df <- merge(pca, pop)
pdf('kgp.pdf')
p <- list()
p[[1]] <- ggplot(df, aes(x=PC1, y=PC2))
p[[2]] <- ggplot(df, aes(x=PC1, y=PC3))
p[[3]] <- ggplot(df, aes(x=PC2, y=PC3))
p[[4]] <- ggplot(df, aes(x=PC1, y=PC4))
p[[5]] <- ggplot(df, aes(x=PC2, y=PC4))
p[[6]] <- ggplot(df, aes(x=PC3, y=PC4))
for (i in 1:6) {
  print(p[[i]] + geom_point(aes(color=POP, shape=POP), alpha=1/3) +
  scale_color_discrete() +
  scale_shape_manual(values=0:(length(levels(df[,'POP']))-1)%%25+1) +
  theme_bw())
}
dev.off()
The R code will generate six images, the first being the following (see here for a full legend):
You might notice that the European populations on the top left, the African populations on the right, the East Asian populations in the bottom left, and the American and South Asian populations in the left, in between Europeans and East Asians. However, this plot does not make justice to American and South Asian populations. A three-dimensional visualization would provide a much better understanding of how these five super-populations relate to each other. I will not share the ugly Matlab code I have used, but I will share the result: As you might now notice, the South Asian populations have their own non-overlapping location in space. You might also appreciate many other previously hiding details. The BEB population (Bengali from Bangladesh) now clearly leans towards the East Asian population, as historical accounts would expect. You will now notice very clearly other smaller details as the five ASW (African American) individuals with significant amount of Native American ancestry and one PEL (Peruvians from Lima) individual with significant amounts of East Asian ancestry. This one individual is HG01944 and he clearly deserves some extra attention. We will talk about him again in the next post. ;-)

2 comments:

  1. This is maybe a silly question, but I keep getting this error and I'm not sure what to do.


    Code works great until there

    [W::vcf_parse] contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
    [vcf.c:1274 bcf_write] Unchecked error (1), exiting.

    ReplyDelete
  2. I have added a line that generates the tabix index.

    ReplyDelete