Some time ago I became interested in chromosome Y. It all started from my maternal grandfather. He was born in Lauria, a small town in the south of Italy. After the civil records for Lauria became available online, it was pretty easy to reconstruct his family tree all the way to the 18th century. However, as shown in the following diagram, one small piece was missing (at the top right):
My grandfather's great-grandfather on the direct paternal line was listed as a child of unknown father. His mother got married in 1812 to someone who subsequently disappeared from the record, and gave birth in 1832 to a child as a single mother. While at first this was puzzling, and unusual for the times, it did seem to fit a particular historical context. As a child she had lost her father in 1806 during the massacre of Lauria that saw the population of the town slaughtered by the French soldiers under general André Masséna, as a punishment for having supported the Bourbon kings during the Napoleon invasion. The French presence brought an increase of secular behaviours among which a significant increase of single mothers (ref).
Chances are that my 4th great-grandfather on the direct paternal line of my grandfather was another local person from Lauria, who decided to never recognize his own biological child who died in 1911 still with the father reported as unknown. Whatever happened, it seemed one of those maybe irrelevant mysteries doomed to be forgotten in time. But then I thought that there was a chance, albeit slight, to shed some light over what happened. Whoever this unknown father who conceived my 3rd great grandfather in 1831 was, he must have carried the same Y chromosome my maternal grandfather carries.
Now, my maternal grandfather does not have sons, does not have brothers, and all of his paternal uncles either left no sons or they moved to Colombia in the 20th century, eventually losing touch with my family. He truly is the only person in my family that carries the Y chromosome that could unlock the mystery. Last year I struck a deal with him promising that, in exchange of some of his DNA, one day I will figure out where his patrilineal line came from and therefore what his real family name should have been if it indeed followed the patrilineal line. How does this work?
While AncestryDNA and 23andMe offer the most comprehensive autosomal tests, there is only one company, FamilyTreeDNA, which performs in depth Y chromosome analyses. The Y chromosome is a special molecule. It never recombines and it gets passed father to son as a single block. Excluding the tips, it is truly the largest molecule to withstand the test of time almost unscathed. But every Y chromosome is unique. In fact, it is so large, that we expect to find, more often than not, a single point mutation difference between the Y chromosome of a son and the Y chromosome of a father in the 21.3 megabases of DNA sequence of the chromosome Y that we can currently assay (see here). And that holds only for point mutations.
Before next-generation sequencing became as mainstream as it is now, most analyses of the Y chromosome relied on measurements of Y chromosome short tandem repeats (STRs). STRs are small segments of repetitive DNA that, due to their repetitive nature, are prone to slippage when copied and this results in their length to change. Although we can almost say that STR analyses are falling out of fashion a bit, in the last 20 years tens of thousands of Y chromosomes have had some of their STRs analyzed. There are 100s of STRs on the Y chromosome but those that have been analyzed before the advent of next-generation sequencing have the most value when you want to compare to available databases. Currently, the cheapest way to get the largest panel is to go for the Y111 DNA test with FamilyTreeDNA. This panel of STRs is such that you can observe almost with 30% chance at least one of them to change length in one father-son transmission. This means that Y chromosomes of direct relatives might actually look alike with a Y111 test, but other than that it is almost guaranteed that every Y chromosome will get its own unique fingerprint with the test and the number of differences might give an idea, albeit vague, of how distantly related two Y chromosomes are.
So I got my grandpa's Y chromosome tested. The results are just a list of numbers, measuring the length of each STR tested, but they can tell a lot. The closest match to my grandpa's Y chromosome matches 60 out of 67 markers, which corresponds to a shared ancestor within the last 20 generations with 90% probability. This is not very helpful, but it does make sense. The match originates from Mercato San Severino, which is less than 100 miles from Lauria:
Furthermore the match's family name is Loria, a variant spelling for Lauria. Less interestingly, a second match, which matches at 54 out of 67 markers, originated from Mezzojuso in Sicily. So what have I learned? I didn't get any clue about the true family name my grandpa should have received, but I have a pretty good guess that his Y chromosome has been around the area for at least 500 years, and most likely my 4th great grandfather was someone local rather than a distant stranger passing by in town. ;-)
Although ultimately I did not find the person that passed the Y chromosome to my 3rd great grandfather, this match tells me that my grandfather's results check out and they were not the result of some weird sample swap on FamilyTreeDNA's side. Maybe one day, within my lifetime, a better closer match will show up and maybe the mystery will be solved. There is a little bit of magic involved in thinking that the answer to this riddle might be out there waiting to come out and maybe one day it will make for a unique story.
I was so impressed from what could be inferred that I decided to take the Y111 test myself. Of course, my maternal grandfather and I don't share the same Y chromosome, so in this case I was investigating a completely different paternal line. Furthermore, I am a Genovese, and it would be nice to get proof that I am not actually related to every Genovese out there. When my results came in, disappointingly, I didn't get any good matches as my grandfather did. The best matches though did make some sense. My paternal grandfather was born in Valguarnera Caropepe in Sicily, and I was able to trace his patrilineal line back to an individual named Francesco Genovese born in Piazza Armerina from Tommaso Genovese and who got married in 1784 in Valguarnera Caropepe. The four best matches I have found (1, 2, 3, 4) originate from Palermo, Trapani, and Caltagirone, three cities in Sicily, the last one being less than 20 miles from Piazza Armerina:
However, these best matches still mismatch my Y chromosome at so many STRs that the time to the most recent common ancestor is difficult to pinpoint. It seems very possible that he might have lived more than 2,000 years ago. At least that tells me that my Y chromosome has been in Sicily for a very long time and it is cool that I can infer that.
What happens if we want to know about a more distant history of our Y chromosomes? When and from where did my grandfather Y chromosome and my own Y chromosome get to Italy? It turns out that there is quite some literature on the topic. Whole genome sequence analysis of 1,000 Genomes project individuals paints a picture of a sudden expansions of Y chromosome lineages during the last 10,000 years ago, together with an expansion of lineages around 50,000 years, that is, at the time when the first humans migrated out of Africa (ref):
My Y chromosome belongs to the J2-M172 lineage (or more specifically J2a-L70) and my grandfather's Y chromosome belongs to the R1b-M343 lineage (or more specifically R1b-M269), which means that they don't share an evolutionary past in the last 50,000 years.
We know a lot about the history of Y chromosomes up to the definition of each lineage. From the Y chromosome that gave origin to all Y chromosomes in people alive today, we can map many of the migrations that established the geography of nowadays geographical distribution:
Understanding what happened within each lineage, that is, assigning a specific subclade to a Y chromosome, can be extremely challenging as the coming of agriculture brought an explosion of each lineage that survived and this resulted in some level of saturation of the STR space. Part of the reason is the massive bottleneck Y chromosomes experienced at the onset of agricolture. Around 8-4 thousands years ago the effective population size of Y chromosomes plummeted in non-African populations, possibly indicating a change in social structures that increased male variance in offspring number (ref):
Whatever the reason, the consequences are quite evident in today's Y chromosome pool. Comparing your own Y chromosome to many other Y chromosomes out there alone can make this effect quite evident. Anybody can do this. In fact, FamilyTreeDNA hosts a large number of projects attempting to cluster Y chromosomes, and these projects include thousands of Y chromosome STR haplotypes available for download, provided you have a FamilyTreeDNA account and can run a script that will download all the publicly available data. What happens if I compare the number of mismatched STRs (excluding multi-copy and fast-changing STRs, leaving 86 STRs) between my Y chromosome and every other Y chromosome analyzed with the Y111 kit? Here is the figure:
While FamilyTreeDNA simply assigns my Y chromosome to lineage J2-M172, which formed about 27,900 years ago, the two STR values DYS445=6 and DYS391=9 clearly assign it to J2a-L70, a subclade that formed approximately 6,900 years ago and is now spread all over Europe. This is a subclade of clade J2a-L24, which formed around 13,800 years ago. The presence of my Y chromosome in Sicily might be explained by the colonization of the Magna Graecia by the Greeks, but a few other scenarios are also possible. The same picture for my grandfather's Y chromosome looks substantially different:
FamilyTreeDNA assigns my granpa's Y chromosome to lineage R1b-M269, which formed about 6,500 years ago. While understanding subclades within this lineage is challenging, as the number of subclades somewhat saturates the STR space, a plausible candidate is R1b-P312, a subclade that formed approximately 4,600 years ago and is now common in Italy. In this second figure we also observe a second mode mostly corresponding to Y chromosome haplotypes from the R1a-M417 cluster, a sister lineage that formed around 5,500 years ago.
Finally, in the previous post I mentioned that 1,000 Genomes project individual HG01944 showed up as an outlier. While claiming that four of his grandparents were born in Peru, his autosomal DNA projected more or less in the middle of East Asians (CHB, JPT, CHS, CDX, KHV) and Peruvians (PEL):
What about his Y chromosome? It belongs to Q-M120 (a subclade of lineage Q-M242), a subclade separate from lineage Q1a-M3 (found in more than 90% of Native American Y chromosomes). Instead, the subclade of HG01944 Y chromosome Q-M120 is found to some extent in East Asia, indicating that most likely the paternal grandfather of HG01944 was indeed East Asian rather than Peruvian. Another way to say it is that, while HG01944 might not be the best individual for a Peruvian reference panel, he is indeed the result of what we could call love without boundaries. ;-)
Sunday, December 18, 2016
Friday, October 14, 2016
1000 Genomes project phase 3 principal component analysis
The 1000 Genomes project phase 3 genotype data has been available since 2014, but I have not seen any detailed instructions for how to generate a principal component analysis plot of the 2,504 individuals for which genotype data is available. Let's fix that. First of all, you will need software to deal with the formats the data is being distributed as:
# create your personal binary directory mkdir -p ~/bin/ # install software needed for basic commands sudo apt-get install git wget unzip samtools # install latest version of bcftools git clone --branch=develop git://github.com/samtools/bcftools.git git clone --branch=develop git://github.com/samtools/htslib.git cd htslib && make && cd .. cd bcftools && make && cd .. /bin/mv bcftools/bcftools ~/bin/ # install latest version of plink2 wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64_dev.zip unzip -od ~/bin/ plink_linux_x86_64_dev.zip plink # install latest version of GCTA wget http://cnsgenomics.com/software/gcta/gcta_1.25.2.zip unzip -od ~/bin/ gcta_1.25.2.zip gcta64Now that we have all necessary software, we need to download a copy of the human genome reference and a copy of the genotype data. These operations might take several hours, so they might be best run on a good internet connection and possibly overnight:
# download a version of human genome reference GRCh37 wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta samtools faidx human_g1k_v37.fasta # download 1000 Genomes reference panel version 5a mkdir -p chrs/ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/chr{1-4,5-9,10-15,16-22X}.1kg.phase3.v5a.vcf.zip for chrs in 1-4 5-9 10-15 16-22X; do unzip chr$chrs.1kg.phase3.v5a.vcf.zip -d chrs/; doneThe next commands are a bit technical. What we are going to is: (i) convert the downloaded VCF files into plink format using previously described best practices; (ii) remove duplicate markers shamefully present in the genotype data and a few long indels that would make plink2 memory hungry; (iii) join all chromosome files into a single plink file.
# download and convert 1000 Genomes project phase 3 reference for chr in {1..22} X; do tabix -f chrs/chr$chr.1kg.phase3.v5a.vcf.gz bcftools norm -Ou -m -any chrs/chr$chr.1kg.phase3.v5a.vcf.gz | bcftools norm -Ou -f human_g1k_v37.fasta | bcftools annotate -Ob -x ID \ -I +'%CHROM:%POS:%REF:%ALT' | plink --bcf /dev/stdin \ --keep-allele-order \ --vcf-idspace-to _ \ --const-fid \ --allow-extra-chr 0 \ --split-x b37 no-fail \ --make-bed \ --out chrs/kgp.chr$chr done # impute sex using chromosome X plink --bfile chrs/kgp.chrX \ --keep-allele-order \ --impute-sex .95 .95 \ --make-bed \ --out chrs/kgp.chrX && \ /bin/rm chrs/kgp.chrX.{bed,bim,fam}~ # check for duplicate markers (there are 11,943 such markers, mostly on the X chromosome, unfortunately) for chr in {{1..22},X,Y,MT}; do cut -f2 chrs/kgp.chr$chr.bim | sort | uniq -c | awk '$1>=2 {print $2}'; done > kgp.dups # check for very long indels (there are 46 of these) cut -f2 chrs/kgp.chr{{1..22},X,Y,MT}.bim | awk 'length($1)>=150' | sort | uniq > kgp.longindels # generate version of each chromosome without duplicate variants for chr in {1..22} X; do cat kgp.{dups,longindels} | plink --bfile chrs/kgp.chr$chr \ --keep-allele-order \ --exclude /dev/stdin \ --make-bed \ --out chrs/kgp.clean.chr$chr done # join all chromosomes into one cat kgp.clean.chrX.fam > kgp.fam cat kgp.clean.chr{{1..22},X}.bim > kgp.bim (echo -en "\x6C\x1B\x01"; tail -qc +4 kgp.clean.chr{{1..22},X}.bed) > kgp.bedIf you have managed to follow up to this point, the difficult part is over. We now are going to download population information and compute the principal components using plink2 and GCTA:
# download population information wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel awk 'BEGIN {print "FID\tIID\tPOP"} NR>1 {print "0\t"$1"\t"$2}' integrated_call_samples_v3.20130502.ALL.panel > kgp.pop # compute principal component analysis weights plink --bfile kgp --maf .01 --indep 50 5 2 --out kgp plink --bfile kgp --extract kgp.prune.in --make-grm-bin --out kgp gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10 (echo FID IID PC{1..20}; cat kgp.eigenvec) > kgp.pcaThe last command will generate a file which contains the first 20 principal components for the 2,504 individuals in the 1000 Genomes project phase 3. Now, we are only left to plot the result. There are many programs that can do this (like Excel) but I will show some R code (requiring ggplot2). If you are familiar with R, you can use the following script:
# generate principal component plot suppressPackageStartupMessages(library(ggplot2)) pca <- read.table('kgp.pca', header = TRUE) pop <- read.table('kgp.pop', header = TRUE) df <- merge(pca, pop) pdf('kgp.pdf') p <- list() p[[1]] <- ggplot(df, aes(x=PC1, y=PC2)) p[[2]] <- ggplot(df, aes(x=PC1, y=PC3)) p[[3]] <- ggplot(df, aes(x=PC2, y=PC3)) p[[4]] <- ggplot(df, aes(x=PC1, y=PC4)) p[[5]] <- ggplot(df, aes(x=PC2, y=PC4)) p[[6]] <- ggplot(df, aes(x=PC3, y=PC4)) for (i in 1:6) { print(p[[i]] + geom_point(aes(color=POP, shape=POP), alpha=1/3) + scale_color_discrete() + scale_shape_manual(values=0:(length(levels(df[,'POP']))-1)%%25+1) + theme_bw()) } dev.off()The R code will generate six images, the first being the following (see here for a full legend): You might notice that the European populations on the top left, the African populations on the right, the East Asian populations in the bottom left, and the American and South Asian populations in the left, in between Europeans and East Asians. However, this plot does not make justice to American and South Asian populations. A three-dimensional visualization would provide a much better understanding of how these five super-populations relate to each other. I will not share the ugly Matlab code I have used, but I will share the result: As you might now notice, the South Asian populations have their own non-overlapping location in space. You might also appreciate many other previously hiding details. The BEB population (Bengali from Bangladesh) now clearly leans towards the East Asian population, as historical accounts would expect. You will now notice very clearly other smaller details as the five ASW (African American) individuals with significant amount of Native American ancestry and one PEL (Peruvians from Lima) individual with significant amounts of East Asian ancestry. This one individual is HG01944 and he clearly deserves some extra attention. We will talk about him again in the next post. ;-)
Subscribe to:
Posts (Atom)