genetics for fun

What can we learn from one Y chromosome?

2016-12-18T21:16:00.003-08:00

Some time ago I became interested in chromosome Y. It all started from my maternal grandfather. He was born in Lauria, a small town in the south of Italy. After the civil records for Lauria became available online, it was pretty easy to reconstruct his family tree all the way to the 18th century. However, as shown in the following diagram, one small piece was missing (at the top right):

My grandfather's great-grandfather on the direct paternal line was listed as a child of unknown father. His mother got married in 1812 to someone who subsequently disappeared from the record, and gave birth in 1832 to a child as a single mother. While at first this was puzzling, and unusual for the times, it did seem to fit a particular historical context. As a child she had lost her father in 1806 during the massacre of Lauria that saw the population of the town slaughtered by the French soldiers under general André Masséna, as a punishment for having supported the Bourbon kings during the Napoleon invasion. The French presence brought an increase of secular behaviours among which a significant increase of single mothers (ref).

Chances are that my 4th great-grandfather on the direct paternal line of my grandfather was another local person from Lauria, who decided to never recognize his own biological child who died in 1911 still with the father reported as unknown. Whatever happened, it seemed one of those maybe irrelevant mysteries doomed to be forgotten in time. But then I thought that there was a chance, albeit slight, to shed some light over what happened. Whoever this unknown father who conceived my 3rd great grandfather in 1831 was, he must have carried the same Y chromosome my maternal grandfather carries.

Now, my maternal grandfather does not have sons, does not have brothers, and all of his paternal uncles either left no sons or they moved to Colombia in the 20th century, eventually losing touch with my family. He truly is the only person in my family that carries the Y chromosome that could unlock the mystery. Last year I struck a deal with him promising that, in exchange of some of his DNA, one day I will figure out where his patrilineal line came from and therefore what his real family name should have been if it indeed followed the patrilineal line. How does this work?

While AncestryDNA and 23andMe offer the most comprehensive autosomal tests, there is only one company, FamilyTreeDNA, which performs in depth Y chromosome analyses. The Y chromosome is a special molecule. It never recombines and it gets passed father to son as a single block. Excluding the tips, it is truly the largest molecule to withstand the test of time almost unscathed. But every Y chromosome is unique. In fact, it is so large, that we expect to find, more often than not, a single point mutation difference between the Y chromosome of a son and the Y chromosome of a father in the 21.3 megabases of DNA sequence of the chromosome Y that we can currently assay (see here). And that holds only for point mutations.

Before next-generation sequencing became as mainstream as it is now, most analyses of the Y chromosome relied on measurements of Y chromosome short tandem repeats (STRs). STRs are small segments of repetitive DNA that, due to their repetitive nature, are prone to slippage when copied and this results in their length to change. Although we can almost say that STR analyses are falling out of fashion a bit, in the last 20 years tens of thousands of Y chromosomes have had some of their STRs analyzed. There are 100s of STRs on the Y chromosome but those that have been analyzed before the advent of next-generation sequencing have the most value when you want to compare to available databases. Currently, the cheapest way to get the largest panel is to go for the Y111 DNA test with FamilyTreeDNA. This panel of STRs is such that you can observe almost with 30% chance at least one of them to change length in one father-son transmission. This means that Y chromosomes of direct relatives might actually look alike with a Y111 test, but other than that it is almost guaranteed that every Y chromosome will get its own unique fingerprint with the test and the number of differences might give an idea, albeit vague, of how distantly related two Y chromosomes are.

So I got my grandpa's Y chromosome tested. The results are just a list of numbers, measuring the length of each STR tested, but they can tell a lot. The closest match to my grandpa's Y chromosome matches 60 out of 67 markers, which corresponds to a shared ancestor within the last 20 generations with 90% probability. This is not very helpful, but it does make sense. The match originates from Mercato San Severino, which is less than 100 miles from Lauria:

Furthermore the match's family name is Loria, a variant spelling for Lauria. Less interestingly, a second match, which matches at 54 out of 67 markers, originated from Mezzojuso in Sicily. So what have I learned? I didn't get any clue about the true family name my grandpa should have received, but I have a pretty good guess that his Y chromosome has been around the area for at least 500 years, and most likely my 4th great grandfather was someone local rather than a distant stranger passing by in town. ;-)

Although ultimately I did not find the person that passed the Y chromosome to my 3rd great grandfather, this match tells me that my grandfather's results check out and they were not the result of some weird sample swap on FamilyTreeDNA's side. Maybe one day, within my lifetime, a better closer match will show up and maybe the mystery will be solved. There is a little bit of magic involved in thinking that the answer to this riddle might be out there waiting to come out and maybe one day it will make for a unique story.

I was so impressed from what could be inferred that I decided to take the Y111 test myself. Of course, my maternal grandfather and I don't share the same Y chromosome, so in this case I was investigating a completely different paternal line. Furthermore, I am a Genovese, and it would be nice to get proof that I am not actually related to every Genovese out there. When my results came in, disappointingly, I didn't get any good matches as my grandfather did. The best matches though did make some sense. My paternal grandfather was born in Valguarnera Caropepe in Sicily, and I was able to trace his patrilineal line back to an individual named Francesco Genovese born in Piazza Armerina from Tommaso Genovese and who got married in 1784 in Valguarnera Caropepe. The four best matches I have found (1, 2, 3, 4) originate from Palermo, Trapani, and Caltagirone, three cities in Sicily, the last one being less than 20 miles from Piazza Armerina:

However, these best matches still mismatch my Y chromosome at so many STRs that the time to the most recent common ancestor is difficult to pinpoint. It seems very possible that he might have lived more than 2,000 years ago. At least that tells me that my Y chromosome has been in Sicily for a very long time and it is cool that I can infer that.

What happens if we want to know about a more distant history of our Y chromosomes? When and from where did my grandfather Y chromosome and my own Y chromosome get to Italy? It turns out that there is quite some literature on the topic. Whole genome sequence analysis of 1,000 Genomes project individuals paints a picture of a sudden expansions of Y chromosome lineages during the last 10,000 years ago, together with an expansion of lineages around 50,000 years, that is, at the time when the first humans migrated out of Africa (ref):

My Y chromosome belongs to the J2-M172 lineage (or more specifically J2a-L70) and my grandfather's Y chromosome belongs to the R1b-M343 lineage (or more specifically R1b-M269), which means that they don't share an evolutionary past in the last 50,000 years.

We know a lot about the history of Y chromosomes up to the definition of each lineage. From the Y chromosome that gave origin to all Y chromosomes in people alive today, we can map many of the migrations that established the geography of nowadays geographical distribution:

Understanding what happened within each lineage, that is, assigning a specific subclade to a Y chromosome, can be extremely challenging as the coming of agriculture brought an explosion of each lineage that survived and this resulted in some level of saturation of the STR space. Part of the reason is the massive bottleneck Y chromosomes experienced at the onset of agricolture. Around 8-4 thousands years ago the effective population size of Y chromosomes plummeted in non-African populations, possibly indicating a change in social structures that increased male variance in offspring number (ref):

Whatever the reason, the consequences are quite evident in today's Y chromosome pool. Comparing your own Y chromosome to many other Y chromosomes out there alone can make this effect quite evident. Anybody can do this. In fact, FamilyTreeDNA hosts a large number of projects attempting to cluster Y chromosomes, and these projects include thousands of Y chromosome STR haplotypes available for download, provided you have a FamilyTreeDNA account and can run a script that will download all the publicly available data. What happens if I compare the number of mismatched STRs (excluding multi-copy and fast-changing STRs, leaving 86 STRs) between my Y chromosome and every other Y chromosome analyzed with the Y111 kit? Here is the figure:

While FamilyTreeDNA simply assigns my Y chromosome to lineage J2-M172, which formed about 27,900 years ago, the two STR values DYS445=6 and DYS391=9 clearly assign it to J2a-L70, a subclade that formed approximately 6,900 years ago and is now spread all over Europe. This is a subclade of clade J2a-L24, which formed around 13,800 years ago. The presence of my Y chromosome in Sicily might be explained by the colonization of the Magna Graecia by the Greeks, but a few other scenarios are also possible. The same picture for my grandfather's Y chromosome looks substantially different:

FamilyTreeDNA assigns my granpa's Y chromosome to lineage R1b-M269, which formed about 6,500 years ago. While understanding subclades within this lineage is challenging, as the number of subclades somewhat saturates the STR space, a plausible candidate is R1b-P312, a subclade that formed approximately 4,600 years ago and is now common in Italy. In this second figure we also observe a second mode mostly corresponding to Y chromosome haplotypes from the R1a-M417 cluster, a sister lineage that formed around 5,500 years ago.

Finally, in the previous post I mentioned that 1,000 Genomes project individual HG01944 showed up as an outlier. While claiming that four of his grandparents were born in Peru, his autosomal DNA projected more or less in the middle of East Asians (CHB, JPT, CHS, CDX, KHV) and Peruvians (PEL):

What about his Y chromosome? It belongs to Q-M120 (a subclade of lineage Q-M242), a subclade separate from lineage Q1a-M3 (found in more than 90% of Native American Y chromosomes). Instead, the subclade of HG01944 Y chromosome Q-M120 is found to some extent in East Asia, indicating that most likely the paternal grandfather of HG01944 was indeed East Asian rather than Peruvian. Another way to say it is that, while HG01944 might not be the best individual for a Peruvian reference panel, he is indeed the result of what we could call love without boundaries. ;-)

1000 Genomes project phase 3 principal component analysis

2016-10-14T21:00:00.000-07:00

The 1000 Genomes project phase 3 genotype data has been available since 2014, but I have not seen any detailed instructions for how to generate a principal component analysis plot of the 2,504 individuals for which genotype data is available. Let's fix that. First of all, you will need software to deal with the formats the data is being distributed as:

# create your personal binary directory
mkdir -p ~/bin/

# install software needed for basic commands
sudo apt-get install git wget unzip samtools

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
/bin/mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64_dev.zip
unzip -od ~/bin/ plink_linux_x86_64_dev.zip plink

# install latest version of GCTA
wget http://cnsgenomics.com/software/gcta/gcta_1.25.2.zip
unzip -od ~/bin/ gcta_1.25.2.zip gcta64

Now that we have all necessary software, we need to download a copy of the human genome reference and a copy of the genotype data. These operations might take several hours, so they might be best run on a good internet connection and possibly overnight:

# download a version of human genome reference GRCh37
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

# download 1000 Genomes reference panel version 5a
mkdir -p chrs/
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/chr{1-4,5-9,10-15,16-22X}.1kg.phase3.v5a.vcf.zip
for chrs in 1-4 5-9 10-15 16-22X; do unzip chr$chrs.1kg.phase3.v5a.vcf.zip -d chrs/; done

The next commands are a bit technical. What we are going to is: (i) convert the downloaded VCF files into plink format using previously described best practices; (ii) remove duplicate markers shamefully present in the genotype data and a few long indels that would make plink2 memory hungry; (iii) join all chromosome files into a single plink file.

# download and convert 1000 Genomes project phase 3 reference
for chr in {1..22} X; do
  tabix -f chrs/chr$chr.1kg.phase3.v5a.vcf.gz
  bcftools norm -Ou -m -any chrs/chr$chr.1kg.phase3.v5a.vcf.gz |
    bcftools norm -Ou -f human_g1k_v37.fasta |
    bcftools annotate -Ob -x ID \
      -I +'%CHROM:%POS:%REF:%ALT' |
    plink --bcf /dev/stdin \
      --keep-allele-order \
      --vcf-idspace-to _ \
      --const-fid \
      --allow-extra-chr 0 \
      --split-x b37 no-fail \
      --make-bed \
      --out chrs/kgp.chr$chr
done

# impute sex using chromosome X
plink --bfile chrs/kgp.chrX \
  --keep-allele-order \
  --impute-sex .95 .95 \
  --make-bed \
  --out chrs/kgp.chrX && \
  /bin/rm chrs/kgp.chrX.{bed,bim,fam}~

# check for duplicate markers (there are 11,943 such markers, mostly on the X chromosome, unfortunately)
for chr in {{1..22},X,Y,MT}; do cut -f2 chrs/kgp.chr$chr.bim | sort | uniq -c | awk '$1>=2 {print $2}'; done > kgp.dups

# check for very long indels (there are 46 of these)
cut -f2 chrs/kgp.chr{{1..22},X,Y,MT}.bim | awk 'length($1)>=150' | sort | uniq > kgp.longindels

# generate version of each chromosome without duplicate variants
for chr in {1..22} X; do
  cat kgp.{dups,longindels} |
    plink --bfile chrs/kgp.chr$chr \
      --keep-allele-order \
      --exclude /dev/stdin \
      --make-bed \
      --out chrs/kgp.clean.chr$chr
done

# join all chromosomes into one
cat kgp.clean.chrX.fam > kgp.fam
cat kgp.clean.chr{{1..22},X}.bim > kgp.bim
(echo -en "\x6C\x1B\x01"; tail -qc +4 kgp.clean.chr{{1..22},X}.bed) > kgp.bed

If you have managed to follow up to this point, the difficult part is over. We now are going to download population information and compute the principal components using plink2 and GCTA:

# download population information
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
awk 'BEGIN {print "FID\tIID\tPOP"} NR>1 {print "0\t"$1"\t"$2}' integrated_call_samples_v3.20130502.ALL.panel > kgp.pop

# compute principal component analysis weights
plink --bfile kgp --maf .01 --indep 50 5 2 --out kgp
plink --bfile kgp --extract kgp.prune.in --make-grm-bin --out kgp
gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10
(echo FID IID PC{1..20}; cat kgp.eigenvec) > kgp.pca

The last command will generate a file which contains the first 20 principal components for the 2,504 individuals in the 1000 Genomes project phase 3. Now, we are only left to plot the result. There are many programs that can do this (like Excel) but I will show some R code (requiring ggplot2). If you are familiar with R, you can use the following script:

# generate principal component plot
suppressPackageStartupMessages(library(ggplot2))
pca <- read.table('kgp.pca', header = TRUE)
pop <- read.table('kgp.pop', header = TRUE)
df <- merge(pca, pop)
pdf('kgp.pdf')
p <- list()
p[[1]] <- ggplot(df, aes(x=PC1, y=PC2))
p[[2]] <- ggplot(df, aes(x=PC1, y=PC3))
p[[3]] <- ggplot(df, aes(x=PC2, y=PC3))
p[[4]] <- ggplot(df, aes(x=PC1, y=PC4))
p[[5]] <- ggplot(df, aes(x=PC2, y=PC4))
p[[6]] <- ggplot(df, aes(x=PC3, y=PC4))
for (i in 1:6) {
  print(p[[i]] + geom_point(aes(color=POP, shape=POP), alpha=1/3) +
  scale_color_discrete() +
  scale_shape_manual(values=0:(length(levels(df[,'POP']))-1)%%25+1) +
  theme_bw())
}
dev.off()

The R code will generate six images, the first being the following (see here for a full legend):

You might notice that the European populations on the top left, the African populations on the right, the East Asian populations in the bottom left, and the American and South Asian populations in the left, in between Europeans and East Asians. However, this plot does not make justice to American and South Asian populations. A three-dimensional visualization would provide a much better understanding of how these five super-populations relate to each other. I will not share the ugly Matlab code I have used, but I will share the result: As you might now notice, the South Asian populations have their own non-overlapping location in space. You might also appreciate many other previously hiding details. The BEB population (Bengali from Bangladesh) now clearly leans towards the East Asian population, as historical accounts would expect. You will now notice very clearly other smaller details as the five ASW (African American) individuals with significant amount of Native American ancestry and one PEL (Peruvians from Lima) individual with significant amounts of East Asian ancestry. This one individual is HG01944 and he clearly deserves some extra attention. We will talk about him again in the next post. ;-)

Distinguishing parents from children from genotyping array data

2015-10-24T20:50:00.003-07:00

From genotype data for a trio (father, mother, child), it is straightforward to identify whom the parents are. It suffices to identify which pair shares genotypes from the third person so that the Mendelian error rate is minimized. From genotype data for a duo (parent, child), it is straightforward to identify that the two individuals are immediately related and that the relationship is a parent-child one. However, determining who is the parent and who is the child is a challenge of a different nature.

Without phasing information, sharing between a parent and a child is quite a symmetric relationship. One piece of information that can give directionality would be the identification of de-novo mutations, or mutations not shared by an outlier individual, that would only be present in the child and not in the parent. From genotype data we cannot assay de-novo mutations, however.

The other piece of information that can give directionality is more closely related to linkage analysis. A distant relative might share some DNA with the parent and possibly the same amount of DNA with the child, or occasionally only part of that amount with the child. The reverse is also possible but quite unlikely.

If you have DNA information for a duo either in 23andMe or in AncestryDNA, you can test whether this information is actionable by first downloading your genotype data using the instructions from my previous post and then visualizing estimated sharing between members of the duo and shared distant relatives with an additional python script.

This is what I have obtained using my own 23andMe and AncestryDNA accounts: Red markers indicate individuals for whom sharing differs with the individuals of the duo by an amount larger than 0.1% for 23andMe, or 5 centiMorgans for AncestryDNA. As you can see from the figure, there are always more red markers on the parent's side of the diagonal.

There are two explanations for this observation:
1) A distant relative shares more than one segment with the parent and a smaller number of segments with the child
2) A distant relative shares one large segment of DNA with the parent and a smaller chunk of that segment with the child due to a recombination event which split the segment in two parts, of which only one was passed to the child

Due to the large number of distant cousins sharing only one segment of DNA, I believe that it is more common to observe case (2) although you still need to expect the passed segment to be at least 5 centiMorgans long in order to be detected and reported.

Notice also that occasionally the child might share more with the distant cousin than the parent does (red markers under the diagonal). It is possible that statistical noise causes a shared DNA segment to be reported in the child but not in the parent. It is also occasionally possible that the distant relative is related to the child both through the mother and through the father, though in my experience this circumstance seems to be rare.

Limitations of 23andMe

While the Relative Finder section of 23andMe shows a long list of DNA matches for each profile, the script used in this post will not pair matches showing up as anonymous. This most likely will exclude 60-70% DNA matches, depending on profile anonymity chosen by your distant cousins.

Visualize DNA matches graph

2015-09-27T11:43:00.000-07:00

Through high-throughput SNP genotyping it has become possible to identify shared DNA with our fifth degree cousins. Two companies, 23andMe and AncestryDNA, currently each hold each a database of more than one million people of dense genotype data. Each database gives us the possibility to identify hundreds of distant DNA cousins as each service reports a list of all individuals who share DNA segments at least 5 centimorgans long.

DNA cousins from 1st up to 4th degree will usually share several such DNA segments making them stand out from the crowd of more distantly related DNA cousins. For the large list of individuals that only share with us one or two DNA segments, it is difficult to distinguish between a match as close as a fifth degree cousin and someone much more distantly related who just happens to share a long DNA segment which through luck stayed intact across several centuries.

Is there any information other than the number of shared DNA segments and their length in centimorgans that we can leverage to prioritize our DNA matches? Since August 26th 2015, AncestryDNA has launched a feature whereby for a given match we can further learn whether it in turn shares DNA with other of our DNA matches, with some limitations. We can also do the same for DNA matches that have agreed to share genomes with us on 23andMe, though it is a little bit more cumbersome.

We can therefore define a DNA matches graph where each node is one of our DNA matches and two nodes are connected by an edge if the two corresponding DNA matches share DNA with each other. What can we learn from this graph? What does the topology look like? Does the graph contain connected components? For example, if our parents were to be from very different places, it would be quite unlikely that our DNA matches from our mother's side would be connected to our DNA matches from our father's side. The same holds if our grandparents are from different places, and so on.

To be able to glance at our DNA matches graph, we need to access all the information that either AncestryDNA or 23andMe provides us at once. To retrieve this information manually is just impractical. To overcome this hurdle, I have put together a few python scripts which will do the job for us.

The workflow runs as follows:
1) We dump all DNA matches information in our user account
2) We process the information to generate a table of all pairwise DNA sharing
3) We visualize the DNA matches graph

However, for now, you will have to be familiar with running python commands and installing a few necessary python packages. Here are some examples of what came out for my mother's 23andMe test:

The blue nodes correspond to DNA matches on my paternal grandmother's side and the pink nodes correspond to DNA matches on my maternal grandmother's side. Interestingly these two groups form two very separate clusters and this is to be expected as my maternal grandparents come from two Italian towns quite distant from each other.

The following example is from the AncestryDNA account of an American friend of mine:

The small nodes correspond to DNA matches estimated as my friend's distant cousins, while the larger nodes are estimated as 4th cousins or closer, and the dark colored nodes correspond to nodes for which AncestryDNA has identified the most recent common ancestor in the family trees within nine generations. The clusters make some sense. For example, we were able to identify a little one corresponding to a little town from Poland where my friend's great-great grandmother is from. It is a great graph and I am jealous because if you have Northern European roots, you get a lot more DNA matches than if you have Southern European roots like me, most likely due to the bias in which individuals buy the AncestryDNA test.

If you are savvy enough, I encourage you to use the tool on your own data and do send me feedback! :-)

Limitations of 23andMe and AncestryDNA

When taking a DNA test with 23andMe or AncestryDNA all matches sharing at least 5 centimorgans with the individual being tested will be reported for further review. In either case some matches will be anonymous and some matches will share as much as their own family tree, depending on the settings selected by the user. I believe for chromosome X matches for males 23andMe might be lowering the threshold to something like 2 centimorgans as it is significantly easier to identify sharing on the X chromosome among males. This 5 centimorgans threshold for shared segments is likely selected for two reasons:

1) Shorter segments become increasingly likely to be false positives below 5 centimorgans without your genome being properly phased which is mostly not the case with the exception of those customers who also had their parents tested.

2) Shorter segments are likely to be due to really far away relationships and as such they might have very little value for genealogy as they would trace back several centuries, much beyond what the paper trail allows in most countries.

Limitations of 23andMe

When browsing the Relative Finder feature within 23andMe, you can see a full list of individuals sharing with you DNA segments larger than 5 centimorgans. For each DNA match you can also see how many separate segments you share, unlike AncestryDNA. However, you cannot determine whether two of your DNA matches are DNA matches with each other unless both users have agreed to share their genetic profiles with you.

Due to this limitation you can only build the DNA matches graph for a restricted class of individuals who are engaged enough in the service to be able to manage their sharing requests. Furthermore, you can only retrieve information for one pair of individuals at a time, making the process completely impractical if attempted manually. Due to how slow the 23andMe server is at retrieving such information, even with the script mentioned above, it will take about three seconds to retrieve information about each pair. For 250 individuals it might take approximately 24 hours, even if most retrieval requests will yield empty results.

Limitations of AncestryDNA

Unlike 23andMe, AncestryDNA does not report the number of shared segments among DNA matches nor the amount of genome shared. However, when browsing the site the browser does receive two variables invisible to the user:

1) sharedCentimorgans being an estimate of the amount of DNA shared across shared DNA segments larger than 5 centimorgans

2) meiosisValue being an estimate from sharedCentimorgans of the number of meioses separating you and your DNA match and this number is never larger than 10

From what I observed pairs of individuals within the AncestryDNA database can then be split in three categories:

1) Pairs that share no DNA segments larger than 5 centimorgans

2) Pairs that share at least one DNA segment longer than 5 centimorgans with a meiosisValue of exactly 10 (corresponding to sharedCentimorgans between 5 and 17.65, that is, relationships classified as 10:DISTANT COUSIN)

3) Pairs that share at least one DNA segment longer than 5 centimorgans with a meiosisValue smaller than 10 (corresponding to sharedCentimorgans greater than 17.65, that is, relationships classified as 0:SELF/TWIN, 1:PARENT/CHILD, 2:IMMEDIATE FAMILY, 3:CLOSE FAMILY, 4:1ST COUSIN, 5,6:2ND COUSIN, 7:3RD COUSIN, 8,9:4TH COUSIN)

When accessing your AncestryDNA test, you will have reported all DNA matches corresponding to the last two categories. Most (in my case 96-98%) of your pairs including you and one of your DNA matches will fall in the second category. However, when you are reviewing one of your DNA matches A in AncestryDNA, the Shared Matches feature will list only DNA matches B such that both A and B and you and B fall in the third category as also explained in another post.

As of this writing, even when listing the DNA matches you have in common with your father/mother you are actually only listing those DNA matches which make a third category pair with your father/mother. Such limitation is not present with 23andMe but I believe this is just a bug on AncestryDNA's side which will get fixed soon.

To summarize, through AncestryDNA you will only have access to those edges of the DNA matches graph corresponding to pairs sharing at least 17.65 centimorgans of DNA. This is quite a conservative selection and likely the result of careful debate within AncestryDNA to find a balance between being at the same time informative and concise.

Annotate CpG mutations in a VCF file

2015-05-19T12:35:00.000-07:00

Mutations in the human genome are not made all the same. Even when restricting our attention to the most common form of human variation, that is, single nucleotide polymorphism, there are different categories. On a first approximation we have transversion and transitions. And among transitions we have transitions on CpG sites, which are much more common than transitions on non-CpG sites. Hence, the need to distinguish the two. If you have mutations encoded in the standard VCF format, it is easy to distinguish transitions from transversions. The first kind shows as a A<->G or a C<->T, while the second kind shows as a A<->C, A<->T, C<->G, or G<->T. But distinguishing CpG transitions from non-CpG transitions within a VCF is not possible without additional information as we need information about the sequence context.

We definitely need the reference genome sequence of the organism the VCF file refers to. However there is no tool to my knowledge that will take a VCF file and a fasta sequence and yield as output an annotated VCF file indicating which mutations are CpG mutations (it would be nice if someone wrote a bcftools pluging for this purporse). But there are tools that will annotate a VCF file given the presence of a mutation in another VCF file. Therefore one solution would be to have a VCF file with all possible CpG mutations. Here is how to do that:


# create your personal binary directory
mkdir -p ~/bin/

# download human genome reference (GRCh37)
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta &&
samtools faidx human_g1k_v37.fasta

# install latest version of bedtools
git clone https://github.com/arq5x/bedtools2.git &&
cd bedtools2 && make && cd .. &&
cp bedtools2/bin/bedtools ~/bin/

# create a VCF file with all CpG mutations (it takes several hours)
awk '{for (i=1; i<$2; i++)
  print $1"\t"i-1"\t"i+1"\t"$1" "i}' human_g1k_v37.fasta.fai |
  bedtools getfasta -name -fi human_g1k_v37.fasta \
  -bed /dev/stdin -fo /dev/stdout |
  tr '\n' ' ' | tr '>' '\n' | grep "CG $" |
  awk -v OFS="\t" 'BEGIN {print "##fileformat=VCFv4.1";
  print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}
  {print $1"\t"$2"\t.\tC\tT\t.\t.\t.";
  print $1"\t"$2+1"\t.\tG\tA\t.\t.\t."}' |
  bgzip > cpg.vcf.gz &&
  tabix -f cpg.vcf.gz

To be completely thorough these mutations will be CpG mutations assuming that the reference sequence is the ancestral sequence which is not always the case. But for the majority of rare mutations it is a fair assumption. Furthermore, this code does not take into account whether a CpG mutation is located within a CpG island and this is also information that might be important as not all CpG sites are equally mutable.

Now that we have our VCF file with all CpG mutations, we can use SnpSift to annotate our input VCF file. I selected SnpSift because it is quite fast and flexible compared to other tools like bcftools (see here). This can be achieved as follows:


# install latest version of snpEff/SnpSift
wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip &&
unzip snpEff_latest_core.zip

# annotate your VCF file with SnpSift
java -jar snpEff/SnpSift.jar annotate -exists CPG cpg.vcf.gz input.vcf.gz | bgzip > output.vcf.gz &&
tabix -f output.vcf.gz

If you want to extract from the VCF file only the variants which are CpG mutations, the following code will work:


# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# extract from VCF file all CpG mutations
bcftools view -Oz -i "CPG==1" output.vcf.gz -o output.cpg.vcf.gz &&
tabix -f output.cpg.vcf.gz

Best practice for converting VCF files to plink format

2014-11-25T12:02:00.000-08:00

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?


20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:


# create your personal binary directory
mkdir -p ~/bin/

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:


wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:


bcftools norm -Ou -m -any input.vcf.gz |
  bcftools norm -Ou -f human_g1k_v37.fasta |
  bcftools annotate -Ob -x ID \
    -I +'%CHROM:%POS:%REF:%ALT' |
  plink --bcf /dev/stdin \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --const-fid \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --make-bed \
    --out output

Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).

The first command will split multi-allelic alleles so that a variant like the following:


20 31022441 AG AGG,A

Will become two variants as follows:


20 31022441 AG AGG
20 31022441 AG A

The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:


20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:


20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A

The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:


cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:

plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~

However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).

How do I identify all the homopolymer run indels in a VCF file?

2014-05-13T12:57:00.002-07:00

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:

#!/bin/bash
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format

bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort

The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.

Estimating global ancestral components with 1000 Genomes

2013-12-16T10:24:00.001-08:00

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:

anc="%DIRECTORY WITH ANCESTRAL COMPONENTS%"

mkdir -p $anc

# download global proportions table
(echo "IID EUR AFR NAT UNK";
for pop in ASW CLM MXL PUR; do
  wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

#!/usr/bin/Rscript
args <- commandArgs(trailingOnly = TRUE)

# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
}
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN
}

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),])
}

# output data to disk
write.table(data[,c("FID","IID","EUR","AFR","NAT","ASN")],args[4],row.names=FALSE,quote=FALSE)

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

PCA of your plink dataset with 1000 Genomes

2013-12-13T13:14:00.001-08:00

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:

set="%PLINK DATASET PREFIX%"
pth="%PATH TO THE 1000 GENOMES PLINK DATASET%"
gct="%DIRECTORY WITH GCTA%"

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10

At this point we are ready to compute principal components.

# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(ggplot2))
args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

1000 Genomes PCA analysis

2013-12-12T16:17:00.000-08:00

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:

gct="%DIRECTORY WITH GCTA%"
vcf="%DIRECTORY WITH KGP VCF FILES%"

And then we can download all necessary files:

# install plink and utilities to convert to plink format
sudo apt-get install plink
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh

# download the 1000 Genomes files
pfx="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr"
sfx=".1kg.ref.phase1_release_v3.20101123.vcf.gz"
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# install GCTA
wget http://www.complextraitgenomics.com/software/gcta/gcta_1.22.zip
mkdir -p $gct
unzip gcta_1.22.zip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:

# convert each VCF file to a plink binary file with sex information included
pfx="$vcf/chr"
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./vcf2tfam.sh /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./vcf2tped.sh /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam
done

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:

# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01 --indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$chr.prune.in --make-bed --out kgp.chr$chr.prune
done

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:

# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.

IBD pipeline

2013-11-18T14:24:00.001-08:00

There are a lot of interesting analyses that segments inherited identical-by-descent (IBD) can be informative for. However, it takes some steps to setup a fully functional IBD pipeline. Here is a basic one that will work with any dataset in binary plink format. It will first phase the data with hapi-ur, then detect IBD segments using beagle, and finally detect clusters of IBD individuals using dash.

To begin with, you need to setup a few variables:

hpr="%DIRECTORY WITH HAPI-UR BINARY FILES%"
gmp="%DIRECTORY WITH GENETIC MAP FILES%"
bgl="%DIRECTORY WITH BEAGLE BINARY FILE%"
dsh="%DIRECTORY WITH DASH BINARY FILE%"
set="%PLINK DATASET PREFIX%"

And have a couple of programs pre-installed on your Linux machine:

# install plink
sudo apt-get install plink

# install hapi-ur
wget https://hapi-ur.googlecode.com/files/hapi-ur-1.01.tgz
tar xzvf hapi-ur-1.01.tgz
mkdir -p $hpr
/bin/cp hapi-ur-1.01/hapi-ur hapi-ur-1.01/insert-map.pl $hpr/

# install genetic maps
wget http://mathgen.stats.ox.ac.uk/impute/genetic_maps_b37.tgz
tar xzvf genetic_maps_b37.tgz
mkdir -p $gmp
for i in {1..22} X_PAR1 X_nonPAR X_PAR2; do gzip -c genetic_maps_b37/genetic_map_chr${i}_combined_b37.txt > $gmp/chr$i.gmap.gz; done

# install beagle
mkdir -p $bgl
wget http://faculty.washington.edu/browning/beagle/b4.r1196.jar -P $bgl/

# install dash
wget http://www1.cs.columbia.edu/~gusev/dash/dash-1-1.tar.gz
tar xzvf dash-1-1.tar.gz
mkdir -p $dsh
/bin/cp dash-1-1/bin/dash_cc $dsh/

The following pipeline will run the analysis independently on each autosome:

# perform IBD and DASH analysis separately for each autosome
for i in {1..22}; do
  # create a directory
  mkdir -p chr$i

  # extract autosome of interest
  p-link --noweb --bfile $set --chr $i --make-bed --out chr$i/$set

  # generate marker file with genetic distance
  zcat $gmp/chr$i.gmap.gz | $hpr/insert-map.pl chr$i/$set.bim - | awk -v OFS="\t" \
    'length($5)>1 || $6=="-" {$5="I"; $6="D"} $5=="-" || length($6)>1 {$5="D"; $6="I"} {print $1,$2,$3,$4,$5,$6}' > chr$i/$set.gmap.bim

  # clean up just in case
  /bin/rm chr$i/$set.sample chr$i/$set.haps

  # phase haplotypes using hapi-ur
  $hpr/hapi-ur -g chr$i/$set.bed -s chr$i/$set.gmap.bim -i chr$i/$set.fam -w 73 --impute -o chr$i/$set

  # convert hapi-ur output for use with beagle
  (echo "##fileformat=VCFv4.1";
  echo "##FORMAT=";
  echo -en "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t";
  awk 'NR>2 {printf $1"."$2"\t"}' chr$i/$set.sample | sed 's/\t$/\n/g';
  awk -v c=$i '{printf c"\t"$3"\t"$2"\t"$4"\t"$5"\t.\tPASS\t.\tGT";
    for (i=6; i<NF; i+=2) printf "\t"$i"|"$(i+1); printf "\n"}' chr$i/$set.haps) |
    java -Xmx3g -jar $bgl/b4.r1185.jar ibd=true gt=/dev/stdin usephase=true burnin-its=0 phase-its=0 out=chr$i/$set

  # convert beagle output for use with DASH
  awk '{sub("\\\."," ",$1); sub("\\\."," ",$3); print $1"."$2-1,$3"."$4-1,$6,$7}' chr$i/$set.hbd chr$i/$set.ibd |
    $dsh/dash_cc chr$i/$set.fam chr$i/$set

  # create map file with IBD clusters
  cut -f1-5 chr$i/$set.clst | awk -v c=$i '{ print c,c"_"$1,$2,$3,$4,$5 }' > chr$i/$set.cmap

  # create map file with IBD clusters for use with plink
  awk -v c=$i '{ print c,$2,0,int(($3+$4)/2) }' chr$i/$set.cmap > chr$i/$set.map

  # convert output to plink binary format
  p-link --maf 0.001 --file chr$i/$set --make-bed --recode --out chr$i/$set
done

You can then join the results in a single plink binary file using the code discussed in a previous post:

# join separate autosome outputs into a single plink binary file
for i in {1..22}; do cat chr$i/$set.cmap; done > $set.dash.cmap
cp $set.fam $set.dash.fam
for i in {1..22}; do cat chr$i/$set.bim; done > $set.dash.bim
(echo -en "\x6C\x1B\x01"; for i in {1..22}; do tail -c +4 chr$i/$set.bed; done) > $set.dash.bed

Longest homopolymer in coding sequence in the human genome

2013-10-30T14:11:00.001-07:00

It is expected that most homopolymer sequences be negatively selected in coding sequence, due to the higher mutation rate. So what is the longest homopolymer in coding sequence in the human genome? Here a short script to answer this question.

First you will need a few programs:

sudo apt-get install wget samtools bedtools

And you will need a copy of the hg19 genome:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar xfO chromFa.tar.gz > hg19.fa
samtools faidx hg19.fa

Finally, generate a bed interval file with the coordinates of the coding exons and look for homopolymers within these:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | awk '{split($10,a,","); split($11,b,","); for (i=1; i<length(a); i++) {from=$7>a[i]?$7:a[i]; to=$8<b[i]?$8:b[i]; if (to>from) print $3"\t"from"\t"to}}' | bedtools sort | bedtools merge > refGene.bed
bedtools getfasta -fi hg19.fa -bed refGene.bed -fo refGene.fa
for bp in A C G T; do hp=$(for i in {1..13}; do echo -n $bp; done); grep -B1 $hp refGene.fa; done

It turns out that the longest homopolymer in coding sequence is 13bp long and is within the CAMKK2 gene.

Merging plink binary files

2013-10-11T14:38:00.002-07:00

If you have a bunch of plink binary files for the same set of people split over different chromosomes, how do you merge them? A simple answer would be to use the --merge-list option in plink. However, this will require some setting up to do and it turns out there is a much easier way to perform this.

Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml) then the following code will just work:

cat plink.chr1.fam > plink.fam
for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed

This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.

A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.

Convert VCF files to PLINK binary format

2013-10-08T10:09:00.000-07:00

How many times have you needed to convert a VCF file to PLINK binary format? The 1000 Genomes project has a recommended tool (http://www.1000genomes.org/vcf-ped-converter) but it only works when converting small regions. Another easy to use tool is SnpSift (http://snpeff.sourceforge.net/SnpSift.html#vcf2tped).

Overall, it really isn't that complicated to convert a VCF file. The idea is to first convert it to tped format and then let plink do the job of converting that to binary format.

sudo apt-get install plink tabix
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh

The tools are fairly simple and they are supposed to be flexible. Now, let's suppose your VCF file is bgzip-comrpessed. A good idea might be to split it in chromosomes and generate one plink file for each chromosome. Here some code that will do that:

for chr in {1..22} X Y MT; do
  tabix -H $vcf | ./vcf2tfam.sh /dev/stdin > gpc.chr$chr.tfam
  tabix $vcf $chr | ./vcf2tped.sh /dev/stdin > gpc.chr$chr.tped
  plink --tfile gpc.chr$chr --make-bed --out gpc.chr$chr
  /bin/rm gpc.chr$chr.tfam gpc.chr$chr.tped gpc.chr$chr.nosex gpc.chr$chr.log
done

Impute APOE and APOL1 with 23andMe

2013-08-28T14:39:00.000-07:00

If you have paid to get your genome genotyped by 23andMe, you probably know that there are a lot of variants that are not directly genotyped by 23andMe. To obviate this problem, large datasets like the 1000 Genomes Project can be used to compare your genotypes with the genotypes of other individuals and guess what your missing genotypes are. This process is called imputation and is routinely practiced by many researchers in my area. Although, little is available online with regard to imputation of your own 23andMe data. The best link I could find was here but the instructions provided to perform the imputation still require a fair amount of knowledge on how to use some specific tools.

I think this is unacceptable and I decided to come up with my own scripts and provide a couple of examples to show how to impute your own rs429358 APOE genotype (the famous Alzheimer variant, which you will have genotyped only if you bought the more recent 23andMe v3 chip) and also how to impute the rs73885319 rs60910145 rs71785313 APOL1 genotypes (those associated to non-diabetic kidney disease in people of African descent, as shown in my previous paper) which unfortunately are still not tested by 23andMe.

I wrote a few simple scripts that will allow you to impute these genotypes in a matter of minutes (or seconds, depending on how fast you can cut and paste the code below). The following instructions should work on any recent Linux Ubuntu box. You should be able to use this code on a Mac as well, provided you have the necessary basic tools already installed.

The first preliminary step is to download your raw 23andMe data. Go to your account and download the "All DNA" data set.

The second preliminary step is to install a few UNIX-friendly programs to manipulate genetic datasets, the imputation software Beagle, a couple of scripts I wrote, and a template VCF file for the variants tested by 23andMe. The following commands will do:

sudo apt-get install bedtools dos2unix openjdk-8-jre tabix unzip wget
wget https://faculty.washington.edu/browning/beagle/beagle.16Jun16.7e4.jar
wget http://personal.broadinstitute.org/giulio/23andme/make_vcf_template.sh
wget http://personal.broadinstitute.org/giulio/23andme/23andme_to_vcf.sh
wget http://personal.broadinstitute.org/giulio/23andme/vcf_impute.sh
wget http://personal.broadinstitute.org/giulio/23andme/23andme.vcf.gz
chmod a+x make_vcf_template.sh 23andme_to_vcf.sh vcf_impute.sh

The 23andme.vcf.gz file is a VCF template I pre-built for 23andMe v2 and 23andMe v3 chips. If you want to build it yourself (optional, you don't need to do this!), from a directory where you have stored your 23andMe raw files, run:

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/00-All.vcf.gz
./make_vcf_template.sh 00-All.vcf.gz genome_*_Full_*.zip

This step takes time as the whole dbSNP dataset (~1.4GB) will need to be downloaded and processed.

Now the first step is to convert the raw downloaded data to a more standard format, the VCF format:

./23andme_to_vcf.sh 23andme.vcf.gz genome_*_Full_*.zip

Where genome_*_Full_*.zip should be substituted with the name of the zip file you downloaded from 23andMe. The second and last step is to create a list of variants (from a single locus) to be imputed and pass the genomic locations of these variants to a script which will run the imputation.

The script provided will automatically download a piece of the 1000 Genomes Project reference panel around the variants of interest, subset this genotype panel to the variants to be imputed and the variants available in your VCF file, and run the imputation with this minimal reference panel. This way the imputation process will be immediate and you will have your results right away.

For APOE, we are only interested in the rs429358 variant and the rs7412 variant (which is already genotyped both in the 23andMe v2 and the 23andMe v3 chip), so the code will look like this:

echo -e "19\t45411940\t45411941\n19\t45412078\t45412079" > apoe.bed
./vcf_impute.sh beagle.16Jun16.7e4.jar apoe.bed genome_*_Full_*.vcf.gz

Where genome_*_Full_*.vcf.gz should be substituted with the name of the compressed VCF file generated in the previous step. For APOL1 we are interested instead in three variants:

echo -e "22\t36661905\t36661906\n22\t36662033\t36662034\n22\t36662040\t36662047" > apol1.bed
./vcf_impute.sh beagle.16Jun16.7e4.jar apol1.bed genome_*_Full_*.vcf.gz

The results will be displayed on screen in VCF file format with best guess genotypes and individual genotype likelihoods. Of course, the results are only probabilistic, so you will need to interpret them accordingly. For the biological interpretation instead, snpedia has a good summary, both for APOE and for APOL1.

If you have any comments, complaint, or suggestions, please send me an email! No perl code was used for any of the scripts.