Saturday, October 24, 2015

Distinguishing parents from children from genotyping array data

From genotype data for a trio (father, mother, child), it is straightforward to identify whom the parents are. It suffices to identify which pair shares genotypes from the third person so that the Mendelian error rate is minimized. From genotype data for a duo (parent, child), it is straightforward to identify that the two individuals are immediately related and that the relationship is a parent-child one. However, determining who is the parent and who is the child is a challenge of a different nature.

Without phasing information, sharing between a parent and a child is quite a symmetric relationship. One piece of information that can give directionality would be the identification of de-novo mutations, or mutations not shared by an outlier individual, that would only be present in the child and not in the parent. From genotype data we cannot assay de-novo mutations, however.

The other piece of information that can give directionality is more closely related to linkage analysis. A distant relative might share some DNA with the parent and possibly the same amount of DNA with the child, or occasionally only part of that amount with the child. The reverse is also possible but quite unlikely.

If you have DNA information for a duo either in 23andMe or in AncestryDNA, you can test whether this information is actionable by first downloading your genotype data using the instructions from my previous post and then visualizing estimated sharing between members of the duo and shared distant relatives with an additional python script.

This is what I have obtained using my own 23andMe and AncestryDNA accounts: Red markers indicate individuals for whom sharing differs with the individuals of the duo by an amount larger than 0.1% for 23andMe, or 5 centiMorgans for AncestryDNA. As you can see from the figure, there are always more red markers on the parent's side of the diagonal.

There are two explanations for this observation:
1) A distant relative shares more than one segment with the parent and a smaller number of segments with the child
2) A distant relative shares one large segment of DNA with the parent and a smaller chunk of that segment with the child due to a recombination event which split the segment in two parts, of which only one was passed to the child

Due to the large number of distant cousins sharing only one segment of DNA, I believe that it is more common to observe case (2) although you still need to expect the passed segment to be at least 5 centiMorgans long in order to be detected and reported.

Notice also that occasionally the child might share more with the distant cousin than the parent does (red markers under the diagonal). It is possible that statistical noise causes a shared DNA segment to be reported in the child but not in the parent. It is also occasionally possible that the distant relative is related to the child both through the mother and through the father, though in my experience this circumstance seems to be rare.

Limitations of 23andMe

While the Relative Finder section of 23andMe shows a long list of DNA matches for each profile, the script used in this post will not pair matches showing up as anonymous. This most likely will exclude 60-70% DNA matches, depending on profile anonymity chosen by your distant cousins.

Sunday, September 27, 2015

Visualize DNA matches graph

Through high-throughput SNP genotyping it has become possible to identify shared DNA with our fifth degree cousins. Two companies, 23andMe and AncestryDNA, currently each hold each a database of more than one million people of dense genotype data. Each database gives us the possibility to identify hundreds of distant DNA cousins as each service reports a list of all individuals who share DNA segments at least 5 centimorgans long.

DNA cousins from 1st up to 4th degree will usually share several such DNA segments making them stand out from the crowd of more distantly related DNA cousins. For the large list of individuals that only share with us one or two DNA segments, it is difficult to distinguish between a match as close as a fifth degree cousin and someone much more distantly related who just happens to share a long DNA segment which through luck stayed intact across several centuries.

Is there any information other than the number of shared DNA segments and their length in centimorgans that we can leverage to prioritize our DNA matches? Since August 26th 2015, AncestryDNA has launched a feature whereby for a given match we can further learn whether it in turn shares DNA with other of our DNA matches, with some limitations. We can also do the same for DNA matches that have agreed to share genomes with us on 23andMe, though it is a little bit more cumbersome.

We can therefore define a DNA matches graph where each node is one of our DNA matches and two nodes are connected by an edge if the two corresponding DNA matches share DNA with each other. What can we learn from this graph? What does the topology look like? Does the graph contain connected components? For example, if our parents were to be from very different places, it would be quite unlikely that our DNA matches from our mother's side would be connected to our DNA matches from our father's side. The same holds if our grandparents are from different places, and so on.

To be able to glance at our DNA matches graph, we need to access all the information that either AncestryDNA or 23andMe provides us at once. To retrieve this information manually is just impractical. To overcome this hurdle, I have put together a few python scripts which will do the job for us.

The workflow runs as follows:
1) We dump all DNA matches information in our user account
2) We process the information to generate a table of all pairwise DNA sharing
3) We visualize the DNA matches graph

However, for now, you will have to be familiar with running python commands and installing a few necessary python packages. Here are some examples of what came out for my mother's 23andMe test:

The blue nodes correspond to DNA matches on my paternal grandmother's side and the pink nodes correspond to DNA matches on my maternal grandmother's side. Interestingly these two groups form two very separate clusters and this is to be expected as my maternal grandparents come from two Italian towns quite distant from each other.

The following example is from the AncestryDNA account of an American friend of mine:

The small nodes correspond to DNA matches estimated as my friend's distant cousins, while the larger nodes are estimated as 4th cousins or closer, and the dark colored nodes correspond to nodes for which AncestryDNA has identified the most recent common ancestor in the family trees within nine generations. The clusters make some sense. For example, we were able to identify a little one corresponding to a little town from Poland where my friend's great-great grandmother is from. It is a great graph and I am jealous because if you have Northern European roots, you get a lot more DNA matches than if you have Southern European roots like me, most likely due to the bias in which individuals buy the AncestryDNA test.

If you are savvy enough, I encourage you to use the tool on your own data and do send me feedback! :-)

Limitations of 23andMe and AncestryDNA

When taking a DNA test with 23andMe or AncestryDNA all matches sharing at least 5 centimorgans with the individual being tested will be reported for further review. In either case some matches will be anonymous and some matches will share as much as their own family tree, depending on the settings selected by the user. I believe for chromosome X matches for males 23andMe might be lowering the threshold to something like 2 centimorgans as it is significantly easier to identify sharing on the X chromosome among males. This 5 centimorgans threshold for shared segments is likely selected for two reasons:

1) Shorter segments become increasingly likely to be false positives below 5 centimorgans without your genome being properly phased which is mostly not the case with the exception of those customers who also had their parents tested.

2) Shorter segments are likely to be due to really far away relationships and as such they might have very little value for genealogy as they would trace back several centuries, much beyond what the paper trail allows in most countries.

Limitations of 23andMe

When browsing the Relative Finder feature within 23andMe, you can see a full list of individuals sharing with you DNA segments larger than 5 centimorgans. For each DNA match you can also see how many separate segments you share, unlike AncestryDNA. However, you cannot determine whether two of your DNA matches are DNA matches with each other unless both users have agreed to share their genetic profiles with you.

Due to this limitation you can only build the DNA matches graph for a restricted class of individuals who are engaged enough in the service to be able to manage their sharing requests. Furthermore, you can only retrieve information for one pair of individuals at a time, making the process completely impractical if attempted manually. Due to how slow the 23andMe server is at retrieving such information, even with the script mentioned above, it will take about three seconds to retrieve information about each pair. For 250 individuals it might take approximately 24 hours, even if most retrieval requests will yield empty results.

Limitations of AncestryDNA

Unlike 23andMe, AncestryDNA does not report the number of shared segments among DNA matches nor the amount of genome shared. However, when browsing the site the browser does receive two variables invisible to the user:

1) sharedCentimorgans being an estimate of the amount of DNA shared across shared DNA segments larger than 5 centimorgans

2) meiosisValue being an estimate from sharedCentimorgans of the number of meioses separating you and your DNA match and this number is never larger than 10

From what I observed pairs of individuals within the AncestryDNA database can then be split in three categories:

1) Pairs that share no DNA segments larger than 5 centimorgans

2) Pairs that share at least one DNA segment longer than 5 centimorgans with a meiosisValue of exactly 10 (corresponding to sharedCentimorgans between 5 and 17.65, that is, relationships classified as 10:DISTANT COUSIN)

3) Pairs that share at least one DNA segment longer than 5 centimorgans with a meiosisValue smaller than 10 (corresponding to sharedCentimorgans greater than 17.65, that is, relationships classified as 0:SELF/TWIN, 1:PARENT/CHILD, 2:IMMEDIATE FAMILY, 3:CLOSE FAMILY, 4:1ST COUSIN, 5,6:2ND COUSIN, 7:3RD COUSIN, 8,9:4TH COUSIN)

When accessing your AncestryDNA test, you will have reported all DNA matches corresponding to the last two categories. Most (in my case 96-98%) of your pairs including you and one of your DNA matches will fall in the second category. However, when you are reviewing one of your DNA matches A in AncestryDNA, the Shared Matches feature will list only DNA matches B such that both A and B and you and B fall in the third category as also explained in another post.

As of this writing, even when listing the DNA matches you have in common with your father/mother you are actually only listing those DNA matches which make a third category pair with your father/mother. Such limitation is not present with 23andMe but I believe this is just a bug on AncestryDNA's side which will get fixed soon.

To summarize, through AncestryDNA you will only have access to those edges of the DNA matches graph corresponding to pairs sharing at least 17.65 centimorgans of DNA. This is quite a conservative selection and likely the result of careful debate within AncestryDNA to find a balance between being at the same time informative and concise.

Tuesday, May 19, 2015

Annotate CpG mutations in a VCF file

Mutations in the human genome are not made all the same. Even when restricting our attention to the most common form of human variation, that is, single nucleotide polymorphism, there are different categories. On a first approximation we have transversion and transitions. And among transitions we have transitions on CpG sites, which are much more common than transitions on non-CpG sites. Hence, the need to distinguish the two. If you have mutations encoded in the standard VCF format, it is easy to distinguish transitions from transversions. The first kind shows as a A<->G or a C<->T, while the second kind shows as a A<->C, A<->T, C<->G, or G<->T. But distinguishing CpG transitions from non-CpG transitions within a VCF is not possible without additional information as we need information about the sequence context.

We definitely need the reference genome sequence of the organism the VCF file refers to. However there is no tool to my knowledge that will take a VCF file and a fasta sequence and yield as output an annotated VCF file indicating which mutations are CpG mutations (it would be nice if someone wrote a bcftools pluging for this purporse). But there are tools that will annotate a VCF file given the presence of a mutation in another VCF file. Therefore one solution would be to have a VCF file with all possible CpG mutations. Here is how to do that:

# create your personal binary directory
mkdir -p ~/bin/

# download human genome reference (GRCh37)
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta &&
samtools faidx human_g1k_v37.fasta

# install latest version of bedtools
git clone https://github.com/arq5x/bedtools2.git &&
cd bedtools2 && make && cd .. &&
cp bedtools2/bin/bedtools ~/bin/

# create a VCF file with all CpG mutations (it takes several hours)
awk '{for (i=1; i<$2; i++)
  print $1"\t"i-1"\t"i+1"\t"$1" "i}' human_g1k_v37.fasta.fai |
  bedtools getfasta -name -fi human_g1k_v37.fasta \
  -bed /dev/stdin -fo /dev/stdout |
  tr '\n' ' ' | tr '>' '\n' | grep "CG $" |
  awk -v OFS="\t" 'BEGIN {print "##fileformat=VCFv4.1";
  print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}
  {print $1"\t"$2"\t.\tC\tT\t.\t.\t.";
  print $1"\t"$2+1"\t.\tG\tA\t.\t.\t."}' |
  bgzip > cpg.vcf.gz &&
  tabix -f cpg.vcf.gz
To be completely thorough these mutations will be CpG mutations assuming that the reference sequence is the ancestral sequence which is not always the case. But for the majority of rare mutations it is a fair assumption. Furthermore, this code does not take into account whether a CpG mutation is located within a CpG island and this is also information that might be important as not all CpG sites are equally mutable.

Now that we have our VCF file with all CpG mutations, we can use SnpSift to annotate our input VCF file. I selected SnpSift because it is quite fast and flexible compared to other tools like bcftools (see here). This can be achieved as follows:

# install latest version of snpEff/SnpSift
wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip &&
unzip snpEff_latest_core.zip

# annotate your VCF file with SnpSift
java -jar snpEff/SnpSift.jar annotate -exists CPG cpg.vcf.gz input.vcf.gz | bgzip > output.vcf.gz &&
tabix -f output.vcf.gz
If you want to extract from the VCF file only the variants which are CpG mutations, the following code will work:

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# extract from VCF file all CpG mutations
bcftools view -Oz -i "CPG==1" output.vcf.gz -o output.cpg.vcf.gz &&
tabix -f output.cpg.vcf.gz