Saturday, October 24, 2015

Distinguishing parents from children from genotyping array data

From genotype data for a trio (father, mother, child), it is straightforward to identify whom the parents are. It suffices to identify which pair shares genotypes from the third person so that the Mendelian error rate is minimized. From genotype data for a duo (parent, child), it is straightforward to identify that the two individuals are immediately related and that the relationship is a parent-child one. However, determining who is the parent and who is the child is a challenge of a different nature.

Without phasing information, sharing between a parent and a child is quite a symmetric relationship. One piece of information that can give directionality would be the identification of de-novo mutations, or mutations not shared by an outlier individual, that would only be present in the child and not in the parent. From genotype data we cannot assay de-novo mutations, however.

The other piece of information that can give directionality is more closely related to linkage analysis. A distant relative might share some DNA with the parent and possibly the same amount of DNA with the child, or occasionally only part of that amount with the child. The reverse is also possible but quite unlikely.

If you have DNA information for a duo either in 23andMe or in AncestryDNA, you can test whether this information is actionable by first downloading your genotype data using the instructions from my previous post and then visualizing estimated sharing between members of the duo and shared distant relatives with an additional python script.

This is what I have obtained using my own 23andMe and AncestryDNA accounts: Red markers indicate individuals for whom sharing differs with the individuals of the duo by an amount larger than 0.1% for 23andMe, or 5 centiMorgans for AncestryDNA. As you can see from the figure, there are always more red markers on the parent's side of the diagonal.

There are two explanations for this observation:
1) A distant relative shares more than one segment with the parent and a smaller number of segments with the child
2) A distant relative shares one large segment of DNA with the parent and a smaller chunk of that segment with the child due to a recombination event which split the segment in two parts, of which only one was passed to the child

Due to the large number of distant cousins sharing only one segment of DNA, I believe that it is more common to observe case (2) although you still need to expect the passed segment to be at least 5 centiMorgans long in order to be detected and reported.

Notice also that occasionally the child might share more with the distant cousin than the parent does (red markers under the diagonal). It is possible that statistical noise causes a shared DNA segment to be reported in the child but not in the parent. It is also occasionally possible that the distant relative is related to the child both through the mother and through the father, though in my experience this circumstance seems to be rare.

Limitations of 23andMe

While the Relative Finder section of 23andMe shows a long list of DNA matches for each profile, the script used in this post will not pair matches showing up as anonymous. This most likely will exclude 60-70% DNA matches, depending on profile anonymity chosen by your distant cousins.