Using Bayes’ Theorem, you can find the relationship probability based on cM value and update results when new data is available
This is the third in a series of blogs about finding my great-great grandfather, the dad of my great-grandmother, Agnes Florence Thornton Saxton. These blogs contain old images, maps, formulas, and charts. Although I’ve made an effort to optimize them for mobile devices, they are much easier to see in desktop mode.
There are tools that help you calculate the probability of a genetic relationship, but Bayes’ Theorem lets you find the answer for yourself. With DNA Painter’s WATO (What Are The Odds) tree, you can upload a GED file to build a tree, enter known cM values, and run different hypotheses to get relationship probabilities. Once you navigate the interface, it automatically generates the hypotheses and runs the statistics in the background. When it’s done, it provides a graphical explanation that lets you see the probabilities right on the various relationship lines. It’s a great tool.
But, if you’re like me, you want to see how they know that. Or, perhaps you are using a different source to obtain your cM ranges or known probability estimates and want to see how that would change the outcome. Using Bayesian Statistics, you can update probabilities as more data becomes available.
This is the heart of statistics and there are big formulas in this blog post. Statistics gets a bad rep, but I’m going to walk through each step to make this as simple as possible. Much of the work is setting up the equations, defining variables, and gathering the right data. The actual math is basic arithmetic.
The Problem
I am going to determine the likelihood that I am the descendant of John G. McCaskey as opposed to his brother, William McCaskey. I will do this by calculating the probability that his known descendants match one of two possible relationships to me. In this example, I have used the same four matches that I used to establish my relationship to John G. McCaskey in DNA Painter’s WATO tree. I also used real cM values from either my first cousin’s DNA test or mine. I chose the higher cM value of the two.
If we are all descended from John G., then the other descendants of John G. are half cousins to me. In that case, Agnes would have been a half-sister to his other kids, so their kids would be half-first cousins to each other, and so forth. On the other hand, if we are descended from his brother, William, then Agnes was a first cousin to John’s G.’s children. In that case, John’s G.’s descendants would be full cousins.
The results of my calculations show that I’m more likely to be descended from John G. (66.2%) than William (38.3%). The result is the same as the WATO tree but my probabilities are different. In the WATO tree, I ran five hypotheses, which included their father, John W., and an unknown male child. When John W. and another child are put into the mix, then it becomes even less likely that I’m descended from William (>1%), most likely that I’m descended from John G. (about 84%), and potentially descended from John W. (16%). Because that is unlikely, I have chosen to demonstrate the two most likely scenarios here.
To view the equations on a mobile device, turn your device from portrait to landscape mode.
Step 1: Determining Genetic Relationship
First, I will calculate the probability that each cousin is either a half cousin or a whole cousin based on their cM values. That requires setting up variables that I will carry through the entire problem. I’ll call the matches Cousin 1, Cousin 2, and so forth. Each of the matches will get two hypotheses (either the half or the whole), so I will denote those as H1 for half cousins and H2 for full cousins.
Next, I need to get the right data. The cM values for each match come from Ancestry DNA‘s tests. But I also need the range of possible cM values for each relationship as well as the prior probabilities that each known cM value would result in that relationship. So, I am using the ranges and prior probabilities from The Shared cM Project 4.0 tool v4. That also means that I’m using the same data that DNA Painter used to calculate my results. The goal is to update these prior probabilities with posterior probabilities that more closely match their likely relationship to me.
Step 2: The Solution
Then, I’ll apply Bayes’ Theorem. The formula for this theorem is:
\begin{equation}
P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)}
\end{equation}
Where
- $P(H|D)$ is the probability of the hypothesis $H$ given the data $D$.
- $P(D|H)$ is the probability of observing the data $D$ given that the hypothesis $H$ is true.
- $P(H)$ is the initial probability of the hypothesis $H$ before observing the data.
- $P(D)$ is the marginal likelihood of the data, which can be computed using the law of total probability: \[ P(D) = \sum_{j} P(D \mid H_j) \cdot P(H_j) \].
Step 3: The Results
Finally, for each cousin, I’ll calculate the likelihood that the amount of DNA that we share would result in the given relationships. Using that information, I will calculate the probabilities of my relationship to John and to William. I will then use Open AI’s Chat GPT to generate the results for all four cousins and format MathJax and HTML.
Cousin 1
Shared cMs: 168
Possible Relationships: Half First Cousin 1x Removed (1C2R) or Second Cousin 1x Removed (2C1R)
My hypotheses are:
- H1: 1C2R (he’s descended from John)
- H2: 2C1R (he’s descended from William)
Using the relationship chart by The Shared cM Project, these are the likelihoods of each relationship:
- The range for Half 1C2R is 16-269 (253 cM span).
- The range for 2C1R is 14-353 (339 cM span).
- For \( H_1 \) (Half 1C2R):
\[
P(D \mid H_1) = \frac{1}{253}
\] - For \( H_2 \) (2C1R):
\[
P(D \mid H_2) = \frac{1}{339}
\]
Prior probabilities from The Shared cM Project tool v4:
- For \( H_1 \):
\[
P(H_1) = 0.52
\] - For \( H_2 \):
\[
P(H_2) = 0.52
\]
Therefore, the marginal likelihood of Data \( P(D) \) for each is:
\[
P(D) = P(D \mid H_1) \cdot P(H_1) + P(D \mid H_2) \cdot P(H_2)
\]
\[
P(D) = \left(\frac{1}{253}\right) \cdot 0.52 + \left(\frac{1}{339}\right) \cdot 0.52
\]
Next, calculate each term:
\[
\frac{1}{253} \cdot 0.52 \approx 0.00205
\]
\[
\frac{1}{339} \cdot 0.52 \approx 0.00157
\]
Then, add the terms to get the sum:
\[
P(D) \approx 0.00205 + 0.00157 = 0.00362
\]
Finally, solve for posterior probabilities:
- For \( H_1 \):
\[
P(H_1 \mid D) = \frac{P(D \mid H_1) \cdot P(H_1)}{P(D)}
\]
\[
P(H_1 \mid D) = \frac{\left(\frac{1}{253}\right) \cdot 0.52}{0.00362}
\]
\[
P(H_1 \mid D) \approx \frac{0.00205}{0.00362} \approx 0.567
\] - For \( H_2 \):
\[
P(H_2 \mid D) = \frac{P(D \mid H_2) \cdot P(H_2)}{P(D)}
\]
\[
P(H_2 \mid D) = \frac{\left(\frac{1}{339}\right) \cdot 0.52}{0.00362}
\]
\[
P(H_2 \mid D) \approx \frac{0.00157}{0.00362} \approx 0.433
\]
The results are:
- \( P(H_1 \mid D) \approx 0.567 \)
- \( P(H_2 \mid D) \approx 0.433 \)
Given the data, hypothesis \( H_1 \) (Half 1C2R) has a higher posterior probability than hypothesis \( H_2 \) (2C1R).
Cousin 2
Shared cMs: 118
Possible Relationships: Half Second Cousin 1x Removed (2C1R) or Third Cousin 1x Removed (3C1R)
Hypotheses:
H1: Half 2C1R
H2: 3C1R
Likelihoods of Data:
- The range for Half 2C1R is 0-190 (190 cM span).
- The range for 3C1R is 0-192 (192 cM span).
- For \( H_1 \) (Half 2C1R):
\[
P(D \mid H_1) = \frac{1}{190}
\] - For \( H_2 \) (3C1R):
\[
P(D \mid H_2) = \frac{1}{192}
\]
Prior probabilities:
- For \( H_1 \):
\[
P(H_1) = 0.26
\] - For \( H_2 \):
\[
P(H_2) = 0.14
\]
Marginal likelihood of Data \( P(D) \):
\[
P(D) = P(D \mid H_1) \cdot P(H_1) + P(D \mid H_2) \cdot P(H_2)
\]
\[
P(D) = \left(\frac{1}{190}\right) \cdot 0.26 + \left(\frac{1}{192}\right) \cdot 0.14
\]
Calculate each term:
\[
\frac{1}{190} \cdot 0.26 \approx 0.00137
\]
\[
\frac{1}{192} \cdot 0.14 \approx 0.00073
\]
Then:
\[
P(D) \approx 0.00137 + 0.00073 = 0.00210
\]
Posterior probabilities:
- For \( H_1 \):
\[
P(H_1 \mid D) = \frac{P(D \mid H_1) \cdot P(H_1)}{P(D)}
\]
\[
P(H_1 \mid D) = \frac{\left(\frac{1}{190}\right) \cdot 0.26}{0.00210}
\]
\[
P(H_1 \mid D) \approx \frac{0.00137}{0.00210} \approx 0.652
\] - For \( H_2 \):
\[
P(H_2 \mid D) = \frac{P(D \mid H_2) \cdot P(H_2)}{P(D)}
\]
\[
P(H_2 \mid D) = \frac{\left(\frac{1}{192}\right) \cdot 0.14}{0.00210}
\]
\[
P(H_2 \mid D) \approx \frac{0.00073}{0.00210} \approx 0.348
\]
Results:
- \( P(H_1 \mid D) \approx 0.652 \)
- \( P(H_2 \mid D) \approx 0.348 \)
Given the data, hypothesis \( H_1 \) (Half 2C1R) has a higher posterior probability than hypothesis \( H_2 \) (3C1R).
Cousin 3
Shared cMs: 59
Possible Relationships: Half 2C1R or 3C1R
Hypotheses:
H1: Half 2C1R
H2: 3C1R
Likelihoods of Data:
- The range for 3C1R is 0-192 (192 cM span).
- The range for Half 2C1R is 0-190 (190 cM span).
- For \( H_1 \) (Half 2C1R):
\[
P(D \mid H_1) = \frac{1}{190}
\] - For \( H_2 \) (3C1R):
\[
P(D \mid H_2) = \frac{1}{192}
\]
Prior probabilities:
- For \( H_1 \):
\[
P(H_1) = 0.22
\] - For \( H_2 \):
\[
P(H_2) = 0.29
\]
Marginal likelihood of Data \( P(D) \):
\[
P(D) = P(D \mid H_1) \cdot P(H_1) + P(D \mid H_2) \cdot P(H_2)
\]
\[
P(D) = \left(\frac{1}{190}\right) \cdot 0.22 + \left(\frac{1}{192}\right) \cdot 0.29
\]
Calculate each term:
\[
\frac{1}{190} \cdot 0.22 \approx 0.00116
\]
\[
\frac{1}{192} \cdot 0.29 \approx 0.00148
\]
Then:
\[
P(D) \approx 0.00116 + 0.00148 = 0.00264
\]
Posterior probabilities:
- For \( H_1 \):
\[
P(H_1 \mid D) = \frac{P(D \mid H_1) \cdot P(H_1)}{P(D)}
\]
\[
P(H_1 \mid D) = \frac{\left(\frac{1}{190}\right) \cdot 0.22}{0.00264}
\]
\[
P(H_1 \mid D) \approx \frac{0.00116}{0.00264} \approx 0.439
\] - For \( H_2 \):
\[
P(H_2 \mid D) = \frac{P(D \mid H_2) \cdot P(H_2)}{P(D)}
\]
\[
P(H_2 \mid D) = \frac{\left(\frac{1}{192}\right) \cdot 0.29}{0.00264}
\]
\[
P(H_2 \mid D) \approx \frac{0.00148}{0.00264} \approx 0.561
\]
Results:
- \( P(H_1 \mid D) \approx 0.439 \)
- \( P(H_2 \mid D) \approx 0.561 \)
Given the data, hypothesis \( H_2 \) (3C1R) has a higher posterior probability than hypothesis \( H_1 \) (Half 2C1R).
Cousin 4
Shared cMs: 35
Possible Relationships: Half 3rd Cousin or 4th Cousin
Hypotheses:
- H1: Half 3rd Cousin
- H2 4th Cousin
Likelihoods of Data:
- The range for Half 3rd Cousin is 0-168 (168 cM span).
- The range for 4th Cousin is 0-139 (139 cM span).
Likelihoods of Data:
- For \( H_1 \) (Half 3rd Cousin):
\[
P(D \mid H_1) = \frac{1}{168}
\] - For \( H_2 \) (4th Cousin):
\[
P(D \mid H_2) = \frac{1}{139}
\]
Prior probabilities:
- For \( H_1 \):
\[
P(H_1) = 0.9
\] - For \( H_2 \):
\[
P(H_2) = 0.18
\]
Marginal likelihood of Data \( P(D) \):
\[
P(D) = P(D \mid H_1) \cdot P(H_1) + P(D \mid H_2) \cdot P(H_2)
\]
\[
P(D) = \left(\frac{1}{168}\right) \cdot 0.9 + \left(\frac{1}{139}\right) \cdot 0.18
\]
Calculate each term:
\[
\frac{1}{168} \cdot 0.9 \approx 0.00536
\]
\[
\frac{1}{139} \cdot 0.18 \approx 0.00130
\]
Then:
\[
P(D) \approx 0.00536 + 0.00130 = 0.00666
\]
Posterior probabilities:
- For \( H_1 \):
\[
P(H_1 \mid D) = \frac{P(D \mid H_1) \cdot P(H_1)}{P(D)}
\]
\[
P(H_1 \mid D) = \frac{\left(\frac{1}{168}\right) \cdot 0.9}{0.00666}
\]
\[
P(H_1 \mid D) \approx \frac{0.00536}{0.00666} \approx 0.805
\] - For \( H_2 \):
\[
P(H_2 \mid D) = \frac{P(D \mid H_2) \cdot P(H_2)}{P(D)}
\]
\[
P(H_2 \mid D) = \frac{\left(\frac{1}{139}\right) \cdot 0.18}{0.00666}
\]
\[
P(H_2 \mid D) \approx \frac{0.00130}{0.00666} \approx 0.195
\]
Results:
- \( P(H_1 \mid D) \approx 0.805 \)
- \( P(H_2 \mid D) \approx 0.195 \)
Given the data, hypothesis \( H_1 \) (Half 3rd Cousin) has a higher posterior probability than hypothesis \( H_2 \) (4th Cousin).
Step 2: Determining Descent
Now, I need to determine if I’m more likely descended from either John or William based on those results. To do this, I need to use the posterior probabilities of each hypothesis (H1 and H2) for each cousin and then combine them to infer the overall likelihood of descent from John versus William.
Calculating Combined Probabilities:
To find the overall probability, I can use the product of the individual posterior probabilities, assuming that each cousin’s match is independent of the others.
For John (matching H1):
\[
P(\text{John} \mid \text{Data}) \propto P(H_1 \mid \text{Data}_1) \times P(H_1 \mid \text{Data}_2) \times P(H_1 \mid \text{Data}_3) \times P(H_1 \mid \text{Data}_4)
\]
This means the probability of being descended from John is proportional to the product of the individual posterior probabilities for each cousin under the theory that they match H1 relationships.
For William (matching H2):
\[
P(\text{William} \mid \text{Data}) \propto P(H_2 \mid \text{Data}_1) \times P(H_2 \mid \text{Data}_2) \times P(H_2 \mid \text{Data}_3) \times P(H_2 \mid \text{Data}_4)
\]
This means the probability of being descended from William is proportional to the product of the individual posterior probabilities for each cousin under the theory that they match H2 relationships.
Calculations:
Now, I’ll run those calculations.
For John:
\[
P(\text{John} \mid \text{Data}) \propto 0.573 \times 0.488 \times 0.399 \times 0.695 \approx 0.0785
\]
For William:
\[
P(\text{William} \mid \text{Data}) \propto 0.427 \times 0.512 \times 0.601 \times 0.305 \approx 0.0400
\]
Step 3: Normalizing the Results
Finally, I have to find the total probability by making sure that the sum equals 1. First, add the posterior probabilities:
\[
0.0785 + 0.0400 = 0.1185
\]
Then, divide the total by each probability:
\[
P(\text{John} \mid \text{Data}) = \frac{0.0785}{0.1185} \approx 0.662
\]
\[
P(\text{William} \mid \text{Data}) = \frac{0.0400}{0.1185} \approx 0.338
\]
Conclusion
Based on the given data and posterior probabilities:
- The probability that I am descended from John is approximately 66.2%.
- The probability that I am descended from William is approximately 33.8%.
And that is how I can demonstrate that I’m statistically more likely to be John G’s great-great grandson.
View Matt Saxton’s family tree here on Ancestry.
Mistakes happen. If you see errors in my logic or believe I am wrong, please contact me here.
Leave a Reply