The Maths of Handedness (Part 1)
It was a few days ago. I was just arriving in London and had to take the tube down to my temporary accommodations. Having a few hours to spare, I decided to catch up on my podcasts. I started to listen to a rebroadcast of the Radiolab episode What's left when you're right (You can find the full story here if you're interested). To make a long story short (and slightly spoil the podcast), they end up talking about the differences between left and right-handed individuals. In particular, they make this one statement that for some reason stayed with me. One of the interviewees and the narrator made the following statement:
"If you have two right-handed parents, their chance of having a left-hander is 9.5%. If you have one righty one lefty parent, your odds do go up, almost 20% chance. Now, two southpaw parents, you have a 26% chance of delivering into the world a southpaw. And if you add up those chances and look out across the entire human species, we are about 90% right-handed, 10% left-handed." - Radiolab, giving me a reason to distract me
My brain, in its infinite inability to concentrate on one thing, decided to go off on a tangent and ask itself, is that true? What does the maths of inheritance look like? More specifically, if those percentages are true, would you see a left-right handed split similar to the one we observe in reality? Well, let's find out.
The Setup
One of the most important parts of any maths problem is the statement. As any good mathematician would do (which, to clarify, I am not), let's simplify the problem down to make the maths comprehensible. Let's imagine we have a very large population of people. The exact number of people doesn't matter, but let's say it's orders of magnitude larger than the number of handedness options (in other words, 2, we'll get to why later). Let's say that, with every generation, a random pair of people get together, and produce two children, and proceed to immediately die (the parents, not the children). Truly a sight to behold. Then the cycle repeats. After a large-enough number of generations, what would the proportions of left to right-handed individuals stabilize to in this population?
As you can see this whole statement is bizarrely unrealistic. It gets quite close to being plausible, but completely misses a lot of things, like the fact that, not any two people can or will have offspring (they have to be of opposite sexes, not everyone wants/does/can have children, and couples are biased towards being two people who live nearby and from the same culture), not every couple that has children have two children, some people have more than one person they have children with, and most importantly, people don't die after having children, and we don't end up in this nice, lockstep world where all children are born at the same time. There are also other assumptions being made, such as the fact that there's a proportion of handedness that we'll tend to stabilize to after enough generations. We haven't shown that yet. That being said, all these simplifications can still provide us with surprisingly good estimates, as we'll see now that we'll dig into the math, so first, let's jump in!
The Maths
Let's begin by defining all the different terms. Here we're moving into the world of statistics and (mostly) probability, so correct notation is key (or so my teachers kept on telling me).
Let's say that, for a given generation \(n\), the probability of a random person being left-handed is \(P(G_n(L))\). Let's say that the probability of a given child being left-handed given the handedness \(x\) and \(y\) of their parents, is \(P(G_n(L) | A_1(x)\cap A_2(y))\), where of course \(A_1(x)\) and \(A_2(y)\) are the probability mass functions (PMF) of the handedness of both parents of said children.
Now, we can start talking about the probabilities we were given at the start. They can be expressed mathematically as:
\(P(G_n(L) | A_1(L) \cap A_2(L)) = 0.26\) \(P(G_n(L) | A_1(x) \cap A_2(y)) = 0.2 \forall x, y \in \{L, R\}, x\neq y\) \(P(G_n(L) | A_1(R) \cap A_2(R)) = 0.095\)
Now that we have this, we can define \(P(G_n(L))\) more generally. Their probability depends solely on the parents. We can now use the law of total probability to define this. Since we know there are only 4 different possible combinations of parent handedness (all exclusive), we can say they're disjoint sets. Basically what that means is that, if you're left-handed, there are only 4 distinct possibilities for the handedness of your parents. They will either both be left-handed, both be right-handed, or one will be left-handed and the other right-handed. We can figure out the probability of you being left-handed by adding up the probability of the four scenarios (which we can express with conditional probability). If you're still curious about what this means and you didn't understand what the hell I said, you can read this article which explains the law of total probability. For now, this means we can write the probability as such: \[ P(G_n(L)) = \sum_{x,y\in\{L,R\}} P(G_n(L) | A_1(x)\cap A_2(y))P(A_1(x)\cap A_2(y)) \] Notice how the first multiplicand in the sum is something we know! We can just plug those numbers in. The second term, \(P(A_1(x)\cap A_2(y))\), is more interesting. One might be tempted to say that the events \(A_1(x)\) and \(A_2(y)\) are independent, and so you can just multiply them right? Why would my father's handed-ness in any way depend on my mother's? However, you forget an important detail: populations are finite.
Let look at this issue with a hypothetical example. You have a population of two left haded people and two right-handed people. Remember, in our setup, any two people can pair off. This means that, if one ancestor is left-handed, there's only one left-handed person left. Thus, the chance of the other ancestor being left-landed reduces, from 50% to ~33%. Now, note how we still don't care who's pairing with who (we're still not considering gender), but the idea gets across. How do we get past this? Do we calculate all the possible combinations? Of course not! We say that the population is really large. If you make it large enough, every parent you pick will make a tiny dent in the pool of people. At the limit (aka, at an infinitely large population), an ancestor's handedness is fully independent of the other ancestor's handedness. You can see why this would apply to the general human population. You won't eliminate that many left-handed people by picking one. Additionally, unlike this scenario, biological sex is important. The population of people for your mother is separate from that of your father, which voids some of these points.
For this reason, we can approximate \(P(A_1(x)\cap A_2(y)) \approx P(A_1(x))P(A_2(y))\). Here comes another assumption, which is that we pick ancestors randomly from the population. Thus, we can just say that \(P(A(x)) = P(G_{n-1}(x))\). We can now rewrite our formula as: \[ P(G_n(L)) = \sum_{x,y\in\{L,R\}} P(G_n(L) | A_1(x)\cap A_2(y))P(G_{n-1}(x))P(G_{n-1}(y)) \]
Now we can talk about steady-state solutions. In a sense, this equation describes a dynamic system. We have an overall space that evolves in steps. Every time you plug in numbers, you get back a number that you can plugin to the next iteration. All you need is an initial state and a formula like the one above. As an example, say we have a population that starts with a 50/50 split of left and right-handed people. That is to say, \(P(G_0(L))=0.5\) and \(P(G_0(R))=0.5\). If we plug these numbers into the formula, we can get that: \[ P(G_1(L)) = (0.26)(0.5)(0.5) + 2(0.2)(0.5)(0.5) + (0.095)(0.5)(0.5) = 0.18875 \]
You can then repeat again and again and again...
\(P(G_2(L)) = (0.26)(0.18875)^2 + 2(0.2)(0.81125)(0.18875) + (0.095)(0.81125)^2 \approx 0.133\)
\(P(G_3(L)) \approx 0.126\)
\(P(G_4(L)) \approx 0.121\)
[...]
An interesting property of this system is that it seems to stabilize to a single point. if you keep on going, it'll go down, down, down, until presumably, it reaches a value at which it does not shift much anymore. These are called steady states because, well, they're steady (who would have guessed). Let's assume this statement is true for now and we can further show it later. That means that there's a value for which, after a long enough period, \(P(G_n(L)) \approx P(G_{n+1}(L))\). Some of you might recognize this as the same rough idea of limits at infinity. There are series which, as you approach infinity, they get nearer and nearer to a single value. This is just another example of that.
Now that we have this new goal in mind, we can finally just say \(P(G_n(L) = P(G_{n-1}(L))\). We can now rewrite our formula into something we can solve: \[ P(G_n(L)) = \sum_{x,y\in\{L,R\}} P(G_n(L) | A_1(x)\cap A_2(y))P(G_n(x))P(G_n(y)) \] To make things easier, we will use a shorter notation:
\(P(G_n(L)) = G_L\)
\(P(G_n(R)) = G_R\)
\(P(G_n(L)|A_1(L)\cap A_2(L)) = G_{LL}\)
\(P(G_n(L)|A_1(L)\cap A_2(R)) = G_{LR}\)
\(P(G_n(L)|A_1(R)\cap A_2(L)) = G_{RL}\)
\(P(G_n(L)|A_1(R)\cap A_2(R)) = G_{RR}\)
[...]
We can now write our formula out: \[ G_L = G_{LL}G_L^2 + G_{LR}G_LG_R + G_{RL}G_RG_L + G_{RR}G_R^2 \] We can now make two substitutions to work out the math. First off, we know that \(G_R = 1 - G_L\) (since we're working with probabilities). Secondly, we know that \(G_{LR} = G_{RL}\) (our definitions above define them as having the same value). We can now simplify this down:
\(G_L = G_{LL}G_L^2 + 2G_{LR}G_LG_R + G_{RR}G_R^2\)
\(G_L = G_{LL}G_L^2 + 2G_{LR}G_L(1-G_L)+G_{RR}(1-G_L)^2\)
At this point, you might realize this looks an awful lot like a quadratic equation, so we can group terms and try to solve it as such:
\(0 = G_{LL}G_L^2 + 2G_{LR}G_L(1-G_L) + G_{RR}(1-G_L)^2 - G_L\)
\(0 = G_{LL}G_L^2 + 2G_{LR}G_L - 2G_{LR}G_L^2 + G_{RR}(1-2G_L + G_L^2) - G_L\) \(0 = G_{LL}G_L^2 + 2G_{LR}G_L - 2G_{LR}G_L^2 + G_{RR} - 2G_{RR}G_L + G_{RR}G_L^2 - G_L\) \(0 = G_L^2(G_{LL}-2G_{LR}+G_{RR}) + G_L(2G_{LR} - 2G_{RR} - 1) + G_{RR}\)
It's at this point where you might be tempted to say "Oh! I can solve the square with this. I know enough to do it!". Don't. Just don't. I tried, and all it will bring you is pain and useless knowledge. Just plug it into the quadratic formula and call it a day. To spare you the suffering, you get this: \[ G_L = \frac{-(2G_{LR}-2G_{RR}-1)\pm \sqrt{(2G_{LR}-2G_{RR}-1)^2-4(G_{LL}-2G_{LR}+G_{RR})(G_{RR})}}{2(G_{LL}-2G_{LR}+G_{RR})} \] Could you simplify it? Probably. Am I gonna bother? No, and neither should you. We have a form where we can plug in numbers, and that's good enough for me (again, I'm an engineer). If we plug numbers in, we get: \[ G_L = \frac{0.79\pm\sqrt{0.6412}}{-0.09} \] As with all quadratic formulas, we have two results, which I'll call \(G_{L_1}\) and \(G_{L_2}\). Once you plug in the numbers, you get:
\(G_{L_1} \approx -17.675\)
\(G_{L_2} \approx 0.119\)
It's pretty clear which solution is the one within the realm of probability. So, this system predicts that around 12% of all people would be left-handed, which is pretty damn accurate, considering the actual number is around 10%.
Closing Remarks
Well, we've now seen a greatly simplified mathematical model give us a result close to what we observe in reality. By defining a problem, and finding a small set of underlying important parts of a problem, you can create a pretty complete model that predicts real-world problems with surprising accuracy.
It's a common trope to joke about how scientists simplify problems down to the point of being useless (just look at xkcd, xkcd again, and spherical cows in a vacuum. Hell, this problem was full of absurd simplifications for a real human population. That being said, knowing how to simplify problems properly can be a great tool to be able to quickly solve a problem and get to a pretty good estimate. In the next entry, I'll talk about how you might solve this problem in a more complex model using a wonderful tool: computers.