After a string of Insights puzzles related to physics, on the relationship between time and entropy, half-lives, overhang and quantum weirdness, we turn this month to the mysteries of biological evolution. Carrie Arnold’s article, “Evolution Runs Faster on Short Timescales,” explores new research showing that genetic changes that are quite brisk when measured over a few generations seem to slow down considerably when measured over millions of years. One of the researchers who have studied this in genomes is Simon Ho, an evolutionary biologist at the University of Sydney. To quote Arnold:

When [Ho] calculated how quickly DNA mutations accumulated in birds and primates over just a few thousand years, Ho found the genomes chock-full of small mutations. This indicated a briskly ticking evolutionary clock. But when he zoomed out and compared DNA sequences separated by millions of years, he found something very different. The evolutionary clock had slowed to a crawl.

The article goes on to describe how, from a mathematical perspective, evolutionary rates decrease exponentially as the timescale increases.

Can we replicate this in a toy version of a single gene? Let’s find out. But first, let’s go over the basics for those who need to brush up on them: Here’s DNA 101 for puzzle enthusiasts.

A gene is a piece of DNA, which is essentially a linear chain of chemical bases that are abbreviated using the letters A, C, G and T. Each of these four letters (bases) appears in random sequence along a given gene, in about equal amounts. Thus the sequence CATGGTACCGAT represents a piece of DNA that is 12 units long. The way DNA works is that each successive three-letter piece of DNA, called a triplet, codes for one of 20 possible units, called amino acids, that make up a protein. Proteins are the body’s workhorses, each one performing different functions thanks to its unique structure and its unique linear sequence of amino acids. Thus in our DNA sequence above, there are four triplets, CAT, GGT, ACC and GAT, each of which codes for a specific amino acid. This piece of DNA, acting through the cell machinery, will form a piece of a specific protein fragment that is four units long.

Now, DNA is generally copied with high fidelity from cell to cell across generations. But, on rare occasions, you can get a “point mutation,” in which one of the letters of the gene sequence is replaced by another random one, causing the gene to produce a different protein, which may be more or less efficient at doing what it was supposed to. This is basically how evolutionary change happens. We can define the speed at which DNA mutates over time as the evolutionary rate: We can measure it over a given period by counting the number of letters that have changed between the original DNA sequence and the current sequence, divided by the number of years that have passed.

OK, lesson over. That’s all the biology we need for our puzzle.

Question 1:

Imagine a gene that is 108 letters with A, T, G, C in random sequence. Assume that every year, there is a random change — one of the letters somewhere on this gene mutates and is replaced by one of the other three. After each year, you compare the current copy of the gene with the original and tally how many letters have changed. After a certain time “the evolutionary clock will have slowed to a crawl” — that is, the number of changed letters will have stopped rising. The evolutionary rate from here on is zero. How many letters of the original gene will have changed at that point? How many years will it take to get to this point? Is the curve exponential?

Question 2:

The above scenario is not very realistic. Every letter in a real-life gene sequence has a different chance of having a mutation that “sticks.” The letters at some locations in the DNA sequence are preserved, because changes in them are catastrophic; others, at inconsequential locations, can change readily. One general rule is that the third letter of every triplet can change easily. This is because the third position in a triplet is often redundant: The first two positions fix the amino acid the triplet codes for.

Assume that the third letter of each triplet is three times as likely to get mutated as are the first and second. Now try to answer the same questions as in Question 1.

Bonus Question:

I gave this hypothetical piece of DNA 108 letters. What was my reason for choosing that number? Is it because 108 has mystical significance, as a Google search will indicate?

I hope these simple mathematical models give you a feel for how evolutionary rates work. This is, of course, nothing like the complexity of what actually happens in a single real-world gene. First, most genes have many more than 108 letters. Second, even scenario 2 above was an oversimplification: The letters at every location have a different likelihood of being preserved, depending on how important the amino acid they code for is to the function of the final protein. Welcome to biology! Yes, mathematical models can work and can give us some insight, but we must always remember that they are gross oversimplifications. In biology, analytical mathematics can take us only so far, and any attempt to capture the nuance of the real world requires highly sophisticated computer models.

How different this is from the physics we discussed in the last Insights puzzle. In it, some commenters, legitimately voicing one of the influential schools of thought, insisted that mathematics is all there is at the quantum level: Reality either does not exist or cannot be known apart from the models! Whether that is possible is a point worth pondering by all those who make and use mathematical models. For me, it is reassuring to return to the messiness of biology once in a while. Happy puzzling!

*Editor’s note: The reader who submits the most interesting, creative or insightful solution (as judged by the columnist) in the comments section will receive a *Quanta Magazine* T-shirt. ( Update: The solution is now available here.) And if you’d like to suggest a favorite puzzle for a future Insights column, submit it as a comment below, clearly marked “NEW PUZZLE SUGGESTION” (it will not appear online, so solutions to the puzzle above should be submitted separately).*

*Note that we may hold comments for the first day or two to allow for independent contributions by readers.*

I'm reminded of the coupon collector problem from my probability class. To start, let's make the simplifying assumption that once a base changes, we don't care if it changes back; in this case, we're just waiting for each of the 108 bases to change, which takes (asymptotic) n*log(n) time, so something over 400 years. Each year, you are more likely to run into "no change", as the target for "change" gets smaller.

Of course, at some point, this simplification breaks down and the gene enters a dynamic equilibrium where some base is as likely to change back to its initial value as some other is to change away. In fact, this will happen when only 1/4 of the bases are still the same, since a differing base has a 1/3 chance of changing back if it's the one selected but is 3 times as likely to be selected as a base matching your records.

With fast and slow changing bases, you're going to run into the dynamic equilibrium with the fast ones way before the slow ones, so you will see a halt to fast evolution before you see a halt to slow evolution.

I'm still working on the first two, but I presume you choice 108 as the number of bases because 108/3 = 36, and there are 36 amino acids in one protein peptide.

Question 1:

For a sequence of N letters, I find that the average number of letters that have changed after T years, relative to the original sequence, to be

< #changes > = (3N/4) { 1 – [ 1 – 4/(3N) ]^T },

which converges to 3N/4 changes as T increases, i.e., to 72 changes for N=108. This also follows just by noting that after sufficient time each letter will have changed to one of the 4 possible values randomly, so each letter will have a 3/4 probability of being different.

To reach the asymptotic number of changes by time T*, the average number of changes should be within 1/2 a letter from 72 on average after this time. Hence from the above formula

(3N/4) [ 1 – 4/(3N) ]^T* < 1/2,

which can be solved to give

T* > [ log 3N/2 ] / [ log 3N/(3N-4) ]

= [ log 156 ] / [ log 81/80 ]

= 406.5 years.

So 407 years suffices. I suppose one should really also look at the variance about the average to see how long this variance takes to decrease below 1.

For large N and taking natural logarithms, then via ln (1-x) ~ -x, T* is approximated by

T* ~ (3N/4) ln (3N/2)

~ (3/4) N ln N

for large N. So the asymptotic time grows a little more than linearly with the number of letters.

Here is my try with the help of a Markov chain.

The state space S is {0;1;…;N}, where N is the length of the sequence.

The random variable X on S gives the number of letters in the sequence

that are different from the original sequence.

Let P(X = j) be the probability that X takes the value j after n steps can

be computed from the transition matrix

p_{ij} = i/3N delta(i,j-1) + 2i/3N delta(i,j) + (N-i)/N delta(i,j+1) ,

which is the probability that the value of X changes from i to j and where

delta ist the Kronecker-Delta and i,j in S.

In one step j can only take the values i-1, i and i+1.

The stochastic vector that represents the original sequence is a_i = delta(i,0).

After n steps the probability distribution is

P(X=j) = sum_i a_i (p^n)_{ij}

where p^n is the n-th power of the transition matrix p.

The expected Value of X is

E(X) = sum_j P(X=j) j = sum_j sum_i a_i (p^n)_ij j .

The result is E(X) = 3N/4 ( 1 – [(3N-4)/3N]^n) .

It is in good agreement with the simulation of the random process that I run in mathematica.

So my answer to question 1 is:

E(X) depends exponentially on n with time constant -1/ln(1-4/3N)

and goes to 3N/4 as n goes to infinity.

The latter makes sense since for X = 3N/4 the probability reaches a kind of equilibrium.

I am not sure wheather the hitting time for X = 3N/4 should have been computed.

Thus far I have not figured out how to do that.

Moreover, concerning the computation of the n-th power of the triangular matrix p:

I could only do it with the help of mathematica but not analytically.

I have found some recursion formula for its determinant but that wasn't really helpfull.

I would be interested if someone followed a similar approach and would desribe how to do these things.

After reading the first two comments I have decided to have a look at the variance and the number of steps

it takes until E(X) differs from 3N/4 only by 1sigma:

The variance is Var(X) = 3N/16 so standard deviation is sigma = sqrt(3N)/4.

Again I figured this out with the help of mathematica.

Suggestions for analytical computation are appreciated.

The number n of steps that is needed such that E(X) = 3N/4 – sigma is

n = ln(4sigma/3N)/ln(1-4/3N),

which for large N is

n = 3N/8 ln(3N)

or n = 286 for N = 108.

1) 'Dynamic' equilibrium occurs after there are 81 letters in error and 27 unchanged or corrected letters. This could take place in as few as 81 years or probably much longer since, in 81 years, some of the changed letters will have changed back.

When there are 81 changed letters, the change expectation is the sum of numcorrect*p(xerror) – numchanges*p(correction) = 27*1 – 81*(1/3) = 0.

The probability of change to a correct letter = 1, and the probability of a correction to a changed letter = 1/3.

(2) I assume the third letter in a triplet (codon) has three acceptable states and one unacceptable state. So the probability of changing from acceptable to an unacceptable state is 1/3rd while the probability of correction from unacceptable to an acceptable state is 1.

There are 36-3rd codons, so when 27 of the 3rd letters/codons are acceptable and 9 are unacceptable, there is dynamic equilibrium since 27*(1/3) – 9*(1) = 0.

In Problem 2), when the 36-3rd letter/codons are in 'dynamic' equilibrium, the other 72 letters (1st and 2nd codons) may not yet be in equilibrium. The 72 other letters are in equilibrium when there are (approximately) 54 letters in error and 18 letters correct.

Since (18/72)*1 – (54/72)*(1/3) = 0.

There is a typo in the last line of my last comment.

It should read "or n = 234 for N = 108".

1) At 81 mutations the evolutionary rate goes to zero. This is because the chance of one of the 81 going back to the original is (81/108)(1/3) = 27/108 which is the same as the one of the same ones (108-81) to mutate to something different (27/108).

To compute the number of years to reach this state I used Markov chains. With the state with 81 mutations being treated as an absorbing node.

Let p(n) = p(n-1)*S; S is the matrix that defines the markov chain and p(i) is the probability vector with the probability of being in each state. Lets index the states so that all k absorbing states are indexed 1…k. Then the S matrix can be broken up as

[ I 0; R Q]

If writing p(n) = S*p(n-1) then we have [I R'; 0 Q']

Note that Q^k_{ij} is the probability that after k steps we are in state j after starting from node i. So if we sum I + Q + Q^2 … = inverse(I-Q) = N we have N_{ij} is the expected number of times we visit node j after starting from node i.

Expected number of nodes visited before being absorbed if starting from a non-absorbing node: N1.

Node 1 is the absorbing node and for other nodes i the number of mutations is 82-i. The probability of staying at the same node is (164-2*i)/324. Transitioning to a state with one more mutation is (26+i)/108 and one less is (82-i)/324.

The expected number of years in getting to 81 transitions is 281.761.

The curve is exponential.

To get at the question of when the process reaches equilibrium, I would like to posit an underlying stochastic process that generates the mutations. Suppose that we have 108 4-sided dice with vertices labeled A, C, T and G. Rolling a die results in a mutation if the die’s value changes, so the probability of a mutation is ¾ for each roll. Each year we randomly choose and roll dice (with replacement) until we obtain a mutation, so the mean number of rolls to obtain a mutation is 4/3. (Note that the same results are obtained by just rolling one die every 9 months. This results in the same mutation rate and a somewhat simpler stochastic process.)

Over the years, the number of dice that have not yet been rolled decreases. The process reaches equilibrium when the last die is rolled for the first time. This is the precise point at which the state of the dice no longer contains any information about the initial state. At this point the expected value of the number of mutations increases to exactly

3/4 * 108 = 81

States with more or less than 81 mutations have a higher probability of moving toward 81 mutations than of moving farther away, so at 81 mutations the process is in dynamic equilibrium.

To find the mean time to reach this equilibrium state, consider the mean number of rolls required to go from k dice rolled to k+1 dice rolled. This is follows a geometric distribution with the probability of rolling a die for the first time of

p = (108-k)/108

and a mean number of rolls of

1/p = 108/(108-k).

Summing the mean number of rolls over k = 0 to 107 results in a mean of 568.5 die rolls to get to equilibrium, which at 4/3 rolls per year is 426.4 years.

For question 2, the probability of the third letter of a triplet being mutated is three times the probability that the first or second letter is mutated. If the probability of a first or second letter mutating is p, then the probability of a third letter mutating is 3p. We would like the sum over all of the dice of the probability of choosing a die to be 1. With 72 first and second letters 36 third letters, this means

72p + 36*3p = 1

or

p = 1/180

Now we can find the number of rolls to reach equilibrium. The probability of rolling a die for the first time is

p_j_k = 1 – j/180 – k/60

where j is the number of first/second-letter dice already rolled and k is the number of third-letter dice already rolled.

Finding the expected time to roll all of the dice is a little more complicated because the probability of rolling a first/second-letter die

p_j+1_k|j_k =(2/5 – j/60)/p_j_k

is different than the probability for a third-letter die

p_j_k+1|j_k = (3/5 – k/180)/p_j_k

The probability that the process reaches each possible state can be calculated with the recursion

q_j_k = q_j-1_k * p_j_k|j-1_k + q_j_k-1 * p_j_k|j_k-1

where q_0_0 = 1

q_j_-1 = q_-1_k = 0

From there it is simple calculate the expected number of rolls in each state

q_j_k / p_j_k

and sum over j = 0 to 72 and k = 0 to 36 (j<72) and k = 0 to 35 (j=72) to get the expected number of rolls to reach equilibrium. This turns out to be 875.0 rolls, which would take 656.2 years at 4/3 rolls per year. This is 1.54 times the time required when all letters mutate at the same rate. This suggests that actual mutation rates are much lower than 1 per generation per 108 base pairs and/or selection pressure is much more effective at suppressing mutations. Otherwise, an organism’s genome would be completed scrambled in a few hundred generations.

Question 1:

We can represent the state of the gene's evolution by a number X which represents the number of letters that differ from the original gene. (This representation works because the question only cares about the number of different letters; the value of the letters doesn't matter.)

The initial state of the system is X=0. Each year a mutation can cause X to change by up to one in either direction. It can increase by one if nature causes a letter that's in the original gene to mutate to something different. It can decrease by one if nature causes a letter that's different from the original gene to happen to flip back to the original. It can also stay the same by flipping a letter that's already different to another different letter.

Let's write N as the length of the gene (=108).

The evolution of the system is based on several probabilities:

– X/N is the probability of nature happening to flip an already-different letter

– in this case, there's a 1/3 chance of flipping back to the original letter (resulting in state X-1), and a 2/3 chance of flipping to another different letter (resulting in state X again).

– (N-X)/N probability to flip a currently-the-same letter

– in this case, we always end up with a letter that's different from the original (resulting in state X+1).

We can write the expression for the evolution of the system as:

{ X-1 with probability X/N * 1/3

next_state = { X with probability X/N * 2/3

{ X+1 with probability (N-X)/N

It's a Markov Chain. The above formula defines the transition matrix.

At the system's steady state, there's an equal likelihood of going to increasing X as for decreasing X. Setting these two expressions equal, we see

X/N * 1/3 = (N-X)/N

X * 1/3 = (N-X)

X = 3*(N-X)

X = 3*N – 3*X

4*X = 3*N

X = 3*N/4

Thus, the system is at a steady state at when X = 3*N/4 = 81 letters differ from the original gene.

You can also derive this by using the transition matrix to solve for the full distribution over the steady state. If you do this, you'll see the mode of the distribution is at X = 81.

To understand how many years it takes to get to this point, I wrote a simple python script to compute the distribution over possible states at any given year by starting from the initial state and repeatedly multiplying by the transition matrix.

One key point is that it's not right to ask "how many years will it take to get to this point" because it may never reach the point. It's a probabilistic matter. It's better to ask, after how many years is there a, say, 95% probability the system has passed through X>=81? (Once the system hits X=81, it's pretty close to the steady state. The system has no memory. While it's possible that in the steady state–the stationary distribution–X bounces around to, say, 100, and that's not reachable within a few steps immediately after entering into X>=81, I think this new formulation of the question is clearer and more precise.)

I modified my python script to compute this cumulative probability as it computes the distribution over states at each step. An output graph is here:

https://docs.google.com/spreadsheets/d/1xca0r2lkU5WnbWi83WBnZe0f9lfmwinOuA-KEk8wYUc/edit#gid=646244014

It looks like it passes X>=81 at 262 years 50% of the time, passes it with 95% probability at 453 years, and passes X>=81 with 99% probability at 583 years. (These long timescales also support my claim that reformulating the question the way I did was reasonable. After all, on the scale of hundreds of years, the time it takes to possibly go from X=81 to X=100+ is comparatively small.)

For reference, I published the python script here:

http://www.mark-pearson.com/quanta-magazine-puzzle/genetic-evolution/1.py.txt

addendum to Question 1:

I realized I forgot to provide an answer to one part of the question. If the scientist looks each year at how many letters differ from the original gene, how many will they see? Here's a chart

https://docs.google.com/spreadsheets/d/1xca0r2lkU5WnbWi83WBnZe0f9lfmwinOuA-KEk8wYUc/edit#gid=1788723866

It rapidly converges to 81.

The curve is exponential, fit perfectly by the formula 81 – 81 * 0.987654320987654^year.

Data from this python script, a slightly modified version of the original one.

http://www.mark-pearson.com/quanta-magazine-puzzle/genetic-evolution/1.average.py.txt

P.S. I also noticed my original python code for this had a minor bug (now fixed), where I left X=108 out of the transition matrix. Because the odds of the gene hitting X=108 during the timespans in question is so tiny (<1e-10), it doesn't change the qualitative conclusions I already mentioned and changes the dates I cited by at most one year.

1. As discussed above, we know the probabilities will level off at 1/4 for every letter in the sequence randomly taking on the initial value. If we start with all of the sequences being "correct", and add a mutation every year we will lose 1/108 correct letters a year. This will continue until period T, where the "correctness" will equal 1/4. Thus we have an exponential function:

(1 – 1/108) ^ T = 1/4

T = log(1/4) / log(1 – 1/108) = 149 yrs.

2. Since one third of the letters are will reach equilibrium in one-third of the time, we need to set up the equation differently. 36 letters will mutate 3x as fast, while 72 will continue as in the first question. We can say:

(1/3 – 1/36) ^ (T/3) + (2/3 – 1/72) ^ T = 1/4

T [ 1/3 * log(1/3 – 1/36) + log(2/3 – 1/72) ] = log(1/4)

T = log(1/4) * [ 1/3 * log(1/3 – 1/36) + log(2/3 – 1/72) ] ^ -1 = 1.69 yrs.

Bonus: No idea, but what Cassidy Block says sounds legit.

Correction on 2.

I had some poor algebra on the second question. Here is the corrected answer, where I found the zeros of the new function:

(1/3 – 1/36) ^ (T/3) + (2/3 – 1/72) ^ T = 1/4

(1/3 – 1/36) ^ (T/3) + (2/3 – 1/72) ^ T – 1/4 = 0

Solve for T, which gives ~ 5 yrs.

OK, I haven't gotten any logical numbers for question 2, and I think I have found out why. Let's take a whole 'nother look at it…

I thought the exponent would be T/3 because I saw that as being more mutations per year, however that actually slows the decay of those portions, which is counter-intuitive. In order to triple it's mutations per year, we do indeed have to multiply T by 3. But this is not the only problem.

I used the values 1/3 and 2/3 because I was considering them parts of the same whole, but this doesn't really make sense given that we are looking at two separate sets. In reality, each set has it's own rate of decay. Knowing this, let's re-write the equation (again)…

(1 – 1/36) ^ (3 * T) + (1 – 1/72) ^ T = 1/4

I then plugged that into a function solver to arrive at a T-value of ~ 99 yrs, which makes much more sense to me than any of the other numbers I have produced. Whether this is correct or not, at least it won't bug me as much!

Question 2:

[sorry for the late submission; I was busy.]

This question can be solved in an analogous manner to my question 1 submission. The complication here is that we need to represent the state of the system by how many letters are different from the original gene among the first two letters in each triplet and how many letters are different from the original gene among the last letter in each triplet. Let's call the state of the system (Y,Z). The initial state is (0,0).

The evolution of the system is based on several probabilities:

– 2/5 is the probability that nature will modify the first two letters in a triplet (i.e., the Y part of the state).

– 3/5 is the probability that nature will modify the last letter in a triplet (i.e., the Z part of the state).

Because there are twice as many Y letters than Z letters, setting this probabilities like this mean that each individual Y letter has one-third as likely to change as an individual Z letter.

We can write the expression for the evolution of the system as

next_state =

{ Y-1, Z 2/5 * Y / (2/3*N) * 1/3 [pick different, make original]

{ Y, Z 2/5 * Y / (2/3*N) * 2/3 [pick different, make a different non-original letter in first part of triplet] +

Y, Z 3/5 * Z / (1/3*N) * 2/3 [pick different, make a different non-original letter in last part of triplet]

{ Y+1, Z 2/5 * (2/3*N – Y) / (2/3*N) [pick original, make different]

{ Y, Z-1 3/5 * Z / (1/3*N) * 1/3 [pick different, make original]

{ Y, Z+1 3/5 * (1/3*N – Z) / (1/3*N) [pick original, make different]

To identify the point when "the number of changed letters will have stopped rising", we have to look for when both Y and Z–which are independent–have both stopped rising on average. This happens to Y when the probability of increasing is the same as the probability of decreasing:

2/5 * Y / (2/3*N) * 1/3 = 2/5 * (2/3*N – Y) / (2/3*N)

and analogously for Z

3/5 * Z / (1/3*N) * 1/3 = 3/5 * (1/3*N – Z) / (1/3*N)

Solving both of these yields Y=N/2 and Z=N/4.

In other words, the total number of differences in the (on average) steady state is N/2 + N/4 = 3*N/4 = 81, just like in question 1.

This isn't surprising. After all, we have the same odds for going to an original letter versus a different letter as in question one. The difference between this question and question one is just that some letters are less likely to switch at all. This means it'll take us longer to reach the time at which the gene doesn't seem to show additional evolution.

With a new python script,

http://www.mark-pearson.com/quanta-magazine-puzzle/genetic-evolution/2.py.txt

I generated an analogous graph as in question 1 to see how likely the gene will have hit/crossed the 81+ difference line by each year.

https://docs.google.com/spreadsheets/d/1mawx5j6b0GGHwEBCAdHS4s4AgUDbGXEBw6FWnTVpAs8/edit#gid=646244014

It looks like it passes 81+ difference line at 360 years 50% of the time, passes it with 95% probability at 608 years, and passes X>=81 with 99% probability at 768 years. This is longer than in question 1 by about 100-200 years at this probability measurements.

Frankly, I'm a bit surprised it's not longer. In the grand scheme of evolution, another hundred or two hundred generations isn't a lot. I would've thought a priori that being three times more resilient to mutations in some letters would've have a stronger effect.

As for how many letters we can expect to be different from the original gene each year, here's a graph:

https://docs.google.com/spreadsheets/d/1mawx5j6b0GGHwEBCAdHS4s4AgUDbGXEBw6FWnTVpAs8/edit#gid=1788723866

generated by this python script

http://www.mark-pearson.com/quanta-magazine-puzzle/genetic-evolution/2.av.py.txt

If you flip back and forth between that graph from this question and the analogous graph my answer to question 1 (put them in separate tabs and use alt-tab back and forth), it's clear (and unsurprising) that this question's graph converges slower.

Unlike in question one, the curve is not perfectly fit by an exponential, likely because the transition probabilities differ along various dimensions.

Question 3:

In question 1, I observed that an exponential curve (81 – 81 * 0.987654320987654^year) exactly fit the curve representing the expected number of differences by year.

This parameter here–0.987654320987654–is actually 80/81. The author could have just as easily chosen 36 or 12 or 8 to be the gene length for this question. The key here is that with a setup like this (four possible letters, and thus three possible letters to change to), you need a number divisible by 4 so that 3/4rds of gene length is still an integer. Using, say, 8, would yield an exponential curve focused around 6 (=3/4 * N): 6 – 6*(5/6)^year.

But 8 is unsatisfactory because it doesn't lend itself to question 2, where we divide everything up into triplets. (Yay for biological inspiration!)

12 would've satisfied all these constraints, as would've 36. But both those numbers sound unrealistically small for the length of gene. Maybe that's why the author chose 108. 🙂

Question (2) So for this one I created a 2D markov chain with states going from (0,0) mutations to (72, 36) for a total of 73*37 states. The equilibrium point works out as before to be (54, 27) mutations. With the markov chain I could confirm that the eigenvector corresponding to the largest eigenvalue (1) had a maximum value for this state. And as before finding the expected time to hit this state for the first time gives us 579.22 years (little more than twice the time for Question 1)

I will leave the statistics and mathematics to those of you who have that talent. But about the number 108: I looked up the number 108 in terms of genes and found that there are 108 genes that are linked to schizophrenia. I'm guessing that that's not what you had in mind when you asked the question so I looked up the number 108 in general terms and came up with this by Sadhguru: "The diameter of the Sun multiplied by 108 equals the distance between Sun and Earth, and the diameter of the Moon multiplied by 108 equals the distance between Earth and Moon. The diameter of the Sun is 108 times the diameter of the Earth."

This is probably not what you were thinking about either. But it's an interesting phenomena to find the significance of a number if you're looking for it.

The statistics and mathematics presented by most responders is one approach, but I doubt that it has much relationship to how such changes actually occur. E.g., as noted in the text, some substitutions are lethal to the organism; if one happens to be the first substitution that will be the end of the chain. Other substitutions distort the DNA. Cytosine and guanine generally pair, as do adenine and thymine. However when a change pairs cytosine and adenine the DNA 'rails' no longer are parallel, and this affects the smooth functioning of the DNA and thus the organism. Assuming it isn't fatal, it increases the probability of a subsequent change back to the preferred pair. While all 108 base pairs may change over evolutionary time, the changes will only 'stick' if they are either advantageous or of no importance to the organism. The 'observation' of no changes over long periods of time may simply reflect that the environment is not changing.

Genetic mutations are a exploration in diversity pitted against what environmental pressures are applied at that specific time period i.e. there are multiple clocks operating in different time scales over the short, mid and long term (possibly in unison – very unlikely [a sign of only one successful way through the combinatorial time field], but more likely independent) like cogs in combination safe that are forever changing and the short term clock role is to maximise variation, but to get through to the next aligned cog only a small number or specific mutation/s attains this passage to the next time period. This will in turn be put under evolutionary pressure while trying to maximise it chances and will have to navigate the pressure and alignment features of that time period until it can slip through into the next time phase which will reduce the amount genetic variation by trial, pressure and luck and so forth.

To solve this mathematically I would have no idea, but I would use some form of combinatorial, gambling, attrition and feedback/looping based mathematics.

A shifting maze of environmental pressures over differing time periods gradually reducing and slowing the evolutionary clock of dna synthesis over the long term but remembering that in the here and now of every time period it is operating as fast as it possibly can within the limits of its genetic scope!

After spending more time with question 1, I have obtained some more results:

Examples of different sequence-length N have confirmed that the limiting distribution,

that is the matrix power (p^n)_{i,j} for n goes to infinity,

is a binomial distribution of length N with probablity m

Binomial[N,j] m^j(1-m)^{N-j} ,

where m is 3/4 , the fraction of false letters among the letters.

This makes sense, because in equilibrium you will hit a false letter with probability m.

(I am still looking for an expression for (p^n)_{ij} in dependance of N)

Accordingly, for n goes to infinity, the expectation value and variance are mN and m(1-m)N respectively.

The exact dependance of E(X_n) on n has already been given before.

Now, I also have an analytical expression for the variance in dependance of n with parameters N and m.

Since its quite long and I have not found a way to simplify it, I will just give the result for m = 3/4 and N = 108.

It is

Var(X_n) = 81/4 (1 – 646 (80/81)^n + 321 (79/81)^n)

with the limit m(1-m)N = 81/4 for n goes to infinity.

Finally I have looked up how to compute the expected hitting time for the state j, that is

h_{mN}(j) = 1 + \sum_{i=0}^N p_{i,j} h_{mN}(i) ,

where in the sum the term with i = mN is omitted.

I would be interested in a way how do make sense of the resulting recurrence relation for h analytically.

I have solved this tridiagonal linear system for h for all j with mathematica and

for j = 0 it yields 281.76 which was given before by Ashish.

I will now have a look on question 2.

Question 2:

There are now two kinds of species which mutate at different speeds.

The probability of a mutation in sequence X, which consists of Nx = 2N/3 = 72 letters is q = 2/5. The probability of a mutation in sequence Y, which consists of Ny = 1N/3 = 36 letters is 1-q = 3/5.

For both the quantitative behaviour is the same as in question 1, but with timescales adjusted to reflect these different probabilities:

E(Xn) = 3Nx/4 [ 1 – ((3Nx-4)/3Nx)^(qn)] ,

E(Yn) = 3Ny/4 [ 1 – ((3Ny-4)/3Ny)^((1-q)n)] ,

which results in E(X) = 54 and E(Y) = 27 for n goes to infinity.

The expected total number of mutations is

E(Xn+Yn) = E(Xn) + E(Yn) ,

and therefore also depends exponentially on n, with limit 3(Nx+Ny)/4 = 81.

Both expected hitting times can be computed in the same way as in question 1:

For q=1, the hitting time for X = 54 would be nx = 176.32 and ny = 78.15 for Y = 27 respectively.

However, since q is not equal to 1, these values have to be weighted appropiately to obtain the hitting time for X + Y = 81,

nx/q + ny/(1-q) = 571.03

which is similar to a previous result given by Ashish.

Question 2 Addendum:

I read over the comments and realized I made a mistake on question 2. My Markov chain implementation for this question used the wrong absorbing state and thus I reached the wrong conclusion about how long it took for the gene to stop showing additional evolution. (I thought the answer was shorter than it was.) I replaced the graph on my spreadsheet:

https://docs.google.com/spreadsheets/d/1mawx5j6b0GGHwEBCAdHS4s4AgUDbGXEBw6FWnTVpAs8/edit#gid=646244014

I'm embarrassed by my error mainly because I should've spotted it because the answer disagreed so much with the graph I generated showing expected number of changes over time.

Indeed, now I see the gene reaches the no-further-evolution line at 432 years 50% of the time, passes it with 95% probability at 790 years.

Note that this answer differs from Ashish's and Dennis's answers, which show an expected hitting time of 570-something years. Indeed, if you look at my expected graph,

https://docs.google.com/spreadsheets/d/1mawx5j6b0GGHwEBCAdHS4s4AgUDbGXEBw6FWnTVpAs8/edit#gid=1788723866

I show the same thing (570-something years).

I believe the difference here is medians versus averages.

revised script:

http://www.mark-pearson.com/quanta-magazine-puzzle/genetic-evolution/2.revised.py.txt

Question 2:

I have now calculated the expectation value of Xn and Yn

and have to revise my earlier result.

The correct n-dependance is:

E(Xn) = mNx(1 – [(mNx-q)/mNx]^n) ,

E(Yn) = mNy(1 – [(mNy-(1-q))/mNy]^n) ,

where m=3/4 is the ratio of false letters and Nx=2N/3 and Ny = N/3

and q is 2/5.

My earlier remarks concerning the hitting time will also be affected by this difference

and this could explain the small difference to Ashishs result I obtained in my previous post.