Are you one of those people who are so involved with the U.S. presidential election that you anxiously visit election-forecasting sites such as FiveThirtyEight, the Upshot or the Princeton Election Consortium (PEC) to look at their latest numbers? These poll-aggregating sites use sophisticated mathematical models to forecast the result of the presidential race. They provide numbers that boldly give each candidate’s probability of winning. This month, Insights invites *Quanta* readers to puzzle about principled ways to decide how much credence should be given to these numbers.

We are all familiar with innumeracy — the lack of a basic knowledge of mathematics. A significant proportion of the population is effectively mathematically illiterate and can easily be led astray by statistical statements and quantitative arguments in news stories. Innumeracy usually afflicts people who are not good with numbers, but there is a subtler affliction that does not spare mathematically competent people and sometimes affects them even more virulently — overprecision. As I described in my solution to Is Infinity Real?, we are conditioned to respect numbers calculated to multiple decimal places. We often ascribe mystical accuracy to them and fail to examine whether their precision is justified.

We have all encountered examples of this in real life. The standard joke concerns the museum docent who tells a visitor that a dinosaur skeleton is 75,000,005 years old. Why? The curator explains that it was 75 million years old when he started working at the museum five years earlier. I remember a geography textbook that declared that the distance between two cities was “about 1609.3 kilometers,” an absurdly overprecise translation of “about a thousand miles” that occurred when the country went metric. And I remember my then five-year-old son, who was used to precise Saturday-morning cartoon schedules, saying to me, “Daddy, come inside, it’s three minutes to the thunderstorm.”

Question 1

FiveThirtyEight and the PEC both predicted the outcome of the 2012 presidential race with spectacular accuracy. Yet on September 26 of this year, just before the first Clinton-Trump debate and 43 days before the election, FiveThirtyEight predicted on the basis of highly complex calculations that the chances for Clinton to win the election were 54.8 percent, while the PEC, with no doubt an equally complex model, estimated her chances to be 84 percent, almost 30 points higher. Both these models have highly successful histories, yet these numbers cannot both have been right. Do you think either number is justified as a principled prediction? Are these predictions meant only to titillate our psychological compulsion to have a running score, or do they have any other value? How many significant digits of these numbers can be trusted? How many presidential elections would it take to definitely conclude that one of these September 26 predictions was better than the other? And finally, as someone who backs a particular candidate, what is the wisest practical attitude to have about these probabilities?

In order to reason about this, we have to define what these probabilities mean. In Bayesian theory, the probability that a future event will occur is called a subjective probability, or credence, something that we encountered in the Sleeping Beauty puzzle. You can make the subjective probability practical by means of a bet: If you believe on a certain day that Clinton has, say, an 80 percent probability of winning, it means that you would rationally accept a bet that that gave you more than $5 for every $4 that you bet. Bookmakers make a living on this, so it matters a great deal to them. Interestingly, the odds offered by the major betting site Predict Wise on the same day reflected a probability that Clinton would win to be about halfway between the FiveThirtyEight and the PEC forecasts: 68 percent. But even if you win the bet, it does not validate the model you are using. You might have just been lucky.

Of course, if you are a purist and betting is not something you care for, you might give up on the actual numbers and focus only on the trend line. Every presidential race is punctuated by events that might affect the outcome — debates, revelations, hackings, foot-in-the-mouth episodes, tax-return sightings, 11-year-old videotapes and so on. It is often easy to say whether each event will affect a particular candidate’s chances positively or negatively. Yet trend lines cannot easily satisfy that insistent voice within you that wants to know the “score” with the faux authority of a decimal place.

Nevertheless, there are some insights that can be gleaned just by looking qualitatively at the shapes of curves that do not have any numbers on them at all — something that would horrify a middle-school teacher teaching elementary graphs. Below I discuss a puzzle based on one of my favorite curves, the upside-down *U* curve, or what I call the “why too much of anything is bad” curve. This curve is similar to the Gaussian bell curve, or the downward parabola. But you do not need any numbers to learn from this curve. It occurs extremely often in real-world situations. Even those who are allergic to numbers can benefit from studying it and practice spotting it in the world around them.

*x*-axis the dose or amount of a food, drug, activity or pretty much anything that you can have in life. On the

*y*-axis is some measure of your well-being. What the curve tells you is that anything that is good for you will, in excess, hurt you — including, yes, being too rich or too thin.

In the right pane, you see the same curve illustrating an important point in evolutionary theory. Here the *x*-axis is any characteristic that has a genetic basis — say, your appetite for sugar, based on your genes. The *y*-axis is your evolutionary fitness — how much this trait contributes to your reproductive success. This graph helps to explain the truism “Everything you like is bad for you.” We originally evolved in an environment where things like sugar, fat and food were scarce — on the left, or upward, arm of the curve — so we evolved to like these far too much. In these days of plenty, we find ourselves on the right arm of the curve, resulting in cravings that are bad or even prematurely fatal for us.

Question 2

The graph in the right panel is the basis of an insight originally articulated by the eminent statistician and evolutionist Sir Ronald Fisher. Imagine that you as an individual are situated at some point on the curve. Presumably this would be at some point near the top, since the fact that you exist implies that you are a product of evolutionarily fit ancestors. If you have a large genetic mutation, you are likely to move to a different point on the curve, which will probably be bad for you. Yet, Fisher asserted, if the mutation were small enough, it would have a 50 percent chance of being good for you. Can you figure out why this should be so?

What will happen if we rotate the graph into a three-dimensional figure looking like a rounded cone and use it to represent not one but thousands of different characteristics? In what way will this modify these conclusions?

There is much to be learned just by looking at shapes of curves. For instance, we all know about the curve for exponential growth in mathematics. But, for the same reason that infinity does not exist in the real world, this curve does not really exist in real life — the curve too-quickly shoots toward infinity. It is usually replaced by the *s*-shape curve or the “all good things must end” curve (the generic version of the logistic curve) — or, if things get really ugly, by the upside-down *U* curve we just saw.

Understanding such qualitative processes with numberless curves does indeed provide insights, but you need to be careful. It is easy even for experts to reach false or controversial conclusions by considering only the shapes of graphs, as has been the case with the Laffer curve, which has the upside-down *U* shape we discussed above. So in most cases, it is indeed wise to follow the teaching of your middle-school math teacher and insist on proper numbering and scaling of all graphs.

Furthermore, every graph needs to be looked at with a critical eye, and the data examined in context, especially when the prediction is an electoral claim. It is as easy to mislead with graphs and numbers as it is with words. To paraphrase a common warning: *caveat suffragator* — let the voter beware! Coincidentally, the Latin root for “vote” from which we get the word “suffrage” is uncomfortably similar to the word “suffering.”

I look forward to further insights from you, *Quanta* readers.

*Editor’s note: The reader who submits the most interesting, creative or insightful solution (as judged by the columnist) in the comments section will receive a *Quanta Magazine* T-shirt. And if you’d like to suggest a favorite puzzle for a future Insights column, submit it as a comment below, clearly marked “NEW PUZZLE SUGGESTION” (it will not appear online, so solutions to the puzzle above should be submitted separately).*

*Note that we may hold comments for the first day or two to allow for independent contributions by readers.*

1) I think neither of these predictions are terribly accurate. For one, they are wildly disparate, so as a result, assuming they adequately predict the same thing, this at minimum means that the thing being measured in this particular instance has a huge standard deviation. This would make both predictions pretty worthless. One could argue the average of the two polls would produce a more accurate prediction (a situation which typically holds), but I don't think that is the case here.

I would argue that the two models (which I would argue are good models in typical circumstances) disagree to such a large extent due to the distribution of registered voters this election cycle. According to Pew Research, this is the trends for registered voters in terms of democrat, republican, independent:

http://www.people-press.org/interactives/party-id-trend/

I would particularly watch the trend from sometime during the Reagan years until Obama's first election, and even Obama's second election. During this period the proportions for each group were fairly consistent, especially if one ignores the trend from 2012 on. After 2012, however, one can clearly see the number of republicans drop and the number of independents steadily rise. The data in the graph is through 2014, and one can rest assured that the trend has continued. At this point, the proportion of registered voters who identify as independents has likely reached greater than 50%.

Add into this that the debate between Clinton and Trump has been lowered to the point where they are arguing about who lies the most, who either is or is associated to the rapiest male in a position of power, who will start WWIII sooner, who has strong association to some foreign power, who is associated with more nefarious foreign powers, etc. and you can rest assured that the situation will get even more extreme, with greater and greater numbers of registered voters identifying as independents. Also as a result, I firmly believe that a large block of registered voters swing wildly back and forth between support for Trump or Clinton, and this, in addition to a smaller proportion of registered voters identifying as either democrat or republican (and thus being more narrowly focused groups value-wise and which also happen to be not that different), is what the deviation in the above cited models represents.

With all that, I find it strange that neither model seems to make any kind of prediction for a win from an independent, particularly Jill Stein. As an aside, imagine if Stein was getting Clinton's media attention, and could count on all of her positives being blared around while any negatives get entirely ignored. Imagine the same for Bernie, or Trump. Also, why can't Clinton beat the spread, or even come close to beating the spread, on Trump? Hmmmm, does someone have a preferred order: Clinton, Trump, Johnson, and…who is the other candidate again…….oh yea Jill Stein.

2) Because any random individual has an extremely low probability of actually being the average. Thus if the mutation is small, it has about a 50% chance of moving the individual closer to the average, which in this case represents maximum fitness, and thus represents a 50% chance for an improvement.

I am assuming in the second part of the question, you are picturing a bunch of graph lines radiating from the peak of a 3-D distribution such that each line represents a different trait. Keep in mind your measures would have to be changed in potentially non-linear ways in order to make all the traits occupy the same physical space within a distribution. In this case, nothing changes. If we have the same mental picture, I am curious why you thought it would, and if we have a different mental picture, I would love to have a more thorough description of what you are picturing

Question 1

"These numbers cannot both have been right". I don't agree with this, in the following sense. If a model M predicts a sequence of probabilities { p(E_k|M) }, for events E_1, E_2,… to occur (e.g., winners of various elections), and the result r_k is defined to be 1 if E_k does occur and 0 otherwise, then the model can be said to be successful over a large number N of predictions if the quantity

error_N =| sum_k p(E_k|M) – sum_k r_k |/N

is sufficiently small. The notion of "sufficiently" relates to how much one is willing to lose in the long run. In usual probability theory error_N is expected to scale like 1/sqrt{N}.

The above definition of success agrees with the usual that if each event is given the same independent probability, p(E_k|M)=q, by the model, then the total number of occurrences is ~Nq. But it also allows for the case where each event is intrinsically unique, such as in an election. With this definition, two different models for unique events can each be successful overall, while giving different individual predictions.

In this sense, there is only a case for saying one model must be "wrong", i.e., for bookmakers to prefer one over the other, if, over a long sequence, they lose more money on it than on the other – e.g., if error_N is larger for one than the other. For finite N (which is always the case in practice), this comes down in part to the size of fluctuations from the mean.

Of course, if one imagines there is some "true" model that, given all the variables, predicts the result correctly every time (with p(E_k|M) = 0 or 1 in every case), then neither of the models mentioned are "right". This could be the case if the events lie in the past (but elections don't), or the world both deterministic (it doesn't seem to be) and predictable with current computational resources (no way).

In practice, I guess one would look at the past fluctuations of each model from the actual outcomes, in addition to error_N from the past outcomes, in deciding which one to use for this particular election. But one can then also subdivide past elections into categories based on similarities, and look only at the predictions of the models for the category in which the present election fits. One will soon give up, I think!

The important thing is that in making a bet on the outcome, one cannot bet on the probability (54% or or 85%), but only on the outcome itself. Of course, one could make side-bets on the deviation of the outcome from the predicted probability of the model, but this is a different bet.

Question 2:

The curve looks like the Laffer curve, which I've seen used to explain things like why paying too little tax can be as bad as paying too much tax (if one wants things like roads, defence forces, and other infrastructure).

I'

Question 1

"These numbers cannot both have been right". I don't agree with this, in the following sense. If a model M predicts a sequence of probabilities { p(E_k|M) }, for events E_1, E_2,… to occur (e.g., winners of various elections), and the result r_k is defined to be 1 if E_k does occur and 0 otherwise, then the model can be said to be successful over a large number N of predictions if the quantity

error_N =| sum_k p(E_k|M) – sum_k r_k |/N

is sufficiently small. The notion of "sufficiently" relates to how much one is willing to lose in the long run. In usual probability theory error_N is expected to scale like 1/sqrt{N}.

The above definition of success agrees with the usual that if each event is given the same independent probability, p(E_k|M)=q, by the model, then the total number of occurrences is ~Nq. But it also allows for the case where each event is intrinsically unique, such as in an election. With this definition, two different models for unique events can each be successful overall, while giving different individual predictions.

In this sense, there is only a case for saying one model must be "wrong", i.e., for bookmakers to prefer one over the other, if, over a long sequence, they lose more money on it than on the other – e.g., if error_N is larger for one than the other. For finite N (which is always the case in practice), this comes down in part to the size of fluctuations from the mean.

Of course, if one imagines there is some "true" model that, given all the variables, predicts the result correctly every time (with p(E_k|M) = 0 or 1 in every case), then neither of the models mentioned are "right". This could be the case if the events lie in the past (but elections don't), or the world both deterministic (it doesn't seem to be) and predictable with current computational resources (no way).

In practice, I guess one would look at the past fluctuations of each model from the actual outcomes, in addition to error_N from the past outcomes, in deciding which one to use for this particular election. But one can then also subdivide past elections into categories based on similarities, and look only at the predictions of the models for the category in which the present election fits. One will soon give up, I think!

The important thing is that in making a bet on the outcome, one cannot bet on the probability (54% or or 85%), but only on the outcome itself. Of course, one could make side-bets on the deviation of the outcome from the predicted probability of the model, but this is a different bet.

Question 2:

The curve looks like the Laffer curve, which I've seen used to explain things like why paying too little tax can be as bad as paying too much tax (if one wants things like roads, defence forces, and other infrastructure).

I'm not sure that I understand the question. But since one is unlikely to be sitting at the optimal point on the curve, but slightly to the right or to the left, then a small mutation will have a 50 percent chance of moving towards the maximum, and a 50 percent chance of moving away from it, no matter which side one is sitting on. The primary reason would seem to be that, to lowest order, the curve is symmetric in the vicinity of the maximum, being well approximated by a parabola in this region.

Question 2 (again)

Ah, I see the article mentions the Laffer curve, and also more dimensions. If the shape was a multidimensional cone, then the advantage depends on the prior distribution for moving in any given direction – there could be correlations between directions, e.g., mutations of nearby genes. But if it is symmetric, then the small mutation near the peak is more likely to be disadvantageous. E.g., for a symmetric paraboloid with a two-dimensional surface, the lines of constant fitness are circles on the plane. Moving a short distance in a random direction from a point on such a circle is more likely to take one outside the circle than inside it (a bit like the recent drunkard's walk problem). The probability will depend on the ratio of the step to the radius of the circle.

I guess the geneticists do look at random walks on the fitness surface, to model genetic drift and diffusion. That would be interesting.

Question 1:

I think both numbers can be justified. Each come from separate models for the world, predictions about how the world situation could evolve and how likely those different paths are.

As for the percents, you can trust as many significant figures as the modeler provides. The percents are accurate with respect to a model. If the modeler calculated them through simulation such as Monte Carlo, they would know how many significant figures to report based how many samples they ran / how many additional samples it would take to change a digit in the percent.

I don't think one would ever be able to conclude that one of these predictions was better than the other. It's impossible to duplicate this election. We can't wipe the slate clean, wipe everyone's memories, and rerun it (though frankly I think I know a lot of people who wished they could forget entirely about this election! :-)). By watching many presidential elections over time, one could conclude that the system FiveThirtyEight uses to build models is likely better than the system PEC uses to build models (or vice versa), but it's impossible to conclude that a particular model was more right from one versus the other. To restate, it's not possible to take multiple samples of this election to see which model estimates the true probability more accurately.

As someone who backs a candidate, I think the wisest action is to ignore the predictions. The predictions are supposed to take the state of the world into account and how it may evolve. If you pay attention to the prediction and change your action based on it (e.g., my candidate is going to win, I was going to canvas for him/her and now don't need to) then the prediction is based on a faulty premise. You're invalidating the prediction. Thus, you shouldn't act on a prediction because acting on it makes it wrong, which makes it illogical to have acted on it.

In question 2, under various continuity & differentiability assumptions, the fitness function along a short vector (small-enough mutation) through almost any point in characteristic space will be either strictly increasing or strictly decreasing. Assuming a mutation and its reverse (opposite vector) are equally likely, fitness will thus be as likely increasing as decreasing when a small-enough mutation occurs.

In question 1 — "these numbers cannot both have been right" — they can both be right w.r.t. their respective models: this is known as the Reference Class Problem.

I respectfully disagree with the sentiment of this author. Of course the predictions differ, because thy are using different models and data for their predictions. That does not mean they are not useful. Polls also always differ, depending on which news network are conducting them (e.g. Fox vs. CNN).

Instead of trying to discredit the existing probability models it would have been more interesting if the author would have have looked at the models to evaluate why the differ and which model might be more reliable than the others.

Thank you for this very interesting article and commentaries. This is a fun read! I wonder if the author or any of the commentators would indulge a modification to the assumption of Question #2. If the "why too much of anything is bad” curve is replaced with either an extremely heavy tailed curve (or "cone") or an extremely skewed curve (or "cone").

Question 1:

Suppose you draw five playing cards from a freshly-shuffled deck, don't show them to me, and ask me what the chance is that there are more red cards than black ones. Obviously I'd be crazy to say anything other than 50%. But now suppose that I inadvertently saw one of the cards you drew, namely the ace of spades. My estimate would immediately drop to a little more than 30%. While the original 50% estimate was inarguably correct, it was not very robust to small changes in the information available to me. So, while the probability (50%) was an output of my model, it didn't really give the full story of what my model was telling me — the model could also supply additional relevant information, like the robustness of this probability to small changes in the inputs.

The U.S. Presidential election is not altogether different from the scenario described above. As we saw in 2000, it's possible to "win" the election by half a million votes and still lose due to the electoral college system. Since most electoral votes are all but determined before the candidates are even selected, predicting the election comes down to determining whether five or so "cards" (blocks of electoral districts) will turn out to be "black" (blue) or "red" (red).

Since these models are not black boxes, the best thing to do isn't to treat them as though they were (as we're implicitly doing by asking how many elections we'd have to see to determine which model was "correct"). Instead, we should take far more information from the models than simply the final probability they output, and only *then* compare their predictions to one another.

Given all this information, a supporter might see that the predicted outcome of the election is especially sensitive to, let's say, voter turnout in a particular locale, encouraging him or her to attempt to get out the vote there.

Question 2:

In fact, we need almost no assumption as to the nature of the curve, so long as we can assume it's smooth. Provided we are not exactly *at* a local maximum of fitness (which, given a unimodal fitness function like the one shown, is actual "maximum fitness,") then fitness is either increasing or decreasing with respect to the characteristic in question. In the former case, a small move to the left (i.e., small enough not to leap right over a local maximum) will hurt, and a small move to the right will help. In the latter case the situation is reversed. In either case, provided there's a 50/50 chance to move left or right then there's a 50/50 chance that the change will be a positive one.

The situation is not changed in higher dimensions, as all change happens in *some* direction. We may simply restrict our attention to the direction the change occurs in, reducing to the one-dimensional case considered before.

@Daniel MacLaury

Re Question 2, I agree with you (and @eJ) that, in the 1D case, one just needs to be near a local maximum for a 50:50 chance of improvement by a sufficiently small mutation. My assumption that the curve needed to be symmetric is not necessary.

But I disagree that the chances remain 50:50 in the multidimensional case. One cannot, after mutation in a given direction, say that this improves fitness because it does if one restricts the definition of fitness to that direction. The fitness function is already defined on the surface, and one one has to consider its total change.

I gave an argument earlier that the mutation is more likely than not to be disadvantageous, based on the idea that if one moves randomly a small distance away in some direction from the circumference of a circle, then one is more likely to end up outside the circle (less fitness) than inside the circle (greater fitness). This was for a circularly symmetric fitness function. I will argue that it holds more generally further below.

But first, to give an explicit example for the circularly symmetric case, suppose the fitness function is defined on the plane of two parameters as

f(x,y) = 1000 – (x^2 +y^2),

with a maximum at the origin, x=y=0. If the organism starts at coordinate (a,b)= (1,0) and moves a small distance d<<1 in some random direction theta, the new coordinate will be

(a',b') = (1+d cos theta, d sin theta). The change in the fitness is therefore

f(a',b') – f(a,b) = 1 – [1+d cos theta]^2 – [d sin theta]^2

= – d^2 – 2d cos theta

= -d ( d + 2 cos theta).

The probability of this being positive is the probability that cos theta is less than -d/2, i.e, fixing theta to range over -pi to pi, that

theta > pi/2 + arcsin(d/2) or theta < -pi/2 – arcsin(d/2).

Since p(theta)=1/(2pi), this gives the probability of the mutation being positive as

p(positive) = 2[pi/2 – arcsin(d/2)]/(2pi)

= 1/2 – (1/pi) arcsin(d/2),

which is always less than 1/2 (for small d one has the approximation [1-d/pi]/2).

More generally, the fitness function will have a set of simple closed contours in the neighbourhood of the local maximum, with the value of the fitness function being constant on each contour, and decreasing as one moves away from the local maximum. It is clear that if these contours are convex, then moving a small fixed distance d away from a point on a given contour, in a random direction, will more likely end up outside the contour than outside (in the special case where the convex contour has a straight segment – e.g., the contour is a square – the probability of ending up inside or outside will be 50:50 only if one is more than a distance d from the closest bend or 'corner' of the contour). So, fitness will decrease on average under a small random mutation.

If the contours are not convex I think the same conclusion can be drawn, as long as the mutation is small compared to the curvature of the contour, and the contour has a finite length (e.g., is not a fractal). The idea here is to imagine, at each point P along a given contour C, drawing a small circle of radius d centred on P (here d is the size of the mutation). These circles taken together will fill up a strip extending outside the contour up to a distance d from C, and extending inside the contour up to a distance d. It is clear that the area A_out of the outside part of this strip is larger than the area A_in of the inside part of the strip. Hence, if one is equally like to start at any point P on the contour (admittedly, a questionable assumption), and moves d in a random direction, one expects the odds of increasing one's fitness, i.e., of moving inside the contour, to be approximately given by A_in:A_out, which is less than 50:50.

@Michael, OK, I agree. You've answered the question for small d; I've answered it for small d(theta), which was a mistake. In the limit, for "infinitesimal" d, it's 50/50 plus-or-minus an infinitesimal amount.

@Michael: You wrote:

"But I disagree that the chances remain 50:50 in the multidimensional case. One cannot, after mutation in a given direction, say that this improves fitness because it does if one restricts the definition of fitness to that direction. The fitness function is already defined on the surface, and one one has to consider its total change. "

I agree, as I believe this is akin to saying that there is some interaction between fitness traits. For example, if we are measuring night sight fitness, since the eyes are on the face, when the eyes get larger, the nose and/or mouth must get smaller, and thus the fitness mechanisms associated with them would likely change. This sounds like a reasonable and sensible assumption to me.

The only critique I have is that it seems to me you are assuming that the given traits would be correlated such that a small change in one would lead to at most a small change in some other. I agree that this is a fair assumption, but it would not hold up in certain extreme circumstances. For example, to use the above example of eyes, the interaction could be such that after a certain point, for a small increase in eyes and thus night vision, a relatively great decrease could be seen in either the nose or mouth (such that the decrease in nose size does not effect overall fitness. This might be a poor and/or awkward example…), whereas increases in other traits don't have the same degree of interaction. Thus in this case the degree to which any given traits are inversely correlated (one goes down when one goes up, or vice versa) would, on a case by case basis, determine if changing a given trait by a small amount would increase fitness.

Just to clarify, in paragraph three above I wrote: "such that the decrease in nose size does not effect overall fitness" but meant "such that the decrease in nose size greatly and negatively effects overall fitness". My apologies for the confusion, and I hope the rest makes sense 🙂

@Michael:

I see, we interpreted the question differently — I was considering infinitesimal displacements, but you're considering the (apparently more interesting) case where the displacements are small but non-infinitesimal.

Looking back I see that I suggested that my infinitesimal displacements were just as good as finite ones that didn't cross a local maximum, which in retrospect is obviously wrong (except in one dimension).