Insights Puzzle

Solution: ‘Which Forecasts Are True?’

Aside from potential clues gleaned from a fluke result, it would take hundreds of U.S. presidential elections to definitely conclude that one election forecasting model is superior to another.

Come November 8, the seemingly interminable U.S. presidential campaign season will come to an end. For election prognosticators, their volumes of text and mathematical calculations — based on feeding hundreds of carefully vetted and aggregated polls into their constantly churning mathematical models — will be relegated to history. The outcome of the race will be condensed into a single binary result: Either Hillary Clinton or Donald Trump will be president. This result, coupled with the margin of victory (the delegate count), will be the only empirical data we will have to gauge the accuracy of the forecasters. For math lovers, it’s also a chance to assess how well mathematical models can predict complex real-world phenomena and to expose their pitfalls.

In the Quanta Insights puzzle this month, I invited readers to ponder this question.

Question 1

FiveThirtyEight and the Princeton Election Consortium (PEC) both predicted the outcome of the 2012 presidential race with spectacular accuracy. Yet on September 26 of this year, just before the first Clinton-Trump debate and 43 days before the election, FiveThirtyEight predicted on the basis of highly complex calculations that the chances for Clinton to win the election were 54.8 percent, while the PEC, with no doubt an equally complex model, estimated her chances to be 84 percent, almost 30 points higher. Both these models have highly successful histories, yet these numbers cannot both have been right. Do you think either number is justified as a principled prediction? Are these predictions meant only to titillate our psychological compulsion to have a running score, or do they have any other value? How many significant digits of these numbers can be trusted? How many presidential elections would it take to definitely conclude that one of these September 26 predictions was better than the other? And finally, as someone who backs a particular candidate, what is the wisest practical attitude to have about these probabilities?

A monthly puzzle celebrating the sudden insights and unexpected twists of scientific problem solving. Your guide is Pradeep Mutalik, a medical research scientist at the Yale Center for Medical Informatics and a lifelong puzzle enthusiast.

Quanta readers seemed loath to attack this question head on. Reader Mark P’s words summarize these reactions: “I think both numbers can be justified. Each comes from separate models for the world.…” This is true, but it trivializes the problem of overprecision. Given that any mathematical model can calculate the probability of an event to hundreds of decimal places, the principled way to decide how many of these are relevant depends on what the margin of error is likely to be. The expected accuracy of the prediction must decide the number of significant digits or decimal places presented. In other statistical contexts, data is presented with “error bars,” and most reputable pollsters include “estimated errors” with their poll numbers. There is also a deeper issue here: as I mentioned, both FiveThirtyEight and the PEC have a remarkably good history of predicting previous presidential election winners almost to the exact number of electoral votes obtained. Yet their probability estimates on September 26 varied by almost 30 percentage points. What gives?

My answer is that these two problems — predicting who will win by what margin based on the aggregation of polls close to the election date, and predicting the probability of who will win at a certain point within a race, especially many days out — are two separate problems entirely. The first of these has been attacked with good success by these and other modelers, justifying the credibility and precision of their models; the models can be improved incrementally, but they are reasonably close to being as good as possible. The second problem — that of estimating the probability of a candidate winning at a specific point in time — on the other hand, is akin to weather prediction: It models a chaotic phenomenon, and the state of the art in making accurate determinations here still has a long, long way to go.

I should say that I really admire the poll aggregation and presidential vote prediction work done by Nate Silver and Sam Wang, the chief architects of FiveThirtyEight and the PEC, respectively. Their modeling of the first problem described above is top-notch. As the election date nears, you can see these two models coming closer together; as of October 26, exactly a month after their widely divergent probabilistic predictions, their predictions of a Clinton win are 85 and 97 percent, respectively. More importantly, their predictions of the electoral votes she is expected to win are 333 and 332, respectively. Based on previous history we can expect with high confidence that their final predictions just before the elections will be pretty accurate — assuming we do not have the kind of systematic polling error that occurred at the Brexit vote, which is known as the Bradley effect. That said, I don’t think that the precision of the divergent probability numbers offered by these two models on September 26 can be justified in a principled way. And the reason for this is that the phenomenon being modeled is far too complex and chaotic.

It is not that probabilities in real-world phenomena cannot be precisely estimated. Some predictions of quantum mechanics, which is a completely probabilistic theory of the world, have been verified to a precision of one in a trillion, good to 12 significant digits. The reason for this, as I’ve said before in these columns, is that physics of one or two particles is far simpler, and therefore far kinder to our mathematical models, than the highly messy and complex situations encountered in biology, psychology and politics.

The standard empirical method of testing probabilistic predictions is to repeat the same experiment many times and count the outcomes. This is all very well in quantum-mechanical experiments, where the exact same experiment can be repeated a million times. How do we determine the empirical accuracy of a prediction in the real world, where each situation and each presidential election is different? The answer is to measure similar cases, which brings up the “Reference Class Problem,” as eJ mentioned: What classes of events do we consider similar enough to assign our probabilities? We have to restrict ourselves to U.S. presidential elections, and the amount of data we have is far too limited to make any kind of empirical verifications of our probabilities. We would have to examine hundreds or thousands of elections and check out hundreds of factors that affected their results 43 days out to have any confidence in the precision of our predictions, including such factors as the chances of result-changing events of all kinds. Only some of the most relevant ones are incorporated in current models, and the assumptions and weights attached to them are pretty arbitrary. In an interesting recent article, Nate Silver looked at four such assumptions — undecided voters, how far back to look at presidential elections, how to model rare events and how to model correlation between state votes — which, if changed, can explain differences between his and other models. In particular, it seems to me that FiveThirtyEight estimates a far higher voter volatility (how likely people are to change their minds) in the current election compared with the PEC model, which takes the longer view that in the current polarized atmosphere, voter volatility is low, in spite of the presence of two third-party candidates.

So, to come back to my original questions, I do not think that the stated probabilities can be justified. Probabilities do have a psychological value and do satisfy our compulsion to score the race, but this score is to a large extent arbitrary, and therefore bogus. I think that these probability numbers are not even accurate to one significant digit: They have an error of more than 10 points. It is much better to use the seven-point scale used by other modelers: solid blue, strong blue, leaning blue, toss-up, leaning red, strong red and solid red, and leave it at that. To definitely conclude that one model is superior to another would require hundreds or even thousands of U.S. presidential elections. On the other hand, chance results could give us some clues. For example, if Trump should win, it would strongly suggest that the more volatile FiveThirtyEight model is more accurate than more stable PEC model. The wisest practical attitude is to ignore the absolute values of the probabilities (except to the extent that they map into the seven point scale) and focus on the aggregated poll averages —  the percentage of the electorate that supports one candidate or the other, based on a combination of all high quality polls. Relative changes within probabilistic models can just be used to get some idea of shifting trends over time, in response to events like debates, outrageous statements and WikiLeaks.

Question 2

The graph below is the basis of an insight originally articulated by the eminent statistician and evolutionist Sir Ronald Fisher. Imagine that you as an individual are situated at some point on the curve. Presumably this would be at some point near the top, since the fact that you exist implies that you are a product of evolutionarily fit ancestors. If you have a large genetic mutation, you are likely to move to a different point on the curve, which will probably be bad for you. Yet, Fisher asserted, if the mutation were small enough, it would have a 50 percent chance of being good for you. Can you figure out why this should be so?

What will happen if we rotate the graph into a three-dimensional figure looking like a rounded cone and use it to represent not one but thousands of different characteristics? In what way will this modify these conclusions?

I can do no better than to quote the answer given by Daniel McLaury:

In fact, we need almost no assumption as to the nature of the curve, so long as we can assume it’s smooth. Provided we are not exactly at a local maximum of fitness (which, given a unimodal fitness function like the one shown, is actual “maximum fitness”), then fitness is either increasing or decreasing with respect to the characteristic in question. In the former case, a small move to the left (i.e., small enough not to leap right over a local maximum) will hurt, and a small move to the right will help. In the latter case, the situation is reversed. In either case, provided there’s a 50-50 chance to move left or right then there’s a 50/50 chance that the change will be a positive one.

The situation is not changed in higher dimensions, as all change happens in some direction. We may simply restrict our attention to the direction the change occurs in, reducing to the one-dimensional case considered before.

There has been some debate in the comments section regarding an answer given by Michael, who treated the multidimensional figure purely mathematically and came up with an answer of less than 50 percent for a small, but not infinitesimal, change. This is, however, a “trick question.” Our third dimension is created by juxtaposing hundreds of different traits with their own upside-down U profiles of fitness. Each individual line making up this 3-D figure is discrete and represents a different trait that may have no interaction with the one adjacent to it. In effect, we are adding a discrete categorical axis, not a continuous numerical one. So it is not kosher to perform the kinds of mathematical operations that Michael did to make his argument. This is the argument that I believe Hans makes. The mathematics of evolutionary fitness is extremely interesting and has shown that there is a great deal of evolutionary change that is neutral with respect to fitness. Perhaps we can go into it in more depth in a future column.

The Quanta T-shirt for this month goes to Daniel McLaury for his interesting comments on both questions.

See you soon for new insights.

View Reader Comments (4)

Leave a Comment

Reader CommentsLeave a Comment

  • Your juxtaposition of hundreds of different traits appears to bear relation neither to the question as posed nor to any reasonable theory of fitness. Some "trick".

  • Hi eJ,

    I knew you were going to say that 🙂
    But how else can we compress thousands of different characteristics in a 3-dimensional diagram? We would require a thousand dimensional hypercube.

    I agree that the conversation between Michael, Hans and you is relevant for a pair of characteristics that are related in a way that you can map the interaction between them into a smooth mathematical space. Some pairs of characteristics may satisfy these assumptions, others will not.

    This speaks to the theme of this column. When we model messy biological phenomena, we have to make many idealizing assumptions. We know that from a biological point of view characteristics may interact not at all, or may do so in complex ways. We assume that genetic variation is continuous although we know it is discrete. We can certainly learn something from our models, but we have to be forever vigilant that we are not reading more into them than the real-world situation warrants.

  • Here is a solution to the evolutionary fitness problem, given the updated version of the question. I agree that I don't really see the relevance of 'rotating the curve' to visualize many different genetic traits – if we do this we introduce an uncountably infinite number of traits indexed by the angle in the plane, and confuse ourselves because these traits should be thought of as independent, unless we want to couple them.

    Let's set up the problem then: we have a finite set of genetic traits G, we'll assume they're independent, and label them 1,…,N. For one trait we have an evolutionary fitness function u:R->R. For all traits we would have a fitness function h:R^N-R. In some cases we may want to assume the fitness function is the same for each trait, in which case we have an assignment f:R^N-R^N given by f(x)=(u(x1),…,u(x_N)). In order to obtain a scalar measure of evolutionary fitness, we should take an average or product to get h(x)=1/N sum_i u(x_i).

    Now we can phrase the question: given a small, random change in evolution space (R^N), do we expect an increase or decrease in evolutionary fitness?

    Depending on how we define 'small random change', we can consider two versions of the problem: the 'discrete' version in which we take a small, finite change or the 'continuous' version where consider a small change in an arbitrary direction.

    Let's focus on the discrete problem first. Fix a point x in R^N, evolution space. A small random change will be as follows: say there is a 50% chance of moving (delta x) in the +x_i direction and 50% chance of moving in the -(delta x) in the x_i direction. Then there is an equal probability, (1/2)^N of moving by epsilon = (+/- delta x, … +/- delta x) in evolution space. This change is 'good' if the change in evolutionary fitness is positive, i.e. if sign((delta h)(x)) > 0. The change is 'bad' if sign((delta h)(x)) < 0. We'd like to compute the expected value of the sign of the change.

    It's not hard to write down this expectation value, but it will depend on h. The presence of the sign means that this will be hard to compute precisely. Without the sign we would be computing the expected change, as opposed to the expectation of a good or bad change. This is an interesting question, but is different from the question posed.

    This a good time to switch over to the continuous setting, where things are a bit cleaner. This time we'll define a 'small random change' as a shift (delta x) in the direction v. We think of v as a unit vector in R^N, so a point on the sphere S^N (e.g. in R^2 we have a 'circle' of directions in which to move in evolution space). It's equally likely to move in any of these directions, so we'll assign a probability 1/vol(S^N).

    Now, we'd like to expected value of the sign of the change. The change of h in the v direction is defined as

    (delta h)(x) = h(x + (delta x)v) – h(x)

    At this point we want to approximate the change using Taylor's theorem:

    (delta h)(x) ~ (grad(h)(x) . v)(delta x) + (1/2) (v^T H(x) v) (delta x)^2 + …

    Let's assume x is near a maximum of h, but not exactly at one. Then we can approximate (delta h)(x) to first order making use of the directional derivative and formalizing a bit what Daniel was saying. The expected sign of the change is then

    E[sgn((delta h)(x))] = (1/vol(S^N)) int_v sgn((grad(h)(x) . v)(delta x)) dv
    =(1/vol(S^N)) int_v sgn(||grad(h)(x)||*||v||cos(theta)(delta x)) dv
    =(1/vol(S^N)) int_v sgn(cos(theta)) dv

    where theta is the angle between (grad h)(x) and the direction vector v. If we think of (grad h)(x) as going through the north pole of the sphere, then we see that cos(theta) = +1 on the 'upper hemisphere' and cos(theta) = -1 on the 'lower hemisphere'. This means that the integral is zero, i.e exactly half of the directions to move are good and half are bad (for example think of N=2 and draw a contour map of h in the plane).

    Although it's intuitive, we can use to the same ideas to say exactly what happens if x is a maximum. Now (grad h)(x) = 0, so we should approximate h to 2nd order.

    (delta h)(x) ~ (1/2) (v^T H(x) v) (delta x)^2

    At the maximum the Hessian is negative-definite. This implies that

    sgn( (1/2) (v^T H(x) v) (delta x)^2 ) = -1

    The expected change E[sgn((delta h)(x))] = -1, so is always 'bad' if we're at a maximum.

    As a curiosity/fun fact: if you remove the sgn and compute E[(delta h)(x)] at a max, you obtain (1/2) (delta x)^2 c_N * tr(H(x)), where c_N is some constant and we are taking the trace of the Hessian, also known as the Laplacian.

  • You state that "I think that these probability numbers are not even accurate to one significant digit: They have an error of more than 10 points."

    What's your justification? Not that I disagree, but it's not clear to me. Are you saying each model is reporting an error that high through its own testing and parameters? Does the statement merely mean that the two models are erroneous compared to the other? Or are you making a judgement based on your prior idea of the range of reasonable parameters to these models is much wider than the ranges used in both these models?

    The reason is important because it'll help justify where the "10" came from. (It sounds like you won't agree with the statement if it said 20, based on your comments about wanting a 7-point scale.) If this is merely your judgement (the third option above), then other reasonable people may put different error bars on these predictions. In effect, you're saying that your mental model (with different priors / priors over parameters) has a different prediction and error bars, better than these other models.

Comments are closed.