Big Data and the Trump Shocker
by Rob Kaiser, Editor, Consulting Psychology Journal
The polls, pols, and pundits all got it wrong—and the world was stunned when Donald Trump defeated Hillary Clinton. People are still reeling; from the talking heads on TV to the regulars at the coffee shop to onlookers around the globe, the question is, “How did this happen?”
Even fivethirtyeight.com, the astonishingly accurate forecasting website created by the great data scientist, Nate Silver, had Hillary at 71% for the win on the eve of Election Day. 538 is unique in that it is not a poll—it is an aggregator that combines results from hundreds of polls and weighs them by certain factors like quality. By averaging so many individual polls, it attempts to overcome the imperfections of each to arrive at a less biased and more reliable prediction.
To be fair, 538 gave Trump a better chance of winning than did most prognosticators, in large part because of its super-sophisticated way of modeling data and recognizing how not all errors are random. But it too failed to see it likely that Trump would be the 45th president of the U.S.
How could so many experts with so much data be so wrong? The answer provides some vital lessons for the promise and pitfalls of using Big Data to make big decisions.
Statistics are only as good as the data that feed into the model. How well does the sample mirror the population you are trying to predict? How clean and credible are the data points? How well do the numbers reflect the phenomenon?
The most common polling technique is to ask people how they intend to vote. But with so many on mobile phones and with de-listed numbers, it is hard to reach big swaths of people. Non-response bias—where people who don’t participate in polls are systematically different from those who do—appears to have seriously skewed this year’s polling data.
Further, a dispiriting aspect of this election was that most Americans did not approve of either candidate. Many people simply weren’t comfortable saying who they would vote for—for fear of being looked down upon or, worse and unfortunately quite often, shouted down and insulted.
And numbers don’t always reveal the whole picture. Qualitative data is often needed to flesh out bare-bones quantitative data and bring it to life. Big-city pollsters were too far removed to notice how Trump rallies were so much bigger and more spirited than Clinton rallies. His supporters came in passionate droves, packing arenas. She could barely muster crowds in the thousands. Mainstream media also failed to notice, or at least report, this leading indicator.
It seems pretty clear that most polls and pundits misread the electorate—Midwest, rural, older, working class, and predominantly white voters had a far greater impact than expected. Their anger at being left behind in the global economy, leap-frogged socially by historically disadvantaged groups, and contemptuously overlooked by an increasingly out-of-touch and entitled ruling elite explains both Trump’s victory and why it was such a surprise: neither the Republican or Democratic parties, nor the media, has paid attention to them.
The lessons of Trump’s election upset for enthusiasts of Big Data are fairly clear too.
First, make sure your sample is representative. There is a huge difference between a convenient sample and a relevant sample. You can scrape a bunch of data off a social media site, but it probably won’t reflect the opinions of the underemployed rustbelt. And by definition the silent majority’s voice won’t be heard until after the fact.
Second, make sure your measures are valid and reliable. Asking people directly usually gets you the answer they want you to hear, not necessarily what they really believe. There is a science to measurement, but constructing good measures is a painstaking and tedious process.
Third, and most importantly, data alone can’t tell the story. You need a coherent theory, both to guide data collection and to interpret the resulting statistical pattern. (You are probably thinking that this is just a variation on Immanuel Kant’s great line about concepts and precepts—it is!) The usual suspects have been scrambling to construct a new narrative, one about populist middle-American resentment and a movement to take back their country. Ignoring this groundswell explains why we didn’t see a Trump victory coming; but catching up to it is helping to explain, in hindsight, how it happened.
The answer is not to swing the pendulum toward “small data.” As my buddy and research methods genius, Ryne Sherman, pointed out, more data is usually better. But perhaps only up to a point. You need a big enough sample to yield reliable statistics, but you also need to know just exactly who you are sampling and what you are measuring. And, of course, you need a pretty good idea about the questions you are asking and why they matter.
Baking these conditions into any data analysis project can prevent a lot of the sort of head-scratching we’ve seen the last couple days.