Why You Should Track Your Predictions and Score Your Calibration

· 5 min read

Philip Tetlock spent twenty years studying political forecasters and discovered something that initially seemed offensive to experts: across most domains, prediction accuracy does not correlate strongly with credentials, fame, or the confidence with which views are expressed. The people who appeared on television confidently predicting geopolitical events were, on average, barely better than random chance. The people who were meaningfully better than chance shared a different characteristic: they kept score.

This research, summarized in "Superforecasting," revealed a community of amateur forecasters participating in structured prediction tournaments who consistently outperformed professional intelligence analysts — people with access to classified information, significant expertise, and institutional resources. The amateurs won because they treated forecasting as a skill to be developed through feedback, rather than a performance to be evaluated by fluency and confidence.

The lesson is structural. You cannot improve at a skill without accurate feedback. You cannot get accurate feedback on predictions if you do not record them before outcomes are known. Most people do neither, which means most people's predictive ability is not improving over time regardless of how much experience they accumulate.

The Mechanics of a Prediction Log

A functional prediction log has four fields: the prediction itself (specific, falsifiable), the timeframe, the confidence level (expressed as a probability between 0% and 100%), and — added later — the outcome.

Specificity is non-negotiable. "I think the economy will slow down" is not a prediction — it is a hedge. A prediction is "I think GDP growth will be below 1.5% in Q3." One is falsifiable; the other can be reinterpreted to be correct regardless of what happens. Your brain will exploit vagueness to protect its sense of accuracy, so you must eliminate the vagueness in advance.

Probability assignments are where most people initially resist. They feel artificial, false-precise. But the resistance is itself informative — it means you are accustomed to expressing beliefs in ways that are immune to being wrong. "I think X will probably happen" cannot be scored. "I think there is a 75% chance X will happen" can. The discomfort of assigning a number is the discomfort of accountability, and that discomfort is the whole point.

After recording predictions with probabilities, the scoring method that matters most is a Brier score or its conceptual equivalent: when your predictions resolve, ask whether the frequency of correct predictions matches your stated confidence levels. Across a sample of at least 30-50 predictions, patterns emerge. Some people find they are systematically overconfident in predictions about other people's behavior. Others find that they hedge excessively on topics where they actually have strong signal. Both patterns reveal something about how the mind is processing information.

The Psychology of Resistance

Several psychological mechanisms work against honest prediction-tracking.

Hindsight bias is the most powerful. Once an event occurs, the mind retroactively adjusts the probability it remembers assigning. The event that surprised you gets remembered as something you expected. The event you saw coming gets remembered as more confidently predicted than it was. The effect is so robust that it appears across cultures, age groups, and expertise levels. The only reliable defense is writing the prediction down before the event occurs.

Motivated reasoning is the second mechanism. When you have a stake in an outcome — financially, emotionally, reputationally — your predictions about it are corrupted before you make them. You are not evaluating evidence; you are constructing a case for the conclusion you prefer. Tracking exposes this over time, but it does not eliminate the bias in the moment. The useful practice is to flag predictions made under high motivated reasoning and hold them to a higher scoring standard.

The third mechanism is domain-specific overconfidence. Researchers have documented that the relationship between confidence and accuracy is often negative in domains where people consider themselves expert. The mechanism is that expertise creates fluency with a domain's concepts and vocabulary, which is experienced as competence but does not reliably translate to predictive accuracy about what the domain will do next. Political scientists are not reliably better than laypeople at predicting elections. Economists are not reliably better at predicting market movements. This is not an argument against expertise — it is an argument for keeping score even in areas where you feel most qualified.

What Calibration Actually Measures

Calibration is not primarily a measure of how smart you are. It is a measure of the accuracy of your uncertainty estimates. A well-calibrated person is one who knows what they know and knows what they do not know, and whose stated confidence tracks that difference.

Poorly calibrated overconfidence is dangerous because it generates action where hesitation would be appropriate. The investor who thinks the outcome is 90% certain when it is actually 55% certain will bet more than the odds justify, repeatedly, and will lose money even if individual bets occasionally pay off.

Poorly calibrated underconfidence is a different problem. People who chronically understate their confidence miss opportunities that their actual knowledge would justify taking. They defer to others who are no better informed but more assertive about it. Underconfidence is less commonly discussed because it looks like humility, but it is equally a miscalibration.

The Revision Loop

The point of prediction tracking is not to build an archive of your accuracy. It is to create the feedback loop that makes revision of your beliefs possible.

When you review your prediction log quarterly, you are not just seeing which predictions were right. You are looking for systematic patterns in the errors. Do you consistently overestimate how quickly projects will complete? Do you consistently underestimate resistance from a particular person or type of person? Do you consistently assign low probability to tail risks and get caught by them? Each pattern is a heuristic that needs revision.

The revision process looks like this: identify the pattern in the errors, hypothesize the belief or assumption that generated the pattern, and explicitly update that belief. Then watch whether subsequent predictions in that category improve. This is the scientific method applied to your own mind — hypothesis, observation, revision, test.

The people who become significantly better forecasters over time are not the people with the best initial calibration. They are the people who treat each error as information about their model of the world rather than as a random event to be forgotten. The record is the mechanism. Without it, experience accumulates without converting into skill.

Starting the Practice

A minimum viable prediction log is ten to fifteen predictions per month across the domains where your predictions actually matter — work, relationships, projects, finances, health. Do not curate the predictions to include only confident ones; the uncertain ones are where the most learning lives.

Review monthly for raw accuracy. Review quarterly for patterns. Review annually for the assumptions those patterns suggest need revising.

The practice compounds. People who track predictions for two years have a materially different relationship to uncertainty than people who do not. They hedge less reflexively, commit more deliberately, and revise their beliefs with more precision. These are not personality traits. They are skills built from feedback that most people never collect.

◆

Cite this:

View edit history

← PreviousHow to Revise Your Definition of Home Over a Lifetime Continue →How to Use the Socratic Method on Yourself

Comments

Be the first to share how this landed.