4 reasons people don’t trust your backtest

Dan Averbukh

4 reasons people don’t trust your backtest

Systematic traders, asset allocators, and other asset managers often use backtesting, in the form of models and screens, to build and analyze investment strategies.

Despite its popularity and usefulness in understanding market trends, backtests are perceived by many in the finance world, and even (especially?) by those who build them, as a dark art lacking credibility. Those who use backtests professionally need others to trust their work, but for good reason, establishing this trust can be very difficult.

Below, we discuss the key reasons why backtests are widely distrusted. In future articles, we will consider what can be done to fix this.

1. Biases sneak in

Biases sneak into backtests in various guises and are difficult to detect, even in one’s own work. Detecting them in others’ backtests can be next to impossible.

A few common examples are look-ahead bias, data-snooping bias, and survivorship bias.

Look-ahead bias

Look-ahead bias occurs when a backtest uses information that would not have been available at the time of the trade, creating unrealistic and overly optimistic results.

For example, a backtested strategy may use end-of-day prices to make intraday decisions, thus introducing information unavailable during the trading day. Another common error is sorting stocks based on P/E ratio in January while using earnings figures that only become available in February.

Many even subtler forms of look-ahead bias are also common.

Knowledge from the future can introduce a bias for predictive validity into a backtest, which will not replicate in real-world scenarios when knowledge from the future is naturally unavailable. Imagine a backtest forecasting economic growth for 2021 that accidentally uses 2020 GDP figures released in early 2021. Looking at the output of a backtest, how can one know whether information from 2021 improved 2020 model performance?

Data snooping and Overfitting biases

Data-snooping bias occurs when multiple hypotheses or strategies are tested on the same data set, increasing the likelihood of finding, by pure chance, a pattern that appears statistically significant.

A diligent analyst may examine hundreds or thousands of variations of various trading rules, looking for a strategy that yields an edge. Because she runs so many tests, the resulting strategy may appear to work, even though it has no predictive power.

From the perspective of a backtest consumer, it matters critically whether you ran 5 simulations or 5,000,000 to find your high Sharpe backtest. But how can one know if the predictive screen you’re evaluating came from 5 attempts or 5,000,000?

A close cousin of data snooping bias is overfitting bias, in which a machine learning model is trained on elements of a historical dataset that are not relevant to forecasting. A slightly less subtle backtest mistake

Survivorship bias

Survivorship bias occurs when only successful entities that have survived until the present are included in an analysis while omitting those that failed. This bias can skew results and present an overly optimistic view of performance. For example, a backtest of mutual funds might only include funds that are still active today, ignoring those that closed due to poor performance.

2. Inherent Complexity

A credible backtest requires a sophisticated analyst and robust, highly developed infrastructure. All aspects of the simulation and the underlying data must be correct for a backtest to yield accurate results.

A typical backtest involves thousands or millions of data points and exponentially more calculations. These calculations use data that is often hard to obtain, poorly documented, and of unclear origin.

For example, finding the top 20 companies by market capitalization in each of the ten largest industries today is challenging but doable for many. Calculating this list accurately for every day in the past ten years is a challenge of a different order of magnitude.

3. Lack of Standardization

a lack of standards is problematic

There are almost as many ways to run a backtest as possible backtests. While some core principles are shared, each person or company may use their own methodology.

As a result, two analysts might develop backtests for similar trading strategies whose outputs look comparable but are not. One analyst uses a simple moving average, while another uses an exponential one. One analyst calculates implied volatilities using Black-Scholes, while another uses a local volatility model. One analyst uses minute-by-minute prices, while another uses tick data. Etc.

These differences present a difficult challenge for interpreting another person’s backtest. Problems with data or calculations hide in unlikely places, and a full audit is generally impossible and impractical.

4. Motivated reasoning

The human tendency for motivated reasoning amplifies each of the above difficulties. A strong backtest can help land a job interview or improve an investor presentation. A poor backtest result often just means there’s more work to do.

Analysts want to find strong results and are rewarded for doing so. Methodological problems in backtests that yield desirable conclusions may not receive the same scrutiny as those that yield poor results. The motivation for finding a positive result is powerful.

Conclusion

An old allocator saw says, “I’ve never seen a bad backtest.” But if all backtests look good, doesn’t that mean that they are fundamentally meaningless from the perspective of allocators?

Can anything be done to salvage a backtest’s external credibility? In a future post, I’ll cover a few approaches to making backtests more credible.

vBase Blog