From Fragility to Robustness: The Value of Ensembles

A Case Study in Robust Equity Momentum

Google dictionary defines the word robust thusly:

  • sturdy in construction
  • able to withstand or overcome adverse conditions

… and offers the following definitions for the word fragile:

  • easily broken or damaged
  • flimsy or insubstantial; easily destroyed
  • not strong or sturdy; delicate and vulnerable

How can an investment model be “sturdy in construction” and “able to withstand or overcome adverse conditions?” How might we tell when an investment model is “easily broken or damaged” or “delicate and vulnerable?”.

Why does it matter?

In this brief case study we will explore the concept of fragility using a slimmed down version of the Newfound/ReSolve Robust Equity Momentum Index(NRROMOT), which rotates between regional equity indexes and bonds based on trend and momentum indicators. We use a slimmed down version for computational tractability, since we will be performing a large number of simulations.

Supervised Human Learning

It is useful to think about the construction of systematic investment strategies as a machine learning process. For the purpose of this article a human (yours truly) will perform much of the analysis that would be performed by machines, but the process is the same.

Specifically, the process we will follow in this article is akin to supervised machine learning because we are attempting to train a model to deliver on a specific objective. We want to predict which markets will produce the highest returns in the next period so that we can compound our wealth at the highest rate with manageable losses.

A model requires explanatory variables that are used to inform predictions. Consistent with the NRROMOT index, we will use measures of trend and momentum to predict the optimal asset to hold for each period.

Trend and momentum are close cousins. Momentum compares the strength of trends between two assets while trend measures the direction of movement; up or down. We use the following trend/momentum oriented explanatory variables to fit our model:

  • Time-series momentum (TS) with lookbacks of 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300 days.
  • Price relative to moving average (PMA) with lookbacks of 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300 days.
  • Short-term moving average relative to long-term moving average (DMA) with short/long lookback pairs of 8/30, 11/45, 15/60, 19/75, 22/90, 26/105, 30/120, 34/135, 38/150, 41/165, 45/180, 49/195, 52/210, 56/225, 60/240, 64/255, 68/270, 71/285, 75/300 days.

The idea behind NRROMOT is that we want to own the regional equity index with the highest momentum so long as global equities are in a positive trend. When global equities are in a negative trend, we will own either short or intermediate-term Treasuries based on which of these has the strongest trend. Figure 1 describes the basic logic to determine the optimal holding at each rebalance.

Figure 1: Strategy Decision Tree

Source: Newfound Research. For illustrative purposes only

We have daily total return data for US equities (S&P 500), foreign equities (EAFE), global equities (ACWI), 7-10 year Treasury bonds, and 1-3 year Treasury bonds back to about 1990. Thus, allowing for priming periods our simulations will start in 1992. Table 1 summarizes the performance of the individual assets over our test horizon.

Table 1: Performance summary for constituent asset classes, 1992 to 2019.

1-3 Year Treasuries 7-10 Year Treasuries Global Equities US Equities Foreign Equities
Start Date Jan 03, 1992 Jan 03, 1992 Jan 03, 1992 Jan 03, 1992 Jan 03, 1992
Annualized Return 2.67% 6.14% 6.11% 10.05% 3.63%
Annualized Volatility 2.00% 6.60% 17.20% 17.60% 17.80%
Sharpe Ratio 0.46 0.67 0.33 0.53 0.19
Max Drawdown -7.20% -11.40% -59.00% -55.50% -63.60%

Source: Data from Bloomberg and CSI Data. Data extensions available upon request at author’s discretion.

Bias / Variance Tradeoff

The purpose of a model is to make predictions with minimal error. But the concept of error is not well understood by finance practioners.

Folks in data science describe error in terms of bias and variance. Models with high bias tend to be simple and generalize well out of sample, but they may leave some explanatory power on the table (i.e. middling backtest but live results are more likely to resemble simulated results). Models with high variance are more tightly coupled with the training data. They are highly explanatory in-sample but may not generalize well on unseen data (i.e. great backtest, poor live results).

To boil it down, data scientists understand that model engineering requires a tradeoff between model complexity (want to explain as much of the effect as possible) and model robustness (want a model that works well on data that may be slightly different from what it was trained on.)

Consider a junior analyst attempting to use our trend and momentum features to engineer a trading model. In our experience, less experienced quants will start out by testing the performance of each individual strategy over the full sample period. Figure 2 plots the compound annual growth rates (CAGR) for strategies specified on each of our 57 trend definitions, traded weekly in five tranches1.

Figure 2: Ordered compound annual growth rates of individual strategy specifications.

Source: Data from Bloomberg and CSI. Analysis by ReSolve Asset Management.

In scrutinizing the results in Figure 2 our junior analyst – and many experienced financial engineers! – might be tempted to conclude that the dma_38,150 indicator is the most optimal predictor. After all, this indicator produced the highest returns out of all the features that were tested.

However, had our junior analyst been trained in data science rather than financial engineering, he would realize that he’s perpetrated a serious flaw in his analysis: he has specified his model based entirely on in-sample performance. In other words, he has chosen a model that was the best fit for the data based on what actually happened in the past. But we have no idea whether the chosen model is likely to be optimal when applied to data that the model hasn’t seen yet.

Walk-forward analysis

At the heart of this issue is the question of whether the best performing model in the past will go on to be the best performing model in the future. While we obviously can’t know how markets will unfold in the future, we can imagine having to decide on an optimal model to trade at times in the past, and observe how those decisions would have played out in subsequent years.

One systematic way to explore this approach – “walk-forward” analysis – follows this process:

  1. Run simulations for all strategies over the full sample period
  2. At each rebalance, use all of the available returns for each strategy up until that date to find the top performing strategy(s)
  3. In the following period, allocate only to those strategy(s) with the best performance until that date

In essence, at each point in time we are going to make a decision about which models are “optimal” based on all available data up to that date, and then we will hold those models in the next period.

This prompts the question of how we should judge which models are optimal. It is common (among junior quants anyway) to choose the strategy with the highest returns. More experienced quants might choose strategies with the highest risk-adjusted returns, measured by Sharpe ratio for example. More sophisticated analysts might seek the portfolio of strategies that maximized the meta-strategy Sharpe ratio. We also employed an approach that bootstrapped the returns up to each rebalance date; we found the max Sharpe optimal strategy weights for each bootstrap sample over a five-year holding horizon and averaged the weights.

Before we discuss the results of our walk-forward analysis, however, we should decide how we might judge performance. What is an unbiased, neutral benchmark against which we can determine whether our walk-forward approach is effective?

The most neutral model that we can construct from our features is one that gives each feature equal weight. We’ll call this strategy the ‘ensemble’ model. If our dynamic methods are able to select certain specifications that materially outperform our ensemble on a walk-forward basis, this would indicate that there is some persistence in top performing specifications. If not, we can reject the theory that specifications that happened to work best on the in-sample period are more likely to produce better results in the future.

Let’s examine the results from our walk-forward tests, described in Figure 3.

Figure 3. Performance of walk-forward simulations.

Ensemble Walk-Forward Top 10 CAGR Walk-Forward Top 10 Sharpe Walk-Forward Top 10 Return/Ulcer Walk Forward Combo Walk-Forward Max Sharpe
Start Date Jan 03, 1992 Jan 03, 1992 Jan 03, 1992 Jan 03, 1992 Jan 03, 1992 Jan 03, 1992
Annualized Return 11.36% 11.34% 11.26% 10.78% 10.97% 11.15%
Sharpe Ratio 0.90 0.84 0.84 0.80 0.82 0.87
Annualized Volatility 10.60% 11.50% 11.50% 11.50% 11.40% 10.80%
Max Drawdown -14.90% -18.20% -17.30% -16.60% -17.90% -16.40%
Positive Rolling Yrs 89.30% 88.40% 89.40% 87.70% 87.60% 90.70%
Growth of $100 2017.19 2007.15 1966.89 1744.09 1831.39 1914.01

Source: Data from Bloomberg and CSI. Analysis by ReSolve Asset Management.

It doesn’t appear as though the walk-forward methods add any value in excess of the naive ensemble. The historical performance of individual strategy specifications does not seem to provide sufficient information to allow us to choose a subset of “optimal” models, which would be expected to outperform our naive equal-weight ensemble. As a result, attempts to choose a single model based on the models’ explanatory power in sample exhibits low bias error, but high variance error since our choice of model does not generalize to out-of-sample data.

Jitter Resampling

The only way to determine the optimal bias/variance tradeoff for model selection is to evaluate models on data that they haven’t seen before.

Walk-forward testing is useful as it recreates the analysts choices at each point in the past. However, the drawback of walk-forward testing is that we don’t use all of the data for our evaluation; rather we only use the data up to each point in time.

Another way to examine the robustness of model specifications is to run the models on brand new data. This is not a trivial exercise as the new data needs to preserve the characteristics of the original data that our models were trained on while introducing enough randomness to tease out potential model fragility.

We propose a method we call “jitter resampling”, which creates new data by subtly shuffling the order of returns in the local area around each daily data point. This approach sustains the mean and volatility, as well as the trend behaviour of each market, which is central to our trend equity thesis.

Specifically, for each daily return we replace the return at time t with a sample return drawn from returns at t-3 through t+3. There is a 40 percent probability that we sample the same return; a 30 percent probability that we replace with the return at t±1; a 20 percent probability that we replace with the return at t±2; and a 10 percent probability that we replace with the return at t±3. We perform this resampling by row so that the cross-sectional relationships between assets are preserved in each sample.

We created a thousand synthetic data sets for our asset class universe and re-ran our simulations for all model specifications on each synthetic universe. This produced a thousand simulations for each model specification, where all specifications were tested on the same data set in each sample.

We were specifically interested in the distribution of terminal wealth across models, since this is a useful proxy for how robust a model is to small changes in sequences of returns. For each model in each sample we found the percent rank of terminal wealth over the full investment horizon (i.e. we found the rank of each model’s terminal wealth relative to all other models on that sample, and then standardized to a value between zero and 100). Figure 4 plots the distribution of model ranks for the top models of each type in the original sample alongside the ensemble.

Figure 4: Distribution of percent ranks of terminal wealth for select strategies relative to all other strategies across one thousand samples.

Source: Data from Bloomberg and CSI. Analysis by ReSolve Asset Management.

We selected for examination the models that had produced the best performance in the original test, and compared the distribution of outcomes for these models against the distribution of outcomes for the ensemble. The performance of top in-sample models exhibited extremely wide rank dispersion on the out-of-sample data, suggesting high variance error.

On the other hand, the ensemble model produced returns above the median (dashed line) on average. Of greater importance, the ensemble produced a much tighter distribution of relative terminal wealth suggesting low variance error. There were no extremely negative outcomes.


In creating investment models (or any models for that matter), investors must seek out models that are most likely to produce the performance they need given the unknowable sequence of returns that they experience once the model goes live with real funds.

Our case study revealed that, while certain models delivered materially better outcomes on the in-sample data used to evaluate the models in hindsight, our walk-forward analysis confirmed that we could not use this relative performance to effectively select model specifications that are more likely to perform out of sample.

Our jitter resampling analysis subjected each model to slightly modified data that preserved the distribution and trending character of the original returns. Consistent with the true objectives of investors, we evaluated how effectively each model navigated these changes in the data by measuring the dispersion of terminal wealth. Strategies that performed best on the in-sample data struggled with the out-of-sample data, delivering a wide range of outcomes.

In contrast, the enesemble strategy produced better than average results consistently and presented a more manageable probability of material adverse outcomes.

  • We rebalance 1/5th of the portfolio every day to approximate the effect of running 5 strategies in parallel, each rebalanced on a different day of the week. For the purpose of this case study we want to isolate the impact of choices of parameter specifications versus the ensemble.