Part 2: Evidence Based Investing is Dead Long Live Evidence Based Investing!

Note: this is Part two of a two-part article series. Please see article one here.

Michael Edesess’ article, The Trend that is Ruining Finance Research, makes the case that financial research is flawed. In this two-part article series, we examine the points that Edesess raised in some detail. His arguments have some merit. Importantly however, his article fails to undermine the value of finance research in general. Rather, his points serve to highlight that finance is a real profession that requires skills, education, and experience that differentiates professionals from laymen.

Edesess’ case against evidence based investing rests on three general assertions. There is a very real issue with using a static t-statistic threshold when the number of independent tests becomes very large. Financial research is often conducted with a universe of securities that includes a large number of micro-cap and nano-cap stocks. These stocks often do not trade regularly, and exhibit large overnight jumps in prices. They are also illiquid and costly to trade. Third, the regression models used in most financial research are poorly calibrated to form conclusions on non-stationary financial data with large outliers.

This article will tackle the “p-hacking” issue in finance, and propose a framework to help those who embrace evidence based investing to make judicious decisions based on a more thoughtful interpretation of finance research.

P-Hacking and scaling significance tests

When Fama, French, Jegadeesh, et al. published the first factor models in the early 1990s, it was reasonable to reject the null hypothesis (no effect) with an observed t-statistic of 2. After all, the computational power and data at the time did not afford very much in the way of data mining. Moreover, these early researchers were careful to derive their models very thoughtfully from first principles, lending economic credence to their results.

However, as Cam Harvey has so assiduously noted, the relevant t-statistic to signal statistical significance must expand through time to reflect the number of independent tests. He suggests that, based on several different approaches to the problem, current finance research should seek to exceed a t-statistic threshold of at least 3 to be considered significant. If the results are derived explicitly through data mining, or through multivariate tests, the threshold should be closer to 4, while results derived from first principles based on economic or behavioral conjecture, and with a properly structured hypothesis test, may be considered significant at thresholds somewhat below 3.

Cam Harvey’s recommendations make tremendous sense. The empirical finance community – like so many other academic communities such as medicine and psychology – are guilty of propagating “magical thinking” for the sake of selling journal subscriptions and advertising. With few exceptions, journals only publish papers with interesting and significant findings. As a result, the true number of tests of significance in finance likely vastly exceeds the number of published journal articles.

A lack of reproducibility

Finance professionals should be cautioned by the fact that researchers are performing more and more tests each year, while journals only report a fraction of the tests that are performed. But these issues are amplified by the fact that many papers are never independently verified. Where researchers do attempt to verify results and find errors, journals just publish corrections that are often buried at the bottom of issues many months or years in the future. This is unsatisfactory.

The Merriam-Webster dictionary defines “profession” as “a calling requiring specialized knowledge and often long and intensive academic preparation”. Incumbent in this definition is the idea that professionals are responsible for understanding and validating research that they use to inform their recommendations to clients. But far too few advisers – even those of the “evidence based” variety – take the time to thoroughly investigate the papers they rely on to form client portfolios. Far fewer have the skills, resources, or inclination to independently validate the strategies they endorse.

My team has identified several major errors in research published in some of the most prestigious finance periodicals. In August we questioned the results of a paper on volatility published in one of the most popular practitioner journals. The author – shaken and contrite – confirmed that he had miscalculated the effect, and overstated the results by more than a factor of two.  I genuinely believe the author did his best to present the facts, but errors happen. That’s why it is incumbent on practitioners to verify results before making allocations with other peoples’ money.

Out of sample confirmation

Cam Harvey and his co-authors are not alone in their desire to bring statistical rigor to the financial research process. Many respected practitioners share their concerns and apply similar methods in their own practices.

One way for practitioners to gain greater confidence in prospective factors is through out-of-sample testing. Fortunately, there is an abundance of out-of-sample analysis validating the most robust factors. One obvious out-of-sample test involves testing the factor on a brand new universe. For example, if a method worked on U.S. stocks, it should also work on stocks in other international stock markets. In addition, if a factor was identified in 1993, then tests over the 20-year period from 1994 – 2013 are also considered out-of-sample. One might also ‘perturb’ a factor’s specification to test for robustness, say by changing the definition of ‘value’ from price to book value to price to cash-flow or price to earnings.

In “Finding Smart Beta in the Factor Zoo”, Jason Hsu and Vitali Kalesnik at Research Affiliates performed tests of the value, momentum, low beta, quality and size factors on stocks across U.S. and international markets. For tests on U.S. markets they used data back to 1967, while international tests were run from 1987. Recall that the size, value and momentum factors were first documented in the early 1990s, and the low beta anomaly was first catalogued by Haugen in the mid-1970s. In addition, all factors were first identified using exclusively U.S. stocks. As such, by testing in international markets over the period 1987-2013 their analysis was legitimately ‘out of sample’. That is, they tested on out-of-sample universes, and over a 26 year horizon where 20 years were out of sample in time. Results in international markets were consistent with the results of the seminal papers.

In addition, Hsu and Kalesnik tested using different definitions of the factors. For example, they tested ‘value’ as defined by dividends-to-price, cash-flow-to-price, and earnings-to-price as well as the original book-to-price metric. They also varied the lookback horizons and skip-months for momentum, and tested both beta and volatility for the low-beta factor, again with different lookback horizons. As you can see from Figure 1., the value, momentum and low beta factors all proved robust to alternative definitions.

Figure 1. Value, low beta and momentum factors prove robust to alternative specifications

Source: Research Affiliates

Clearly Jason Hsu at Research Affiliates takes seriously the concerns raised by Cam Harvey, and has taken steps to increase empirical rigour of their solutions, but they are not alone in their quest. The principals at AQR, principally Cliff Asness and colleagues, performed their own analysis of the value and momentum factors across both a universe of global stocks and a universe of global asset class indexes. Their tests span the period 1972-2011, so about 40% of their analysis period is out of sample in time. Of course, about half of their global stock universe, and the entire global asset class universe, is also out of sample for the entire period. Their results are summarized in Figure 2. below.

Figure 2. Statistical significance of value and momentum factors across global stocks and asset classes, 1972-2011

Source: Asness, Moskowitz and Pedersen, “Value and Momentum Everywhere

Highlighted in green, note the statistical significance of risk-adjusted excess returns from the value and momentum factors in global stocks (top) and global asset classes (bottom). This analysis validates the persistence of the value and momentum factors across a largely out of sample data set. Even better, the t-scores exceed the higher thresholds proposed by Cam Harvey, and tests on the asset class universe overcome higher hurdles with substantial margin to spare (full disclosure: ReSolve investment solutions rely largely on the asset class momentum and low beta factors).

A professional way forward

Edesess asserts that, in the absence of reliable research, investment professionals should make life-changing decisions for clients based on “common sense”. But common sense is just a narrow data sample – one’s own experience – filtered through an often imperfect, emotionally charged, heavily biased cognitive prism. Further, there is no mention of “common sense” in the dictionary – or any practical – definition of the word “professional”. The fact is, to call ourselves professionals, investment practitioners must make judicious decisions based on finance research. There are many reasons why this may be challenging, but the alternative is unacceptable.

To be successful in empirical finance requires a mosaic of experience, mental models, data, humility, and a fundamental understanding of how decisions are made in markets. For example, my framework considers that investors are corrupted by the following forces when faced with making decisions in uncertain markets:

  • incentives
  • agency issues
  • behavioral biases (prospect theory, herding, overreaction, underreaction)
  • non-wealth-maximizing preferences (i.e. lottery preferences, leverage aversion, home-bias)
  • structural challenges (i.e. siloed decision making, regulation, compliance, information diffusion)

A dense body of literature in behavioral finance, and my own experience with clients, advisors, and investment managers, supports the view that these forces drive investors to make decisions that are not purely in the interest of their own wealth. These inefficiencies manifest in investable sources of excess return for those investors with the capacity to take the other side of the trade. As I seek to interpret the empirical literature, and innovate in pursuit of sustainable premia, there must be a clear connection between these forces and the premia under investigation.

Evidence based investment professionals should also have a healthy understanding and respect for complex adaptive systems. Even where an investor is satisfied that an effect is rooted in the factors above, and economically significant, she must be honest with herself about whether there are sustainable barriers to arbitrage that would allow it to persist. A solid risk-based explanation is a wide moat that suggests an effect should persist. As Michael Edesses asserts, the ERP is solidly rooted in risk. The volatility risk premium is also obviously rooted in risk, as is the duration premium. Some other commonly cited risk premia have plausible risk explanations, but also might be explained by behavioral biases or alternative preferences.

Most investors think about risk in terms of loss, but I would argue that tracking error, regulatory risk, liability risk, career risk, and other types of risk play an integral role in investor decision making. Most investors find it very difficult to underperform their home market index, or miss out on bubble-like investments for any length of time. Regulators impose constraints on leverage and concentration. The new fiduciary standards may subject advisers to liability from making recommendations that deviate from other “prudent investors”. Institutional investors face career risk from recommending investments that may underperform in the short term. These forces result in non-wealth-maximizing decision making, and are real risks that manifest in persistent anomalies.

For example, equity mutual fund managers are typically incentivized on the basis of assets in their funds, and investment performance relative to their benchmark. Benchmark-centric performance metrics such as Information Ratio penalize managers based on tracking error. Yet outperformance necessarily requires managers to take bets that are different from the index.

If there were no leverage constraints, a manager could overlay diversified beta exposure to complement their active bets. But regulatory constraints prevent 40-Act mutual funds from taking on leverage, except in certain narrow circumstances. In practice, this leads managers to place concentrated bets on certain stocks with large active risk.

To balance this risk, managers often lower portfolio tracking error by investing in a basket of high-beta stocks with their remaining capital. Thus, due to regulatory constraints and incentive structure, mutual fund managers place a premium on high-beta stocks that is independent of expected returns. This lowers the ex-ante expected return on these stocks, and is a strong candidate for the source of the low-beta anomaly. A few other commonly cited alternative premia have equally valid explanations rooted in similar forces. [Please see discussion of multi-asset strategies below, as an example of strong barriers to arbitrage].

The framework above is not perfect. It is an organic concept, which evolves over time with my own experience in markets. I invite you to append your own belief systems to make it your own. But it is a way forward. Ultimately, our goal as a profession should be that all advisers have the “specialized knowledge, and long and intensive academic preparation” to deliver informed, robust advice to clients. “Common sense” is a necessary, but profoundly insufficient, foundation for a professional code of conduct. Investors deserve better.

Multi-asset strategies: A Case study in structural barriers to arbitrage

The framework above presents a compelling case for multi-asset strategies. Multi-asset anomalies arise from the same forces as securities-based anomalies, are even more economically significant, and may have larger barriers to arbitrage. Most institutional portfolios are structured along asset-class silos, where each silo is charged with seeking alpha within its own narrow sandbox. As a result the equity team at one institution competes with very little friction against the equity teams at every other institution.

However, there are large barriers to arbitrage at the multi-asset level. The asset allocation is set by a policy committee, and guided by long-term capital market expectations. Institutions rarely take on material active risk at this level of the portfolio. Some institutions are bound by rigid actuarial rules, and are able to tolerate very little deviation from policy weights. Committee level decisions are typically slow, incremental, and reactive. And peer-oriented compensation schemes heavily and asymmetrically penalize short-term tracking error relative to long-term alpha generation. These are large and persistent barriers to arbitrage that suggest multi-asset anomalies like trend and carry have a long shelf-life.

Moreover, for all the reasons stated above investors have been slow to embrace active multi-asset strategies. According to the Blackrock research department (whom I queried on this very subject last year), active multi-asset strategies like GTAA, managed futures, risk parity, global macro, and slower-moving asset allocation strategies in managed accounts, account for just 13% of global liquid market capitalization. This stands in stark contrast to the proportion of active in stocks and bonds, where 65% and 87% of these markets are dominated by active mandates, respectively. And this does not count assets tracking active indexes, like “smart-beta” ETFs, etc.

Figure 3. Proportion of actively managed assets by mandate

Source: Blackrock

This gap does not exist for lack of evidence. As shown in Figure 4 below, global multi-asset carry and trend strategies exhibit historical Sharpe ratios roughly twice what is observed in historical tests of traditional equity-based factors like cross-sectional momentum and value. Admittedly, some multi-asset factor strategies have struggled in the current central-banking dominated cycle, alongside many traditional equity factors. Stock momentum and value have had a very rough decade indeed.

It’s noteworthy that the period 1932 – 1942 was also a very difficult decade for most systematic strategies, as central banks were also active during that period, distorting the natural price-discovery process. And more generally, factor based investors should expect to have long periods of “famine”; if factor investors feasted every night, the feast would quickly dwindle to a thin gruel as arbitrage would be risk-free.

Figure 4. Sharpe ratios: global equities vs. global asset classes

Sources: Value and Momentum data from Asness, Moskowitz & Pedersen “Value and Momentum Everywhere” (2013). Carry (dividend) equity factor is for U.S. only from Ken French database (long top decile value-weighted, short bottom decile value-weighted for stocks in top 30% by market capitalization). Carry factor is from Koijen et al., “Carry” (2013). Defensive factor from Frazzini & Pedersen, “Betting Against Beta” (2014). Equity trend data from “The Enduring Effect of Time-Series Momentum on Stock Returns over nearly 100-Years” by D’Souza et al. (2015). Multi-asset trend data from Hurst, Ooi, and Pedersen, “A Century of Evidence on Trend-Following Investing” (2017).