Sunday, July 7, 2019

Correlation screening - A way to quickly sort factor relationships


Correlation screening has been formally developed over the last ten years as a means of finding which factors are most important for prediction from a large set. It is an effective tool for data mining. 

The investment problem that needs to be solved is very simple. Assuming there are a large number of fundamental macro factors that could possibly explain equity returns, how do you sort through this set to find the ones that are most important. A high level description of correlation screening is that it measures the correlations across the entire dataset and pick those that are above a defined threshold. Many may view this as atheoretical science, but we are taking a more open-minded view. For example, we may know that there is a relationship between asset returns and growth, but we don't know what are the specific pieces of information that are most useful as a proxy for growth expectations. More specifically, assume that the PMI index is a good proxy for growth. Correlation screening could help with finding the best predictive expression of PMI from a broad set of choices.  


An interesting financial application of correlation screening is the active model for equity risk premium timing developed by Hull Trading. The firm has writte
n on this topic and used correlation screening as a means of teasing out the factors that may have the best predictive power for stock forecasts. It was used to test a large number of predictor variations across varying time periods and predictive horizons. (See "A Practitioner’s Defense of Return Predictability" by Blair Hull Xiao Qiao.) The chart below shows th factors picked through time. 



Can this be viewed as data snooping? It allows a structured decision approach to picking variables without any theory and can be abused if employed without further analysis and out-of-sample testing. 

In the simplest case, the researcher chooses a broad group of predictive factors and sets a correlation threshold as a screen for testing. The researcher then focuses attention on the variables that are greater than a threshold. The advantages of correlation screening is that it can look at a large number of specifications of variables that are expected to to be predictive. The screening can also look at analyzing a large number of alternative predictive horizons and do this work in real time. The approach, often used with machine learning, is a simple practical tool for analysis. Call it pre-testing before deeper analysis. This simple procedure can provide quick analysis.  

No comments: