Thursday, February 18, 2021

Data preprocessing - Getting the foundations right is important for the return factory


Hedge fund return generation is like a production process or factory is an apt analogy of what goes on inside the firm. This is especially true of quant firms where there are well-defined repetitive tasks. The quant return factory differs from a discretionary artisan stock-picker. Using this analogy requires investors to dive into the production line and start with data preprocessing. The trader thinking like a factory manager should focus his attention on key data bottlenecks and the inference engine for decisions. For the investor, differentiating managers requires thinking through the factory production line.

See our past writing on the factory narrative:

For the production process, the inputs are critical for any good output. If the quality of data is poor, it does not matter what is the level of model sophistication employed. If the preprocessing of data takes too long or not done correctly, there will be a "garbage in" problem. 

Some the key components of the production process that should be considered:

Data collection - 

What is the source of the data? Is there an alternative that provides a check on prices and scrubs bad data? Most would be surprised at the fact that even price data will have problems.

For fundamental data, there are always exceptions and outliers that have to be addressed, reviewed, and replaced. For macro managers, the problem of economic revisions, and getting announcement data is a problem that can be overcome with effort. 

The tilt to new data requires a new scrubbing process because checking can be difficult. Additionally, there has to be an assessment as whether new data are orthogonal or unique relative to existing data.

Data preprocessing - 

What is the form of data to be used for analysis? Take the simple case of market volatility. What is the right number for volatility? There is historical data with different look-back periods, implied, and calculated numbers that may include open, high, low, and close. Choices have to be made and numbers calculated and stored. Decisions have to be made for whether some calculations are made and stored before inputted into models.  

Data categorization  -

Data has to be categorized. For the macro category, there is the issue of using government data collected with some delay, and survey data which may have less delay. Different data have different sensitivities to assets so it may have to be categorized differently. Data are also structured as set dates which can be directly incorporated into a process while other data may come as a surprise because it was not pre-announced. 

Production process - 

All data has to be structured to be included in models at the right time and through a process that can lead to timely output that matches the decision time period. 

There are many ways of managing a production process. Both manager and investor should review on a periodic basis to ensure that data management and analysis is done efficiently.  



No comments: