Thursday, December 30, 2021

Over-fitting and under-fitting and model building


Overfitting leads to a model that too closely follows a particular data set and will thus fail to fit new data or predict reliably new observations outside the initial or training set. You will feel good about modeling of the overfitted training set. You will be disappointed during testing out of sample. 

Overfitting can be thought of as fitting the model to noise, while under-fitting is not fitting a model to the signal. In your prediction with overfitting, you'll reproduce the noise, the under-fitting will just generate something close to the mean.

Overfitting: Training: good vs. Test: bad

Under-fitting: Training: bad vs. Test: bad

One will expect that there will be more shrinkage or difference between training and test results for an overfitted model.


Under-fitting - missing parameters that are important with explaining some relationship or making a prediction. Under-fitting can be in the form of choosing an inappropriate specification. For example, a linear model will always under-fit a non-linear relationship.

Training error will decrease as more features are added which is good, but like many things too much of a good thing will have adverse consequences. Validation error should also decline with more features, but there is a limit to this improvement. If validation error starts to increase while training error continues to decline, then there is overfitting. 

In the back of your mind, the modeler should always have the trade-off graph between complexity and error. More complexity and the training error goes down, but test error will be higher. For simple models, the training error is higher, but the test error may be  lower. The same can be shown in a variance-bias trade-off graph.





The simple trend-follower is willing to suffer from under-fitting rather than finding a model that does well in back-testing (overfit) through adding a significant number of features. The trend model may do worse in training yet may have better success out of sample.

No comments: