Handling outliers in the dataset
If we have a lot of outliers in multiple linear regression (with a lot of features) and dealing with all outliers and missing values by dropping them causes the number of observations to decrease by a significant number, say 30%. Will this not cause the underfitting of the model? Is there another way to deal with outliers without reducing the number of observations?
Thank you for your question!
There are indeed alternative statistical approaches to dealing with outliers and missing values. One thing you could do is fill in the values yourself. A great discussion to this problem you can find in another one of our courses, Data Preprocessing with NumPy, in the Substituting Missing Values in Ndarrays section.