Colab_section_5_outliers.ipynb

Question

I am curios about the function below, in the notebook the zscore upper/lower values seem to be hardcoded. I changed it as below. Does this make sense?

def z_score_outliers(array,
z_score_lower,
z_score_upper):

z_scores = scipy.stats.zscore(array)
outliers = (z_scores > z_score_upper) | (z_scores < z_score_lower)

return array[outliers]

Also, could you pls explain:

## Create a training-test set
X = df[features]
X_train = X[:4000]
X_test = X[1000:]

My X has only 2,746 rows which would mean the test data set is just a subset of the traning data set. In the last course I learnt that test and training data set should be completely isolated to avoid leakage. Are their different rules when using IsolationForest?

Thanks

Answer 1

Yes, your modified function makes sense. You have simply made the z-score upper and lower bounds parameters that can be passed to the function, rather than being hardcoded. This makes the function more flexible and reusable. However, this function assumes a symmetric distribution of z-scores, which might not always be the case.

IsolationForest is generally less sensitive to the training-test split than other models. In this case, you can modify the size of the train and test data.

Colab_section_5_outliers.ipynb

Submit an answer

related questions