I am curios about the function below, in the notebook the zscore upper/lower values seem to be hardcoded. I changed it as below. Does this make sense?
z_scores = scipy.stats.zscore(array)
outliers = (z_scores > z_score_upper) | (z_scores < z_score_lower)
Also, could you pls explain:
## Create a training-test set
X = df[features]
X_train = X[:4000]
X_test = X[1000:]
My X has only 2,746 rows which would mean the test data set is just a subset of the traning data set. In the last course I learnt that test and training data set should be completely isolated to avoid leakage. Are their different rules when using IsolationForest?
Yes, your modified function makes sense. You have simply made the z-score upper and lower bounds parameters that can be passed to the function, rather than being hardcoded. This makes the function more flexible and reusable. However, this function assumes a symmetric distribution of z-scores, which might not always be the case.
IsolationForest is generally less sensitive to the training-test split than other models. In this case, you can modify the size of the train and test data.