Super learner
This user is a Super Learner. To become a Super Learner, you need to reach Level 8.
Last answered:

13 Nov 2023

Posted on:

01 Nov 2023

0

Colab_section_5_outliers.ipynb

I am curios about the function below, in the notebook the zscore upper/lower values seem to be hardcoded. I changed it as below. Does this make sense?


def z_score_outliers(array,
                     z_score_lower,
                     z_score_upper):

    z_scores = scipy.stats.zscore(array)
    outliers = (z_scores > z_score_upper) | (z_scores < z_score_lower)
    
    return array[outliers]


Also, could you pls explain:


## Create a training-test set
X = df[features]
X_train = X[:4000]
X_test = X[1000:]


My X has only 2,746 rows which would mean the test data set is just a subset of the traning data set. In the last course I learnt that test and training data set should be completely isolated to avoid leakage. Are their different rules when using IsolationForest?


Thanks 

1 answers ( 0 marked as helpful)
Posted on:

13 Nov 2023

0

Yes, your modified function makes sense. You have simply made the z-score upper and lower bounds parameters that can be passed to the function, rather than being hardcoded. This makes the function more flexible and reusable. However, this function assumes a symmetric distribution of z-scores, which might not always be the case.

IsolationForest is generally less sensitive to the training-test split than other models. In this case, you can modify the size of the train and test data.

Submit an answer