I have two questions:
- splitting the data. In the tutorial, the data was split first, then preprocess separately for train and test data. Why not preprocess the data first, then split?
- In the section of ‘Preprocessing Discrete Variables: Automating Calculations’, about the code below,
df_inputs_prepr[‘home_ownership:RENT_OTHER_NONE_ANY’] = sum([df_inputs_prepr[‘home_ownership:RENT’], df_inputs_prepr[‘home_ownership:OTHER’],
where does “df_inputs_prepr[‘home_ownership:RENT’]” come from, cannot find the how [‘home_ownership:RENT’] dertermined.
Thanks for reaching out!
- We must split the data prior to preprocessing. The main reason for that is that we use the train data to calculate the WoE. Then based on that we make our conclusions regarding coarse classing and fine classing.
If we used the test data there, too, then we would have ‘peaked into the test data’ and then used the test to make decisions about the training dataset. This is definitely not recommended.
- Not quite sure what you mean here. Isn’t RENT one of the categories of home ownership?
The 365 Team
Thanks for your quick response! Really helpful!
Regarding the second question: yes, rent is one of the categories, but [‘home_ownership:RENT’] is not the column of the dataset. How come we can directly use df[‘home_ownership:RENT’]?