The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

Credit Risk Modeling

Credit Risk Modeling


I have two questions:

  1.  splitting the data. In the tutorial, the data was split first, then preprocess separately for train and test data. Why not preprocess the data first, then split? 
  2. In the section of ‘Preprocessing Discrete Variables: Automating Calculations’, about the code below,
    df_inputs_prepr[‘home_ownership:RENT_OTHER_NONE_ANY’] = sum([df_inputs_prepr[‘home_ownership:RENT’], df_inputs_prepr[‘home_ownership:OTHER’],
    where does “df_inputs_prepr[‘home_ownership:RENT’]” come from, cannot find the how [‘home_ownership:RENT’] dertermined.
1 Answer

365 Team

Hi Humming,
Thanks for reaching out!

  1. We must split the data prior to preprocessing. The main reason for that is that we use the train data to calculate the WoE. Then based on that we make our conclusions regarding coarse classing and fine classing.

    If we used the test data there, too, then we would have ‘peaked into the test data’ and then used the test to make decisions about the training dataset. This is definitely not recommended.

  2. Not quite sure what you mean here. Isn’t RENT one of the categories of home ownership?

The 365 Team

Thanks for your quick response!  Really helpful!

Regarding the second question: yes, rent is one of the categories, but [‘home_ownership:RENT’] is not the column of the dataset. How come we can directly use df[‘home_ownership:RENT’]?

8 months
Online Data Science Training
SAVE 60%