Credit Risk Modeling
I have two questions:
- splitting the data. In the tutorial, the data was split first, then preprocess separately for train and test data. Why not preprocess the data first, then split?
- In the section of 'Preprocessing Discrete Variables: Automating Calculations', about the code below, df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership:RENT'], df_inputs_prepr['home_ownership:OTHER'], df_inputs_prepr['home_ownership:NONE'],df_inputs_prepr['home_ownership:ANY']]). where does "df_inputs_prepr['home_ownership:RENT']" come from, cannot find the how ['home_ownership:RENT'] dertermined.
3 answers ( 0 marked as helpful)
Hi Humming,
Thanks for reaching out!
- We must split the data prior to preprocessing. The main reason for that is that we use the train data to calculate the WoE. Then based on that we make our conclusions regarding coarse classing and fine classing.
If we used the test data there, too, then we would have 'peaked into the test data' and then used the test to make decisions about the training dataset. This is definitely not recommended. - Not quite sure what you mean here. Isn't RENT one of the categories of home ownership?
Hey, I'm a new member of this site who's been following the course Credit Risk Modeling in Python on Udemy.
I've been blocked by a persistent bug when running the code block that has this definition
df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership:RENT'], df_inputs_prepr['home_ownership:OTHER'], df_inputs_prepr['home_ownership:NONE'],df_inputs_prepr['home_ownership:ANY']])
the error thrown is
# If we have a listlike key, _check_indexing_error will raise
KeyError: 'home_ownership:RENT'
Any assistance will be appreciated
I hope I'm asking this in the right place
I was able to get it working by creating new columns containing each of the value types for the 'home_ownership'
column appended at the end so that we end up with 'home_ownership:RENT'
, 'home_ownership:OTHER'
'home_ownership:NONE'
and 'home_ownership:ANY'
as new fields of the df_inputs_prepr
dataframe
# Strings for each unique value of df_inputs_prepr['addr_state]
list = df_inputs_prepr['home_ownership'].unique()
# Loop
for element in list:
# Loop through the all values of 'addr_state' and append them to the end off new columns
if ['home_ownership:' + element] in df_inputs_prepr.columns.values:
pass
else:
df_inputs_prepr['home_ownership:' + element] = 0