Credit risk modeling is the place where data science and fintech meet. It is one of the most important activities conducted in a bank and the one with most attention since the Great recession. This course is the only comprehensive credit risk modeling course in Python available right now. It shows the complete credit risk modeling picture - from preprocessing, through probability of default (PD), loss given default (LGD) and exposure at default (EAD) modeling, and finally finishing off with calculating expected loss (EL).

preview the course

for FREE!

Create a free account and start learning data science today.

create free accountWe start by explaining why credit risk is important for financial institutions. We also define ground 0 terms such as expected loss, probability of default, loss given default and exposure at default.

What is credit risk and why is it important?

Expected loss (EL) and its components: PD, LGD and EAD

Capital adequacy, regulations, and the Basel II accord

Basel II approaches: SA, F-IRB, and A-IRB

Different facility types (asset classes) and credit risk modeling approaches

Our example focuses on consumer loans. Since there are more than 100 potential features, we've devoted a complete section to explain why some features are chosen over others.

Our example: consumer loans. A first look at the dataset

Dependent variables and independent variables

Each raw datasets has its drawbacks. While most preprocessing is model specific, in some cases (like missing values imputation), we could generalize the data preparation.

Importing the data into Python

Preprocessing few continuous variables

Preprocessing few discrete variables

Check for missing values and clean

Once we have completed all general preprocessing, we dive into model specific preprocessing. We employ fine classing, coarse classing, weight of evidence and information value criterion to achieve the probability of default preprocessing. Conventionally, we should turn all variables into dummy indicators prior to modeling.

How is the PD model going to look like?

Dependent variable: Good/ Bad (default) definition

Fine classing, weight of evidence, coarse classing, information value

Data preparation. Splitting data

Data preparation. Preprocessing discrete variables: automating calculations

Data preparation. Preprocessing discrete variables: visualizing results

Data Preparation. Preprocessing Discrete Variables: Creating Dummies

Data preparation. Preprocessing continuous variables: automating calculations

Show all lessons

Data preparation. Preprocessing continuous variables: creating dummies

Data preparation. Preprocessing the test dataset

Show fewer lessons

Having set up all variables to be dummies, we estimate the probability of default. The most intuitive and widely accepted approach is to employ a logistic regression.

The PD model. Logistic regression with dummy variables

Loading the data and selecting the features

PD model estimation

Build a logistic regression model with p-values.

Interpreting the coefficients in the PD model

Since each model overfits the training data, it is crucial to test the results on out-of-sample observations. Consequently, we find its accuracy, its area under the curve (AUC), Gini coefficcient and Kolmogorov-Smirnov test.

Out-of-sample validation (test).

Evaluation of model performance: accuracy and area under the curve (AUC)

Evaluation of model performance: Gini and Kolmogorov-Smirnov.

In practice, banks don't really want a complicated Python implemented model. Instead, they prefer a simple score-card which contains only Yes/No questions, that could be employed by any bank employee. In this section we learn how to create one.

Calculating probability of default for a single customer

Creating a scorecard

Calculating credit score

From credit score to PD

Setting cut-offs

Model estimation is extremely important, but an often neglected step is model maintenance. A common approach is to monitor the population stability over time using the population stability index (PSI) and revisit our model if needed.

PD model monitoring via assessing population stability

Population stability index: preprocessing

Population stability index: calculation and interpretation

To calculate the final expected loss, we need three ingredients. Probability of default (PD), loss given default (LGD) and exposure at default (EAD). In this section we preprocess our data to be able to estimate the LGD and EAD models.

LGD and EAD models: independent variables

LGD and EAD models: dependent variables

LGD and EAD models: distribution of recovery rates and credit conversion factors

LGD models are often estimated using a beta regression. To keep the modeling part simpler, we employ a 2-step regression model, which aims to simulate a beta regression. We combine the predictions from a logistic regression with those from a linear regression to estimate the loss given default.

LGD model: preparing the inputs

LGD model: testing the model

LGD model: estimating the accuracy of the model

LGD model: saving the model

LGD model: stage 2 - linear regression

LGD model: stage 2 - linear regression evaluation

LGD model: combining stage 1 and stage 2

The exposure at default (EAD) modeling is very similar to the LGD one. In this section we take advantage of a linear regression to calculate EAD>

EAD model estimation and interpretation

EAD model validation

After having calcuated PD, LGD, and EAD, we reach the final step - computing expected loss (EL). This is also the number which is most interesting to C-level executives and is the finale of the credit risk modeling process.

Calculating expected loss

MODULE 4

This course is part of Module 4 of the 365 Data Science Program. The complete training consists of four modules, each building up on your knowledge from the previous one. Module 4 is focused on developing a specialized industry-relevant skillset and students are encouraged to complete Modules 1, 2, and 3 before they start this part of the training. Here you will learn how to perform Credit Risk Modeling for banks, Customer Analytics for retail or other commercial companies, and Time Series Analysis for finance and stock data.

See All ModulesReal-life project and data. Solve them on your own computer as you would in the office.

Our expert instructors are happy to help. Post a question and get a personal answer by one of our instructors.

Earn a verifiable certificate after each completed course. Celebrate your successes and share your progress with your professional network!

Sign up today for FREE!

Whether you want to scale your career or transition into a new field, data science is the number one skillset employers look for. Grow your analytics expertise and get hired as a data scientist!