Credit Risk Analytics

Credit Risk Prediction using customer information and financial transactions.

Risk Analytics is one of the key areas of data science and business intelligence in finance. With Risk analytics and management, a company is able to take strategic decisions, increase the trustworthiness and security of the company by identifying, assessing, and controlling threats to an organization’s capital and earnings. Risk Analytics is the application of data science knowledge to manage risk in a way that helps in revenue growth with minimal losses associated with fraudulent activities by bad actors.

This was my first project related to Data Science applied to Finance. We obtain the data from the lending club dataset.

“ I was so amazed by how difficult was to put the ML pipeline in production.”

Visit this repository and check the notebook out to find more information about the steps of the project and the results. A summary here below:

Objective

The objective of this exercise was to classify our current and old customers in order to predict if the new clients will pay the loan or not.

Target

The target we want to predict is the `loan_status` variable.

Data

Initially, we have 4 datasets with the information of our clients, to proceed with the data analysis we merged them into one dataset. The shape of this data is 884876 rows × 151 columns.

EDA

After performing an Exploratory Data Analysis and selecting the best variables we got 425015 rows × 51 columns dataset.

Processing data

In order to perform the next algorithms, we processed the data (scaling, one hot encoding...) and created our TRAIN, VALIDATION, and TEST datasets.

Predictions

Algorithms	Accuracy	ROC-AUC
Logistic Regression	76.11 %	0.72
Support Vector Machine	58.89 %	0.51
Random Forest	76.25 %	0.73
XGBoost	76.58 %	0.73
K-Nearest Neighbors	73.80 %	0.59
XGBoost Tuned	73.42 %	0.73

Conclusions

The best model is the XGboost tuned, because it has the best accuracy, precisions, recalls, and f1-score. But as we can see the differences between the metrics are very low, so if we have to conclude which is the best one for production would be the Logistic Regression because it is clearly interpretable and faster than the XGboost or Random Forest Classifier.

Members of the group

Octavio del Sueldo: hugo.delsueldo@cunef.edu

Jose Lopez: jose.lopez@cunef.edu