Credit Risk Analytics
Risk Analytics is one of the key areas of data science and business intelligence in finance. With Risk analytics and management, a company is able to take strategic decisions, increase the trustworthiness and security of the company by identifying, assessing, and controlling threats to an organization’s capital and earnings. Risk Analytics is the application of data science knowledge to manage risk in a way that helps in revenue growth with minimal losses associated with fraudulent activities by bad actors.
This was my first project related to Data Science applied to Finance. We obtain the data from the lending club dataset.
“ I was so amazed by how difficult was to put the ML pipeline in production.”
Visit this repository and check the notebook out to find more information about the steps of the project and the results. A summary here below:
Objective
The objective of this exercise was to classify our current and old customers in order to predict if the new clients will pay the loan or not.
Target
The target we want to predict is the `loan_status` variable.
Data
Initially, we have 4 datasets with the information of our clients, to proceed with the data analysis we merged them into one dataset. The shape of this data is 884876 rows × 151 columns.
EDA
After performing an Exploratory Data Analysis and selecting the best variables we got 425015 rows × 51 columns dataset.
Processing data
In order to perform the next algorithms, we processed the data (scaling, one hot encoding...) and created our TRAIN, VALIDATION, and TEST datasets.
Predictions
Algorithms | Accuracy | ROC-AUC |
---|---|---|
Logistic Regression | 76.11 % | 0.72 |
Support Vector Machine | 58.89 % | 0.51 |
Random Forest | 76.25 % | 0.73 |
XGBoost | 76.58 % | 0.73 |
K-Nearest Neighbors | 73.80 % | 0.59 |
XGBoost Tuned | 73.42 % | 0.73 |
Conclusions
The best model is the XGboost tuned, because it has the best accuracy, precisions, recalls, and f1-score. But as we can see the differences between the metrics are very low, so if we have to conclude which is the best one for production would be the Logistic Regression because it is clearly interpretable and faster than the XGboost or Random Forest Classifier.
Members of the group
Octavio del Sueldo: hugo.delsueldo@cunef.edu
Jose Lopez: jose.lopez@cunef.edu