Credit Risk Model
Project Overview
Loans and mortgages are common services in the financial industry. In order to assist the loan approval process, an effective predictive model is crucial to identify whether a loan applicant is likely to repay or default. For financial compliance purposes, the prediction results from the trained model must be unbiased and explainable.
Challenges
The primary goal was to develop and train a structurally simple model that achieves performance comparable to sophisticated models like neural networks. The finance sector is one of the most regulated industry, with strict compliance requirements. This presented a significant challenge, as both model performance and explainability were mandatory requirements.
Methodology
- Data Extraction: Extract the required data from SQL database into R for pre-processing and model training.
- Data Pre-processing: Feature engineering to create new variables through mathematical combinations of existing variables.
- Training and Iteration: Fit the data into the logistic regression model, and iteratively select the most important variables.
- Optimization: Select and fine-tune the best-performing model by benchmarking it against a sophisticated neural network model.
- Model Validation: Report the performance of the model on held-out validation data.
- Deployment: Pipeline the necessary data from SQL database to the trained model for production.
- Documentation: Record the final variables selected for the logistic regression model along with their respective coefficients. The final handover document provides a clear justification of how each variable influences the model prediction of a loan applicant’s default probability.
Tools
The project primarily used a logistic regression model, with R programming for model training and Python for benchmarking. During deployment, data from SQL database was pipelined to the trained model.
Results
The outcome was a logistic regression model with accuracy comparable to the benchmark neural network model. Plotting the model performance on training and validation data demonstrated that the training process had maximized its potential. The structural simplicity of the model ensured stability, making continuous integration/continuous deployment (CI/CD) practices optional.