This is why having a working framework will make you a better data professional
Everyone works with data one way or another nowadays. Unfortunately, many also struggle to keep a track of their step by step when doing so. Why does this happen?
Many of the most successful data professionals, and many professionals overall, have frameworks with steps on how to do every task. When we talk about data, we have a renowned framework called CRISP-DM.
This framework consists of the following steps:
1. Business Understanding
2. Data Understanding
3. Prepare Data
4. Data Modeling
5. Evaluate the Results
6. Deploy
We will showcase how applying this framework to a case study can help deliver better and cleaner results. We will be working on lendingclub.com data from 2007–2010 that consists of data from almost 10,000 borrowers that took loans — with some paid back and others still in progress. All the case study is worked with Python, and you can check all the analysis on this repository.
What we will check into?
For this case study, I was interested in using lendingclub.com data to better understand:
- ¿How long it takes for users to pay back their loans?
- ¿Which are the purposes that tend to be more profitable?
- ¿Can we predict which loans will be fully paid?
These three questions aim to understand and automate the loan giving process with a ML model that will help drive growth to lendingclub.com.
Part 1 — ¿How long it takes for users to pay back their loans?
We first start by checking on the time to payment on months. We also did an extra step of separating the distribution with each quartile as a limit to have a better glance on the result.
We can conclude that loans on lendingclub.com are paid quickly, on less than 2 months most of the time.
Part 2— ¿Which are the purposes that tend to be more profitable?
We now want to check which kind of loan is the most profitable and for that we first check the distribution of loans and installment mean amount per purpose.
We can see that the order of the purposes is different on each plot, but this will also be important in our final conclusion of the second question, and we will do that after plotting the distribution of the profit per purpose.
Looking at the results on each purpose we now know that the 3 main purposes which tend to be more profitable are Debt Consolidation, Small Business and Credit Card. But we also need to check the mean installment per purpose which shows that the Small Business loans installments are the biggest.
If we could give more loans on that category, considering that it has one of the lowest total counts, we can increase the profit made considerably.
Part 3 — ¿Can we predict which loans will be fully paid?
After knowing how much time it takes the average user to fully pay a loan and the most profitable loan purpose, we now will focus on checking which loans will for sure be paid. For that we will choose which model predicts the best the not_fully_paid variable on the dataset.
On this occasion we will work with two models, Decision Tree and Random Forrest, and check which gets the best results.
Decision Tree Results:
Random Forrest Results:
After running both models, we know now that the RandomForrest is the best model for this case due to the better metrics on the Classification Report. A next step on this analysis would be to further analyze which features are the best to improve this model metrics.
Conclusions
We now know overall that with a model to accurately predict the loan payment of a given purpose we could automate and give more loans to the top three more profitable purposes and earn more. This would push a lot the Small Businesses which could be a lever of growth in the future.
This model could be sent to the team of lendingclub.com for deployment… of course if this data were more representative to what they would be up to nowadays.
Last thoughts
If this is your first-time hearing of using a framework to work with data, then I hope this example has been helpful. Since I started working with data on my own everything has been a bit easier whenever I use this or any other framework, because at the end a framework is just a series of steps and even you can create one for your own use.
The notebook with all the analysis is available to check on this repository and if you would like to analyze more, I invite you to do so.
¡Thank you for reading!