Machine Learning for Financial Statement Analysis & Credit Outcomes Prediction
What is Chaptr Global?
Chaptr Global is an education financing venture that services learn now, pay later financing options for schools in Africa. Common types of deferred tuition are income-share agreements and flexible payment plans, both of which allow students to learn now and pay later without the burden of debt. By eliminating the cost barrier to accessing quality education, they can support a sustainable student financing model, where graduate repayments go back to finance newly enrolled students into a learning program.
How do ISAs work?
For an ISA to work, Chaptr has to handle the student onboarding layer and graduate repayment collection layer:
- On onboarding, they are trying to determine whether a student is fit for an income-share agreement by assessing whether or not they can afford the program, and whether they are likely to have a positive return on investment (ROI) on graduation.
- On repayment collection, Chaptr has to get income reports from students, collect payments, and eventually vet if the income reports are accurate and that students are not falsely misrepresenting their income.
Where does BAINSA fit in?
Now that you understand Chaptr, let’s discuss our involvement in this project. We have two main focus areas – Financial Statement Analysis and Credit Outcome Prediction. The group divided into two sub-teams, each concentrating on one of the two issues.
Team A: Financial Statement Analysis
Chaptr can assess an individual’s income by examining their bank and mobile money records. During enrollment, they must assess which students require assistance to participate in the program by reviewing their family income. For repayment collection, they can inspect the reported income of the student to spot any discrepancies, which leads to a decrease in Return On Investment for ISAs, thus impacting the sustainability of this financing method. These were the challenges Team A was set to address.
Below is this project scope broken down into three parts:
- Data gathering on bank statements and mobile statements alongside the statement owner’s individual income (payslip/tax returns). This will be the training data for income from statement analysis project. Data collection will involve collecting PDF statements, collecting one’s income data, and parsing through the statements for inflows and outflows.
- Utilize data analysis techniques such as exploratory analysis and feature engineering
- Develop a model that accurately evaluates one’s monthly recurring income.
Strategy:
Ideally, our objective was to construct a regression model that enables us to accurately evaluate an individual’s default risk. This model will serve as a valuable tool in assessing and validating the consistency of income over time, facilitating a thorough analysis of financial stability.
We evaluated each statement by considering a range of essential features, which encompass but are not limited to:
- Total Cash Flow rate
- Transactions frequency
- Precision – the proportion of positive identifications that were indeed correct → (True Positives)/(True Positives + False Positives)
- Recall – the proportion of real positives that were correctly identified → (True Positives)/(True Positives + False Negatives)
- F1 Score – a metric that is useful when we need to seek a balance between precision and recall → 2 x Precision x Recall / (Precision + Recall)
Due to the small amount of the data available we used the ensemble methods with limited max depth at 50 and learning rate at 0.1.
- Random Forest
- XGBoost
- LightGBM
Team B: Credit Outcome Prediction
Bank and mobile money statements allow Chaptr to vet an individual’s income based inflows and outflows. However they are unable to offer ISAs to all of their students and this is the challenge we tried to resolve. Specifically, on onboarding we want to determine which students should they invest in because of their high likelihood to have positive outcomes. To improve our model, we can measure graduate employment status and job quality (e.g. income) against student profiles during repayment collection in order to refine our outcomes predictions.
We split the goal into three main steps:
- Data Collection: Collect profile data on different student through surveys, as well as scraping from social media and job/recruiting websites.
- Data Analysis: Run exploratory data analysis and feature engineering to allow us to better understand and model statement data.
- Modeling & Evaluation: Fit a model that allows us to predict one’s job placement and job quality rate accurately.
Team B quickly took the challenge and built an efficient and automated scraper for an open source project, utilizing AWS Lambda. The scraper exported target profile data (age, education, work experience, median salary, company names) into binary format, which would then be used for data analysis and modeling. This approach significantly improved the efficiency of data collection.
Once we had sufficient amount of data we were ready to move forward with our analysis. The team explored different modeling techniques such as logistic regression, random forests, and gradient boosting. We evaluated the performance of these models using metrics such as accuracy, precision, recall, and area under the ROC curve. Random forests had the highest accuracy and area under the ROC curve, outperforming the other two algorithms. We also conducted feature importance analysis to identify the most influential factors in forecasting a student’s relative probability of gaining employment after graduation and their anticipated income.
Following extensive data collection, analysis, and model selection, the team successfully trained a predictive model. Chaptr is currently testing the model to evaluate its accuracy and effectiveness in estimating job placement and job quality rates for students.
Conclusion
Following extensive data collection, analysis, and model selection, the teams successfully trained a predictive model. Chaptr is currently testing the model to evaluate its accuracy and effectiveness in estimating job placement and job quality rates for students. Throughout the project, both teams collaborated closely and shared insights from their respective analyses. We iteratively refined their models, taking into account feedback from the other team and incorporating additional data as it became available. Moving forward, together with Chaptr we plan to continue refining the model and incorporating additional data sources to improve its accuracy and robustness. This includes exploring alternative algorithms such as neural networks and ensemble methods. We are confident our research will be of great benefit to educational institutions, and could revolutionize measuring student outcomes globally. We’re immensely thankful to BAINSA and Chaptr for offering us this incredible chance.
Authors: Emil Mollov and Kassym Mukhanbetiyar