Reducing costs of Azure Databricks (FinOps in practice)
Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making. Having ...
A while back we looked at churn analysis and its importance for a company. We also presented our approach to estimating a customer’s value using RFM analysis and scoring. In this post, I am going to present a real churn & score solution which we implemented for one of our clients. In building the solution, we sought to address three main challenges: how to identify customers who were about to stop using a car service station, how to evaluate them and how to understand the reasons which influenced their decision.
This post is the third in a series of articles on customer churn analysis. The first two can be found here:
We have already discussed the framework in detail, so I am going to use the graphics below as a reference and describe how we carried out each specific step. Bear in mind that this framework can be used in many machine learning problems, not only churn analysis. Our experience in developing models and implementing them in various contexts has translated into a product that is flexible enough to address multiple problems across a range of industries.
We first had to decide which kind of data sources to use. That decision would allow us to agree on the following:
The first goal is crucial—only an accurate specification of the modeling problem allows you to achieve the results you want. Together with our client’s business subject matter experts, we established the following definition of “churn”: a churning customer is one who has not visited the car service in the 12 months following their last visit. We then proceeded to review possible data sources. Most of the data had already been transferred to a data lake, which simplified what would have been a far more complicated and laborious review process. After careful consideration, we identified several possible areas of interest:
A vital step in every ML project is to perform exploratory data analysis. It allows you to investigate potential modeling features and specify what additional operations will be required. These may include: imputing empty records, limiting categories (if there are too many), removing outliers and correcting variable formats. The presentation of such an analysis need not be aesthetically appealing nor graphically polished. It is, after all, no more (nor less) than an efficient way for a data scientist to verify initial assumptions.
Apart from information that is available straight away, we wanted to enrich our dataset with additional variables. That’s why we used several kinds of aggregation, interactions, indicator variables and text mining techniques. Only then could we be sure we were maximizing the benefits of the data sources available to us.
Having chosen our optimal modeling dataset, we next needed to specify the features we would use in our models. There are several ways to go about this, depending on the type of model being considered. For instance, you can employ stepwise selection, LASSO and feature importance to shrink the final set of features. However, we could also have tackled the problem with a more traditional approach, such as correlation analysis, and specified a desired subset of variables beforehand.
They say that data scientists spend 80% of their workday preparing data, a claim that has proven true numerous times in our own projects. After preparing the data, we were ready to start modeling. For the present project, we tried out four different models to tackle our problem: logistic regression, gradient boosted decision trees, random forest and support vector machines. Each has different characteristics, mostly concerning computational complexity and interpretability.
We compared the models using several indicators, starting with model accuracy (the proportion of the total number of correct predictions to all the predictions) but also taking into consideration sensitivity and precision. Ultimately, we were able to achieve around 85% accuracy. So, out of 100 churn indications, around 85 of them proved correct.
Before I proceed to describing the final parts, I’ll have a look at several steps in an ML project that are key to its success. As each problem will have unique requirements and circumstances, the list is far from comprehensive. However, each of these steps has featured prominently in every analytical endeavor we have encountered:
Regardless of how experienced your data science team may be, never forget to regularly update business recipient with your findings. They may seem logical and spot-on to you, but you aren’t the final user of the solution, so you may need to shift your approach.
Don’t overlook the crucial step of exploratory data analysis, the healthcheck of your data integration methods and assumptions. Only after confirming that your EDA results are correct, you can safely assume that the dataset you’ve prepared is correct and error-free. You can also decide on the subsequent steps to fill out your pipeline.
Kaggle Masters and other leading data scientists agree that preparing a broad set of well-thought-out features is key to success in machine learning. Only then you are able to benefit from the full potential of your data.
First consider using regression, a very popular “benchmark model” which can reveal a great deal of what kind of result to expect in your modeling problem.
When used recklessly, even the most advanced ML methods can prove less accurate than a simple approach. For instance, gradient boosted decision trees will give you better results if the multiple input factors are carefully tweaked.
Ultimately, we estimated the probability of our client’s customers—and their cars—churning. So what now? There are usually two main use cases for our churn modeling results. The first is fairly straightforward—we generate a single list of customer identification numbers together with the likelihood of their churning. Our client loaded this list into his system in order to make immediate use of the modeling results and generate quick marketing campaigns.
The second use case goes back to the RFM analysis and scoring. We not only evaluate the chances of a customer leaving the client, but also try to capture the value of that customer exiting the company. This will enable the client to decide how much to spend in an effort to retain him.
The information we collected on both churn and score enabled us to prepare a comprehensive managerial dashboard allowing for the deep analysis of results. We provided information on key factors influencing our model decision and also the ability to benefit from multiple dimensions in order to tailor our analysis. Thanks to the scoring analysis, our client could focus on specific customer segments and move ahead with appropriate anti-churn measures. You can find a sample dashboard below.
Read similar articles