Continuing our journey from the previous post where we defined the issue of churn prediction, in this installment, let us create the model in Azure Machine Learning. We are trying to predict the likelihood of a customer’s churn based on certain features in the profile which are stored in the Telecom Customer entity. We will use a technique called Supervised Learning, where we train the model on our data first, then understand the trends before it can start to give us some insights.
Obviously, you need access to Azure Machine Learning. Once you log into it, you can create a new Experiment. That gives you a work space designer and a toolbox (somewhat like SSIS/Biztalk) where you can drag and control the feed into each other. So it is a flexible model, and for most tasks, you do not need to write code.
Below is a screenshot of my experiment with toolbox on the left.
Now machine learning is something which is slightly atypical for a usual CRM audience. I would not be able to fit full details of each of these tools in this blog, but I will touch on each of these steps so that you can understand at high level what is going on inside these boxes. Let us address them one by one.
Dynamics CRM 2016 Telecom
This module is the input data module where we read the CRM customer information in the form of a dataset. At the moment of writing of this blog, there is no direct connection available from Azure Machine Learning to Dynamics CRM Online. But where there is a will, there is a way, i.e. I discovered that you can connect to Dynamics CRM using the following
- You schedule a daily export of Dynamics CRM data into a location that Azure Machine Learning can read, e.g. Azure blob storage, Web Url over Http.
- You can write a small Python based module that connects to Dynamics CRM using Azure Directory Services. The module can then pass the data to the Azure using a DataFrame control.
From my experience, having an automatic sync is not important from Dynamics to Azure ML but it is important the other way round i.e. Azure ML to Dynamics.
This module basically splits your data into a two sets
- Training dataset – the data based on which the machine learning model will learn
- Testing dataset – the data based on which the accuracy of the model will be determined
I have chosen stratified split which ensures that the Testing dataset is balanced when it comes to classes being predicted. The split ratio is 80/20 i.e. 80% of the records will be used for training and 20% for testing.
Two-class Decision Forest
This is main classifier, i.e. the module that does the brunt of the work. The classifier of choice here is a random forest with bootstrap aggregation. Two-class makes sense for us because our prediction has two outcomes, i.e. whether the customer will churn or not.
Random forests are fast classifiers and very difficult to overfit. Rather than taking one path, they learn your data from different angles (called ensembles). Then in the end, the scores of various ensembles are combined to come up with an overall prediction score. You can read more about this classifier here.
This module basically connects the classifier to the data. As you can see in the screenshot of the experiment I posted above, there are two arrows coming out of Split Data. The one of the left is the 80% one, i.e. the training dataset. The output of this module is trained model that is ready to make predictions.
This step uses the trained model from the previous step and tests the accuracy of the model against our test data. Put simply, here we start feeding the data to the model that it has not seen before and count how many number of times the model gave the correct prediction Vs the wrong prediction.
The scores (hit vs miss) generated from the previous modules are evaluated in this step. In Data Science, there are standard metrics to measure this kind of performance, e.g. Confusion Matrices, ROC curves and many more. Below is the screenshot of the Confusion matrix
I know there is a lot of confusing details here (hence the name Confusion Matrix), but as a rule of thumb, we need to focus on AUC, i.e. area under the curve. As shown in the results above, we have a decent 72.9% of the area under the curve (which in layman terms means percentage of correct predictions). Higher percentage does not necessarily equate to a better model. More often than not, a higher percentage (e.g. 90%) means overfitting, i.e. a state where your model does very well on the sample data but not so well on the real-world data. So our model is good to go.
You can read more about the metrics and terms above here