Your Source for Data Science

Churn Prediction with Apache Spark Machine Learning

Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.


The prediction process is heavily data-driven and often utilizes advanced machine learning techniques. In this post, we’ll take a look at what types of customer data are typically used, do some preliminary analysis of the data, and generate churn prediction models–all with Spark and its machine learning frameworks.

Customer 360 Using data science in order to better understand and predict customer behavior is an iterative process, which involves:

  1. Data discovery and model creation:
    • Analysis of historical data
    • Identifying new data sources, which traditional analytics or databases are not using, due to the format, size, or structure
    • Collecting, correlating, and analyzing data across multiple data sources
    • Knowing and applying the right kind of machine learning algorithms to get value out of the data
  2. Using the model in production to make predictions
  3. Data discovery and updating the model with new data


In order to understand the customer, a number of factors can be analyzed, such as:

  • Customer demographic data (age, marital status, etc.)
  • Sentiment analysis of social media
  • Customer usage patterns and geographical usage trends
  • Calling-circle data
  • Browsing behavior from clickstream logs
  • Support call center statistics
  • Historical data that show patterns of behavior that suggest churn

With this analysis, telecom companies can gain insights to predict and enhance the customer experience, prevent churn, and tailor marketing campaigns.


Classification is a family of supervised machine learning algorithms that identify which category an item belongs to (e.g., whether a transaction is fraud or not fraud), based on labeled examples of known items (e.g., transactions known to be fraud or not). Classification takes a set of data with known labels and pre-determined features and learns how to label new records based on that information. Features are the “if questions” that you ask. The label is the answer to those questions. In the example below, if it walks, swims, and quacks like a duck, then the label is “duck.”


Let’s go through an example of telecom customer churn:

  • What are we trying to predict?
    • Whether a customer has a high probability of unsubscribing from the service or not
    • Churn is the Label: True or False
  • What are the “if questions” or properties that you can use to make predictions?
    • Call statistics, customer service calls, etc.
    • To build a classifier model, you extract the features of interest that most contribute to the classification.
Source Continue Reading