notebook

1. Business Problem:

We are looking at a subscription model business for a bank where subscription is the primary source of revenue with different sources. The companies try to minimize the customer churn - Cancel subscription to retain their customers. In this business problem, we will look into customer behavior pattern to know the disengagement pattern with the product

1.1 Problem statement

The object is to know which users are like to cancel the subscription. It will help to focus on re-engaging these users with the product. It can be a generic e-mail to explain the benefits of products, focus on user value. We need to careful to only connect with users who will stay with reminder, not leave after the reminder.

1.2 Source

Data is shared for learning and doesn’t reperesnt real customer in any form

1.3 Business objective / Constraints

  • Cost of misclassification (churn or not churn) is high
  • No strict Latency
  • Interpretability is important for Marketing / Product Managers

2. Machine Learning Problem

As it is clear, we would like to use ML in this business problem.

2.1 Type of Machine Learning Problem

It is a classic case of classification problem - to be precise, it’s a case of binary classification.

2.2 Performance metrics

We would like to be measure how accuracte is our model and also how precise it is. Mainly, Would like to see F-1 Score as metrics

2.3 Data

#### Importing Libraries ####

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

dataset = pd.read_csv('churn_data.csv') 
dataset.head()
userchurnagehousingcredit_scoredepositswithdrawalpurchases_partnerspurchasescc_taken...waiting_4_loancancelled_loanreceived_loanrejected_loanzodiac_signleft_for_two_month_plusleft_for_one_monthrewards_earnedreward_rateis_referred
055409037.0naNaN00000...0000Leo10NaN0.000
123547028.0R486.000100...0000Leo0044.01.471
258313035.0R561.047286470...0000Capricorn1065.02.170
38095026.0R567.026338250...0000Capricorn0033.01.101
461353127.0naNaN00200...0000Aries101.00.030

5 rows × 31 columns

2.3.1 Data Column details

  • userid - MongoDB userid
  • churn - Active = NoSuspended < 30 = No Else Churn = Yes
  • age - age of the customer
  • city - city of the customer
  • state- state where the customer lives
  • postal_code - zip code of the customer
  • zodiac_sign- zodiac sign of the customer
  • rent_or_own - Does the customer rents or owns a house
  • more_than_one_mobile_device - does the customer use more than one mobile device
  • payFreq- Pay Frequency of the cusomter
  • in_collections - is the customer in collections
  • loan_pending - is the loan pending
  • withdrawn_application - has the customer withdrawn the loan applicaiton
  • paid_off_loan- has the customer paid of the loan
  • did_not_accept_funding - customer did not accept funding
  • cash_back_engagement - Sum of cash back dollars received by a customer / No of days in the app
  • cash_back_amount - Sum of cash back dollars received by a customer
  • used_ios- Has the user used an iphone
  • used_android - Has the user used a android based phone
  • has_used_mobile_and_web - Has the user used mobile and web platforms
  • has_used_web - Has the user used MoneyLion Web app
  • has_used_mobile - as the user used MoneyLion app
  • has_reffered- Has the user referred
  • cards_clicked - How many times a user has clicked the cards
  • cards_not_helpful- How helpful was the cards
  • cards_helpful- How helpful was the cards
  • cards_viewed- How many times a user viewed the cards
  • cards_share- How many times a user shared his cards
  • trivia_view_results-How many times a user viewed trivia results
  • trivia_view_unlocked- How many times a user viewed trivia view unlocked screen
  • trivia_view_locked - How many times a user viewed trivia view locked screen
  • trivia_shared_results- How many times a user shared trivia results
  • trivia_played - How many times a user played trivia
  • re_linked_account- Has the user re linked account
  • un_linked_account - Has the user un linked account
  • credit_score - Customer’s credit score

3. Exploration Data Analysis (EDA)

# Viewing the Data

dataset.head(5)
userchurnagehousingcredit_scoredepositswithdrawalpurchases_partnerspurchasescc_taken...waiting_4_loancancelled_loanreceived_loanrejected_loanzodiac_signleft_for_two_month_plusleft_for_one_monthrewards_earnedreward_rateis_referred
055409037.0naNaN00000...0000Leo10NaN0.000
123547028.0R486.000100...0000Leo0044.01.471
258313035.0R561.047286470...0000Capricorn1065.02.170
38095026.0R567.026338250...0000Capricorn0033.01.101
461353127.0naNaN00200...0000Aries101.00.030

5 rows × 31 columns

## Data columns
dataset.columns
Index(['user', 'churn', 'age', 'housing', 'credit_score', 'deposits',
       'withdrawal', 'purchases_partners', 'purchases', 'cc_taken',
       'cc_recommended', 'cc_disliked', 'cc_liked', 'cc_application_begin',
       'app_downloaded', 'web_user', 'app_web_user', 'ios_user',
       'android_user', 'registered_phones', 'payment_type', 'waiting_4_loan',
       'cancelled_loan', 'received_loan', 'rejected_loan', 'zodiac_sign',
       'left_for_two_month_plus', 'left_for_one_month', 'rewards_earned',
       'reward_rate', 'is_referred'],
      dtype='object')
## Data describe
dataset.describe() 
userchurnagecredit_scoredepositswithdrawalpurchases_partnerspurchasescc_takencc_recommended...registered_phoneswaiting_4_loancancelled_loanreceived_loanrejected_loanleft_for_two_month_plusleft_for_one_monthrewards_earnedreward_rateis_referred
count27000.00000027000.00000026996.00000018969.00000027000.00000027000.00000027000.00000027000.00000027000.00000027000.000000...27000.00000027000.00000027000.00000027000.00000027000.00000027000.00000027000.00000023773.00000027000.00000027000.000000
mean35422.7025190.41385232.219921542.9442253.3415560.30700028.0625193.2734810.07377892.625778...0.4209260.0012960.0188150.0181850.0048890.1734440.01807429.1101250.9076840.318037
std20321.0066780.4925329.96483861.0593159.1314061.05541642.2196868.9530770.43729988.869343...0.9128310.0359810.1358730.1336230.0697510.3786380.13322221.9734780.7520160.465723
min1.0000000.00000017.0000002.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000000.000000
25%17810.5000000.00000025.000000507.0000000.0000000.0000000.0000000.0000000.00000010.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000009.0000000.2000000.000000
50%35749.0000000.00000030.000000542.0000000.0000000.0000009.0000000.0000000.00000065.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.00000025.0000000.7800000.000000
75%53244.2500001.00000037.000000578.0000001.0000000.00000043.0000001.0000000.000000164.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.00000048.0000001.5300001.000000
max69658.0000001.00000091.000000838.00000065.00000029.0000001067.00000063.00000029.000000522.000000...5.0000001.0000001.0000001.0000001.0000001.0000001.000000114.0000004.0000001.000000

8 rows × 28 columns

# Cleaning Data
dataset[dataset.credit_score < 300]
dataset = dataset[dataset.credit_score >= 300]
# Removing NaN
dataset.isna().any()
dataset.isna().sum()
dataset = dataset.drop(columns = ['credit_score', 'rewards_earned'])
dataset2 = dataset.drop(columns = ['user', 'churn'])
## Histograms
dataset2 = dataset.drop(columns = ['user', 'churn'])
fig = plt.figure(figsize=(15, 12))
plt.suptitle('Histograms of Numerical Columns', fontsize=20)
for i in range(1, dataset2.shape[1] + 1):
    plt.subplot(6, 5, i)
    f = plt.gca()
    f.axes.get_yaxis().set_visible(False)
    f.set_title(dataset2.columns.values[i - 1])

    vals = np.size(dataset2.iloc[:, i - 1].unique())
    
    plt.hist(dataset2.iloc[:, i - 1], bins=vals)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

png

dataset2 = dataset[['housing', 'is_referred', 'app_downloaded',
                    'web_user', 'app_web_user', 'ios_user',
                    'android_user', 'registered_phones', 'payment_type',
                    'waiting_4_loan', 'cancelled_loan',
                    'received_loan', 'rejected_loan', 'zodiac_sign',
                    'left_for_two_month_plus', 'left_for_one_month', 'is_referred']]
fig = plt.figure(figsize=(15, 12))
plt.suptitle('Pie Chart Distributions', fontsize=20)
for i in range(1, dataset2.shape[1] + 1):
    plt.subplot(6, 3, i)
    f = plt.gca()
    f.axes.get_yaxis().set_visible(False)
    f.set_title(dataset2.columns.values[i - 1])
   
    values = dataset2.iloc[:, i - 1].value_counts(normalize = True).values
    index = dataset2.iloc[:, i - 1].value_counts(normalize = True).index
    plt.pie(values, labels = index, autopct='%1.1f%%')
    plt.axis('equal')
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

png

# Uneven features
df1= (dataset[dataset2.waiting_4_loan == 1].churn.value_counts()).to_frame()
df1["Name"] = 'waiting_4_loan'

df2= (dataset[dataset2.cancelled_loan == 1].churn.value_counts()).to_frame()
df2["Name"] = 'cancelled_loan'

df3= (dataset[dataset2.received_loan == 1].churn.value_counts()).to_frame()
df3["Name"] = 'received_loan'


df4= (dataset[dataset2.rejected_loan == 1].churn.value_counts()).to_frame()
df4["Name"] = 'rejected_loan'

df5= (dataset[dataset2.left_for_one_month == 1].churn.value_counts()).to_frame()
df5["Name"] = 'left_for_one_month'

df = pd.concat([df1, df2, df3, df4, df5])
df.reset_index(level=0, inplace=True)

df.columns = ['Churn_index', 'Churn_Count', 'Features']
## Visualize Uneven Features
fig = plt.figure()
g = sns.factorplot(x = 'Features', y='Churn_Count', hue = 'Churn_index',data=df, kind='bar')
g.set_xticklabels(rotation=90)
plt.show()
<Figure size 432x288 with 0 Axes>

png

## Correlation with Response Variable
dataset.drop(columns = ['housing', 'payment_type',
                         'churn', 'zodiac_sign']
    ).corrwith(dataset.churn).plot.bar(figsize=(20,10),
              title = 'Correlation with Response variable',
              fontsize = 15, rot = 45,
              color = sns.color_palette("muted"),                          
              grid = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1873b28ce80>

png

# Correlation Matrix
sns.set(style="white")

# Compute the correlation matrix
corr = dataset.drop(columns = ['user', 'churn']).corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 15))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10,n =9,  sep=20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x1873be18748>

png

# Removing Correlated Fields
dataset = dataset.drop(columns = ['app_web_user'])

Note: Although there are somewhat correlated fields, they are not colinearThese feature are not functions of each other, so they won’t break the model. But these feature won’t help much either. Feature Selection should remove them.


4 Machine Learning

4.1 Data

Importing Libraries

#### Importing Libraries ####

import pandas as pd
import numpy as np
import random
import seaborn as sn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

Data Preparation

## Data Preparation
user_identifier = dataset['user']
dataset = dataset.drop(columns = ['user'])
dataset.housing.value_counts()
R     9221
na    7910
O     1834
Name: housing, dtype: int64
dataset.groupby('housing')['churn'].nunique().reset_index()
housingchurn
0O2
1R2
2na2
dataset = pd.get_dummies(dataset)
dataset.columns
Index(['churn', 'age', 'deposits', 'withdrawal', 'purchases_partners',
       'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked',
       'cc_application_begin', 'app_downloaded', 'web_user', 'ios_user',
       'android_user', 'registered_phones', 'waiting_4_loan', 'cancelled_loan',
       'received_loan', 'rejected_loan', 'left_for_two_month_plus',
       'left_for_one_month', 'reward_rate', 'is_referred', 'housing_O',
       'housing_R', 'housing_na', 'payment_type_Bi-Weekly',
       'payment_type_Monthly', 'payment_type_Semi-Monthly',
       'payment_type_Weekly', 'payment_type_na', 'zodiac_sign_Aquarius',
       'zodiac_sign_Aries', 'zodiac_sign_Cancer', 'zodiac_sign_Capricorn',
       'zodiac_sign_Gemini', 'zodiac_sign_Leo', 'zodiac_sign_Libra',
       'zodiac_sign_Pisces', 'zodiac_sign_Sagittarius', 'zodiac_sign_Scorpio',
       'zodiac_sign_Taurus', 'zodiac_sign_Virgo', 'zodiac_sign_na'],
      dtype='object')
dataset = dataset.drop(columns = ['housing_na', 'zodiac_sign_na', 'payment_type_na'])
dataset.columns
Index(['churn', 'age', 'deposits', 'withdrawal', 'purchases_partners',
       'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked',
       'cc_application_begin', 'app_downloaded', 'web_user', 'ios_user',
       'android_user', 'registered_phones', 'waiting_4_loan', 'cancelled_loan',
       'received_loan', 'rejected_loan', 'left_for_two_month_plus',
       'left_for_one_month', 'reward_rate', 'is_referred', 'housing_O',
       'housing_R', 'payment_type_Bi-Weekly', 'payment_type_Monthly',
       'payment_type_Semi-Monthly', 'payment_type_Weekly',
       'zodiac_sign_Aquarius', 'zodiac_sign_Aries', 'zodiac_sign_Cancer',
       'zodiac_sign_Capricorn', 'zodiac_sign_Gemini', 'zodiac_sign_Leo',
       'zodiac_sign_Libra', 'zodiac_sign_Pisces', 'zodiac_sign_Sagittarius',
       'zodiac_sign_Scorpio', 'zodiac_sign_Taurus', 'zodiac_sign_Virgo'],
      dtype='object')

Training and Test Split set

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(columns = 'churn'), dataset['churn'],
                                                    test_size = 0.2,
                                                    random_state = 0)


# Balancing the Training Set
y_train.value_counts()
0    8934
1    6238
Name: churn, dtype: int64
pos_index = y_train[y_train.values == 1].index
neg_index = y_train[y_train.values == 0].index
if len(pos_index) > len(neg_index):
    higher = pos_index
    lower = neg_index
else:
    higher = neg_index
    lower = pos_index

random.seed(0)
higher = np.random.choice(higher, size=len(lower))
lower = np.asarray(lower)

new_indexes = np.concatenate((lower, higher))

X_train = X_train.loc[new_indexes,]
y_train = y_train[new_indexes]

Feature Scaling

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train))
X_test2 = pd.DataFrame(sc_X.transform(X_test))
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values
X_train = X_train2
X_test = X_test2

Generic Function to call Grid Search with multiple classifier

from sklearn.metrics import confusion_matrix
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
plt.rcParams["font.family"] = 'DejaVu Sans'

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
from datetime import datetime
def perform_model(model, X_train, y_train, X_test, y_test, class_labels, cm_normalize=True, \
                 print_cm=True, cm_cmap=plt.cm.Greens):
    
    
    # to store results at various phases
    results = dict()
    
    # time at which model starts training 
    train_start_time = datetime.now()
    print('training the model..')
    model.fit(X_train, y_train)
    print('Done \n \n')
    train_end_time = datetime.now()
    results['training_time'] =  train_end_time - train_start_time
    print('training_time(HH:MM:SS.ms) - {}\n\n'.format(results['training_time']))
    
    
    # predict test data
    print('Predicting test data')
    test_start_time = datetime.now()
    y_pred = model.predict(X_test)
    test_end_time = datetime.now()
    print('Done \n \n')
    results['testing_time'] = test_end_time - test_start_time
    print('testing time(HH:MM:SS:ms) - {}\n\n'.format(results['testing_time']))
    results['predicted'] = y_pred
   

    # calculate overall accuracty of the model
    accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_pred)
    # store accuracy in results
    results['accuracy'] = accuracy
    print('---------------------')
    print('|      Accuracy      |')
    print('---------------------')
    print('\n    {}\n\n'.format(accuracy))
    
    
    # calculate f1-score of the model
    f1_score = metrics.f1_score(y_true=y_test, y_pred=y_pred)
    # store accuracy in results
    results['f1_score'] = f1_score
    print('---------------------')
    print('|      f1_score      |')
    print('---------------------')
    print('\n    {}\n\n'.format(f1_score))
    
    
    # confusion matrix
    cm = metrics.confusion_matrix(y_test, y_pred)
    results['confusion_matrix'] = cm
    if print_cm: 
        print('--------------------')
        print('| Confusion Matrix |')
        print('--------------------')
        print('\n {}'.format(cm))
        
    # plot confusin matrix
    plt.figure(figsize=(9,9))
    plt.grid(b=False)
    plot_confusion_matrix(cm, classes=class_labels, normalize=True, title='Normalized confusion matrix', cmap = cm_cmap)
    plt.show()
    
    # get classification report
    print('-------------------------')
    print('| Classifiction Report |')
    print('-------------------------')
    classification_report = metrics.classification_report(y_test, y_pred)
    # store report in results
    results['classification_report'] = classification_report
    print(classification_report)
    
    # add the trained  model to the results
    results['model'] = model
    
    return results
    
    
def print_grid_search_attributes(model):
    # Estimator that gave highest score among all the estimators formed in GridSearch
    print('--------------------------')
    print('|      Best Estimator     |')
    print('--------------------------')
    print('\n\t{}\n'.format(model.best_estimator_))


    # parameters that gave best results while performing grid search
    print('--------------------------')
    print('|     Best parameters     |')
    print('--------------------------')
    print('\tParameters of best estimator : \n\n\t{}\n'.format(model.best_params_))


    #  number of cross validation splits
    print('---------------------------------')
    print('|   No of CrossValidation sets   |')
    print('--------------------------------')
    print('\n\tTotal numbre of cross validation sets: {}\n'.format(model.n_splits_))


    # Average cross validated score of the best estimator, from the Grid Search 
    print('--------------------------')
    print('|        Best Score       |')
    print('--------------------------')
    print('\n\tAverage Cross Validate scores of best estimator : \n\n\t{}\n'.format(model.best_score_))

from sklearn import linear_model
from sklearn import metrics

from sklearn.model_selection import GridSearchCV
labels = ['No', 'Yes']

Logistic Regression

# start Grid search
import warnings
warnings.filterwarnings('ignore')

parameters = {'C':[0.01, 0.1, 1, 10, 20, 30], 'penalty':['l2','l1']}
log_reg = linear_model.LogisticRegression()
log_reg_grid = GridSearchCV(log_reg, param_grid=parameters, cv=3, verbose=1, n_jobs=-1)
log_reg_grid_results =  perform_model(log_reg_grid, X_train, y_train, X_test, y_test, class_labels=labels)
training the model..
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    7.3s finished


Done 
 

training_time(HH:MM:SS.ms) - 0:00:07.655025


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.001999


---------------------
|      Accuracy      |
---------------------

    0.6430266279989454


---------------------
|      f1_score      |
---------------------

    0.6414194915254238


--------------------
| Confusion Matrix |
--------------------

 [[1228  999]
 [ 355 1211]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.78      0.55      0.64      2227
           1       0.55      0.77      0.64      1566

    accuracy                           0.64      3793
   macro avg       0.66      0.66      0.64      3793
weighted avg       0.68      0.64      0.64      3793
# observe the attributes of the model 
print_grid_search_attributes(log_reg_grid_results['model'])
--------------------------
|      Best Estimator     |
--------------------------

	LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'C': 0.1, 'penalty': 'l1'}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6550176338570054

Linear SVC

from sklearn.svm import LinearSVC
parameters = {'C':[0.125, 0.5, 1, 2, 8, 16]}
lr_svc = LinearSVC()
lr_svc_grid = GridSearchCV(lr_svc, param_grid=parameters, n_jobs=-1, verbose=1)
lr_svc_grid_results = perform_model(lr_svc_grid, X_train, y_train, X_test, y_test, class_labels=labels)
training the model..
Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:   13.1s finished


Done 
 

training_time(HH:MM:SS.ms) - 0:00:16.869027


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.000971


---------------------
|      Accuracy      |
---------------------

    0.6353809649354073


---------------------
|      f1_score      |
---------------------

    0.6349960411718131


--------------------
| Confusion Matrix |
--------------------

 [[1207 1020]
 [ 363 1203]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.77      0.54      0.64      2227
           1       0.54      0.77      0.63      1566

    accuracy                           0.64      3793
   macro avg       0.65      0.66      0.64      3793
weighted avg       0.67      0.64      0.64      3793
print_grid_search_attributes(lr_svc_grid_results['model'])
--------------------------
|      Best Estimator     |
--------------------------

	LinearSVC(C=0.125, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'C': 0.125}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6533344020519397

Kernal SVM with GridSearch

from sklearn.svm import SVC
parameters = {'C':[2,8,16],\
              'gamma': [ 0.0078125, 0.125, 2]}
rbf_svm = SVC(kernel='rbf')
rbf_svm_grid = GridSearchCV(rbf_svm,param_grid=parameters, n_jobs=-1)
rbf_svm_grid_results = perform_model(rbf_svm_grid, X_train, y_train, X_test, y_test, class_labels=labels)
training the model..
Done 
 

training_time(HH:MM:SS.ms) - 0:01:44.801175


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:02.137951


---------------------
|      Accuracy      |
---------------------

    0.4932770893751648


---------------------
|      f1_score      |
---------------------

    0.6022350993377484


--------------------
| Confusion Matrix |
--------------------

 [[ 416 1811]
 [ 111 1455]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.79      0.19      0.30      2227
           1       0.45      0.93      0.60      1566

    accuracy                           0.49      3793
   macro avg       0.62      0.56      0.45      3793
weighted avg       0.65      0.49      0.43      3793
print_grid_search_attributes(rbf_svm_grid_results['model'])
--------------------------
|      Best Estimator     |
--------------------------

	SVC(C=8, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=2, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'C': 8, 'gamma': 2}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6999839692209041
from sklearn.tree import DecisionTreeClassifier
parameters = {'max_depth':np.arange(3,10,2)}
dt = DecisionTreeClassifier()
dt_grid = GridSearchCV(dt,param_grid=parameters, n_jobs=-1)
dt_grid_results = perform_model(dt_grid, X_train, y_train, X_test, y_test, class_labels=labels)
print_grid_search_attributes(dt_grid_results['model'])
training the model..
Done 
 

training_time(HH:MM:SS.ms) - 0:00:00.224999


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.002001


---------------------
|      Accuracy      |
---------------------

    0.6867914579488531


---------------------
|      f1_score      |
---------------------

    0.6455847255369928


--------------------
| Confusion Matrix |
--------------------

 [[1523  704]
 [ 484 1082]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.76      0.68      0.72      2227
           1       0.61      0.69      0.65      1566

    accuracy                           0.69      3793
   macro avg       0.68      0.69      0.68      3793
weighted avg       0.70      0.69      0.69      3793

--------------------------
|      Best Estimator     |
--------------------------

	DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'max_depth': 9}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6878807310035268
print_grid_search_attributes(dt_grid_results['model'])
--------------------------
|      Best Estimator     |
--------------------------

	DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'max_depth': 9}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6878807310035268
#Random Forest with Grid Search
from sklearn.ensemble import RandomForestClassifier
params = {'n_estimators': np.arange(10,201,20), 'max_depth':np.arange(3,15,2)}
rfc = RandomForestClassifier()
rfc_grid = GridSearchCV(rfc, param_grid=params, n_jobs=-1)
rfc_grid_results = perform_model(rfc_grid, X_train, y_train, X_test, y_test, class_labels=labels)
print_grid_search_attributes(rfc_grid_results['model'])
training the model..
Done 
 

training_time(HH:MM:SS.ms) - 0:00:34.146021


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.063967


---------------------
|      Accuracy      |
---------------------

    0.6925916161349855


---------------------
|      f1_score      |
---------------------

    0.673389355742297


--------------------
| Confusion Matrix |
--------------------

 [[1425  802]
 [ 364 1202]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.80      0.64      0.71      2227
           1       0.60      0.77      0.67      1566

    accuracy                           0.69      3793
   macro avg       0.70      0.70      0.69      3793
weighted avg       0.72      0.69      0.69      3793

--------------------------
|      Best Estimator     |
--------------------------

	RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=13, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=110,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'max_depth': 13, 'n_estimators': 110}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.737976915678102
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {'max_depth': np.arange(5,8,1), \
             'n_estimators':np.arange(130,170,10)}
gbdt = GradientBoostingClassifier()
gbdt_grid = GridSearchCV(gbdt, param_grid=param_grid, n_jobs=-1)
gbdt_grid_results = perform_model(gbdt_grid, X_train, y_train, X_test, y_test, class_labels=labels)
print_grid_search_attributes(gbdt_grid_results['model'])
training the model..
Done 
 

training_time(HH:MM:SS.ms) - 0:01:15.668046


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.019998


---------------------
|      Accuracy      |
---------------------

    0.7044555760611653


---------------------
|      f1_score      |
---------------------

    0.6779661016949152


--------------------
| Confusion Matrix |
--------------------

 [[1492  735]
 [ 386 1180]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.79      0.67      0.73      2227
           1       0.62      0.75      0.68      1566

    accuracy                           0.70      3793
   macro avg       0.71      0.71      0.70      3793
weighted avg       0.72      0.70      0.71      3793

--------------------------
|      Best Estimator     |
--------------------------

	GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=7,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=150,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'max_depth': 7, 'n_estimators': 150}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.746232766912472
## Xgboost
from xgboost import XGBClassifier
param_grid = {
        'min_child_weight': [1, 5],
        'gamma': [0.5, 1],
        'subsample': [0.6],
        'colsample_bytree': [0.6, 0.8],
        'max_depth': [3, 4]
        }

xgboost = XGBClassifier(learning_rate=0.02, n_estimators=100, objective='binary:logistic',
                    silent=True, nthread=1)

xgboost_grid = GridSearchCV(xgboost, param_grid=param_grid, n_jobs=-1)
xgboost_grid_results = perform_model(xgboost_grid, X_train, y_train, X_test, y_test, class_labels=labels)
print_grid_search_attributes(xgboost_grid_results['model'])
training the model..
Done 
 

training_time(HH:MM:SS.ms) - 0:00:20.966037


Predicting test data
Done 
 

testing time(HH:MM:SS:ms) - 0:00:00.016998


---------------------
|      Accuracy      |
---------------------

    0.6767730029000791


---------------------
|      f1_score      |
---------------------

    0.6528878822197055


--------------------
| Confusion Matrix |
--------------------

 [[1414  813]
 [ 413 1153]]

png

-------------------------
| Classifiction Report |
-------------------------
              precision    recall  f1-score   support

           0       0.77      0.63      0.70      2227
           1       0.59      0.74      0.65      1566

    accuracy                           0.68      3793
   macro avg       0.68      0.69      0.68      3793
weighted avg       0.70      0.68      0.68      3793

--------------------------
|      Best Estimator     |
--------------------------

	XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=1,
              learning_rate=0.02, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=1, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=True, subsample=0.6, verbosity=1)

--------------------------
|     Best parameters     |
--------------------------
	Parameters of best estimator : 

	{'colsample_bytree': 0.8, 'gamma': 1, 'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.6}

---------------------------------
|   No of CrossValidation sets   |
--------------------------------

	Total numbre of cross validation sets: 3

--------------------------
|        Best Score       |
--------------------------

	Average Cross Validate scores of best estimator : 

	0.6892433472266752

Compare Results

print('\n                     Accuracy  F1-score    Error')
print('                 ----------  ---------  --------')
print('Logistic Regression : {:.04}%     {:.04}%    {:.04}%'.format(log_reg_grid_results['accuracy'] * 100,\
                                                           log_reg_grid_results['f1_score'] * 100,\
                                                  100-(log_reg_grid_results['accuracy'] * 100)))

print('Linear SVC          : {:.04}%    {:.04}%     {:.04}% '.format(lr_svc_grid_results['accuracy'] * 100,\
                                                            lr_svc_grid_results['f1_score'] * 100,\
                                                        100-(lr_svc_grid_results['accuracy'] * 100)))

print('rbf SVM classifier  : {:.04}%    {:.04}%    {:.04}% '.format(rbf_svm_grid_results['accuracy'] * 100,\
                                                           rbf_svm_grid_results['f1_score'] * 100,\
                                                          100-(rbf_svm_grid_results['accuracy'] * 100)))

print('DecisionTree        : {:.04}%    {:.04}%    {:.04}% '.format(dt_grid_results['accuracy'] * 100,\
                                                           dt_grid_results['f1_score'] * 100,\
                                                        100-(dt_grid_results['accuracy'] * 100)))

print('Random Forest       : {:.04}%    {:.04}%    {:.04}% '.format(rfc_grid_results['accuracy'] * 100,\
                                                           rfc_grid_results['f1_score'] * 100,\
                                                           100-(rfc_grid_results['accuracy'] * 100)))

print('GradientBoosting DT : {:.04}%    {:.04}%    {:.04}% '.format(rfc_grid_results['accuracy'] * 100,\
                                                                    rfc_grid_results['f1_score'] * 100,\
                                                        100-(rfc_grid_results['accuracy'] * 100)))
print('Xgboost             : {:.04}%    {:.04}%    {:.04}% '.format(xgboost_grid_results['accuracy'] * 100,\
                                                                    xgboost_grid_results['f1_score'] * 100,\
                                                        100-(xgboost_grid_results['accuracy'] * 100)))
                     Accuracy  F1-score    Error
                 ----------  ---------  --------
Logistic Regression : 64.3%     64.14%    35.7%
Linear SVC          : 63.54%    63.5%     36.46% 
rbf SVM classifier  : 49.33%    60.22%    50.67% 
DecisionTree        : 68.68%    64.56%    31.32% 
Random Forest       : 69.26%    67.34%    30.74% 
GradientBoosting DT : 69.26%    67.34%    30.74% 
XGBOOST             : 67.68%    65.29%    32.32% 

5 Conclusion

Our model provides an indication of which users are likely to churn. I have purposely left the date of the expected churn open-ended.The reason we are not aiming to create new products for people who are going to leave us for sure, but for people who are starting to lose interest in the app.

I would like to include the time it takes for churn - Survival analysis in the next project with possibly different data set to showcase different acceptance of problem-solving skills. Also, we have open-ended emphasis to get a sense of bit-likely customer likely to churn.


6 Next Steps

  • To include, Interpretability - Local Surrogate (LIME) or Individual Conditional Expectation (ICE)
  • Add additional features like Session details, session duration etc.
✠  Previous Data Science Framework in Industry ✠  Next Marketing Channel Attribution