Customer Churn Prediction Analysis

CIS 9660 - Data Mining for Business Analytics

Authors

Harrison Cabe

Raúl J. Solá Navarro

Samuel Spitzer

Victor Murra Schott

Published

May 15, 2026

1 Introduction

Customer churn refers to customers who cancel or stop using a company’s service. 1 In the telecom industry, losing customers is expensive because bringing in new ones costs significantly more than keeping existing ones. 2 Being able to identify which customers are likely to leave before they actually do gives companies a chance to step in with a targeted offer or intervention.

This project uses the Kaggle Telco Customer Churn dataset 3 to answer the following question: Can we predict whether a telecom customer will churn based on their contract type, monthly charges, and service usage? We build and compare logistic regression models, evaluate their performance using standard classification metrics, and translate the results into business recommendations.

The analysis has two goals. The first is predictive: build a model that correctly classifies customers as likely to churn or stay. The second is inferential: test whether the combination of InternetService type and Contract type has a meaningful effect on churn beyond what each variable contributes on its own and, specifically, whether certain contract-service pairings carry risk that neither variable alone can fully explain.

2 Dataset Description

The dataset comes from Kaggle’s Telco Customer Churn and covers 7,043 customers of a fictional telecommunications company. It includes 21 variables 4 across demographics, subscribed services, and account details, with Churn (Yes/No) as the target variable.

The predictor variables fall into three groups:

  • Demographics: gender, SeniorCitizen, Partner, Dependents
  • Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
  • Account: Contract, tenure, MonthlyCharges, TotalCharges, PaymentMethod, PaperlessBilling
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, roc_auc_score,
                             confusion_matrix, RocCurveDisplay, make_scorer, recall_score)

URL = "https://raw.githubusercontent.com/RaulSolaNavarro/CIS9660-2026-SPRING/main/data/Telco-Customer-Churn.csv"
df = pd.read_csv(URL)

# ── Cleaning (Sammy) ──────────────────────────────────────────────────────────
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
n_dropped = int(df['TotalCharges'].isnull().sum())
df = df.dropna()

# Collapse "No <service> service" into "No" (Sammy)
df['MultipleLines']     = df['MultipleLines'].replace('No phone service', 'No')
df['OnlineSecurity']    = df['OnlineSecurity'].replace('No internet service', 'No')
df['OnlineBackup']      = df['OnlineBackup'].replace('No internet service', 'No')
df['DeviceProtection']  = df['DeviceProtection'].replace('No internet service', 'No')
df['TechSupport']       = df['TechSupport'].replace('No internet service', 'No')
df['StreamingTV']       = df['StreamingTV'].replace('No internet service', 'No')
df['StreamingMovies']   = df['StreamingMovies'].replace('No internet service', 'No')

# Binary encoding using .map() (Sammy)
df['gender']            = df['gender'].map({'Male': 1, 'Female': 0})
df['Partner']           = df['Partner'].map({'Yes': 1, 'No': 0})
df['Dependents']        = df['Dependents'].map({'Yes': 1, 'No': 0})
df['PhoneService']      = df['PhoneService'].map({'Yes': 1, 'No': 0})
df['PaperlessBilling']  = df['PaperlessBilling'].map({'Yes': 1, 'No': 0})
df['Churn']             = df['Churn'].map({'Yes': 1, 'No': 0})
df['MultipleLines']     = df['MultipleLines'].map({'Yes': 1, 'No': 0})
df['OnlineSecurity']    = df['OnlineSecurity'].map({'Yes': 1, 'No': 0})
df['OnlineBackup']      = df['OnlineBackup'].map({'Yes': 1, 'No': 0})
df['DeviceProtection']  = df['DeviceProtection'].map({'Yes': 1, 'No': 0})
df['TechSupport']       = df['TechSupport'].map({'Yes': 1, 'No': 0})
df['StreamingTV']       = df['StreamingTV'].map({'Yes': 1, 'No': 0})
df['StreamingMovies']   = df['StreamingMovies'].map({'Yes': 1, 'No': 0})

# Fix TotalCharges dtype (Sammy)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# One-hot encode nominal multi-category columns (Sammy)
df = pd.get_dummies(df, columns=['InternetService', 'Contract', 'PaymentMethod'],
                    drop_first=True)

# Convert booleans from get_dummies to int (Sammy)
bool_cols = df.select_dtypes(include='bool').columns
df[bool_cols] = df[bool_cols].astype(int)

# Drop customerID
df = df.drop(columns=['customerID'])

# Keep a raw copy for EDA plots that need original category labels
df_raw = pd.read_csv(URL)
df_raw['TotalCharges'] = pd.to_numeric(df_raw['TotalCharges'], errors='coerce')
df_raw = df_raw.dropna()

# ── Inline values ─────────────────────────────────────────────────────────────
n_obs         = len(df)
churn_rate    = df['Churn'].mean()
no_churn_rate = 1 - churn_rate
total_charges_tenure_corr = df['TotalCharges'].corr(df['tenure'])

# EDA values from raw labels
contract_churn  = df_raw.groupby('Contract')['Churn'].apply(
    lambda x: (x == 'Yes').mean())
mtm_churn       = contract_churn['Month-to-month']
two_yr_churn    = contract_churn['Two year']

internet_churn  = df_raw.groupby('InternetService')['Churn'].apply(
    lambda x: (x == 'Yes').mean())
fiber_churn     = internet_churn['Fiber optic']

median_charges_churn    = df_raw[df_raw['Churn']=='Yes']['MonthlyCharges'].median()
median_charges_no_churn = df_raw[df_raw['Churn']=='No']['MonthlyCharges'].median()
median_tenure_churn     = df_raw[df_raw['Churn']=='Yes']['tenure'].median()
median_tenure_no_churn  = df_raw[df_raw['Churn']=='No']['tenure'].median()

tenure_churn_corr  = df['tenure'].corr(df['Churn'])
monthly_churn_corr = df['MonthlyCharges'].corr(df['Churn'])

# Pre-compute feature count: 23 base features (after dropping TotalCharges),
# expanded to 27 in the interaction model (+4 interaction terms)
n_features_base = df.drop(columns=['Churn', 'TotalCharges']).shape[1]  # 23
n_features_model = n_features_base + 4  # 27 with interaction terms

During data cleaning, 11 rows were dropped because TotalCharges was blank for customers with zero tenure (brand new accounts with no billing history yet), leaving a working dataset of 7,032 observations. TotalCharges was also removed from the model because it is almost entirely determined by tenure (correlation of 0.83), which would distort the model’s coefficient estimates. The Churn column was converted to a binary integer (1 = churned, 0 = stayed).

Table 1: Dataset Overview
Variable Value
Observations 7,032
Features 23
Target Churn (Yes/No)
Churn Rate 26.6%

3 Exploratory Data Analysis

Before building any models, we looked at how the key variables relate to churn. The dataset is moderately imbalanced: 73.4% of customers did not churn and 26.6% did.

Code
# Boxplots (Sammy)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(x='Churn', y='MonthlyCharges', data=df_raw, ax=axes[0])
axes[0].set_title('Monthly Charges by Churn')
axes[0].set_xticklabels(['No Churn', 'Churn'])
sns.boxplot(x='Churn', y='tenure', data=df_raw, ax=axes[1])
axes[1].set_title('Tenure by Churn')
axes[1].set_xticklabels(['No Churn', 'Churn'])
plt.tight_layout()
plt.show()

# Bar charts by category (Sammy)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

contract_churn_plot = df_raw.groupby('Contract')['Churn'].value_counts(
    normalize=True).unstack()
contract_churn_plot.plot(kind='bar', ax=axes[0], colormap='Set2')
axes[0].set_title('Churn Rate by Contract Type')
axes[0].set_xlabel('Contract Type')
axes[0].set_ylabel('Proportion')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha='right')
axes[0].legend(['No Churn', 'Churn'])

internet_churn_plot = df_raw.groupby('InternetService')['Churn'].value_counts(
    normalize=True).unstack()
internet_churn_plot.plot(kind='bar', ax=axes[1], colormap='Set2')
axes[1].set_title('Churn Rate by Internet Service')
axes[1].set_xlabel('Internet Service')
axes[1].set_ylabel('Proportion')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')
axes[1].legend(['No Churn', 'Churn'])

payment_churn_plot = df_raw.groupby('PaymentMethod')['Churn'].value_counts(
    normalize=True).unstack()
payment_churn_plot.plot(kind='bar', ax=axes[2], colormap='Set2')
axes[2].set_title('Churn Rate by Payment Method')
axes[2].set_xlabel('Payment Method')
axes[2].set_ylabel('Proportion')
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation=45, ha='right')
axes[2].legend(['No Churn', 'Churn'])

plt.tight_layout()
plt.show()

Exploratory Data Analysis - Key Variables vs Churn

A few patterns stand out. Contract type is the variable most strongly tied to churn: customers on month-to-month contracts churned at a rate of 42.7%, while customers on two-year contracts churned at just 2.8%. Fiber optic internet users also showed a high churn rate of 41.9%, which may point to service quality or pricing issues in that segment. Customers who churned tended to have higher monthly bills (median around $80 vs. $64 for those who stayed) and had been with the company for less time (median tenure of about 10 months vs. 38 months for loyal customers).

Code
# Full correlation heatmap across all encoded features (Sammy)
plt.figure(figsize=(16, 12))
sns.heatmap(df.corr(), annot=True, fmt='.2f',
            cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Correlation Heatmap', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Correlation Heatmap - All Features (Sammy)

The correlation heatmap 5 shows the pairwise Pearson correlation coefficient 6 between every pair of variables after preprocessing. Among the numeric variables, tenure has the strongest relationship with churn (correlation of -0.35), meaning longer-tenured customers are less likely to leave. MonthlyCharges has a weaker but positive relationship (0.19), meaning higher bills are associated with slightly higher churn. TotalCharges is largely redundant with tenure (correlation of 0.83), which is why it was excluded from the models.

4 Methodology

4.1 Data Preprocessing

Binary variables were encoded directly using .map(), and columns with a redundant “No service” category were collapsed into a simple Yes/No before encoding. Multi-category nominal columns (InternetService, Contract, PaymentMethod) were one-hot encoded using pd.get_dummies() to avoid implying a false ordering between categories. Features were then standardized so that no single variable dominates the model due to its scale. The data was split into 80% for training and 20% for testing, with the split done in a way that preserves the same churn rate in both sets.

Code
X = df.drop(columns=['Churn', 'TotalCharges'])
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

n_train = X_train.shape[0]
n_test  = X_test.shape[0]

print(f"Training set: {n_train} observations")
print(f"Test set:     {n_test} observations")
print(f"Churn rate - Train: {y_train.mean():.3f} | Test: {y_test.mean():.3f}")
Training set: 5625 observations
Test set:     1407 observations
Churn rate - Train: 0.266 | Test: 0.266

The training set contains 5,625 observations and the test set contains 1,407, with a churn rate of 26.6% in both sets confirming the stratified split 7 worked correctly. After one-hot encoding, the feature space expanded from 22 base features to 26 features in the interaction model (which adds four contract × internet service terms).

4.2 Logistic Regression Models

We built two models. The baseline model uses all available features after pre-processing. The interaction model extends the baseline by adding four interaction terms that capture the combined effect of contract type and internet service on churn: one-year contract × fiber optic, two-year contract × fiber optic, one-year contract × no internet, and two-year contract × no internet. Since month-to-month and DSL are the reference categories, their effects are captured in the model intercept and serve as the baseline against which all other combinations are compared.

Logistic regression was chosen because its output is easy to interpret. Each coefficient tells you the direction and size of a variable’s effect on the odds of churning. Given the business goal, churn recall was treated as the most important metric, because missing a customer who is about to leave is more costly than incorrectly flagging one who would have stayed.

5 Results

5.1 Model Performance

Code
X_train_int = X_train.copy()
X_test_int  = X_test.copy()

# Four interaction terms (Harrison)
X_train_int['Contract1_x_FiberOptic'] = X_train['Contract_One year'] * X_train['InternetService_Fiber optic']
X_train_int['Contract2_x_FiberOptic'] = X_train['Contract_Two year'] * X_train['InternetService_Fiber optic']
X_train_int['Contract1_x_NoInternet'] = X_train['Contract_One year'] * X_train['InternetService_No']
X_train_int['Contract2_x_NoInternet'] = X_train['Contract_Two year'] * X_train['InternetService_No']

X_test_int['Contract1_x_FiberOptic'] = X_test['Contract_One year'] * X_test['InternetService_Fiber optic']
X_test_int['Contract2_x_FiberOptic'] = X_test['Contract_Two year'] * X_test['InternetService_Fiber optic']
X_test_int['Contract1_x_NoInternet'] = X_test['Contract_One year'] * X_test['InternetService_No']
X_test_int['Contract2_x_NoInternet'] = X_test['Contract_Two year'] * X_test['InternetService_No']

scaler_int = StandardScaler()
X_train_int_scaled = scaler_int.fit_transform(X_train_int)
X_test_int_scaled  = scaler_int.transform(X_test_int)

lr_int = LogisticRegression(random_state=42, max_iter=1000)
lr_int.fit(X_train_int_scaled, y_train)

y_pred_int  = lr_int.predict(X_test_int_scaled)
y_proba_int = lr_int.predict_proba(X_test_int_scaled)[:, 1]

roc_auc_int         = roc_auc_score(y_test, y_proba_int)
report_int          = classification_report(y_test, y_pred_int,
                                            target_names=['No Churn','Churn'],
                                            output_dict=True)
accuracy_int        = report_int['accuracy']
churn_recall_int    = report_int['Churn']['recall']
churn_precision_int = report_int['Churn']['precision']
churn_f1_int        = report_int['Churn']['f1-score']

print("=== Interaction Model Performance ===")
print(f"ROC-AUC: {roc_auc_int:.4f}")
print(classification_report(y_test, y_pred_int, target_names=['No Churn','Churn']))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
cm = confusion_matrix(y_test, y_pred_int)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No Churn','Churn'], yticklabels=['No Churn','Churn'])
axes[0].set_title('Confusion Matrix - Interaction Model')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')
RocCurveDisplay.from_predictions(y_test, y_proba_int, ax=axes[1], color='tomato')
axes[1].plot([0,1],[0,1],'k--')
axes[1].set_title('ROC Curve - Interaction Model')
plt.tight_layout()
plt.show()
=== Interaction Model Performance ===
ROC-AUC: 0.8341
              precision    recall  f1-score   support

    No Churn       0.85      0.89      0.87      1033
       Churn       0.64      0.56      0.60       374

    accuracy                           0.80      1407
   macro avg       0.75      0.73      0.73      1407
weighted avg       0.79      0.80      0.80      1407

Confusion Matrix and ROC Curve - Interaction Model

The interaction model achieved an ROC-AUC of 0.8341, exceeding the target of 0.80. Overall accuracy was 80.1%, just under the 80% target. Churn recall at the default threshold of 0.50 was 56.4%, which falls short of the 75% target. The weighted average recall of 80% seen in the classification report is misleading here: it is pulled upward by how well the model predicts the much larger non-churn class, which has nearly three times as many customers. The churn-specific recall is the honest metric for this business goal, and the threshold tuning section below addresses it directly.

Code
from scipy import stats
import statsmodels.api as sm

# Prepare features - baseline (no interaction terms)
scaler_base = StandardScaler()
X_train_base_scaled = scaler_base.fit_transform(X_train)
X_test_base_scaled  = scaler_base.transform(X_test)

# Add constant for statsmodels (intercept)
X_train_base_sm = sm.add_constant(X_train_base_scaled)
X_train_int_sm  = sm.add_constant(X_train_int_scaled)

# Fit both models with statsmodels
baseline_sm    = sm.Logit(y_train, X_train_base_sm).fit(disp=0)
interaction_sm = sm.Logit(y_train, X_train_int_sm).fit(disp=0)

# Likelihood Ratio Test
lr_stat = -2 * (baseline_sm.llf - interaction_sm.llf)
df_diff = interaction_sm.df_model - baseline_sm.df_model
p_value = stats.chi2.sf(lr_stat, df_diff)

print("=== Likelihood Ratio Test ===")
print(f"Baseline log-likelihood:    {baseline_sm.llf:.4f}")
print(f"Interaction log-likelihood: {interaction_sm.llf:.4f}")
print(f"LR statistic:               {lr_stat:.4f}")
print(f"Degrees of freedom:         {int(df_diff)}")
print(f"p-value:                    {p_value:.4f}")

# Individual coefficient p-values from interaction model
coef_pval_df = pd.DataFrame({
    'Feature':      X_train_int.columns,
    'Coefficient':  interaction_sm.params[1:],
    'p_value':      interaction_sm.pvalues[1:]
}).sort_values('Coefficient', key=abs, ascending=False).head(15)

lrt_pvalue   = p_value
lrt_stat     = lr_stat
lrt_df       = int(df_diff)
=== Likelihood Ratio Test ===
Baseline log-likelihood:    -2321.8814
Interaction log-likelihood: -2315.5576
LR statistic:               12.6475
Degrees of freedom:         4
p-value:                    0.0131
Code
def sig_label(p):
    if p < 0.001:  return '***'
    elif p < 0.01: return '**'
    elif p < 0.05: return '*'
    else:          return ''

coef_pval_df['sig'] = coef_pval_df['p_value'].apply(sig_label)

fig, ax = plt.subplots(figsize=(10, 7))
colors = ['tomato' if c > 0 else 'steelblue' for c in coef_pval_df['Coefficient']]
bars = ax.barh(coef_pval_df['Feature'], coef_pval_df['Coefficient'], color=colors)
ax.axvline(0, color='black', linewidth=0.8)

for bar, (_, row) in zip(bars, coef_pval_df.iterrows()):
    x = bar.get_width()
    offset = 0.03 if x >= 0 else -0.03
    ha = 'left' if x >= 0 else 'right'
    ax.text(x + offset, bar.get_y() + bar.get_height() / 2,
            row['sig'], va='center', ha=ha, fontsize=12, fontweight='bold',
            color='black')

ax.set_title(
    'Top 15 Coefficients - Interaction Model\n(* p<0.05  ** p<0.01  *** p<0.001)',
    fontsize=12, fontweight='bold')
ax.set_xlabel('Coefficient (log-odds)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

Interaction Model Coefficients with Significance Levels

5.2 Likelihood Ratio Test: Baseline vs. Interaction Model

The interaction model adds four terms to the baseline. To test whether this added complexity is statistically justified, rather than just a chance improvement on the training data, we use the Likelihood Ratio Test (LRT). The LRT compares how well each model fits the data by looking at their log-likelihoods: a model that fits better will assign higher probability to the outcomes it actually observed. 8

The null hypothesis is that the four interaction terms contribute nothing (the baseline model fits the data just as well). A p-value below 0.05 means we can reject that and conclude the interactions carry real signal. The chart above shows the top 15 coefficients from the interaction model, with significance stars indicating which individual terms are statistically distinguishable from zero.

The LRT yielded a chi-squared statistic of 12.65 with 4 degrees of freedom (p = 0.013). Since p < 0.05, we reject the null hypothesis and conclude that the interaction terms jointly improve model fit in a statistically meaningful way. Looking at individual coefficients, Contract1_x_NoInternet was the only interaction term significant at the 0.05 level (p = 0.018), meaning that one-year contract holders with no internet service have a meaningfully lower churn probability than the month-to-month DSL reference group, beyond what either variable explains on its own. The other three interaction terms (Contract2_x_FiberOptic, Contract2_x_NoInternet, and Contract1_x_FiberOptic) did not reach statistical significance individually (p = 0.18, 0.54, and 0.55 respectively), suggesting their effects are captured largely by the main contract and internet service terms already in the model.

5.3 Threshold Tuning

By default, the model predicts churn when the estimated probability is 0.50 or higher. Lowering this cutoff means the model flags more customers as likely churners, which improves recall at the cost of flagging some customers who would have stayed. The table below shows how performance changes across different thresholds.

Code
thresholds = [0.30, 0.35, 0.40, 0.45, 0.50]
rows = []
for t in thresholds:
    y_pred_t = (y_proba_int >= t).astype(int)
    report = classification_report(y_test, y_pred_t,
                                   target_names=['No Churn','Churn'],
                                   output_dict=True)
    rows.append({
        'Threshold': t,
        'Accuracy':  f"{report['accuracy']:.3f}",
        'Churn Recall': f"{report['Churn']['recall']:.3f}",
        'Churn Precision': f"{report['Churn']['precision']:.3f}",
        'F1 Score': f"{report['Churn']['f1-score']:.3f}"
    })

threshold_df = pd.DataFrame(rows)
t30 = threshold_df[threshold_df['Threshold'] == 0.30].iloc[0]
t30_recall    = float(t30['Churn Recall'])
t30_precision = float(t30['Churn Precision'])

threshold_df.style.hide(axis='index')
Threshold Accuracy Churn Recall Churn Precision F1 Score
0.300000 0.745 0.751 0.514 0.610
0.350000 0.766 0.706 0.547 0.616
0.400000 0.783 0.671 0.578 0.621
0.450000 0.792 0.626 0.605 0.615
0.500000 0.801 0.564 0.643 0.601

At a threshold of 0.30, the model catches 75.1% of actual churners, meeting the recall target from the proposal. The trade-off is that precision drops to 51.4%, meaning about half of the customers the model flags would not have actually churned. In a retention context this is an acceptable trade-off: reaching out to a loyal customer with a discount offer costs little, while doing nothing about a customer who is about to leave is far more expensive.

5.4 Cross-Validation

To check that the model’s performance holds up on data it has not seen before, we ran 5-fold cross-validation on the training set.

Code
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    'accuracy': 'accuracy',
    'roc_auc':  'roc_auc',
    'recall':   make_scorer(recall_score),
    'f1':       'f1'
}
cv_results = cross_validate(
    LogisticRegression(random_state=42, max_iter=1000),
    scaler_int.fit_transform(X_train_int),
    y_train, cv=cv, scoring=scoring)

cv_roc_mean = cv_results['test_roc_auc'].mean()
cv_roc_std  = cv_results['test_roc_auc'].std()

cv_df = pd.DataFrame({
    'Metric':  ['Accuracy', 'ROC-AUC', 'Recall', 'F1'],
    'Mean':    [cv_results[f'test_{m}'].mean() for m in ['accuracy','roc_auc','recall','f1']],
    'Std Dev': [cv_results[f'test_{m}'].std()  for m in ['accuracy','roc_auc','recall','f1']]
})
cv_df['Mean']    = cv_df['Mean'].map('{:.4f}'.format)
cv_df['Std Dev'] = cv_df['Std Dev'].map('{:.4f}'.format)
cv_df.style.hide(axis='index')
Metric Mean Std Dev
Accuracy 0.8009 0.0115
ROC-AUC 0.8451 0.0053
Recall 0.5351 0.0210
F1 0.5883 0.0219

The results are consistent across all five folds. The ROC-AUC averaged 0.8451 with a standard deviation of 0.005, meaning scores ranged roughly between 0.84 and 0.85 across folds, indicating the model is stable and not overfitting to the training data. A model that had simply memorized the training data would score well on training data but deteriorate on held-out chunks; the stability here indicates that is not happening. The cross-validation recall of 53.5% reflects the default threshold of 0.50, the same threshold used in the classification report, and is consistent with the threshold tuning results: at threshold 0.30, churn recall rises to 75.1% as shown in the previous section.

Code
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest on the same features as the interaction model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_int_scaled, y_train)

y_pred_rf  = rf.predict(X_test_int_scaled)
y_proba_rf = rf.predict_proba(X_test_int_scaled)[:, 1]

roc_auc_rf      = roc_auc_score(y_test, y_proba_rf)
report_rf       = classification_report(y_test, y_pred_rf,
                                        target_names=['No Churn','Churn'],
                                        output_dict=True)
accuracy_rf     = report_rf['accuracy']
recall_rf       = report_rf['Churn']['recall']
precision_rf    = report_rf['Churn']['precision']

# Threshold tuning for Random Forest
y_pred_rf_t30    = (y_proba_rf >= 0.30).astype(int)
report_rf_t30    = classification_report(y_test, y_pred_rf_t30,
                                         target_names=['No Churn','Churn'],
                                         output_dict=True)
recall_rf_t30    = report_rf_t30['Churn']['recall']
precision_rf_t30 = report_rf_t30['Churn']['precision']
accuracy_rf_t30  = report_rf_t30['accuracy']

print("=== Random Forest Performance ===")
print(f"ROC-AUC: {roc_auc_rf:.4f}")
print(classification_report(y_test, y_pred_rf, target_names=['No Churn','Churn']))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', ax=axes[0],
            xticklabels=['No Churn','Churn'], yticklabels=['No Churn','Churn'])
axes[0].set_title('Confusion Matrix - Random Forest')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')
RocCurveDisplay.from_predictions(y_test, y_proba_rf, ax=axes[1], color='seagreen')
axes[1].plot([0,1],[0,1],'k--')
axes[1].set_title('ROC Curve - Random Forest')
plt.tight_layout()
plt.show()

# Comparison table
comparison = pd.DataFrame({
    'Metric': ['ROC-AUC', 'Accuracy (t=0.50)', 'Churn Recall (t=0.50)',
               'Churn Recall (t=0.30)', 'Churn Precision (t=0.30)'],
    'Logistic Regression': [f"{roc_auc_int:.4f}", f"{accuracy_int:.1%}",
                             f"{churn_recall_int:.1%}", f"{t30_recall:.1%}",
                             f"{t30_precision:.1%}"],
    'Random Forest':       [f"{roc_auc_rf:.4f}", f"{accuracy_rf:.1%}",
                             f"{recall_rf:.1%}", f"{recall_rf_t30:.1%}",
                             f"{precision_rf_t30:.1%}"]
})
comparison.style.hide(axis='index')
=== Random Forest Performance ===
ROC-AUC: 0.8104
              precision    recall  f1-score   support

    No Churn       0.83      0.89      0.86      1033
       Churn       0.61      0.49      0.54       374

    accuracy                           0.78      1407
   macro avg       0.72      0.69      0.70      1407
weighted avg       0.77      0.78      0.77      1407

Confusion Matrix and ROC Curve - Random Forest
Metric Logistic Regression Random Forest
ROC-AUC 0.8341 0.8104
Accuracy (t=0.50) 80.1% 78.2%
Churn Recall (t=0.50) 56.4% 48.9%
Churn Recall (t=0.30) 75.1% 74.6%
Churn Precision (t=0.30) 51.4% 49.4%

5.5 Random Forest Comparison

To evaluate whether a more flexible model improves on the logistic regression, we trained a Random Forest classifier on the same feature set. Random Forest builds many decision trees on random subsets of the data and averages their predictions, which allows it to capture non-linear relationships and complex interactions without requiring them to be specified manually.9 The table above compares the two models directly.

The Random Forest achieved a ROC-AUC of 0.8104, compared to 0.8341 for the logistic regression which is a meaningful gap in overall discriminative ability. At the default threshold of 0.50, the Random Forest also underperformed on recall (48.9% vs. 56.4%) and accuracy (78.2% vs. 80.1%). At threshold 0.30, the two models converge closely: the Random Forest reaches 74.6% churn recall vs. 75.1% for logistic regression, with nearly identical precision (49.4% vs. 51.4%). The McNemar’s test in the next section confirms whether this remaining difference is statistically meaningful.

Code
# Random Forest feature importances
rf_importance = pd.DataFrame({
    'Feature':    X_train_int.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False).head(15)

# Logistic Regression top 15 coefficients by absolute value
feature_names = list(X_train_int.columns)
coef_df = pd.DataFrame({
    'Feature':     feature_names,
    'Coefficient': lr_int.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False).head(15)

top_pos_feature = coef_df[coef_df['Coefficient'] > 0].iloc[0]
top_neg_feature = coef_df[coef_df['Coefficient'] < 0].iloc[0]

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Left: Random Forest importance
axes[0].barh(rf_importance['Feature'], rf_importance['Importance'], color='seagreen')
axes[0].set_title('Random Forest\nFeature Importance', fontweight='bold')
axes[0].set_xlabel('Mean Decrease in Impurity')
axes[0].invert_yaxis()

# Right: Logistic Regression coefficients
lr_colors = ['tomato' if c > 0 else 'steelblue' for c in coef_df['Coefficient']]
axes[1].barh(coef_df['Feature'], coef_df['Coefficient'], color=lr_colors)
axes[1].axvline(0, color='black', linewidth=0.8)
axes[1].set_title('Logistic Regression\nCoefficients (log-odds)', fontweight='bold')
axes[1].set_xlabel('Coefficient')
axes[1].invert_yaxis()

plt.suptitle('Feature Importance Comparison', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Feature Importance Comparison: Random Forest vs. Logistic Regression

Both models agree on the most important drivers of churn. tenure and contract type (Contract_Two year, Contract_One year) rank among the top predictors in both models, as does InternetService_Fiber optic. This convergence across two fundamentally different modeling approaches strengthens confidence that these variables reflect genuine patterns in the data rather than model-specific artifacts.

Code
from statsmodels.stats.contingency_tables import mcnemar

# Get predictions from both models at threshold 0.30
y_pred_lr_t30 = (y_proba_int >= 0.30).astype(int)
y_pred_rf_t30 = (y_proba_rf >= 0.30).astype(int)

# Build contingency table
both_correct = ((y_pred_lr_t30 == y_test) & (y_pred_rf_t30 == y_test)).sum()
lr_only      = ((y_pred_lr_t30 == y_test) & (y_pred_rf_t30 != y_test)).sum()
rf_only      = ((y_pred_lr_t30 != y_test) & (y_pred_rf_t30 == y_test)).sum()
both_wrong   = ((y_pred_lr_t30 != y_test) & (y_pred_rf_t30 != y_test)).sum()

contingency_table = [[both_correct, lr_only],
                     [rf_only,      both_wrong]]

result = mcnemar(contingency_table, exact=False, correction=True)

mcnemar_stat   = result.statistic
mcnemar_pvalue = result.pvalue

print("McNemar's Test Results")
print(f"Contingency table: {contingency_table}")
print(f"Chi-squared statistic: {mcnemar_stat:.4f}")
print(f"p-value: {mcnemar_pvalue:.4f}")
McNemar's Test Results
Contingency table: [[np.int64(952), np.int64(96)], [np.int64(74), np.int64(285)]]
Chi-squared statistic: 2.5941
p-value: 0.1073

5.6 Statistical Model Comparison

Comparing two models by looking at their metrics alone does not tell you whether the difference in performance is statistically meaningful or just due to chance. McNemar’s test addresses this directly: rather than comparing overall accuracy, it looks specifically at the cases where the two models disagree (where one got it right and the other got it wrong). 10

Both models were evaluated at threshold 0.30, consistent with the threshold tuning results above. A p-value below 0.05 would indicate the two models make meaningfully different errors and that the performance gap is statistically significant.

The test yielded a chi-squared statistic of 2.59 and a p-value of 0.107. Since p > 0.05, we cannot conclude that the two models make meaningfully different errors on this test set since the performance gap is not statistically significant. This is an important finding: despite the Random Forest’s added flexibility, it does not outperform the simpler logistic regression in a way that can be distinguished from chance. Given that logistic regression also offers interpretable coefficients and p-values that directly support the inferential goal of this analysis, it is the preferred model for this use case.

5.7 Feature Importance

Red bars in the logistic regression coefficient chart indicate variables that increase the likelihood of churn, while blue bars indicate variables that reduce it. MonthlyCharges (+0.72) and tenure (-0.81) remain the two strongest individual predictors. Among the interaction terms, two-year fiber optic customers show the highest churn risk of all contract-service combinations, which may reflect customers locked into long-term contracts who feel under-served by their internet service. In contrast, customers on long-term contracts with no internet service show the lowest churn, suggesting that simpler, phone-only plans attract a more stable customer base.

6 Conclusions

The logistic regression model shows that churn in this telecom dataset is quite predictable. At the default threshold, the interaction model achieves an ROC-AUC of 0.8341 and overall accuracy of 80.1%. By lowering the classification threshold to 0.30, the model meets the project’s churn recall target of 75%. Cross-validation confirms these results hold across different subsets of the data, with a stable ROC-AUC of 0.8451 across five folds.

Three findings stand out. First, contract type is the clearest signal: month-to-month customers churn at 42.7%, compared to 2.8% for customers on two-year contracts. Second, monthly charges consistently increase churn risk, which points to price sensitivity as a real factor. Third, tenure works in the opposite direction: customers who have been around longer are much less likely to leave. This means early retention efforts are especially valuable.

The Likelihood Ratio Test confirmed that the interaction terms jointly improve model fit (p = 0.013), though only one of the four individual interaction terms (one-year contract holders with no internet service) reached statistical significance on its own. Comparing the logistic regression against a Random Forest showed that the more flexible model did not produce meaningfully better predictions: the McNemar’s test returned p = 0.107, and both models agreed on the same top predictors. This convergence supports choosing the simpler, more interpretable logistic regression for this use case.

7 Practical Implications

The model’s output can feed directly into retention programs:

Contract upgrade offers: Customers on month-to-month contracts with a predicted churn probability above 0.30 are the highest-priority group for outreach. Offering a discounted upgrade to an annual contract could meaningfully reduce the 42.7% churn rate in this segment.

Pricing reviews: Customers with high monthly bills are more likely to churn. Loyalty discounts or bundle offers targeted at high-charge customers in the top risk group could improve retention without broad price cuts.

Fiber optic service: The 41.9% churn rate among fiber optic users is worth investigating separately. It may reflect service reliability issues, pricing relative to competition, or unmet expectations. A focused customer satisfaction review in this segment would help identify the root cause.

Early customer engagement: Since short tenure is one of the strongest signals of churn risk, putting more resources into the customer experience during the first year of service could improve long-term retention rates and increase the overall value of the customer base.

Portions of this project were completed with the assistance of Claude (Anthropic). AI assistance was used in the following areas: structuring and writing the Quarto .qmd document, setting up the GitHub Actions workflow for GitHub Pages deployment, and debugging Python code errors during development. All of the base code (EDA + Models), analytical decisions, interpretations, and conclusions are the authors’ own.

Footnotes

  1. In this dataset, churn is measured as a binary snapshot: each customer is labeled as churned or retained based on whether they left within the last month of the observation window. It does not capture how long a customer was at risk or when exactly they left, so churn rate here means the proportion of customers in the sample who churned during that period (26.6% of 7,032 customers).↩︎

  2. Gallo, A. (2014). The value of keeping the right customers. Harvard Business Review. https://hbr.org/2014/10/the-value-of-keeping-the-right-customers. Gallo synthesizes research by Frederick Reichheld of Bain & Company showing that acquiring a new customer costs 5 to 25 times more than retaining an existing one, and that a 5% increase in retention rates can boost profits by 25% to 95%.↩︎

  3. An IBM sample dataset of a fictional telecommunications company.↩︎

  4. The raw dataset contains 21 columns. customerID is dropped immediately as a non-predictive identifier. The 13 binary Yes/No columns are encoded in place with .map() and do not change the column count. One-hot encoding with drop_first=True then expands three categorical columns: InternetService (3 levels → 2 dummies), Contract (3 levels → 2 dummies), and PaymentMethod (4 levels → 3 dummies), adding a net of 4 columns. This gives 24 columns total. TotalCharges is then excluded from the model due to its near-perfect collinearity with tenure (r = 0.83), leaving 23 features used in modeling.↩︎

  5. Reading the heatmap: Each cell shows the Pearson correlation coefficient \(r\) between the row variable and the column variable, ranging from -1 to 1. Red cells indicate a positive relationship (both variables tend to move together); blue cells indicate a negative relationship (one rises as the other falls); white or near-zero cells indicate little linear association. The matrix is symmetric, meaning that the cell at row \(i\), column \(j\) is identical to row \(j\), column \(i\), so only one rectangle carries new information. Diagonal cells are always 1.00 (a variable is perfectly correlated with itself). To read churn predictors specifically, scan the Churn row or column: red values flag features positively associated with churn, blue values flag features negatively associated with churn.↩︎

  6. The Pearson coefficient only captures linear relationships between two variables. Two variables can have a strong curved or non-monotonic relationship and still produce an \(r\) near zero, meaning the heatmap may understate some associations. For this analysis the limitation is acceptable: logistic regression models the log-odds of churn as a linear function of the predictors, so linear correlation is the appropriate screening tool.↩︎

  7. Stratified splitting preserves the proportion of the target class in each subset. Without it, a purely random split could by chance assign disproportionately few churners to the test set, making recall and precision estimates unreliable. With stratify=y, scikit-learn splits churners and non-churners separately, takes 80% of each group, and combines them, guaranteeing that both sets mirror the original 26.6% churn rate. This is especially important here given the class imbalance (73.4% no churn vs. 26.6% churn).↩︎

  8. The Likelihood Ratio Test statistic is calculated as \(-2(\ell_0 - \ell_1)\), where \(\ell_0\) is the log-likelihood of the baseline model and \(\ell_1\) is the log-likelihood of the interaction model. Under the null hypothesis that the interaction terms have no effect, this statistic follows a chi-squared distribution with degrees of freedom equal to the number of added terms. In this case, 4. A p-value below 0.05 indicates the interaction model fits the data significantly better than the baseline.↩︎

  9. Unlike logistic regression, Random Forest does not assume a linear relationship between features and the log-odds of churn. It can detect patterns that logistic regression misses, at the cost of interpretability. Coefficients with a clear directional meaning are replaced by feature importance scores that measure how much each variable reduces prediction error across all trees.↩︎

  10. McNemar’s test is the standard non-parametric test for comparing two classifiers evaluated on the same test set. It builds a 2×2 contingency table of agreements and disagreements between the two models, and tests whether the off-diagonal cells (the cases where only one model was correct) are symmetric. A p-value below 0.05 indicates the two models make meaningfully different errors, i.e. the performance difference is unlikely to be due to chance. It is preferred over a simple accuracy comparison because it accounts for the paired nature of the predictions.↩︎