Customer Churn Prediction Analysis

CIS 9660 - Group 2 Final Presentation

2026-05-11

Team Members

Our Team

  • Harrison Cabe
  • Raúl J. Solá Navarro
  • Samuel Spitzer
  • Victor Murra Schott

Course

CIS 9660 - Data Mining for Business Analytics

Baruch College > Spring 2026

The Business Problem

Research Question

Can we predict whether a telecom customer will churn based on their contract type, monthly charges, and service usage?


Why It Matters

  • Acquiring a new customer costs significantly more than retaining one
  • Early identification of at-risk customers enables targeted intervention
  • Even small reductions in churn have significant revenue impact

Our Goals

Predictive: Build a model that correctly classifies customers as likely to churn or stay

Inferential: Test whether the combination of InternetService and Contract type significantly affects churn

Dataset

Kaggle: Telco Customer Churn

🔗 kaggle.com/datasets/blastchar/telco-customer-churn

Property Value
Observations 7,032 (after cleaning)
Features 23 (after encoding)
Target Churn (Yes / No)
Churn Rate 26.6%

Preprocessing Steps

  • Dropped 11 rows with missing TotalCharges
  • Collapsed redundant “No service” categories
  • Binary encoded with .map()
  • One-hot encoded InternetService, Contract, PaymentMethod
  • Scaled MonthlyCharges, TotalCharges, tenure

Exploratory Data Analysis

Key Patterns in the Data

Contract Type

Contract Churn Rate
Month-to-month 42.7%
One year 11.3%
Two year 2.8%

Internet Service

Service Churn Rate
Fiber optic 41.9%
DSL 19.0%
No internet 7.4%

Continuous Variables

Variable No Churn Churn
Monthly Charges (median) $64 $80
Tenure (median) 38 mo 10 mo

Payment Method

Method Churn Rate
Electronic check 45.3%
Bank transfer 16.7%
Credit card 15.3%
Mailed check 19.2%

Correlation Findings

Numeric Variables vs. Churn


📉 tenure

Correlation: -0.35

Strongest numeric predictor → longer-tenured customers are significantly less likely to leave

📈 MonthlyCharges

Correlation: +0.19

Higher bills are associated with increased churn risk

⚠️ TotalCharges

Correlation with tenure: 0.83

Excluded from modeling to avoid multicollinearity

Methodology


Model 1: Baseline

Logistic regression using all features after preprocessing

  • 80/20 train/test split with stratification
  • max_iter=1000 for convergence
  • Evaluated at default threshold of 0.50

Model 2: Interaction

Extends baseline with 4 interaction terms

  • Contract1 × FiberOptic
  • Contract2 × FiberOptic
  • Contract1 × NoInternet
  • Contract2 × NoInternet

Month-to-month × DSL serve as the reference baseline

Model Results

Performance at Default Threshold (0.50)


Metric Baseline Interaction Target
ROC-AUC 0.8289 0.8341 ≥ 0.80 ✅
Accuracy 79.8% 80.1% ≥ 80% ✅
Churn Recall 56.4% 56.4% ≥ 75% ❌
Churn Precision 63.6% 64.0%


Recall falls short at the default threshold → addressed through threshold tuning

Threshold Tuning

Lowering the Classification Threshold to 0.30

Threshold Accuracy Churn Recall Precision
0.50 80.1% 56.4% 64.0%
0.40 78.3% 67.1% 57.8%
0.30 74.5% 75.1% 51.4%

Why 0.30?

At threshold = 0.30 the model catches ~75% of actual churners

The trade-off is acceptable: Missing a churner = losing a customer

Flagging a loyal customer = small cost of a discount offer

Interaction Term Findings

What the Four Interaction Terms Revealed

📈 Increases Churn Risk

  • Two-year × Fiber optic: highest churn risk of all combinations; customers locked into long-term contracts who feel underserved by their internet service
  • One-year × Fiber optic: elevated but lower risk than two-year

📉 Reduces Churn Risk

  • Two-year × No internet: lowest churn of all; simple phone-only plans attract a very stable customer base
  • One-year × No internet: similarly protective

Key Insight

It’s not just fiber optic that drives churn, it’s fiber optic customers who are locked in that are most at risk

Feature Importance

Top Drivers of Churn (Interaction Model Coefficients)


📈 Increases Churn

Feature Coefficient
InternetService_fiber_optic +1.33
MonthlyCharges +0.72
StreamingTV +0.49
PaymentMethod_electronic_check +0.41

📉 Reduces Churn

Feature Coefficient
tenure -0.82
Contract_two_year -0.69
InternetService_no -1.02
Contract_one_year -0.29

Practical Implications


🎯 Contract Upgrade Offers

Target month-to-month customers with predicted churn probability ≥ 0.30 with discounted annual contract offers

💰 Pricing Interventions

Loyalty discounts for high-charge customers in the top risk quartile

🔧 Fiber Optic Service Review

The 41.9% churn rate, especially among long-term fiber optic customers, signals a service quality or pricing issue worth investigating

📅 Early Engagement

Churn risk is highest in the first 10 months ✅

Invest in onboarding and early retention programs!

Conclusions

What We Found

  • ROC-AUC of 0.83: the model discriminates well between churners and non-churners
  • At threshold 0.30, churn recall reaches ~75%, meeting the project target
  • tenure and InternetService_fiber_optic are the two strongest individual predictors
  • Contract type is the strongest categorical signal: two-year customers churn at just 2.8% vs 42.7% for month-to-month

What’s New from Interactions

The four-term interaction model revealed that it’s not simply fiber optic that drives churn, but rather the long-term contract holders on fiber optic who are most at risk, suggesting dissatisfaction with a service they feel locked into

Full Report

🔗 View on GitHub Pages

Post-Presentation Additions

Enhancements added to the final report after today


📊 Statistical Testing

  • Likelihood Ratio Test: formally confirmed the interaction terms jointly improve model fit over the baseline (χ² = 12.65, p = 0.013)
  • Individual p-values via statsmodels: identified which specific interaction terms are statistically significant

🌲 Model Comparison

  • Random Forest trained on the same feature set as a benchmark
  • McNemar’s test confirmed the performance gap between models is not statistically significant (p = 0.107)

📈 New Visualizations

  • Coefficient chart with significance stars (* p<0.05, ** p<0.01, *** p<0.001)
  • Side-by-side feature importance comparison: Random Forest vs. Logistic Regression

🔑 Key Takeaway

Both models agree on the same top predictors (tenure, contract type, and fiber optic service) strengthening confidence that these reflect genuine patterns and supporting the choice of the simpler, more interpretable logistic regression

Thank You


Dataset: Telco Customer Churn (Kaggle) 7,032 obs × 23 features

Method: Logistic Regression + 4 Interaction Terms Contract × InternetService

Key Finding: Long-term fiber optic customers are the highest-risk churn segment

Full Report: RaulSolaNavarro.github.io/CIS9660-2026-SPRING/churn-report.html

Questions?

  • Harrison Cabe
  • Raúl J. Solá Navarro
  • Samuel Spitzer
  • Victor Murra Schott