Credit Default Prediction — Logistic Regression vs LDA

R · ISLR · Train/test split · Threshold trade-offs · 2025

Overview

Goal: predict whether a customer defaults (default = Yes) using balance, income, and student, and compare Logistic Regression against LDA.

Key idea: default is rare (~3–4%), so the probability cutoff you choose matters. Lower cutoffs flag more accounts as "high risk" and catch more true defaulters.

Method

Random 70/30 train-test split (seeded).
Fit Logistic Regression and LDA on the training set.
Compute predicted P(default = Yes) for the test set.
Evaluate three cutoffs: 0.50, 0.10, 0.05.

Reporting focuses on: predicted default rate (how many accounts are flagged) and caught defaulter rate (share of true defaulters flagged).

Results

Baseline: predicting "No default" for everyone gives ~96.37% accuracy (test default rate ~3.63%).

Model	Cutoff	Test accuracy	Test error	Predicted default rate	Caught defaulter rate
Logistic	0.50	97.20%	2.80%	1.50%	32.11%
Logistic	0.10	93.53%	6.47%	8.10%	72.48%
Logistic	0.05	89.97%	10.03%	12.53%	84.40%
LDA	0.50	97.10%	2.90%	1.13%	25.69%
LDA	0.10	93.40%	6.60%	8.23%	72.48%
LDA	0.05	88.43%	11.57%	14.07%	84.40%

Takeaways

Cutoff 0.50 is very conservative: high accuracy, but catches only ~26–32% of defaulters.
Cutoff 0.10 flags ~8% of accounts and catches ~72% of defaulters (more useful as a screening rule).
Cutoff 0.05 catches ~84% of defaulters, but flags ~12–14% of accounts and increases overall error.
On this split, Logistic performs slightly better than LDA, but they're broadly similar.

Plots

Logistic regression predicted probabilities (test set). Vertical lines mark cutoffs 0.50 / 0.10 / 0.05.

Logistic regression predicted probability histogram

LDA predicted probabilities (test set). Most predictions are near 0 because default is rare.

GitHub

Note: results depend on the random train/test split. A next step would be cross-validation or a validation set for selecting cutoffs.

Tags: ISLR · R · Train/test split · Threshold trade-offs