Overview
Goal: predict whether a customer defaults (default = Yes) using
balance, income, and student, and compare
Logistic Regression against LDA.
Key idea: default is rare (~3–4%), so the probability cutoff you choose matters. Lower cutoffs flag more accounts as “high risk” and catch more true defaulters.
Method
- Random 70/30 train-test split (seeded).
- Fit Logistic Regression and LDA on the training set.
- Compute predicted
P(default = Yes)for the test set. - Evaluate three cutoffs:
0.50,0.10,0.05.
Reporting focuses on: predicted default rate (how many accounts are flagged) and caught defaulter rate (share of true defaulters flagged).
Results
Baseline context: if we predicted “No default” for everyone, accuracy would be about 96.37% (since the test default rate is ~3.63%).
| Model | Cutoff | Test accuracy | Test error | Predicted default rate | Caught defaulter rate |
|---|---|---|---|---|---|
| Logistic | 0.50 | 97.20% | 2.80% | 1.50% | 32.11% |
| Logistic | 0.10 | 93.53% | 6.47% | 8.10% | 72.48% |
| Logistic | 0.05 | 89.97% | 10.03% | 12.53% | 84.40% |
| LDA | 0.50 | 97.10% | 2.90% | 1.13% | 25.69% |
| LDA | 0.10 | 93.40% | 6.60% | 8.23% | 72.48% |
| LDA | 0.05 | 88.43% | 11.57% | 14.07% | 84.40% |
Takeaways
- Cutoff 0.50 is very conservative: high accuracy, but it catches only ~26–32% of defaulters.
- Cutoff 0.10 flags ~8% of accounts and catches ~72% of defaulters (more useful as a screening rule).
- Cutoff 0.05 catches ~84% of defaulters, but flags ~12–14% of accounts and increases overall error.
- On this split, Logistic performs slightly better than LDA, but they’re broadly similar.
Plots
Files
Note: results depend on the random train/test split. A next step would be cross-validation or a validation set for selecting cutoffs.