Logistic Regression with caret

A quick notebook on how to create and interpret LR models.
code
ML
algorithms
college
Author

Matthias Quinn

Published

October 26, 2019

Logistic Regression

Goals:

  1. Learn the basics of a logistic regression
  2. Learn how to implement logistic regression in R with Caret

Notes:

  • Logistic Regression is suited for categorical response variables and one or more predictor variables.
  • Log(Odds) = \beta1*X1+...\beta N*Xn
  • The log(odds) ratio is ln[p/(1-p)]
  • Predicted probability of an event occurring: p = 1/(1+\exp^{-z})

This project will be using the German Credit dataset from the caret package. Goal is to predict whether a consumer is good or bad for business.

This project will be using the German Credit dataset from the caret package. Goal is to predict whether a consumer is good or bad for business.

Code
library(caret)
Loading required package: ggplot2
Loading required package: lattice
Code
library(readr)
library(jsonlite)
data("GermanCredit")
mydata <- read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
Rows: 400 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): admit, gre, gpa, rank

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
mydata$rank <- factor(mydata$rank)

Step 1. Split the data into training and test sets

Code
set.seed(12345789)
inTrain <- createDataPartition(y = GermanCredit$Class, p = 0.6, list = FALSE, )
training <- GermanCredit[inTrain, ]
testing  <- GermanCredit[-inTrain, ]

To create the logistic model, we’ll use the train function from caret.

Code
lmodel <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + CreditHistory.Critical,
                data = training,
                method = "glm",
                family = "binomial")

To obtain the odds for the coefficients and remove the log:

Code
exp(coef(lmodel$finalModel))
           (Intercept)                    Age          ForeignWorker 
             3.0516947              1.0095028              0.2494196 
   Property.RealEstate            Housing.Own CreditHistory.Critical 
             1.7752920              1.7580389              2.2363912 

Keep in mind that you can’t just read the coefficients like you would in simple linear regression.

For example, our model is suggesting that for every one unit increase in Age, the (odds) of the consumer having good credit increases by a factor of 1.02

95% Confidence Interval for terms:

Code
logistic = glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + CreditHistory.Critical,
                data = training,
                family = "binomial")

cbind(OR = exp(coef(logistic)), exp(confint(logistic)), pValue = (summary(logistic)$coefficients[, 4]))
Waiting for profiling to be done...
                              OR      2.5 %     97.5 %       pValue
(Intercept)            3.0516947 0.74027688 20.9221296 0.1700607984
Age                    1.0095028 0.99357880  1.0262071 0.2501534852
ForeignWorker          0.2494196 0.03925941  0.8803206 0.0651729645
Property.RealEstate    1.7752920 1.15611482  2.7758529 0.0100252083
Housing.Own            1.7580389 1.18604549  2.6003291 0.0047917432
CreditHistory.Critical 2.2363912 1.45002355  3.5270854 0.0003723358

Essentially, if the confidence interval contains 1, it is not very helpful

##Making Predictions: To make predictions, simple use the predict function from base R.

Code
predictions = predict(lmodel, newdata = testing, type = "prob")
head(predictions)
         Bad      Good
2  0.2547727 0.7452273
6  0.4854772 0.5145228
10 0.2040837 0.7959163
15 0.5020246 0.4979754
17 0.1683434 0.8316566
18 0.3710473 0.6289527

Model Evaluation and Diagnostics

Is this model any good? How well does it fit our data? Which predictors are the most important?

1. Likelihood Ratio Test:

Tests whether the full model or a lesser model is the better fit for the data. A smaller p-value provides sufficient evidence that the full model is better than a reduced model.

Code
mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + 
                     CreditHistory.Critical,
                   data=training,
                   family="binomial")
mod_fit_two <- glm(Class ~ Age + ForeignWorker,
                   data=training,
                   family = binomial(link = "logit"))

anova(mod_fit_one, mod_fit_two, test ="Chisq")
Analysis of Deviance Table

Model 1: Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + 
    CreditHistory.Critical
Model 2: Class ~ Age + ForeignWorker
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       594     691.14                          
2       597     723.12 -3  -31.976 5.294e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion The full model is a better fit than a reduced model.

2. Pseudo R^2

Linear regression has the well-known R^{2}. However, logistic regression does not, so we’ll use something called McFadden’s R^{2}, which is given by: 1-\frac{ln(LM)}{ln(L0)} where ln(LM) is the log likelihood value for the fitted model and ln(L0) is the log likelihood for the null model with just the intercept. Values closer to 0 indicate less predictive power and values closer to 1 indicate more predictive power.

Code
library(pscl)
Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis
Code
pR2(mod_fit_one)
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-345.57110076 -366.51858123   41.89496095    0.05715257    0.06744294 
         r2CU 
   0.09562580 

The output above indicates that are model has low predictive power

3. Hosmer-Lemeshow Test

This test examines whether the proportion of events are similar to the predicted probabilities of occurrence in subgroups of the data using a chi-square test. Interpretation: Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. Null hypothesis: the model fits the data

4. Variable Importance

Measures the absolute value of the t-statistic for each model parameter

Code
varImp(lmodel)
glm variable importance

                       Overall
CreditHistory.Critical  100.00
Housing.Own              69.35
Property.RealEstate      59.15
ForeignWorker            28.81
Age                       0.00

5. Confusion Matrix

Code
preds = predict(lmodel, newdata = testing)
confusionMatrix(data = preds, reference = testing$Class)
Confusion Matrix and Statistics

          Reference
Prediction Bad Good
      Bad   12   15
      Good 108  265
                                          
               Accuracy : 0.6925          
                 95% CI : (0.6447, 0.7374)
    No Information Rate : 0.7             
    P-Value [Acc > NIR] : 0.651           
                                          
                  Kappa : 0.0596          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.1000          
            Specificity : 0.9464          
         Pos Pred Value : 0.4444          
         Neg Pred Value : 0.7105          
             Prevalence : 0.3000          
         Detection Rate : 0.0300          
   Detection Prevalence : 0.0675          
      Balanced Accuracy : 0.5232          
                                          
       'Positive' Class : Bad             
                                          

Probit Regression:

E(Y|X) = P(Y=1|X)

Code
myprobit <- glm(admit ~ gre + gpa + rank,
                family = binomial(link = "probit"),
                data = mydata)
summary(myprobit)

Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "probit"), 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6163  -0.8710  -0.6389   1.1560   2.1035  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.386836   0.673946  -3.542 0.000398 ***
gre          0.001376   0.000650   2.116 0.034329 *  
gpa          0.477730   0.197197   2.423 0.015410 *  
rank2       -0.415399   0.194977  -2.131 0.033130 *  
rank3       -0.812138   0.208358  -3.898 9.71e-05 ***
rank4       -0.935899   0.245272  -3.816 0.000136 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.41  on 394  degrees of freedom
AIC: 470.41

Number of Fisher Scoring iterations: 4

As is evident, all model terms are statistically significant at \alpha = 0.05

To interpret the probit coefficients and output:

  • For a one unit increase in gre, the z-score of admission increases by 0.001.
  • For a one unit increase in gpa, the z-score of admission increases by 0.478. *Attending a rank 2 school decreases the z-score of admission by 0.4154 compared to a rank 1 school.
  • Null deviance represents the difference between a full model and a model with just the y-intercept as a coefficient. The goal is for the model to have a low residual deviance.