── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): admit, gre, gpa, rank
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
mydata$rank <-factor(mydata$rank)
Step 1. Split the data into training and test sets
Code
set.seed(12345789)inTrain <-createDataPartition(y = GermanCredit$Class, p =0.6, list =FALSE, )training <- GermanCredit[inTrain, ]testing <- GermanCredit[-inTrain, ]
To create the logistic model, we’ll use the train function from caret.
Keep in mind that you can’t just read the coefficients like you would in simple linear regression.
For example, our model is suggesting that for every one unit increase in Age, the (odds) of the consumer having good credit increases by a factor of 1.02
Essentially, if the confidence interval contains 1, it is not very helpful
##Making Predictions: To make predictions, simple use the predict function from base R.
Code
predictions =predict(lmodel, newdata = testing, type ="prob")head(predictions)
Bad Good
2 0.2547727 0.7452273
6 0.4854772 0.5145228
10 0.2040837 0.7959163
15 0.5020246 0.4979754
17 0.1683434 0.8316566
18 0.3710473 0.6289527
Model Evaluation and Diagnostics
Is this model any good? How well does it fit our data? Which predictors are the most important?
1. Likelihood Ratio Test:
Tests whether the full model or a lesser model is the better fit for the data. A smaller p-value provides sufficient evidence that the full model is better than a reduced model.
Code
mod_fit_one <-glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + CreditHistory.Critical,data=training,family="binomial")mod_fit_two <-glm(Class ~ Age + ForeignWorker,data=training,family =binomial(link ="logit"))anova(mod_fit_one, mod_fit_two, test ="Chisq")
Analysis of Deviance Table
Model 1: Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical
Model 2: Class ~ Age + ForeignWorker
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 594 691.14
2 597 723.12 -3 -31.976 5.294e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion The full model is a better fit than a reduced model.
2. Pseudo R^2
Linear regression has the well-known R^{2}. However, logistic regression does not, so we’ll use something called McFadden’s R^{2}, which is given by: 1-\frac{ln(LM)}{ln(L0)} where ln(LM) is the log likelihood value for the fitted model and ln(L0) is the log likelihood for the null model with just the intercept. Values closer to 0 indicate less predictive power and values closer to 1 indicate more predictive power.
Code
library(pscl)
Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis
The output above indicates that are model has low predictive power
3. Hosmer-Lemeshow Test
This test examines whether the proportion of events are similar to the predicted probabilities of occurrence in subgroups of the data using a chi-square test. Interpretation: Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. Null hypothesis: the model fits the data
4. Variable Importance
Measures the absolute value of the t-statistic for each model parameter
Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "probit"),
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6163 -0.8710 -0.6389 1.1560 2.1035
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.386836 0.673946 -3.542 0.000398 ***
gre 0.001376 0.000650 2.116 0.034329 *
gpa 0.477730 0.197197 2.423 0.015410 *
rank2 -0.415399 0.194977 -2.131 0.033130 *
rank3 -0.812138 0.208358 -3.898 9.71e-05 ***
rank4 -0.935899 0.245272 -3.816 0.000136 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.41 on 394 degrees of freedom
AIC: 470.41
Number of Fisher Scoring iterations: 4
As is evident, all model terms are statistically significant at \alpha = 0.05
To interpret the probit coefficients and output:
For a one unit increase in gre, the z-score of admission increases by 0.001.
For a one unit increase in gpa, the z-score of admission increases by 0.478. *Attending a rank 2 school decreases the z-score of admission by 0.4154 compared to a rank 1 school.
Null deviance represents the difference between a full model and a model with just the y-intercept as a coefficient. The goal is for the model to have a low residual deviance.