Datasklr is a blog to provide examples of data science projects to those passionate about learning and having fun with data.

Probit and Complementary Log-Log Models for Binary Regression

Probit and Complementary Log-Log Models for Binary Regression

shutterstock_684210214.png
 
Screen Shot 2019-12-30 at 9.36.47 AM.png
 

Introduction to Alternatives to Logit Models:

The logit model is only one of many methods for fitting a regression model with a binary dependent variable. Two other models are also worth discussing: the probit model and the complementary log-log model. The goal of this short blog is to compare them with logit, which was discussed at Binary Logistic Regression (Click for more).

Differences in Distribution:

The observed variable y was classified as 1 or 0 depending on z score being above or below a threshold value:

  • Logit: The errors have a standard logistic distribution

  • Probit: The errors have a standard normal distribution

  • Complementary Log-Log:The errrors have a standard extreme value-distribution or double-exponential distribution

Probit Function:

A normal distribution has a mean of 0 and a standard deviation of 1. A standard normal variable has a cumulative distribution function. Take a look at this link. For every value of a variable , the table provides the probability that the value of a variable is less than that. The inverse of the cumulative distribution function is the probit transformation. While the probabilities range between o and 1, the probit function ranges between negative infinity and infinity.

When fitting a binary regression model, the probit and logit models will closely resemble each other and will likely provide similar findings. The logit model’s exponentiated coefficients can be interpreted as odds ratios, while the probit model may have an advantage when using multiple binary regressors in an analysis.

Complementary Log-Log Function:

The function is widely used in survival analysis. A major difference between the c log-log model and logit or probit models is that the c log-log model is asymmetrical, while the other two are symmetrical. This feature is especially important when fitting Cox-regression models that uses proportional hazards.

A more in-depth discussion on the complementary log-log function is available from the University of Alberta by clicking here. I also recommend Paul Allison’s Logistic Regression Using SAS for the explanation portion of both Probit and Complementary Log-Log Functions.

Comparison of Outputs:

While in prior work I extensively used Sci-Kit Learn, here I wanted to use the glm function of the statsmodels package so that the link functions can be specified (especially for complementary log-log). Logit and probit link functions are available as single models as well. I still used sklearn for partitioning the data.

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#Create training and test datasets

#Create regressors and dependent variable
#note that donr was dropoped from X because it is the dependent variable
#all other variables were dropoped because they were identified as not significant contributors to reg
X = charity_df.drop(['donr', 'reg3', 'reg4','hinc','genf','avhv', 'lgif', 'rgif', 'agif', 'ID'], axis=1) 
y = charity_df['donr']

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.20, random_state = 5)
X = sm.add_constant(X_train)

glm_logit = sm.GLM(y_train, X, family=sm.families.Binomial(link=sm.families.links.logit))
glm_probit = sm.GLM(y_train, X, family=sm.families.Binomial(link=sm.families.links.probit))
glm_cloglog = sm.GLM(y_train, X, family=sm.families.Binomial(link=sm.families.links.cloglog))

result_logit=glm_logit.fit()
result_probit=glm_probit.fit()
result_cloglog=glm_cloglog.fit()


stats_logit=result_logit.summary()
stats_logit2=result_logit.summary2()
stats_probit=result_probit.summary()
stats_probit2=result_probit.summary2()
stats_cloglog=result_cloglog.summary()
stats_cloglog2=result_cloglog.summary2()

print(stats_logit)
print(stats_logit2)
print(stats_probit)
print(stats_probit2)
print(stats_cloglog)
print(stats_cloglog2)

The Logit Model:

Screen Shot 2019-12-30 at 10.45.15 AM.png

The Probit Model:

Screen Shot 2019-12-30 at 10.46.27 AM.png

The Complementary Log-Log Model:

Screen Shot 2019-12-30 at 10.47.57 AM.png
Multinomial Logistic Regression

Multinomial Logistic Regression

Binary Logistic Regression

Binary Logistic Regression

0