November 18, 2020

Announcements

  • No class next week - Happy Thanksgiving!
  • Please complete the course evaluation if you have not done so already.
  • Many of you have not yet signed up for a presentation time. The first one is in two weeks! Click here to sign up for a time slot to present.

Relationship between dichotomous (x) and continuous (y) variables

df <- data.frame(
    x = rep(c(0, 1), each = 10),
    y = c(rnorm(10, mean = 1, sd = 1),
          rnorm(10, mean = 2.5, sd = 1.5))
)
head(df)
##   x         y
## 1 0 1.9243372
## 2 0 0.6924972
## 3 0 0.3735868
## 4 0 0.3772007
## 5 0 1.5807124
## 6 0 1.6292041
tab <- describeBy(df$y, group = df$x, mat = TRUE, skew = FALSE)
tab$group1 <- as.integer(as.character(tab$group1))

Relationship between dichotomous (x) and continuous (y) variables

ggplot(df, aes(x = x, y = y)) + geom_point(alpha = 0.5) +
    geom_point(data = tab, aes(x = group1, y = mean), color = 'red', size = 4) + 
    geom_smooth(method = lm, se = FALSE, formula = y ~ x)

Regression so far…

At this point we have covered:

  • Simple linear regression
    • Relationship between numerical response and a numerical or categorical predictor
  • Multiple regression
    • Relationship between numerical response and multiple numerical and/or categorical predictors

What we haven’t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

Odds

For some event \(E\),

\[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\]

Similarly, if we are told the odds of E are \(x\) to \(y\) then

\[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \]

which implies

\[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

Generalized Linear Models

Generalized linear models (GLM) is a generalization of OLS that allows for the response variables (i.e. dependent variables) to have an error distribution that is not normally distributed. Logistic regression is just one type of GLM, specifically for dichotomous response variables that follow a binomial distribution.

All generalized linear models have the following three characteristics:

  1. A probability distribution describing the outcome variable
  2. A linear model
    \(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\)
  3. A link function that relates the linear model to the parameter of the outcome distribution
    \(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

Logistic Regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.

Logit function

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

The Logistic Function

\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]

logistic <- function(t) { return(1 / (1 + exp(-t))) }
df <- data.frame(x=seq(-4, 4, by=0.01))
df$sigma_t <- logistic(df$x)
plot(df$x, df$sigma_t)

t as a Linear Function

\[ t = \beta_0 + \beta_1 x \]

The logistic function can now be rewritten as

\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]

Similar to OLS, we wish to minimize the errors. However, instead of minimizing the least squared residuals, we will use a maximum likelihood function.

Example: Hours Studying Predicting Passing

study <- data.frame(
    Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,
            3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50),
    Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1)
)
lr.out <- glm(Pass ~ Hours, data=study, family=binomial(link='logit'))
lr.out
## 
## Call:  glm(formula = Pass ~ Hours, family = binomial(link = "logit"), 
##     data = study)
## 
## Coefficients:
## (Intercept)        Hours  
##      -4.078        1.505  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  18 Residual
## Null Deviance:       27.73 
## Residual Deviance: 16.06     AIC: 20.06

Model

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times Hours\]

Plotting the Results

Prediction

Odds (or probability) of passing if studied zero hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0\] \[\frac{p}{1-p} = exp(-4.078) = 0.0169\] \[p = \frac{0.0169}{1.169} = .016\]

Odds (or probability) of passing if studied 4 hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4\] \[\frac{p}{1-p} = exp(1.942) = 6.97\] \[p = \frac{6.97}{7.97} = 0.875\]

Fitted Values

study[1,]
##   Hours Pass    Predict Predict_Pass
## 1   0.5    0 0.03471034        FALSE
logistic <- function(x, b0, b1) {
    return(1 / (1 + exp(-1 * (b0 + b1 * x)) ))
}
logistic(.5, b0=-4.078, b1=1.505)
## [1] 0.03470667

Of course, the fitted function will do the same:

fitted(lr.out)[1]
##          1 
## 0.03471034

Model Performance

The use of statistical models to predict outcomes, typically on new data, is called predictive modeling. Logistic regression is a common statistical procedure used for prediction. We will utilize a confusion matrix to evaluate accuracy of the predictions.

Titanic

str(titanic_train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Data Setup

We will split the data into a training set (70% of observations) and validation set (30%).

train.rows <- sample(nrow(titanic), nrow(titanic) * .7)
titanic_train <- titanic[train.rows,]
titanic_test <- titanic[-train.rows,]

This is the proportions of survivors and defines what our “guessing” rate is. That is, if we guessed no one survived, we would be correct 62% of the time.

(survived <- table(titanic_train$survived) %>% prop.table)
## 
##        No       Yes 
## 0.6200873 0.3799127

Model Training

lr.out <- glm(survived ~ pclass + sex + sibsp + parch,
              family = binomial(link = 'logit'), data = titanic_train)
summary(lr.out)
## 
## Call:
## glm(formula = survived ~ pclass + sex + sibsp + parch, family = binomial(link = "logit"), 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1258  -0.6870  -0.5023   0.6760   2.2936  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.47998    0.16614   8.908  < 2e-16 ***
## pclass.L    -1.13024    0.14711  -7.683 1.56e-14 ***
## pclass.Q     0.09514    0.16473   0.578    0.564    
## sexmale     -2.72599    0.18640 -14.624  < 2e-16 ***
## sibsp       -0.16882    0.10945  -1.542    0.123    
## parch       -0.04256    0.09764  -0.436    0.663    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1216.49  on 915  degrees of freedom
## Residual deviance:  866.04  on 910  degrees of freedom
## AIC: 878.04
## 
## Number of Fisher Scoring iterations: 4

Predicted Values

titanic_train$prediction <- predict(lr.out, type = 'response', newdata = titanic_train)
ggplot(titanic_train, aes(x = prediction, color = survived)) + geom_density()

Results

titanic_train$prediction_class <- titanic_train$prediction > 0.5
tab <- table(titanic_train$prediction_class, 
             titanic_train$survived) %>% prop.table() %>% print()
##        
##                 No        Yes
##   FALSE 0.53820961 0.12445415
##   TRUE  0.08187773 0.25545852

For the training set, the overall accuracy is 79.37%. Recall that 37.99% of passengers survived. Therefore, the simplest model would be to predict that everyone died, which would mean we would be correct 62.01% of the time. Therefore, our prediction model is 17.36% better than guessing.

Checking with the validation dataset

(survived_test <- table(titanic_test$survived) %>% prop.table())
## 
##        No       Yes 
## 0.6132316 0.3867684
titanic_test$prediction <- predict(lr.out, newdata = titanic_test, type = 'response')
titanic_test$prediciton_class <- titanic_test$prediction > 0.5
tab_test <- table(titanic_test$prediciton_class, titanic_test$survived) %>%
    prop.table() %>% print()
##        
##                No       Yes
##   FALSE 0.5089059 0.1246819
##   TRUE  0.1043257 0.2620865

The overall accuracy is 77.1%, or 15.8% better than guessing.

Receiver Operating Characteristic (ROC) Curve

The ROC curve is created by plotting the true positive rate (TPR; AKA sensitivity) against the false positive rate (FPR; AKA probability of false alarm) at various threshold settings.

roc <- calculate_roc(titanic_train$prediction, titanic_train$survived == 'Yes')
summary(roc)
## AUC = 0.82
## Cost of false-positive = 1
## Cost of false-negative = 1
## Threshold with minimum cost = 0.576

ROC Curve

plot(roc)

ROC Curve

plot(roc, curve = 'accuracy')

Caution on Interpreting Accuracy

  • Loh, Sooo, and Zing (2016) predicted sexual orientation based on Facebook Status.
  • They reported model accuracies of approximately 90% using SVM, logistic regression and/or random forest methods.
  • Gallup (2018) poll estimates that 4.5% of the Americal population identifies as LGBT.
  • My proposed model: I predict all Americans are heterosexual.
  • The accuracy of my model is 95.5%, or 5.5% better than Facebook’s model!
  • Predicting “rare” events (i.e. when the proportion of one of the two outcomes large) is difficult and requires independent (predictor) variables that strongly associated with the dependent (outcome) variable.

Fitted Values Revisited

What happens when the ratio of true-to-false increases (i.e. want to predict “rare” events)?

Let’s simulate a dataset where the ratio of true-to-false is 10-to-1. We can also define the distribution of the dependent variable. Here, there is moderate separation in the distributions.

test.df2 <- getSimulatedData(treat.mean=.6, 
                             control.mean=.4)

The multilevelPSA::psrange function will sample with varying ratios from 1:10 to 1:1. It takes multiple samples and averages the ranges and distributions of the fitted values from logistic regression.

psranges2 <- psrange(test.df2,
                     test.df2$treat, 
                     treat ~ .,
                     samples=seq(100,1000,by=100), 
                     nboot=20)

plot(psranges2)

Additional Resources