November 4, 2020

Announcements

Data project proposals have been graded. If you didn’t get 5/5 you can resubmit for full credit. Be sure to see my comments.

Data Science Institute at Columbia University we will host the third annual the third annual Machine Learning in Science & Engineering (MLSE 2020) conference virtually on December 14 - 15, 2020. It is free for students. Click here for more information

Presentations

Data Project

Click here to sign-up for a presentation slots. There are two time slots (12 to 1:30pm and 8 to 9:30pm) on three different days (December 1st, 3rd, and 8th).

You are required to attend ONLY ONE of those time slots. You will do your presentation, watch the other presentations, and provide peer feedback (will be shared anonymously afterward).

Presentations should be no more than 10 minutes.

Checklist / Suggested Outline

  • Overview slide

    • Context on the data collection
    • Description of the dependent variable (what is being measured)
    • Description of the independent variable (what is being measured; include at least 2 variables)
    • Research question
  • Summary statistics

  • Include appropriate data visualizations.

  • Statistical output

    • Include the appropriate statistics for your method used.
    • For null hypothesis tests (e.g. t-test, chi-squared, ANOVA, etc.), state the null and alternative hypotheses along with relevant statistic and p-value (and confidence interval if appropriate).
    • For regression models, include the regression output and interpret the R-squared value.
  • Conclusion

    • Why is this analysis important?
    • Limitations of the analysis?

Criteria for Grading

  • Data is presented to support the conslusions using the appropriate analysis (i.e. the statistical method chosen supports the research question).

  • Suitable tables summarize data in a clear and meaningful way even to those unfamiliar with the project.

  • Suitable graphics summarize data in a clear and meaningful way even to those unfamiliar with the project.

  • Data reviewed and analyzed accurately and coherently.

  • Proper use of descriptive and/or inferential statistics.

Example Project

2000 Election

The 2000 election between George Bush and Al Gore was ultimately decided in Florida. However, there was a third candidate on the ballot, Pat Buchanan, and one county with an unpredictable outcome. Is there evidence that a large number of votes were cast for a mistaken candidate?

The elections data frame contains the breakdown of votes by each of the 67 counties in Florida.

elections <- read.table("../course_data/2000elections.txt", header=TRUE)

There are 67 counties in Florida that cast at total of 2,910,078 votes for George Bush and 2,909,117 resulting in Bush winning by 961 votes.

However, in the days following the election there was much controversy surrounding so called “hanging chads.” That is, there were a number of ballots where it was not clear who the vote was for. This was a particular issue in Palm Beach.

Number of votes by county in Florida

ggplot(elections, aes(bush, buch)) + geom_point() +
    xlab("Number of votes for Bush") + ylab("Number of votes for Buchanan") +
    ggtitle("Number of votes by county in Florida")

Correlation

cor.test(elections$buch, elections$bush)
## 
##  Pearson's product-moment correlation
## 
## data:  elections$buch and elections$bush
## t = 6.455, df = 65, p-value = 1.574e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4527668 0.7522709
## sample estimates:
##       cor 
## 0.6250012

Linear Regression Model

model1 <- lm(buch ~ bush, data = elections)
summary(model1)
## 
## Call:
## lm(formula = buch ~ bush, data = elections)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -911.30  -46.11  -26.05   12.01 2608.01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.697e+01  5.446e+01   0.863    0.392    
## bush        4.920e-03  7.622e-04   6.455 1.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared:  0.3906, Adjusted R-squared:  0.3813 
## F-statistic: 41.67 on 1 and 65 DF,  p-value: 1.574e-08

Residual Analysis

Log Tranform

ggplot(elections, aes(bush, buch)) + geom_point() +
    scale_x_log10() + scale_y_log10() +
    xlab("Log of number of votes for Bush") + ylab("Log of number of votes for Buchanan") +
    ggtitle("Number of votes by county in Florida")

Correlation with log tranformations

cor.test(log(elections$buch), log(elections$bush))
## 
##  Pearson's product-moment correlation
## 
## data:  log(elections$buch) and log(elections$bush)
## t = 19.222, df = 65, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8760098 0.9515894
## sample estimates:
##       cor 
## 0.9221706

Linear Regression Model (log transform)

model2 <- lm(log(buch) ~ log(bush), data = elections)
summary(model2)
## 
## Call:
## lm(formula = log(buch) ~ log(bush), data = elections)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97038 -0.24247  0.00825  0.25452  1.65752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.55079    0.38903  -6.557 1.04e-08 ***
## log(bush)    0.75620    0.03934  19.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4672 on 65 degrees of freedom
## Multiple R-squared:  0.8504, Adjusted R-squared:  0.8481 
## F-statistic: 369.5 on 1 and 65 DF,  p-value: < 2.2e-16

Regression model without Palm Beach

model3 <- lm(log(buch) ~ log(bush), data = elections[-50,])
summary(model3)
## 
## Call:
## lm(formula = log(buch) ~ log(bush), data = elections[-50, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97136 -0.22384  0.02279  0.26959  1.00652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.31657    0.35470  -6.531 1.23e-08 ***
## log(bush)    0.72960    0.03599  20.271  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4203 on 64 degrees of freedom
## Multiple R-squared:  0.8652, Adjusted R-squared:  0.8631 
## F-statistic: 410.9 on 1 and 64 DF,  p-value: < 2.2e-16

Residual Analysis (log)

Predict Palm Beach from the model

Obtain the predicted vote count for Palm Beach given the fitted model without

new <- data.frame(bush = elections$bush[50])

The difference between predicted on the original scale and the observed vote count

elections$buch[50] - exp(predict(model3, new))
##        1 
## 2809.498

Predict Palm Beach from the model (cont.)

Prediction Confidence Interval for log(vote count)

predict(model3, new, interval='prediction', level=.95)
##        fit      lwr      upr
## 1 6.392757 5.532353 7.253162

Prediction Confidence Interval on the original scale

exp(predict(model3, new, interval='prediction',level=.95))
##        fit     lwr      upr
## 1 597.5019 252.738 1412.564
elections$buch[50]
## [1] 3407

Therefore, what we can say is that it is likely that Palm Beach is a different community.

Palm Beach Ballot

References