September 2, 2020

Summarizing Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")
library(lego)
data(legosets)

Types of Variables

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

Data Types in R

Types of Variables

str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6172 obs. of  14 variables:
##  $ Item_Number : chr  "10246" "10247" "10248" "10249" ...
##  $ Name        : chr  "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ...
##  $ Pieces      : int  2262 2464 1158 898 13 39 32 105 13 11 ...
##  $ Minifigures : int  6 10 NA NA 1 2 2 3 2 2 ...
##  $ Image_URL   : chr  "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ...
##  $ GBP_MSRP    : num  132.99 149.99 69.99 59.99 9.99 ...
##  $ USD_MSRP    : num  159.99 199.99 99.99 79.99 9.99 ...
##  $ CAD_MSRP    : num  200 230 120 NA 13 ...
##  $ EUR_MSRP    : num  149.99 179.99 89.99 69.99 9.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...

Qualitative Variables

Descriptive statistics:

  • Contingency Tables
  • Proportional Tables

Plot types:

  • Bar plot

Contingency Tables

table(legosets$Availability, useNA='ifany')
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   695                     2                  1795 
##           Promotional Promotional (Airline)                Retail 
##                   141                    12                  3120 
##      Retail - limited               Unknown 
##                   403                     4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##                        
##                         Blister pack  Box Box with backing card Bucket Canister
##   LEGO exclusive                  45  147                     0      1        0
##   LEGOLAND exclusive               0    2                     0      0        0
##   Not specified                    0   20                     0      0        0
##   Promotional                      0   44                     0      0        0
##   Promotional (Airline)            0   11                     0      0        0
##   Retail                          53 2575                    16     30       78
##   Retail - limited                 2  302                     1      5        0
##   Unknown                          0    1                     0      0        0
##                        
##                         Foil pack Loose Parts Not specified Other Plastic box
##   LEGO exclusive                0          71             7     5           1
##   LEGOLAND exclusive            0           0             0     0           0
##   Not specified                 5           0          1739     0           6
##   Promotional                   0           1             0     3           2
##   Promotional (Airline)         0           0             1     0           0
##   Retail                      285           0             0    28           0
##   Retail - limited              1           0             0     0           1
##   Unknown                       0           0             0     0           0
##                        
##                         Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive            412              0    6    0
##   LEGOLAND exclusive          0              0    0    0
##   Not specified              24              0    0    1
##   Promotional                90              0    0    1
##   Promotional (Airline)       0              0    0    0
##   Retail                      4             18    0   33
##   Retail - limited           86              0    0    5
##   Unknown                     3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1126053143          0.0003240441          0.2908295528 
##           Promotional Promotional (Airline)                Retail 
##          0.0228451069          0.0019442644          0.5055087492 
##      Retail - limited               Unknown 
##          0.0652948801          0.0006480881

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Quantitative Variables

Descriptive statistics:

  • Mean
  • Median
  • Quartiles
  • Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
  • Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

  • Dot plots
  • Histograms
  • Density plots
  • Box plots
  • Scatterplots

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976


fivenum(legosets$Pieces, na.rm=TRUE)
## [1]    0.0   30.0   82.0  256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25

The summary Function

summary(legosets$Pieces)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    82.0   215.2   256.2  5922.0     112

The psych Package

library(psych)
describe(legosets$Pieces, skew=FALSE)
##    vars    n   mean    sd min  max range   se
## X1    1 6060 215.17 356.2   0 5922  5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
##     item                group1 vars    n      mean        sd min  max range
## X11    1        LEGO exclusive    1  659 172.74203 442.96954   1 3428  3427
## X12    2    LEGOLAND exclusive    1    2 211.00000 154.14928 102  320   218
## X13    3         Not specified    1 1747 145.87178 309.19929   1 5195  5194
## X14    4           Promotional    1  140  53.97143 108.42721   1 1000   999
## X15    5 Promotional (Airline)    1   12 126.16667  47.01612  10  203   193
## X16    6                Retail    1 3094 245.78119 294.78052   0 3803  3803
## X17    7      Retail - limited    1  402 410.94030 652.06435   1 5922  5921
## X18    8               Unknown    1    4  27.50000  15.96872   6   44    38
##             se
## X11  17.255643
## X12 109.000000
## X13   7.397620
## X14   9.163772
## X15  13.572384
## X16   5.299546
## X17  32.522014
## X18   7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)

Transformations

With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.

hist(log(legosets$Pieces))

Density Plots

plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')

Density Plot (log tansformed)

plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')

Box Plots

boxplot(legosets$Pieces)

boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn

Scatter Plots

plot(legosets$Pieces, legosets$USD_MSRP)

Examining Possible Outliers (expensive sets)

legosets[which(legosets$USD_MSRP >= 400),]
##      Item_Number                                   Name Year        Theme
## 901      2000430             Identity and Landscape Kit 2013 Serious Play
## 902      2000431                        Connections Kit 2013 Serious Play
## 2050     2000409                 Window Exploration Bag 2010 Serious Play
## 2852       10179 Ultimate Collector's Millennium Falcon 2007    Star Wars
##                       Subtheme Pieces Minifigures
## 901                                NA           6
## 902                              2455          NA
## 2050                             4900          NA
## 2852 Ultimate Collector Series   5195           5
##                                                 Image_URL GBP_MSRP USD_MSRP
## 901  http://images.brickset.com/sets/images/2000430-1.jpg   509.99   789.99
## 902  http://images.brickset.com/sets/images/2000431-1.jpg   490.18   754.99
## 2050 http://images.brickset.com/sets/images/2000409-1.jpg   314.99   484.99
## 2852   http://images.brickset.com/sets/images/10179-1.jpg   342.49   499.99
##      CAD_MSRP EUR_MSRP     Packaging  Availability
## 901    789.99   699.99 Not specified Not specified
## 902    754.99   559.99 Not specified Not specified
## 2050   484.99   359.99 Not specified Not specified
## 2852       NA       NA Not specified Not specified

Examining Possible Outliers (big sets)

legosets[which(legosets$Pieces >= 4000),]
##      Item_Number                                   Name Year           Theme
## 2047       10214                           Tower Bridge 2010 Advanced Models
## 2050     2000409                 Window Exploration Bag 2010    Serious Play
## 2628       10189                              Taj Mahal 2008 Advanced Models
## 2852       10179 Ultimate Collector's Millennium Falcon 2007       Star Wars
##                       Subtheme Pieces Minifigures
## 2047                 Buildings   4287          NA
## 2050                             4900          NA
## 2628                 Buildings   5922          NA
## 2852 Ultimate Collector Series   5195           5
##                                                 Image_URL GBP_MSRP USD_MSRP
## 2047   http://images.brickset.com/sets/images/10214-1.jpg   209.99   239.99
## 2050 http://images.brickset.com/sets/images/2000409-1.jpg   314.99   484.99
## 2628   http://images.brickset.com/sets/images/10189-1.jpg   199.99   299.99
## 2852   http://images.brickset.com/sets/images/10179-1.jpg   342.49   499.99
##      CAD_MSRP EUR_MSRP     Packaging     Availability
## 2047   299.99   219.99           Box Retail - limited
## 2050   484.99   359.99 Not specified    Not specified
## 2628   399.99       NA           Box Retail - limited
## 2852       NA       NA Not specified    Not specified

plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Source: https://en.wikipedia.org/wiki/Pie_chart.

Just say NO to pie charts!

“There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart”

John Tukey

ggplot2

  • ggplot2 is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.
  • ggplot2 is, in general, more flexible for creating “prettier” and complex plots.
  • Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) ggplot2 has at least three ways of creating plots:
    1. qplot
    2. ggplot(...) + geom_XXX(...) + ...
    3. ggplot(...) + layer(...)
  • We will focus only on the second.

First Example

library(ggplot2)
data(diamonds)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Parts of a ggplot2 Statement

  • Data
    ggplot(myDataFrame, aes(x=x, y=y)
  • Layers
    geom_point(), geom_histogram()
  • Facets
    facet_wrap(~ cut), facet_grid(~ cut)
  • Scales
    scale_y_log10()
  • Other options
    ggtitle('my title'), ylim(c(0, 10000)), xlab('x-axis label')

Lots of geoms

ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]
##  [1] "geom_abline"            "geom_area"              "geom_bar"              
##  [4] "geom_bin2d"             "geom_blank"             "geom_boxplot"          
##  [7] "geom_col"               "geom_contour"           "geom_contour_filled"   
## [10] "geom_count"             "geom_crossbar"          "geom_curve"            
## [13] "geom_density"           "geom_density_2d"        "geom_density_2d_filled"
## [16] "geom_density2d"         "geom_density2d_filled"  "geom_dotplot"          
## [19] "geom_errorbar"          "geom_errorbarh"         "geom_freqpoly"         
## [22] "geom_function"          "geom_hex"               "geom_histogram"        
## [25] "geom_hline"             "geom_jitter"            "geom_label"            
## [28] "geom_line"              "geom_linerange"         "geom_map"              
## [31] "geom_path"              "geom_point"             "geom_pointrange"       
## [34] "geom_polygon"           "geom_qq"                "geom_qq_line"          
## [37] "geom_quantile"          "geom_raster"            "geom_rect"             
## [40] "geom_ribbon"            "geom_rug"               "geom_segment"          
## [43] "geom_sf"                "geom_sf_label"          "geom_sf_text"          
## [46] "geom_smooth"            "geom_spoke"             "geom_step"             
## [49] "geom_text"              "geom_tile"              "geom_violin"           
## [52] "geom_vline"             "update_geom_defaults"

Scatterplot Revisited

ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)

Boxplots Revisited

ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()

Likert Scales

Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
            ST24Q01="I read only if I have to.",
            ST24Q02="Reading is one of my favorite hobbies.",
            ST24Q03="I like talking about books with other people.",
            ST24Q04="I find it hard to finish books.",
            ST24Q05="I feel happy if I receive a book as a present.",
            ST24Q06="For me, reading is a waste of time.",
            ST24Q07="I enjoy going to a bookstore or a library.",
            ST24Q08="I read only to get information that I need.",
            ST24Q09="I cannot sit still and read for more than a few minutes.",
            ST24Q10="I like to express my opinions about books I have read.",
            ST24Q11="I like to exchange books with my friends."))

likert R Package

l24 <- likert(items24)
summary(l24)
##                                                        Item      low neutral
## 10   I like to express my opinions about books I have read. 41.07516       0
## 5            I feel happy if I receive a book as a present. 46.93475       0
## 8               I read only to get information that I need. 50.39874       0
## 7                I enjoy going to a bookstore or a library. 51.21231       0
## 3             I like talking about books with other people. 54.99129       0
## 11                I like to exchange books with my friends. 55.54115       0
## 2                    Reading is one of my favorite hobbies. 56.64470       0
## 1                                 I read only if I have to. 58.72868       0
## 4                           I find it hard to finish books. 65.35125       0
## 9  I cannot sit still and read for more than a few minutes. 76.24524       0
## 6                       For me, reading is a waste of time. 82.88729       0
##        high     mean        sd
## 10 58.92484 2.604913 0.9009968
## 5  53.06525 2.466751 0.9446590
## 8  49.60126 2.484616 0.9089688
## 7  48.78769 2.428508 0.9164136
## 3  45.00871 2.328049 0.9090326
## 11 44.45885 2.343193 0.9609234
## 2  43.35530 2.344530 0.9277495
## 1  41.27132 2.291811 0.9369023
## 4  34.64875 2.178299 0.8991628
## 9  23.75476 1.974736 0.8793028
## 6  17.11271 1.810093 0.8611554

likert Plots

plot(l24)

likert Plots

plot(l24, type='heat')

likert Plots

plot(l24, type='density')

Dual Scales

Some problems1:

  • The designer has to make choices about scales and this can have a big impact on the viewer
  • “Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers)
  • They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
  • Because of the issues above, in malicious hands they make it possible to deliberately mislead

This example looks at the relationship between NZ dollar exchange rate and trade weighted index.

library(DATA606)
shiny_demo('DualScales', package='DATA606')

My advise:

  • Avoid using them. You can usually do better with other plot types.
  • When necessary (or compelled) to use them, rescale (using z-scores)

1 http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html 2 http://ellisp.github.io/blog/2016/08/18/dualaxes