Data Exploration - The Chicago Dataset

Bryan

2018/07/07

Introduction

This article is about using the tidyverse package in R to explore data. The Chicago dataset is the basis for this analysis. The data is air quality measurements taken over 19 years. The focus is on using the dplyr package to do the analysis. The dplyr package is part of the tidyverse in R and a suite of tools to analyze, transform, plot, and manipulate data. The base version of R does all of this; however, dplyr will work faster and cost fewer computing resources when used on big data. The functions used to explore the data are as follows:

Additional packages used are:


read_rds()

I start with importing the data into a data frame with the readr package, and the structure of the data is a tibble, which is the tidyverse data frame structure. In the output, the tibble structure will list the first ten rows of data and column headers with the data types in this dataset.

Not shown in this data but worth mentioning:

chicago = as_tibble(readr::read_rds("../../static/data/chicago.rds"))
(chicago)
## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0
##  2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2
##  3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8
##  4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4
##  5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3
##  6 chic   40    35.1 1987-01-06         NA       48       5.83      25.8
##  7 chic   34.5  26.8 1987-01-07         NA       41       9.29      20.6
##  8 chic   29    22   1987-01-08         NA       36      11.3       17.0
##  9 chic   26.5  29   1987-01-09         NA       33.3     4.5       23.4
## 10 chic   32.5  27.8 1987-01-10         NA       NA       4.96      19.5
## # ... with 6,930 more rows
Table 1: The Chicago Dataset
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233
chic 34.5 26.750 1987-01-07 NA 41.00000 9.291667 20.58171
chic 29.0 22.000 1987-01-08 NA 36.00000 11.291667 17.03723
chic 26.5 29.000 1987-01-09 NA 33.28571 4.500000 23.38889
chic 32.5 27.750 1987-01-10 NA NA 4.958333 19.54167

In table 1, the Chicago dataset has eight variables and 6940 rows. The city variable appears as ‘chic’ for Chicago, and the missing data is filled in by default with NA. Also, the date is in the y/m/d format. We can see the different readings for the real numbers, from 1 to 6 decimal places.

names()

names() function lists all the variable names in the data.

names(chicago) 
## [1] "city"       "tmpd"       "dptp"       "date"       "pm25tmean2"
## [6] "pm10tmean2" "o3tmean2"   "no2tmean2"

List the first three names with the index[:]. Note R is a little different than Python. R will display the last index position [3], where Python would display 1 & 2.

names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"

List the last three variables.

names(chicago)[6:8] 
## [1] "pm10tmean2" "o3tmean2"   "no2tmean2"

Summary Statistics with skimr

The skim() function in the skimr package gives a good idea, in one place, of what the data looks like in table 2. Noting there are 4447 NA’s in pm25tmean2 and 242 NA’s in pm10tmean2. The date range is from 1987 to 2005. The histogram shows us the skew of each variable and a nice feature for skimr.

In base R, I would also use the summary() function. As you will see, the tidyverse does not have a summary function that doesn’t require multiple lines of code. I will generate some of these statistics, but it takes more time than typing in one line into the console.

skimr::skim(chicago)
Table 2: Data summary
Name chicago
Number of rows 6940
Number of columns 8
_______________________
Column type frequency:
character 1
Date 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 4 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 1987-01-01 2005-12-31 1996-07-01 6940

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tmpd 1 1.00 50.31 19.41 -16.00 35.00 51.00 67.00 92.00 ▁▃▇▇▅
dptp 2 1.00 40.34 18.49 -25.62 27.00 39.88 55.75 78.25 ▁▂▇▇▅
pm25tmean2 4447 0.36 16.23 8.70 1.70 9.70 14.66 20.60 61.50 ▇▇▂▁▁
pm10tmean2 242 0.97 33.90 17.97 2.00 21.50 30.28 42.00 365.00 ▇▁▁▁▁
o3tmean2 0 1.00 19.44 11.39 0.15 10.07 18.52 27.00 66.59 ▇▇▅▁▁
no2tmean2 0 1.00 25.23 7.99 6.16 19.65 24.56 30.14 62.48 ▂▇▃▁▁

select()

To view specific variables, I use the select() function. Perhaps this small dataset in not going to show the benefits of select(), however, with hundreds or thousands of variables, select() will prove a worthy function. Here in table 3, we can isolate the variables city through dptp very quickly.

select(chicago, 
       city:dptp)
## # A tibble: 6,940 x 3
##    city   tmpd  dptp
##    <chr> <dbl> <dbl>
##  1 chic   31.5  31.5
##  2 chic   33    29.9
##  3 chic   33    27.4
##  4 chic   29    28.6
##  5 chic   32    28.9
##  6 chic   40    35.1
##  7 chic   34.5  26.8
##  8 chic   29    22  
##  9 chic   26.5  29  
## 10 chic   32.5  27.8
## # ... with 6,930 more rows
Table 3: Select the first three variables.
city tmpd dptp
chic 31.5 31.500
chic 33.0 29.875
chic 33.0 27.375
chic 29.0 28.625
chic 32.0 28.875
chic 40.0 35.125
chic 34.5 26.750
chic 29.0 22.000
chic 26.5 29.000
chic 32.5 27.750

In table 4, select() can also do the inverse of city:dptp and exclude those variables to display the remaining variables.

select(chicago, -(city:dptp))
## # A tibble: 6,940 x 5
##    date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 1987-01-01         NA       34       4.25      20.0
##  2 1987-01-02         NA       NA       3.30      23.2
##  3 1987-01-03         NA       34.2     3.33      23.8
##  4 1987-01-04         NA       47       4.38      30.4
##  5 1987-01-05         NA       NA       4.75      30.3
##  6 1987-01-06         NA       48       5.83      25.8
##  7 1987-01-07         NA       41       9.29      20.6
##  8 1987-01-08         NA       36      11.3       17.0
##  9 1987-01-09         NA       33.3     4.5       23.4
## 10 1987-01-10         NA       NA       4.96      19.5
## # ... with 6,930 more rows
Table 4: Select all except first three variables.
date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1987-01-01 NA 34.00000 4.250000 19.98810
1987-01-02 NA NA 3.304348 23.19099
1987-01-03 NA 34.16667 3.333333 23.81548
1987-01-04 NA 47.00000 4.375000 30.43452
1987-01-05 NA NA 4.750000 30.33333
1987-01-06 NA 48.00000 5.833333 25.77233
1987-01-07 NA 41.00000 9.291667 20.58171
1987-01-08 NA 36.00000 11.291667 17.03723
1987-01-09 NA 33.28571 4.500000 23.38889
1987-01-10 NA NA 4.958333 19.54167

Table 5 demonstrates the ends_with() and table 6 the starts_with() functions that only work inside of select(). Additional functions here would be:

Lets find all the variables that end with the number 2 in table 5.

select(chicago, 
       ends_with("2")) 
## # A tibble: 6,940 x 4
##    pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##         <dbl>      <dbl>    <dbl>     <dbl>
##  1         NA       34       4.25      20.0
##  2         NA       NA       3.30      23.2
##  3         NA       34.2     3.33      23.8
##  4         NA       47       4.38      30.4
##  5         NA       NA       4.75      30.3
##  6         NA       48       5.83      25.8
##  7         NA       41       9.29      20.6
##  8         NA       36      11.3       17.0
##  9         NA       33.3     4.5       23.4
## 10         NA       NA       4.96      19.5
## # ... with 6,930 more rows
Table 5: Select variables ending in 2.
pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
NA 34.00000 4.250000 19.98810
NA NA 3.304348 23.19099
NA 34.16667 3.333333 23.81548
NA 47.00000 4.375000 30.43452
NA NA 4.750000 30.33333
NA 48.00000 5.833333 25.77233
NA 41.00000 9.291667 20.58171
NA 36.00000 11.291667 17.03723
NA 33.28571 4.500000 23.38889
NA NA 4.958333 19.54167

Let’s find all the variables that start with the letter “d” in table 6.

select(chicago, 
       starts_with("d"))
## # A tibble: 6,940 x 2
##     dptp date      
##    <dbl> <date>    
##  1  31.5 1987-01-01
##  2  29.9 1987-01-02
##  3  27.4 1987-01-03
##  4  28.6 1987-01-04
##  5  28.9 1987-01-05
##  6  35.1 1987-01-06
##  7  26.8 1987-01-07
##  8  22   1987-01-08
##  9  29   1987-01-09
## 10  27.8 1987-01-10
## # ... with 6,930 more rows
Table 6: Select variables starting with the letter d.
dptp date
31.500 1987-01-01
29.875 1987-01-02
27.375 1987-01-03
28.625 1987-01-04
28.875 1987-01-05
35.125 1987-01-06
26.750 1987-01-07
22.000 1987-01-08
29.000 1987-01-09
27.750 1987-01-10

filter()

The filter() function is a way to find values in the data. Here the results are placed in table 7. Shown is filtering all pm25tmean2 variables that are greater than 30. We can do any combination of variables with specific parameters. The tibble output indicates there are 194 instances where pm25tmean2 is above a reading of 30.

filter(chicago, 
       pm25tmean2 > 30)
## # A tibble: 194 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic     23  21.9 1998-01-17       38.1       32.5     3.18      25.3
##  2 chic     28  25.8 1998-01-23       34.0       38.7     1.75      29.4
##  3 chic     55  51.3 1998-04-30       39.4       34      10.8       25.3
##  4 chic     59  53.7 1998-05-01       35.4       28.5    14.3       31.4
##  5 chic     57  52   1998-05-02       33.3       35      20.7       26.8
##  6 chic     57  56   1998-05-07       32.1       34.5    24.3       34.0
##  7 chic     75  65.8 1998-05-15       56.5       91      38.6       29.0
##  8 chic     61  59   1998-06-09       33.8       26      17.9       25.5
##  9 chic     73  60.3 1998-07-13       30.3       64.5    37.0       37.9
## 10 chic     78  67.1 1998-07-14       41.4       75      40.1       32.6
## # ... with 184 more rows
Table 7: Filter variable pm25tmean2 greater than 30.
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
chic 23 21.9 1998-01-17 38.10 32.46154 3.180556 25.30000
chic 28 25.8 1998-01-23 33.95 38.69231 1.750000 29.37630
chic 55 51.3 1998-04-30 39.40 34.00000 10.786232 25.31310
chic 59 53.7 1998-05-01 35.40 28.50000 14.295125 31.42905
chic 57 52.0 1998-05-02 33.30 35.00000 20.662879 26.79861
chic 57 56.0 1998-05-07 32.10 34.50000 24.270422 33.99167
chic 75 65.8 1998-05-15 56.50 91.00000 38.573007 29.03261
chic 61 59.0 1998-06-09 33.80 26.00000 17.890810 25.49668
chic 73 60.3 1998-07-13 30.30 64.50000 37.018864 37.93056
chic 78 67.1 1998-07-14 41.40 75.00000 40.080902 32.59054

Table 8 filters all values of the variable pm25tmean2 greater than 30, and I summarize to refine that to the minimum and maximum values. The table is showing that 30.05 is the lowest, and 61.5 is the highest value after filtering the variable. %>% is shown in the code and a way to write a few sequence operations together.

chicago %>%
  filter(pm25tmean2 > 30) %>%
  summarize(minValue = min(pm25tmean2), 
            maxValue = max(pm25tmean2))
## # A tibble: 1 x 2
##   minValue maxValue
##      <dbl>    <dbl>
## 1     30.0     61.5
Table 8: Variable pm25tmean2 greater than 30.
minValue maxValue
30.05 61.5

Table 9 is a refinement in table 8. Here I want to find the values higher than 30 in pm25tmean2 and values greater than 80 in tmpd. In the output tibble, there are only 17 instances where this condition is true and a little clustering in 2001, 2002, and 2005.

chicago %>%
  select(date, tmpd, pm25tmean2) %>%
  filter(pm25tmean2 > 30 & tmpd > 80)
## # A tibble: 17 x 3
##    date        tmpd pm25tmean2
##    <date>     <dbl>      <dbl>
##  1 1998-08-23    81       39.6
##  2 1998-09-06    81       31.5
##  3 2001-07-20    82       32.3
##  4 2001-08-01    84       43.7
##  5 2001-08-08    85       38.8
##  6 2001-08-09    84       38.2
##  7 2002-06-20    82       33  
##  8 2002-06-23    82       42.5
##  9 2002-07-08    81       33.1
## 10 2002-07-18    82       38.8
## 11 2003-06-25    82       33.9
## 12 2003-07-04    84       32.9
## 13 2005-06-24    86       31.9
## 14 2005-06-27    82       51.5
## 15 2005-06-28    85       31.2
## 16 2005-07-17    84       32.7
## 17 2005-08-03    84       37.9
Table 9: Filter pm25tmean2 >30 & tmpd >80 with date.
date tmpd pm25tmean2
1998-08-23 81 39.60000
1998-09-06 81 31.50000
2001-07-20 82 32.30000
2001-08-01 84 43.70000
2001-08-08 85 38.83750
2001-08-09 84 38.20000
2002-06-20 82 33.00000
2002-06-23 82 42.50000
2002-07-08 81 33.10000
2002-07-18 82 38.85000
2003-06-25 82 33.90000
2003-07-04 84 32.90000
2005-06-24 86 31.85714
2005-06-27 82 51.53750
2005-06-28 85 31.20000
2005-07-17 84 32.70000
2005-08-03 84 37.90000

This filtering collects all values above 80 for the tmpd variable and then summarizes to find the minimum and maximum values. 80.5 is the minimum, and 92 is the maximum.

chicago %>%
  filter(tmpd > 80) %>%
  summarize(minValue = min(tmpd), 
            maxValue = max(tmpd))
## # A tibble: 1 x 2
##   minValue maxValue
##      <dbl>    <dbl>
## 1     80.5       92

arrange()

Table 10 shows the arrange() function and sorts the o3tmean2 variable. Default from lowest to highest numerical values.

chicago %>%
  arrange(o3tmean2)
## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   43.5  39   1994-12-05       NA         47.5    0.153      27.2
##  2 chic   32.5  34.4 1994-12-15       NA         43.5    0.361      30.5
##  3 chic   38.5  38.5 1993-12-17       NA         16      0.427      23.9
##  4 chic   23.5  16.9 1994-12-13       NA         48.5    0.492      33.3
##  5 chic   34    33.9 2001-01-14       28         22      0.5        26.2
##  6 chic   33    37.5 1995-01-11       NA         68      0.517      37.8
##  7 chic   40.5  44.9 1993-01-03       NA         21.5    0.528      19.8
##  8 chic   61    59.3 1998-12-04       31.6       42.5    0.556      24.4
##  9 chic   35.5  34   1993-01-21       NA         19      0.653      27.0
## 10 chic   36.5  36   1994-12-16       NA         41      0.694      34.4
## # ... with 6,930 more rows
Table 10: Arrange O3 mean variable.
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
chic 43.5 39.000 1994-12-05 NA 47.5 0.1527778 27.20399
chic 32.5 34.375 1994-12-15 NA 43.5 0.3611111 30.45452
chic 38.5 38.500 1993-12-17 NA 16.0 0.4270833 23.92361
chic 23.5 16.875 1994-12-13 NA 48.5 0.4915459 33.34242
chic 34.0 33.900 2001-01-14 28.0 22.0 0.5000000 26.18750
chic 33.0 37.500 1995-01-11 NA 68.0 0.5169082 37.77083
chic 40.5 44.875 1993-01-03 NA 21.5 0.5277778 19.82500
chic 61.0 59.300 1998-12-04 31.6 42.5 0.5555556 24.39167
chic 35.5 34.000 1993-01-21 NA 19.0 0.6527778 27.00284
chic 36.5 36.000 1994-12-16 NA 41.0 0.6944444 34.39167

Table 11 sorts the tmpd variable lowest to highest.

chicago %>%
  arrange(tmpd)
## # A tibble: 6,940 x 8
##    city   tmpd   dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl>  <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic  -16   -25.6  1994-01-18         NA       27      14.7       22.1
##  2 chic  -12   -18.9  1996-02-03         NA       25.6     7.74      30.1
##  3 chic  -10.5 -23.1  1994-01-15         NA       36      12.0       24.7
##  4 chic  -10.5 -21    1996-02-02         NA       35       9.56      31.5
##  5 chic   -8    -2.38 1994-01-19         NA       24.5     4.86      36.6
##  6 chic   -6.5 -15.8  1988-01-05         NA       24      13.2       17.2
##  7 chic   -6.5 -23    1989-12-21         NA       49      16.7       19.5
##  8 chic   -6    -7.88 1996-02-04         NA       21       9.58      30.0
##  9 chic   -5.5 -15.1  1988-01-06         NA       75       7.15      27.7
## 10 chic   -4.5 -11.9  1989-12-22         NA       49       9.42      29.4
## # ... with 6,930 more rows
Table 11: Arrange tmpd variable.
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
chic -16.0 -25.625 1994-01-18 NA 27.00000 14.656250 22.10236
chic -12.0 -18.875 1996-02-03 NA 25.61538 7.736111 30.13542
chic -10.5 -23.125 1994-01-15 NA 36.00000 12.010417 24.70139
chic -10.5 -21.000 1996-02-02 NA 35.00000 9.555556 31.47917
chic -8.0 -2.375 1994-01-19 NA 24.50000 4.864583 36.59796
chic -6.5 -15.750 1988-01-05 NA 24.00000 13.166667 17.22083
chic -6.5 -23.000 1989-12-21 NA 49.00000 16.666667 19.50000
chic -6.0 -7.875 1996-02-04 NA 21.00000 9.583333 29.95833
chic -5.5 -15.125 1988-01-06 NA 75.00000 7.145833 27.66594
chic -4.5 -11.875 1989-12-22 NA 49.00000 9.416667 29.41667

In table 12 we can sort the tmpd variable in descending order by adding the parameter to the arrange() function.

chicago %>%
  arrange(desc(tmpd))
## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   92    76.4 1995-07-13       NA         92.5     58.5      27.3
##  2 chic   91.5  75.6 1995-07-14       NA         54.5     52.0      24.8
##  3 chic   90    77.1 1999-07-30       NA         54       34.2      27.5
##  4 chic   90    70.9 2005-07-24       22.0       44       47.6      12.0
##  5 chic   89.5  70.4 1988-08-02       NA         67       49.8      25.9
##  6 chic   89    71.1 1988-08-01       NA         67.2     44.1      26.0
##  7 chic   88.5  70.2 1988-06-20       NA         88       61.7      25.6
##  8 chic   88    67.4 1988-08-03       NA         65       40.6      27.7
##  9 chic   88    66.4 1988-08-17       NA         71       39.3      23.5
## 10 chic   88    71.2 1991-07-22       NA         70       34.9      24.6
## # ... with 6,930 more rows
Table 12: Arrange tmpd variable descending.
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
chic 92.0 76.375 1995-07-13 NA 92.5 58.54952 27.31775
chic 91.5 75.625 1995-07-14 NA 54.5 52.00397 24.77355
chic 90.0 77.100 1999-07-30 NA 54.0 34.17065 27.47977
chic 90.0 70.900 2005-07-24 22.05 44.0 47.60417 11.96354
chic 89.5 70.375 1988-08-02 NA 67.0 49.84949 25.91973
chic 89.0 71.125 1988-08-01 NA 67.2 44.14991 25.96309
chic 88.5 70.250 1988-06-20 NA 88.0 61.67708 25.64027
chic 88.0 67.375 1988-08-03 NA 65.0 40.64176 27.73958
chic 88.0 66.375 1988-08-17 NA 71.0 39.31240 23.50521
chic 88.0 71.250 1991-07-22 NA 70.0 34.85781 24.55123

rename()

The rename() function will be beneficial if you need to give the variables a descriptive name. In this example, table 13, I shorten the variable name and process these all in one line of code.

chicago %>%
  rename(temp=tmpd, 
         dewpoint=dptp, 
         pm25=pm25tmean2, 
         pm10=pm10tmean2, 
         O3=o3tmean2, 
         NO2=no2tmean2) 
## # A tibble: 6,940 x 8
##    city   temp dewpoint date        pm25  pm10    O3   NO2
##    <chr> <dbl>    <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
##  1 chic   31.5     31.5 1987-01-01    NA  34    4.25  20.0
##  2 chic   33       29.9 1987-01-02    NA  NA    3.30  23.2
##  3 chic   33       27.4 1987-01-03    NA  34.2  3.33  23.8
##  4 chic   29       28.6 1987-01-04    NA  47    4.38  30.4
##  5 chic   32       28.9 1987-01-05    NA  NA    4.75  30.3
##  6 chic   40       35.1 1987-01-06    NA  48    5.83  25.8
##  7 chic   34.5     26.8 1987-01-07    NA  41    9.29  20.6
##  8 chic   29       22   1987-01-08    NA  36   11.3   17.0
##  9 chic   26.5     29   1987-01-09    NA  33.3  4.5   23.4
## 10 chic   32.5     27.8 1987-01-10    NA  NA    4.96  19.5
## # ... with 6,930 more rows
Table 13: Rename variables.
city temp dewpoint date pm25 pm10 O3 NO2
chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233
chic 34.5 26.750 1987-01-07 NA 41.00000 9.291667 20.58171
chic 29.0 22.000 1987-01-08 NA 36.00000 11.291667 17.03723
chic 26.5 29.000 1987-01-09 NA 33.28571 4.500000 23.38889
chic 32.5 27.750 1987-01-10 NA NA 4.958333 19.54167

mutate()

In this example, the mutate() function adds a new variable to the data. In this case, we can take the mean of no2tmean2 and subtract it from itself to standardize the variable. This will show if the mean is above or below the average of 25.23 (taken from table 2). In this one line, I calculated and created a new variable for the data called no2detrend. A positive detrend indicates above the mean, and a negative value indicates below the mean for that individual record. See table 14.

smlchicago <- chicago %>%
  select(tmpd, dptp, o3tmean2, no2tmean2)

smlchicago %>%
  mutate(nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))
## # A tibble: 6,940 x 5
##     tmpd  dptp o3tmean2 no2tmean2 nO2detrend
##    <dbl> <dbl>    <dbl>     <dbl>      <dbl>
##  1  31.5  31.5     4.25      20.0     -5.24 
##  2  33    29.9     3.30      23.2     -2.04 
##  3  33    27.4     3.33      23.8     -1.42 
##  4  29    28.6     4.38      30.4      5.20 
##  5  32    28.9     4.75      30.3      5.10 
##  6  40    35.1     5.83      25.8      0.540
##  7  34.5  26.8     9.29      20.6     -4.65 
##  8  29    22      11.3       17.0     -8.19 
##  9  26.5  29       4.5       23.4     -1.84 
## 10  32.5  27.8     4.96      19.5     -5.69 
## # ... with 6,930 more rows
Table 14: Create a new variable (difference from the mean).
tmpd dptp o3tmean2 no2tmean2 nO2detrend
31.5 31.500 4.250000 19.98810 -5.2437869
33.0 29.875 3.304348 23.19099 -2.0408884
33.0 27.375 3.333333 23.81548 -1.4164060
29.0 28.625 4.375000 30.43452 5.2026417
32.0 28.875 4.750000 30.33333 5.1014512
40.0 35.125 5.833333 25.77233 0.5404485
34.5 26.750 9.291667 20.58171 -4.6501722
29.0 22.000 11.291667 17.03723 -8.1946513
26.5 29.000 4.500000 23.38889 -1.8429933
32.5 27.750 4.958333 19.54167 -5.6902155

transmute()

The transmute() does the same as above but breaks the new variables out into a tibble or data frame. In table 15, the three new variables create a new tibble. The temperature tmpd variable now gives a better idea of where it is in comparison to the mean average temperature.

transmute(chicago, 
          tmpdS = tmpd - mean(tmpd, na.rm=TRUE), 
          o3detrend = o3tmean2 - mean(o3tmean2, na.rm=TRUE), 
          nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))
## # A tibble: 6,940 x 3
##    tmpdS o3detrend nO2detrend
##    <dbl>     <dbl>      <dbl>
##  1 -18.8    -15.2      -5.24 
##  2 -17.3    -16.1      -2.04 
##  3 -17.3    -16.1      -1.42 
##  4 -21.3    -15.1       5.20 
##  5 -18.3    -14.7       5.10 
##  6 -10.3    -13.6       0.540
##  7 -15.8    -10.1      -4.65 
##  8 -21.3     -8.14     -8.19 
##  9 -23.8    -14.9      -1.84 
## 10 -17.8    -14.5      -5.69 
## # ... with 6,930 more rows
Table 15: Breakout 3 new variables to new tibble.
tmpdS o3detrend nO2detrend
-18.80934 -15.185513 -5.2437869
-17.30934 -16.131165 -2.0408884
-17.30934 -16.102180 -1.4164060
-21.30934 -15.060513 5.2026417
-18.30934 -14.685513 5.1014512
-10.30934 -13.602180 0.5404485
-15.80934 -10.143846 -4.6501722
-21.30934 -8.143846 -8.1946513
-23.80934 -14.935513 -1.8429933
-17.80934 -14.477180 -5.6902155

group_by()

summarize()

Table 16 shows how to transform the data. I created a new variable “year” The year is extracted from the y/m/d string using lubridates year() function. The data is organized by year. The summarize() function creates six new columns of min/max measurements per year for three variables. The pm25tmean2 didn’t have any records for ten years until 1998.

mutate(chicago, year = (lubridate::year(date))) %>%
  group_by(year) %>% 
  summarize( 
          pm25min=min(pm25tmean2, na.rm=TRUE),
          pm25max=max(pm25tmean2, na.rm=TRUE),
          o3min=min(o3tmean2, na.rm=TRUE),
          o3max=max(o3tmean2, na.rm=TRUE),
          no2mmin=min(no2tmean2, na.rm=TRUE),
          no2max=max(no2tmean2, na.rm=TRUE)
          )
## # A tibble: 19 x 7
##     year pm25min pm25max o3min o3max no2mmin no2max
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>   <dbl>  <dbl>
##  1  1987  Inf     -Inf   2.12   63.0    7.87   57.5
##  2  1988  Inf     -Inf   2.38   61.7    7.73   62.5
##  3  1989  Inf     -Inf   2      59.7    9.58   59.5
##  4  1990  Inf     -Inf   2.12   52.2    6.73   46.2
##  5  1991  Inf     -Inf   2      63.1    6.39   55.5
##  6  1992  Inf     -Inf   2      50.8    9.96   49.8
##  7  1993  Inf     -Inf   0.427  44.3   10.3    48.9
##  8  1994  Inf     -Inf   0.153  52.2   10.9    53.9
##  9  1995  Inf     -Inf   0.517  66.6    8.93   53.1
## 10  1996  Inf     -Inf   1.17   58.4   10.5    52.7
## 11  1997  Inf     -Inf   0.957  56.5    7.73   53.0
## 12  1998    2.1     56.5 0.556  50.7    8.8    47.8
## 13  1999    5.18    49.0 0.854  57.5    7.64   49.5
## 14  2000    2.8     46.7 1.04   55.8    6.16   50.8
## 15  2001    3.2     49.5 0.5    51.8    8.65   48.7
## 16  2002    2.91    44.0 2.10   54.9    7.80   47.8
## 17  2003    2.5     47.7 1.10   56.2    7.70   57.6
## 18  2004    2.6     48.2 2.20   44.5    7.55   45.0
## 19  2005    1.7     61.5 1.77   58.8    8.83   48.6
Table 16: Minimum-Maximum values per year over 19 years.
year pm25min pm25max o3min o3max no2mmin no2max
1987 Inf -Inf 2.1250000 62.96966 7.866667 57.52542
1988 Inf -Inf 2.3750000 61.67708 7.733333 62.47998
1989 Inf -Inf 2.0000000 59.72727 9.580729 59.51562
1990 Inf -Inf 2.1250000 52.22917 6.726172 46.19444
1991 Inf -Inf 2.0000000 63.10417 6.385417 55.48958
1992 Inf -Inf 2.0000000 50.82870 9.962500 49.77714
1993 Inf -Inf 0.4270833 44.30093 10.263768 48.90139
1994 Inf -Inf 0.1527778 52.17844 10.944444 53.89485
1995 Inf -Inf 0.5169082 66.58750 8.925000 53.13333
1996 Inf -Inf 1.1666667 58.39583 10.458333 52.68850
1997 Inf -Inf 0.9565217 56.54167 7.729167 53.01042
1998 2.1000 56.50000 0.5555556 50.66250 8.800000 47.81558
1999 5.1800 49.04286 0.8541667 57.48864 7.641667 49.54924
2000 2.8000 46.70000 1.0416667 55.76103 6.158333 50.81123
2001 3.2000 49.47143 0.5000000 51.81984 8.650000 48.71875
2002 2.9125 44.04286 2.0964674 54.88043 7.802083 47.76042
2003 2.5000 47.73750 1.0978261 56.16608 7.697917 57.56439
2004 2.6000 48.15714 2.1965580 44.48240 7.552083 44.97817
2005 1.7000 61.50000 1.7708333 58.84126 8.833333 48.59722

Table 17 displays the same min/max data for the three variables; pm25tmean2, o3tmean2, and no2tmean2. And using the group_by() I extracted the month() out with lubridate. This breakdown shows further relationships in the data and has transformed the original data into something to devise a hypothesis.

mutate(chicago, month = (lubridate::month(date))) %>%
  group_by(month) %>%
  summarize( 
          pm25min=min(pm25tmean2, na.rm=TRUE),
          pm25max=max(pm25tmean2, na.rm=TRUE),
          o3min=min(o3tmean2, na.rm=TRUE),
          o3max=max(o3tmean2, na.rm=TRUE),
          no2min=min(no2tmean2, na.rm=TRUE),
          no2max=max(no2tmean2, na.rm=TRUE)
)
## # A tibble: 12 x 7
##    month pm25min pm25max o3min o3max no2min no2max
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1     1    3.3     49.5 0.5    28.2   7.73   52.4
##  2     2    3.8     61.5 1.05   37.4  10.5    57.6
##  3     3    5.4     47.7 1.13   39.0   7.73   53.9
##  4     4    1.7     46.7 7.24   47.9   7.80   57.5
##  5     5    2.4     56.5 4.06   52.8   7.95   51.3
##  6     6    2.8     51.5 6.13   66.6   6.39   52.2
##  7     7    3       41.4 9.69   59.5   7.06   62.5
##  8     8    2.5     44.5 4.90   54.0   7.30   55.9
##  9     9    3.2     56.5 2.71   57.5   6.16   53.1
## 10    10    4       43.6 1.14   47.1   6.73   59.5
## 11    11    3.28    41.7 0.854  29.5   7.35   50  
## 12    12    3.91    42.3 0.153  27.7   9.20   50.3
Table 17: Minimum-Maximum values per month over 19 years.
month pm25min pm25max o3min o3max no2mmin no2max
1 3.300000 49.47143 0.5000000 28.22222 7.729167 52.44697
2 3.800000 61.50000 1.0492424 37.37500 10.490972 57.56439
3 5.400000 47.73750 1.1316425 39.05000 7.733333 53.89485
4 1.700000 46.70000 7.2409656 47.94907 7.802083 57.52542
5 2.400000 56.50000 4.0615942 52.75000 7.947917 51.28535
6 2.800000 51.53750 6.1296212 66.58750 6.385417 52.20833
7 3.000000 41.40000 9.6912202 59.54167 7.064513 62.47998
8 2.500000 44.48750 4.8981884 53.96701 7.295592 55.90522
9 3.200000 56.50000 2.7121706 57.48864 6.158333 53.13333
10 4.000000 43.60000 1.1443511 47.09275 6.726172 59.51562
11 3.275000 41.70000 0.8541667 29.45833 7.350000 50.00000
12 3.914286 42.30000 0.1527778 27.70833 9.204167 50.28154

Table 18 focuses on per year measurements. Rather than min/max (0% and 100% percentiles), we can look at the quantiles of 20-80 and get a better understanding of the probability distribution by year.

mutate(chicago, year = (lubridate::year(date))) %>%
  group_by(year) %>%
  summarize( 
          pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
          pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
          o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
          o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
          no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
          no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
          )
## # A tibble: 19 x 7
##     year pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1  1987   NA       NA    9.02  31.2   18.3   32.3
##  2  1988   NA       NA    9.33  34.1   18.7   31.4
##  3  1989   NA       NA    9.55  30.9   20.0   33.2
##  4  1990   NA       NA    9.40  29.0   15.8   29.3
##  5  1991   NA       NA    9.18  29.6   15.5   27.9
##  6  1992   NA       NA    6.77  23.7   20.2   30.9
##  7  1993   NA       NA    6.85  24.7   20.0   31.6
##  8  1994   NA       NA    7.97  26.6   21.1   35.9
##  9  1995   NA       NA    6.84  27.6   20.7   34.9
## 10  1996   NA       NA    6.34  25.1   20.7   33.2
## 11  1997   NA       NA    8.13  28.5   19.0   32.2
## 12  1998   10.5     25.4  6.80  30.0   18.6   30.9
## 13  1999   10.4     25.3  8.50  30.3   18.7   30.9
## 14  2000    9.54    24.2  8.31  27.6   16.8   31.0
## 15  2001    9.1     23.5  8.13  29.2   19.3   31.3
## 16  2002    8.51    21.0 10.4   30.5   17.9   29.3
## 17  2003    8.58    21.1 10.7   31.1   18.7   31.6
## 18  2004    7.89    19.4 11.2   28.9   17.3   29.4
## 19  2005    7.9     22.2 11.4   34.5   16.6   28.6
Table 18: Quantiles 20-80 per year distribution.
year pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
1987 NA NA 9.015942 31.17821 18.27433 32.33631
1988 NA NA 9.333333 34.06250 18.65417 31.37754
1989 NA NA 9.550000 30.89374 20.04166 33.24937
1990 NA NA 9.395652 29.02654 15.77971 29.32133
1991 NA NA 9.183333 29.58202 15.51984 27.91394
1992 NA NA 6.770833 23.66204 20.20698 30.87500
1993 NA NA 6.845169 24.65485 20.02333 31.60217
1994 NA NA 7.966667 26.55242 21.07083 35.85169
1995 NA NA 6.844192 27.57417 20.65522 34.94202
1996 NA NA 6.335824 25.11111 20.69792 33.22391
1997 NA NA 8.127778 28.49761 19.00802 32.19722
1998 10.520000 25.40000 6.796860 30.00888 18.63361 30.94758
1999 10.377500 25.33429 8.496717 30.28982 18.71836 30.88763
2000 9.540000 24.19714 8.305556 27.58080 16.79167 31.04167
2001 9.100000 23.51600 8.131966 29.24917 19.29803 31.26060
2002 8.510000 20.96000 10.408333 30.50463 17.90625 29.29532
2003 8.580000 21.14000 10.738095 31.06458 18.66014 31.62000
2004 7.887143 19.43000 11.166667 28.85202 17.30729 29.37500
2005 7.900000 22.20000 11.426826 34.53244 16.59653 28.55060

Table 19 is the break down of per month probability distribution using 20/80. The break down helps us explore the data deeper and find if the data is normal or skewed over a monthly trend.

mutate(chicago, month = (lubridate::month(date))) %>%
  group_by(month) %>%
  summarize( 
          pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
          pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
          o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
          o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
          no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
          no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
          )
## # A tibble: 12 x 7
##    month pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1     1   10.9     24.4  4.45  13.5   19.9   32.1
##  2     2   10.5     27.6  7.90  18.5   21.0   33.2
##  3     3    9.7     25.5 12.5   26.6   19.9   33.3
##  4     4    7.72    18.4 18.1   30.5   18.6   32.6
##  5     5    7.8     19.1 20.9   33.5   17.9   31.5
##  6     6    8.7     21.8 22.5   40.5   19.0   32.8
##  7     7    9.24    23.1 22.5   38.3   16.4   29.3
##  8     8    9.24    23.2 19.3   33.3   17.5   30.3
##  9     9    7.62    22.7 12.6   28.7   18.2   33.0
## 10    10    7.6     19.3  8.13  18.4   18.3   31.5
## 11    11    8.50    20.5  4.16  12.4   17.6   29.6
## 12    12   10.6     23.3  3.26  11.2   19.5   29.9
Table 19: Quantiles 20-80 per month distibution.
month pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
1 10.920000 24.41333 4.445652 13.46167 19.88214 32.08750
2 10.470000 27.61714 7.899758 18.51063 20.99167 33.17778
3 9.700000 25.46500 12.525000 26.62478 19.86111 33.27674
4 7.724286 18.42000 18.090657 30.48091 18.61875 32.58999
5 7.800000 19.10000 20.944556 33.54967 17.86199 31.46076
6 8.700000 21.85000 22.476235 40.49473 19.00893 32.78384
7 9.240000 23.08500 22.522239 38.30213 16.40743 29.28919
8 9.240000 23.17500 19.326223 33.28750 17.47797 30.29226
9 7.617143 22.66714 12.640212 28.65187 18.18177 33.00062
10 7.600000 19.28000 8.134375 18.38452 18.25942 31.51581
11 8.495000 20.51714 4.161111 12.40333 17.56083 29.62459
12 10.580000 23.26667 3.257699 11.18927 19.52083 29.87431

Conclusion

The article covered several of the main tools found in the tidyverse package for R. I demonstrated around ten different ways to manipulate and transform a dataset using the dplyr functions. We learned the Chicago data records over 19 years of air quality measurements. The pm25tmeans2 was not measured for the first ten years and showed by 4447 missing values. The primary dplyr functions used and combined are:

References

Peng, R. D. (n.d.). Chicago dataset. Retrieved from: http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip

Peng, R. D. (2016). Exploratory Data Analysis with R. p. 4-17. Retrieved from: https://leanpub.com/exdata

RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.