Data Exploration - The Chicago Dataset | Bryan Schafroth Portfolio

Introduction

This article is about using the tidyverse package in R to explore data. The Chicago dataset is the basis for this analysis. The data is air quality measurements taken over 19 years. The focus is on using the dplyr package to do the analysis. The dplyr package is part of the tidyverse in R and a suite of tools to analyze, transform, plot, and manipulate data. The base version of R does all of this; however, dplyr will work faster and cost fewer computing resources when used on big data. The functions used to explore the data are as follows:

names()
select()
filter()
summarize()
arrange()
rename()
mutate()
transmute()
group_by()
%>% (pipe)

Additional packages used are:

Tidyverse
Readr
Lubridate
Skimr
Knitr

read_rds()

I start with importing the data into a data frame with the readr package, and the structure of the data is a tibble, which is the tidyverse data frame structure. In the output, the tibble structure will list the first ten rows of data and column headers with the data types in this dataset.

chr is characters
dbl is real numbers
date are dates

Not shown in this data but worth mentioning:

int is integers
dttm is date-time
lgl is logical
fctr is factors or categorical

chicago = as_tibble(readr::read_rds("../../static/data/chicago.rds"))

(chicago)

## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0
##  2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2
##  3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8
##  4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4
##  5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3
##  6 chic   40    35.1 1987-01-06         NA       48       5.83      25.8
##  7 chic   34.5  26.8 1987-01-07         NA       41       9.29      20.6
##  8 chic   29    22   1987-01-08         NA       36      11.3       17.0
##  9 chic   26.5  29   1987-01-09         NA       33.3     4.5       23.4
## 10 chic   32.5  27.8 1987-01-10         NA       NA       4.96      19.5
## # ... with 6,930 more rows

Table 1: The Chicago Dataset
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
chic	31.5	31.500	1987-01-01	NA	34.00000	4.250000	19.98810
chic	33.0	29.875	1987-01-02	NA	NA	3.304348	23.19099
chic	33.0	27.375	1987-01-03	NA	34.16667	3.333333	23.81548
chic	29.0	28.625	1987-01-04	NA	47.00000	4.375000	30.43452
chic	32.0	28.875	1987-01-05	NA	NA	4.750000	30.33333
chic	40.0	35.125	1987-01-06	NA	48.00000	5.833333	25.77233
chic	34.5	26.750	1987-01-07	NA	41.00000	9.291667	20.58171
chic	29.0	22.000	1987-01-08	NA	36.00000	11.291667	17.03723
chic	26.5	29.000	1987-01-09	NA	33.28571	4.500000	23.38889
chic	32.5	27.750	1987-01-10	NA	NA	4.958333	19.54167

In table 1, the Chicago dataset has eight variables and 6940 rows. The city variable appears as ‘chic’ for Chicago, and the missing data is filled in by default with NA. Also, the date is in the y/m/d format. We can see the different readings for the real numbers, from 1 to 6 decimal places.

names()

names() function lists all the variable names in the data.

names(chicago)

## [1] "city"       "tmpd"       "dptp"       "date"       "pm25tmean2"
## [6] "pm10tmean2" "o3tmean2"   "no2tmean2"

List the first three names with the index[:]. Note R is a little different than Python. R will display the last index position [3], where Python would display 1 & 2.

names(chicago)[1:3]

## [1] "city" "tmpd" "dptp"

List the last three variables.

names(chicago)[6:8]

## [1] "pm10tmean2" "o3tmean2"   "no2tmean2"

Summary Statistics with `skimr`

The skim() function in the skimr package gives a good idea, in one place, of what the data looks like in table 2. Noting there are 4447 NA’s in pm25tmean2 and 242 NA’s in pm10tmean2. The date range is from 1987 to 2005. The histogram shows us the skew of each variable and a nice feature for skimr.

In base R, I would also use the summary() function. As you will see, the tidyverse does not have a summary function that doesn’t require multiple lines of code. I will generate some of these statistics, but it takes more time than typing in one line into the console.

skimr::skim(chicago)

Table 2: Data summary
Name	chicago
Number of rows	6940
Number of columns	8
_______________________
Column type frequency:
character	1
Date	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
city	0	1	4	4	0	1	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	1987-01-01	2005-12-31	1996-07-01	6940

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tmpd	1	1.00	50.31	19.41	-16.00	35.00	51.00	67.00	92.00	▁▃▇▇▅
dptp	2	1.00	40.34	18.49	-25.62	27.00	39.88	55.75	78.25	▁▂▇▇▅
pm25tmean2	4447	0.36	16.23	8.70	1.70	9.70	14.66	20.60	61.50	▇▇▂▁▁
pm10tmean2	242	0.97	33.90	17.97	2.00	21.50	30.28	42.00	365.00	▇▁▁▁▁
o3tmean2	0	1.00	19.44	11.39	0.15	10.07	18.52	27.00	66.59	▇▇▅▁▁
no2tmean2	0	1.00	25.23	7.99	6.16	19.65	24.56	30.14	62.48	▂▇▃▁▁

select()

To view specific variables, I use the select() function. Perhaps this small dataset in not going to show the benefits of select(), however, with hundreds or thousands of variables, select() will prove a worthy function. Here in table 3, we can isolate the variables city through dptp very quickly.

select(chicago, 
       city:dptp)

## # A tibble: 6,940 x 3
##    city   tmpd  dptp
##    <chr> <dbl> <dbl>
##  1 chic   31.5  31.5
##  2 chic   33    29.9
##  3 chic   33    27.4
##  4 chic   29    28.6
##  5 chic   32    28.9
##  6 chic   40    35.1
##  7 chic   34.5  26.8
##  8 chic   29    22  
##  9 chic   26.5  29  
## 10 chic   32.5  27.8
## # ... with 6,930 more rows

Table 3: Select the first three variables.
city	tmpd	dptp
chic	31.5	31.500
chic	33.0	29.875
chic	33.0	27.375
chic	29.0	28.625
chic	32.0	28.875
chic	40.0	35.125
chic	34.5	26.750
chic	29.0	22.000
chic	26.5	29.000
chic	32.5	27.750

In table 4, select() can also do the inverse of city:dptp and exclude those variables to display the remaining variables.

select(chicago, -(city:dptp))

## # A tibble: 6,940 x 5
##    date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 1987-01-01         NA       34       4.25      20.0
##  2 1987-01-02         NA       NA       3.30      23.2
##  3 1987-01-03         NA       34.2     3.33      23.8
##  4 1987-01-04         NA       47       4.38      30.4
##  5 1987-01-05         NA       NA       4.75      30.3
##  6 1987-01-06         NA       48       5.83      25.8
##  7 1987-01-07         NA       41       9.29      20.6
##  8 1987-01-08         NA       36      11.3       17.0
##  9 1987-01-09         NA       33.3     4.5       23.4
## 10 1987-01-10         NA       NA       4.96      19.5
## # ... with 6,930 more rows

Table 4: Select all except first three variables.
date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
1987-01-01	NA	34.00000	4.250000	19.98810
1987-01-02	NA	NA	3.304348	23.19099
1987-01-03	NA	34.16667	3.333333	23.81548
1987-01-04	NA	47.00000	4.375000	30.43452
1987-01-05	NA	NA	4.750000	30.33333
1987-01-06	NA	48.00000	5.833333	25.77233
1987-01-07	NA	41.00000	9.291667	20.58171
1987-01-08	NA	36.00000	11.291667	17.03723
1987-01-09	NA	33.28571	4.500000	23.38889
1987-01-10	NA	NA	4.958333	19.54167

Table 5 demonstrates the ends_with() and table 6 the starts_with() functions that only work inside of select(). Additional functions here would be:

matches()
num_range()
one_of()
everything()
contains()

Lets find all the variables that end with the number 2 in table 5.

select(chicago, 
       ends_with("2"))

## # A tibble: 6,940 x 4
##    pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##         <dbl>      <dbl>    <dbl>     <dbl>
##  1         NA       34       4.25      20.0
##  2         NA       NA       3.30      23.2
##  3         NA       34.2     3.33      23.8
##  4         NA       47       4.38      30.4
##  5         NA       NA       4.75      30.3
##  6         NA       48       5.83      25.8
##  7         NA       41       9.29      20.6
##  8         NA       36      11.3       17.0
##  9         NA       33.3     4.5       23.4
## 10         NA       NA       4.96      19.5
## # ... with 6,930 more rows

Table 5: Select variables ending in 2.
pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
NA	34.00000	4.250000	19.98810
NA	NA	3.304348	23.19099
NA	34.16667	3.333333	23.81548
NA	47.00000	4.375000	30.43452
NA	NA	4.750000	30.33333
NA	48.00000	5.833333	25.77233
NA	41.00000	9.291667	20.58171
NA	36.00000	11.291667	17.03723
NA	33.28571	4.500000	23.38889
NA	NA	4.958333	19.54167

Let’s find all the variables that start with the letter “d” in table 6.

select(chicago, 
       starts_with("d"))

## # A tibble: 6,940 x 2
##     dptp date      
##    <dbl> <date>    
##  1  31.5 1987-01-01
##  2  29.9 1987-01-02
##  3  27.4 1987-01-03
##  4  28.6 1987-01-04
##  5  28.9 1987-01-05
##  6  35.1 1987-01-06
##  7  26.8 1987-01-07
##  8  22   1987-01-08
##  9  29   1987-01-09
## 10  27.8 1987-01-10
## # ... with 6,930 more rows

Table 6: Select variables starting with the letter d.
dptp	date
31.500	1987-01-01
29.875	1987-01-02
27.375	1987-01-03
28.625	1987-01-04
28.875	1987-01-05
35.125	1987-01-06
26.750	1987-01-07
22.000	1987-01-08
29.000	1987-01-09
27.750	1987-01-10

filter()

The filter() function is a way to find values in the data. Here the results are placed in table 7. Shown is filtering all pm25tmean2 variables that are greater than 30. We can do any combination of variables with specific parameters. The tibble output indicates there are 194 instances where pm25tmean2 is above a reading of 30.

filter(chicago, 
       pm25tmean2 > 30)

## # A tibble: 194 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic     23  21.9 1998-01-17       38.1       32.5     3.18      25.3
##  2 chic     28  25.8 1998-01-23       34.0       38.7     1.75      29.4
##  3 chic     55  51.3 1998-04-30       39.4       34      10.8       25.3
##  4 chic     59  53.7 1998-05-01       35.4       28.5    14.3       31.4
##  5 chic     57  52   1998-05-02       33.3       35      20.7       26.8
##  6 chic     57  56   1998-05-07       32.1       34.5    24.3       34.0
##  7 chic     75  65.8 1998-05-15       56.5       91      38.6       29.0
##  8 chic     61  59   1998-06-09       33.8       26      17.9       25.5
##  9 chic     73  60.3 1998-07-13       30.3       64.5    37.0       37.9
## 10 chic     78  67.1 1998-07-14       41.4       75      40.1       32.6
## # ... with 184 more rows

Table 7: Filter variable pm25tmean2 greater than 30.
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
chic	23	21.9	1998-01-17	38.10	32.46154	3.180556	25.30000
chic	28	25.8	1998-01-23	33.95	38.69231	1.750000	29.37630
chic	55	51.3	1998-04-30	39.40	34.00000	10.786232	25.31310
chic	59	53.7	1998-05-01	35.40	28.50000	14.295125	31.42905
chic	57	52.0	1998-05-02	33.30	35.00000	20.662879	26.79861
chic	57	56.0	1998-05-07	32.10	34.50000	24.270422	33.99167
chic	75	65.8	1998-05-15	56.50	91.00000	38.573007	29.03261
chic	61	59.0	1998-06-09	33.80	26.00000	17.890810	25.49668
chic	73	60.3	1998-07-13	30.30	64.50000	37.018864	37.93056
chic	78	67.1	1998-07-14	41.40	75.00000	40.080902	32.59054

Table 8 filters all values of the variable pm25tmean2 greater than 30, and I summarize to refine that to the minimum and maximum values. The table is showing that 30.05 is the lowest, and 61.5 is the highest value after filtering the variable. %>% is shown in the code and a way to write a few sequence operations together.

chicago %>%
  filter(pm25tmean2 > 30) %>%
  summarize(minValue = min(pm25tmean2), 
            maxValue = max(pm25tmean2))

## # A tibble: 1 x 2
##   minValue maxValue
##      <dbl>    <dbl>
## 1     30.0     61.5

Table 8: Variable pm25tmean2 greater than 30.
minValue	maxValue
30.05	61.5

Table 9 is a refinement in table 8. Here I want to find the values higher than 30 in pm25tmean2 and values greater than 80 in tmpd. In the output tibble, there are only 17 instances where this condition is true and a little clustering in 2001, 2002, and 2005.

chicago %>%
  select(date, tmpd, pm25tmean2) %>%
  filter(pm25tmean2 > 30 & tmpd > 80)

## # A tibble: 17 x 3
##    date        tmpd pm25tmean2
##    <date>     <dbl>      <dbl>
##  1 1998-08-23    81       39.6
##  2 1998-09-06    81       31.5
##  3 2001-07-20    82       32.3
##  4 2001-08-01    84       43.7
##  5 2001-08-08    85       38.8
##  6 2001-08-09    84       38.2
##  7 2002-06-20    82       33  
##  8 2002-06-23    82       42.5
##  9 2002-07-08    81       33.1
## 10 2002-07-18    82       38.8
## 11 2003-06-25    82       33.9
## 12 2003-07-04    84       32.9
## 13 2005-06-24    86       31.9
## 14 2005-06-27    82       51.5
## 15 2005-06-28    85       31.2
## 16 2005-07-17    84       32.7
## 17 2005-08-03    84       37.9

Table 9: Filter pm25tmean2 >30 & tmpd >80 with date.
date	tmpd	pm25tmean2
1998-08-23	81	39.60000
1998-09-06	81	31.50000
2001-07-20	82	32.30000
2001-08-01	84	43.70000
2001-08-08	85	38.83750
2001-08-09	84	38.20000
2002-06-20	82	33.00000
2002-06-23	82	42.50000
2002-07-08	81	33.10000
2002-07-18	82	38.85000
2003-06-25	82	33.90000
2003-07-04	84	32.90000
2005-06-24	86	31.85714
2005-06-27	82	51.53750
2005-06-28	85	31.20000
2005-07-17	84	32.70000
2005-08-03	84	37.90000

This filtering collects all values above 80 for the tmpd variable and then summarizes to find the minimum and maximum values. 80.5 is the minimum, and 92 is the maximum.

chicago %>%
  filter(tmpd > 80) %>%
  summarize(minValue = min(tmpd), 
            maxValue = max(tmpd))

## # A tibble: 1 x 2
##   minValue maxValue
##      <dbl>    <dbl>
## 1     80.5       92

arrange()

Table 10 shows the arrange() function and sorts the o3tmean2 variable. Default from lowest to highest numerical values.

chicago %>%
  arrange(o3tmean2)

## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   43.5  39   1994-12-05       NA         47.5    0.153      27.2
##  2 chic   32.5  34.4 1994-12-15       NA         43.5    0.361      30.5
##  3 chic   38.5  38.5 1993-12-17       NA         16      0.427      23.9
##  4 chic   23.5  16.9 1994-12-13       NA         48.5    0.492      33.3
##  5 chic   34    33.9 2001-01-14       28         22      0.5        26.2
##  6 chic   33    37.5 1995-01-11       NA         68      0.517      37.8
##  7 chic   40.5  44.9 1993-01-03       NA         21.5    0.528      19.8
##  8 chic   61    59.3 1998-12-04       31.6       42.5    0.556      24.4
##  9 chic   35.5  34   1993-01-21       NA         19      0.653      27.0
## 10 chic   36.5  36   1994-12-16       NA         41      0.694      34.4
## # ... with 6,930 more rows

Table 10: Arrange O3 mean variable.
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
chic	43.5	39.000	1994-12-05	NA	47.5	0.1527778	27.20399
chic	32.5	34.375	1994-12-15	NA	43.5	0.3611111	30.45452
chic	38.5	38.500	1993-12-17	NA	16.0	0.4270833	23.92361
chic	23.5	16.875	1994-12-13	NA	48.5	0.4915459	33.34242
chic	34.0	33.900	2001-01-14	28.0	22.0	0.5000000	26.18750
chic	33.0	37.500	1995-01-11	NA	68.0	0.5169082	37.77083
chic	40.5	44.875	1993-01-03	NA	21.5	0.5277778	19.82500
chic	61.0	59.300	1998-12-04	31.6	42.5	0.5555556	24.39167
chic	35.5	34.000	1993-01-21	NA	19.0	0.6527778	27.00284
chic	36.5	36.000	1994-12-16	NA	41.0	0.6944444	34.39167

Table 11 sorts the tmpd variable lowest to highest.

chicago %>%
  arrange(tmpd)

## # A tibble: 6,940 x 8
##    city   tmpd   dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl>  <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic  -16   -25.6  1994-01-18         NA       27      14.7       22.1
##  2 chic  -12   -18.9  1996-02-03         NA       25.6     7.74      30.1
##  3 chic  -10.5 -23.1  1994-01-15         NA       36      12.0       24.7
##  4 chic  -10.5 -21    1996-02-02         NA       35       9.56      31.5
##  5 chic   -8    -2.38 1994-01-19         NA       24.5     4.86      36.6
##  6 chic   -6.5 -15.8  1988-01-05         NA       24      13.2       17.2
##  7 chic   -6.5 -23    1989-12-21         NA       49      16.7       19.5
##  8 chic   -6    -7.88 1996-02-04         NA       21       9.58      30.0
##  9 chic   -5.5 -15.1  1988-01-06         NA       75       7.15      27.7
## 10 chic   -4.5 -11.9  1989-12-22         NA       49       9.42      29.4
## # ... with 6,930 more rows

Table 11: Arrange tmpd variable.
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
chic	-16.0	-25.625	1994-01-18	NA	27.00000	14.656250	22.10236
chic	-12.0	-18.875	1996-02-03	NA	25.61538	7.736111	30.13542
chic	-10.5	-23.125	1994-01-15	NA	36.00000	12.010417	24.70139
chic	-10.5	-21.000	1996-02-02	NA	35.00000	9.555556	31.47917
chic	-8.0	-2.375	1994-01-19	NA	24.50000	4.864583	36.59796
chic	-6.5	-15.750	1988-01-05	NA	24.00000	13.166667	17.22083
chic	-6.5	-23.000	1989-12-21	NA	49.00000	16.666667	19.50000
chic	-6.0	-7.875	1996-02-04	NA	21.00000	9.583333	29.95833
chic	-5.5	-15.125	1988-01-06	NA	75.00000	7.145833	27.66594
chic	-4.5	-11.875	1989-12-22	NA	49.00000	9.416667	29.41667

In table 12 we can sort the tmpd variable in descending order by adding the parameter to the arrange() function.

chicago %>%
  arrange(desc(tmpd))

## # A tibble: 6,940 x 8
##    city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
##  1 chic   92    76.4 1995-07-13       NA         92.5     58.5      27.3
##  2 chic   91.5  75.6 1995-07-14       NA         54.5     52.0      24.8
##  3 chic   90    77.1 1999-07-30       NA         54       34.2      27.5
##  4 chic   90    70.9 2005-07-24       22.0       44       47.6      12.0
##  5 chic   89.5  70.4 1988-08-02       NA         67       49.8      25.9
##  6 chic   89    71.1 1988-08-01       NA         67.2     44.1      26.0
##  7 chic   88.5  70.2 1988-06-20       NA         88       61.7      25.6
##  8 chic   88    67.4 1988-08-03       NA         65       40.6      27.7
##  9 chic   88    66.4 1988-08-17       NA         71       39.3      23.5
## 10 chic   88    71.2 1991-07-22       NA         70       34.9      24.6
## # ... with 6,930 more rows

Table 12: Arrange tmpd variable descending.
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
chic	92.0	76.375	1995-07-13	NA	92.5	58.54952	27.31775
chic	91.5	75.625	1995-07-14	NA	54.5	52.00397	24.77355
chic	90.0	77.100	1999-07-30	NA	54.0	34.17065	27.47977
chic	90.0	70.900	2005-07-24	22.05	44.0	47.60417	11.96354
chic	89.5	70.375	1988-08-02	NA	67.0	49.84949	25.91973
chic	89.0	71.125	1988-08-01	NA	67.2	44.14991	25.96309
chic	88.5	70.250	1988-06-20	NA	88.0	61.67708	25.64027
chic	88.0	67.375	1988-08-03	NA	65.0	40.64176	27.73958
chic	88.0	66.375	1988-08-17	NA	71.0	39.31240	23.50521
chic	88.0	71.250	1991-07-22	NA	70.0	34.85781	24.55123

rename()

The rename() function will be beneficial if you need to give the variables a descriptive name. In this example, table 13, I shorten the variable name and process these all in one line of code.

chicago %>%
  rename(temp=tmpd, 
         dewpoint=dptp, 
         pm25=pm25tmean2, 
         pm10=pm10tmean2, 
         O3=o3tmean2, 
         NO2=no2tmean2)

## # A tibble: 6,940 x 8
##    city   temp dewpoint date        pm25  pm10    O3   NO2
##    <chr> <dbl>    <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
##  1 chic   31.5     31.5 1987-01-01    NA  34    4.25  20.0
##  2 chic   33       29.9 1987-01-02    NA  NA    3.30  23.2
##  3 chic   33       27.4 1987-01-03    NA  34.2  3.33  23.8
##  4 chic   29       28.6 1987-01-04    NA  47    4.38  30.4
##  5 chic   32       28.9 1987-01-05    NA  NA    4.75  30.3
##  6 chic   40       35.1 1987-01-06    NA  48    5.83  25.8
##  7 chic   34.5     26.8 1987-01-07    NA  41    9.29  20.6
##  8 chic   29       22   1987-01-08    NA  36   11.3   17.0
##  9 chic   26.5     29   1987-01-09    NA  33.3  4.5   23.4
## 10 chic   32.5     27.8 1987-01-10    NA  NA    4.96  19.5
## # ... with 6,930 more rows

Table 13: Rename variables.
city	temp	dewpoint	date	pm25	pm10	O3	NO2
chic	31.5	31.500	1987-01-01	NA	34.00000	4.250000	19.98810
chic	33.0	29.875	1987-01-02	NA	NA	3.304348	23.19099
chic	33.0	27.375	1987-01-03	NA	34.16667	3.333333	23.81548
chic	29.0	28.625	1987-01-04	NA	47.00000	4.375000	30.43452
chic	32.0	28.875	1987-01-05	NA	NA	4.750000	30.33333
chic	40.0	35.125	1987-01-06	NA	48.00000	5.833333	25.77233
chic	34.5	26.750	1987-01-07	NA	41.00000	9.291667	20.58171
chic	29.0	22.000	1987-01-08	NA	36.00000	11.291667	17.03723
chic	26.5	29.000	1987-01-09	NA	33.28571	4.500000	23.38889
chic	32.5	27.750	1987-01-10	NA	NA	4.958333	19.54167

mutate()

In this example, the mutate() function adds a new variable to the data. In this case, we can take the mean of no2tmean2 and subtract it from itself to standardize the variable. This will show if the mean is above or below the average of 25.23 (taken from table 2). In this one line, I calculated and created a new variable for the data called no2detrend. A positive detrend indicates above the mean, and a negative value indicates below the mean for that individual record. See table 14.

smlchicago <- chicago %>%
  select(tmpd, dptp, o3tmean2, no2tmean2)

smlchicago %>%
  mutate(nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))

## # A tibble: 6,940 x 5
##     tmpd  dptp o3tmean2 no2tmean2 nO2detrend
##    <dbl> <dbl>    <dbl>     <dbl>      <dbl>
##  1  31.5  31.5     4.25      20.0     -5.24 
##  2  33    29.9     3.30      23.2     -2.04 
##  3  33    27.4     3.33      23.8     -1.42 
##  4  29    28.6     4.38      30.4      5.20 
##  5  32    28.9     4.75      30.3      5.10 
##  6  40    35.1     5.83      25.8      0.540
##  7  34.5  26.8     9.29      20.6     -4.65 
##  8  29    22      11.3       17.0     -8.19 
##  9  26.5  29       4.5       23.4     -1.84 
## 10  32.5  27.8     4.96      19.5     -5.69 
## # ... with 6,930 more rows

Table 14: Create a new variable (difference from the mean).
tmpd	dptp	o3tmean2	no2tmean2	nO2detrend
31.5	31.500	4.250000	19.98810	-5.2437869
33.0	29.875	3.304348	23.19099	-2.0408884
33.0	27.375	3.333333	23.81548	-1.4164060
29.0	28.625	4.375000	30.43452	5.2026417
32.0	28.875	4.750000	30.33333	5.1014512
40.0	35.125	5.833333	25.77233	0.5404485
34.5	26.750	9.291667	20.58171	-4.6501722
29.0	22.000	11.291667	17.03723	-8.1946513
26.5	29.000	4.500000	23.38889	-1.8429933
32.5	27.750	4.958333	19.54167	-5.6902155

transmute()

The transmute() does the same as above but breaks the new variables out into a tibble or data frame. In table 15, the three new variables create a new tibble. The temperature tmpd variable now gives a better idea of where it is in comparison to the mean average temperature.

transmute(chicago, 
          tmpdS = tmpd - mean(tmpd, na.rm=TRUE), 
          o3detrend = o3tmean2 - mean(o3tmean2, na.rm=TRUE), 
          nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))

## # A tibble: 6,940 x 3
##    tmpdS o3detrend nO2detrend
##    <dbl>     <dbl>      <dbl>
##  1 -18.8    -15.2      -5.24 
##  2 -17.3    -16.1      -2.04 
##  3 -17.3    -16.1      -1.42 
##  4 -21.3    -15.1       5.20 
##  5 -18.3    -14.7       5.10 
##  6 -10.3    -13.6       0.540
##  7 -15.8    -10.1      -4.65 
##  8 -21.3     -8.14     -8.19 
##  9 -23.8    -14.9      -1.84 
## 10 -17.8    -14.5      -5.69 
## # ... with 6,930 more rows

Table 15: Breakout 3 new variables to new tibble.
tmpdS	o3detrend	nO2detrend
-18.80934	-15.185513	-5.2437869
-17.30934	-16.131165	-2.0408884
-17.30934	-16.102180	-1.4164060
-21.30934	-15.060513	5.2026417
-18.30934	-14.685513	5.1014512
-10.30934	-13.602180	0.5404485
-15.80934	-10.143846	-4.6501722
-21.30934	-8.143846	-8.1946513
-23.80934	-14.935513	-1.8429933
-17.80934	-14.477180	-5.6902155

group_by()

summarize()

Table 16 shows how to transform the data. I created a new variable “year” The year is extracted from the y/m/d string using lubridates year() function. The data is organized by year. The summarize() function creates six new columns of min/max measurements per year for three variables. The pm25tmean2 didn’t have any records for ten years until 1998.

mutate(chicago, year = (lubridate::year(date))) %>%
  group_by(year) %>% 
  summarize( 
          pm25min=min(pm25tmean2, na.rm=TRUE),
          pm25max=max(pm25tmean2, na.rm=TRUE),
          o3min=min(o3tmean2, na.rm=TRUE),
          o3max=max(o3tmean2, na.rm=TRUE),
          no2mmin=min(no2tmean2, na.rm=TRUE),
          no2max=max(no2tmean2, na.rm=TRUE)
          )

## # A tibble: 19 x 7
##     year pm25min pm25max o3min o3max no2mmin no2max
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>   <dbl>  <dbl>
##  1  1987  Inf     -Inf   2.12   63.0    7.87   57.5
##  2  1988  Inf     -Inf   2.38   61.7    7.73   62.5
##  3  1989  Inf     -Inf   2      59.7    9.58   59.5
##  4  1990  Inf     -Inf   2.12   52.2    6.73   46.2
##  5  1991  Inf     -Inf   2      63.1    6.39   55.5
##  6  1992  Inf     -Inf   2      50.8    9.96   49.8
##  7  1993  Inf     -Inf   0.427  44.3   10.3    48.9
##  8  1994  Inf     -Inf   0.153  52.2   10.9    53.9
##  9  1995  Inf     -Inf   0.517  66.6    8.93   53.1
## 10  1996  Inf     -Inf   1.17   58.4   10.5    52.7
## 11  1997  Inf     -Inf   0.957  56.5    7.73   53.0
## 12  1998    2.1     56.5 0.556  50.7    8.8    47.8
## 13  1999    5.18    49.0 0.854  57.5    7.64   49.5
## 14  2000    2.8     46.7 1.04   55.8    6.16   50.8
## 15  2001    3.2     49.5 0.5    51.8    8.65   48.7
## 16  2002    2.91    44.0 2.10   54.9    7.80   47.8
## 17  2003    2.5     47.7 1.10   56.2    7.70   57.6
## 18  2004    2.6     48.2 2.20   44.5    7.55   45.0
## 19  2005    1.7     61.5 1.77   58.8    8.83   48.6

Table 16: Minimum-Maximum values per year over 19 years.
year	pm25min	pm25max	o3min	o3max	no2mmin	no2max
1987	Inf	-Inf	2.1250000	62.96966	7.866667	57.52542
1988	Inf	-Inf	2.3750000	61.67708	7.733333	62.47998
1989	Inf	-Inf	2.0000000	59.72727	9.580729	59.51562
1990	Inf	-Inf	2.1250000	52.22917	6.726172	46.19444
1991	Inf	-Inf	2.0000000	63.10417	6.385417	55.48958
1992	Inf	-Inf	2.0000000	50.82870	9.962500	49.77714
1993	Inf	-Inf	0.4270833	44.30093	10.263768	48.90139
1994	Inf	-Inf	0.1527778	52.17844	10.944444	53.89485
1995	Inf	-Inf	0.5169082	66.58750	8.925000	53.13333
1996	Inf	-Inf	1.1666667	58.39583	10.458333	52.68850
1997	Inf	-Inf	0.9565217	56.54167	7.729167	53.01042
1998	2.1000	56.50000	0.5555556	50.66250	8.800000	47.81558
1999	5.1800	49.04286	0.8541667	57.48864	7.641667	49.54924
2000	2.8000	46.70000	1.0416667	55.76103	6.158333	50.81123
2001	3.2000	49.47143	0.5000000	51.81984	8.650000	48.71875
2002	2.9125	44.04286	2.0964674	54.88043	7.802083	47.76042
2003	2.5000	47.73750	1.0978261	56.16608	7.697917	57.56439
2004	2.6000	48.15714	2.1965580	44.48240	7.552083	44.97817
2005	1.7000	61.50000	1.7708333	58.84126	8.833333	48.59722

Table 17 displays the same min/max data for the three variables; pm25tmean2, o3tmean2, and no2tmean2. And using the group_by() I extracted the month() out with lubridate. This breakdown shows further relationships in the data and has transformed the original data into something to devise a hypothesis.

mutate(chicago, month = (lubridate::month(date))) %>%
  group_by(month) %>%
  summarize( 
          pm25min=min(pm25tmean2, na.rm=TRUE),
          pm25max=max(pm25tmean2, na.rm=TRUE),
          o3min=min(o3tmean2, na.rm=TRUE),
          o3max=max(o3tmean2, na.rm=TRUE),
          no2min=min(no2tmean2, na.rm=TRUE),
          no2max=max(no2tmean2, na.rm=TRUE)
)

## # A tibble: 12 x 7
##    month pm25min pm25max o3min o3max no2min no2max
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1     1    3.3     49.5 0.5    28.2   7.73   52.4
##  2     2    3.8     61.5 1.05   37.4  10.5    57.6
##  3     3    5.4     47.7 1.13   39.0   7.73   53.9
##  4     4    1.7     46.7 7.24   47.9   7.80   57.5
##  5     5    2.4     56.5 4.06   52.8   7.95   51.3
##  6     6    2.8     51.5 6.13   66.6   6.39   52.2
##  7     7    3       41.4 9.69   59.5   7.06   62.5
##  8     8    2.5     44.5 4.90   54.0   7.30   55.9
##  9     9    3.2     56.5 2.71   57.5   6.16   53.1
## 10    10    4       43.6 1.14   47.1   6.73   59.5
## 11    11    3.28    41.7 0.854  29.5   7.35   50  
## 12    12    3.91    42.3 0.153  27.7   9.20   50.3

Table 17: Minimum-Maximum values per month over 19 years.
month	pm25min	pm25max	o3min	o3max	no2mmin	no2max
1	3.300000	49.47143	0.5000000	28.22222	7.729167	52.44697
2	3.800000	61.50000	1.0492424	37.37500	10.490972	57.56439
3	5.400000	47.73750	1.1316425	39.05000	7.733333	53.89485
4	1.700000	46.70000	7.2409656	47.94907	7.802083	57.52542
5	2.400000	56.50000	4.0615942	52.75000	7.947917	51.28535
6	2.800000	51.53750	6.1296212	66.58750	6.385417	52.20833
7	3.000000	41.40000	9.6912202	59.54167	7.064513	62.47998
8	2.500000	44.48750	4.8981884	53.96701	7.295592	55.90522
9	3.200000	56.50000	2.7121706	57.48864	6.158333	53.13333
10	4.000000	43.60000	1.1443511	47.09275	6.726172	59.51562
11	3.275000	41.70000	0.8541667	29.45833	7.350000	50.00000
12	3.914286	42.30000	0.1527778	27.70833	9.204167	50.28154

Table 18 focuses on per year measurements. Rather than min/max (0% and 100% percentiles), we can look at the quantiles of 20-80 and get a better understanding of the probability distribution by year.

mutate(chicago, year = (lubridate::year(date))) %>%
  group_by(year) %>%
  summarize( 
          pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
          pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
          o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
          o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
          no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
          no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
          )

## # A tibble: 19 x 7
##     year pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1  1987   NA       NA    9.02  31.2   18.3   32.3
##  2  1988   NA       NA    9.33  34.1   18.7   31.4
##  3  1989   NA       NA    9.55  30.9   20.0   33.2
##  4  1990   NA       NA    9.40  29.0   15.8   29.3
##  5  1991   NA       NA    9.18  29.6   15.5   27.9
##  6  1992   NA       NA    6.77  23.7   20.2   30.9
##  7  1993   NA       NA    6.85  24.7   20.0   31.6
##  8  1994   NA       NA    7.97  26.6   21.1   35.9
##  9  1995   NA       NA    6.84  27.6   20.7   34.9
## 10  1996   NA       NA    6.34  25.1   20.7   33.2
## 11  1997   NA       NA    8.13  28.5   19.0   32.2
## 12  1998   10.5     25.4  6.80  30.0   18.6   30.9
## 13  1999   10.4     25.3  8.50  30.3   18.7   30.9
## 14  2000    9.54    24.2  8.31  27.6   16.8   31.0
## 15  2001    9.1     23.5  8.13  29.2   19.3   31.3
## 16  2002    8.51    21.0 10.4   30.5   17.9   29.3
## 17  2003    8.58    21.1 10.7   31.1   18.7   31.6
## 18  2004    7.89    19.4 11.2   28.9   17.3   29.4
## 19  2005    7.9     22.2 11.4   34.5   16.6   28.6

Table 18: Quantiles 20-80 per year distribution.
year	pm25Q20	pm25Q80	o3Q20	o3Q80	no2Q20	no2Q80
1987	NA	NA	9.015942	31.17821	18.27433	32.33631
1988	NA	NA	9.333333	34.06250	18.65417	31.37754
1989	NA	NA	9.550000	30.89374	20.04166	33.24937
1990	NA	NA	9.395652	29.02654	15.77971	29.32133
1991	NA	NA	9.183333	29.58202	15.51984	27.91394
1992	NA	NA	6.770833	23.66204	20.20698	30.87500
1993	NA	NA	6.845169	24.65485	20.02333	31.60217
1994	NA	NA	7.966667	26.55242	21.07083	35.85169
1995	NA	NA	6.844192	27.57417	20.65522	34.94202
1996	NA	NA	6.335824	25.11111	20.69792	33.22391
1997	NA	NA	8.127778	28.49761	19.00802	32.19722
1998	10.520000	25.40000	6.796860	30.00888	18.63361	30.94758
1999	10.377500	25.33429	8.496717	30.28982	18.71836	30.88763
2000	9.540000	24.19714	8.305556	27.58080	16.79167	31.04167
2001	9.100000	23.51600	8.131966	29.24917	19.29803	31.26060
2002	8.510000	20.96000	10.408333	30.50463	17.90625	29.29532
2003	8.580000	21.14000	10.738095	31.06458	18.66014	31.62000
2004	7.887143	19.43000	11.166667	28.85202	17.30729	29.37500
2005	7.900000	22.20000	11.426826	34.53244	16.59653	28.55060

Table 19 is the break down of per month probability distribution using 20/80. The break down helps us explore the data deeper and find if the data is normal or skewed over a monthly trend.

mutate(chicago, month = (lubridate::month(date))) %>%
  group_by(month) %>%
  summarize( 
          pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
          pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
          o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
          o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
          no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
          no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
          )

## # A tibble: 12 x 7
##    month pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
##    <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1     1   10.9     24.4  4.45  13.5   19.9   32.1
##  2     2   10.5     27.6  7.90  18.5   21.0   33.2
##  3     3    9.7     25.5 12.5   26.6   19.9   33.3
##  4     4    7.72    18.4 18.1   30.5   18.6   32.6
##  5     5    7.8     19.1 20.9   33.5   17.9   31.5
##  6     6    8.7     21.8 22.5   40.5   19.0   32.8
##  7     7    9.24    23.1 22.5   38.3   16.4   29.3
##  8     8    9.24    23.2 19.3   33.3   17.5   30.3
##  9     9    7.62    22.7 12.6   28.7   18.2   33.0
## 10    10    7.6     19.3  8.13  18.4   18.3   31.5
## 11    11    8.50    20.5  4.16  12.4   17.6   29.6
## 12    12   10.6     23.3  3.26  11.2   19.5   29.9

Table 19: Quantiles 20-80 per month distibution.
month	pm25Q20	pm25Q80	o3Q20	o3Q80	no2Q20	no2Q80
1	10.920000	24.41333	4.445652	13.46167	19.88214	32.08750
2	10.470000	27.61714	7.899758	18.51063	20.99167	33.17778
3	9.700000	25.46500	12.525000	26.62478	19.86111	33.27674
4	7.724286	18.42000	18.090657	30.48091	18.61875	32.58999
5	7.800000	19.10000	20.944556	33.54967	17.86199	31.46076
6	8.700000	21.85000	22.476235	40.49473	19.00893	32.78384
7	9.240000	23.08500	22.522239	38.30213	16.40743	29.28919
8	9.240000	23.17500	19.326223	33.28750	17.47797	30.29226
9	7.617143	22.66714	12.640212	28.65187	18.18177	33.00062
10	7.600000	19.28000	8.134375	18.38452	18.25942	31.51581
11	8.495000	20.51714	4.161111	12.40333	17.56083	29.62459
12	10.580000	23.26667	3.257699	11.18927	19.52083	29.87431

Conclusion

The article covered several of the main tools found in the tidyverse package for R. I demonstrated around ten different ways to manipulate and transform a dataset using the dplyr functions. We learned the Chicago data records over 19 years of air quality measurements. The pm25tmeans2 was not measured for the first ten years and showed by 4447 missing values. The primary dplyr functions used and combined are:

names()
select()
filter()
summarize()
arrange()
rename()
mutate()
transmute()
group_by()
%>%

References

Peng, R. D. (n.d.). Chicago dataset. Retrieved from: http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip

Peng, R. D. (2016). Exploratory Data Analysis with R. p. 4-17. Retrieved from: https://leanpub.com/exdata

RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.

Data Exploration - The Chicago Dataset

Bryan

2018/07/07