Introduction
This article is about using the tidyverse package in R to explore data. The Chicago dataset is the basis for this analysis. The data is air quality measurements taken over 19 years. The focus is on using the dplyr package to do the analysis. The dplyr package is part of the tidyverse in R and a suite of tools to analyze, transform, plot, and manipulate data. The base version of R does all of this; however, dplyr will work faster and cost fewer computing resources when used on big data. The functions used to explore the data are as follows:
names()
select()
filter()
summarize()
arrange()
rename()
mutate()
transmute()
group_by()
%>%
(pipe)
Additional packages used are:
- Tidyverse
- Readr
- Lubridate
- Skimr
- Knitr
read_rds()
I start with importing the data into a data frame with the readr package, and the structure of the data is a tibble, which is the tidyverse data frame structure. In the output, the tibble structure will list the first ten rows of data and column headers with the data types in this dataset.
chr
is charactersdbl
is real numbersdate
are dates
Not shown in this data but worth mentioning:
int
is integersdttm
is date-timelgl
is logicalfctr
is factors or categorical
chicago = as_tibble(readr::read_rds("../../static/data/chicago.rds"))
(chicago)
## # A tibble: 6,940 x 8
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0
## 2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2
## 3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8
## 4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4
## 5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3
## 6 chic 40 35.1 1987-01-06 NA 48 5.83 25.8
## 7 chic 34.5 26.8 1987-01-07 NA 41 9.29 20.6
## 8 chic 29 22 1987-01-08 NA 36 11.3 17.0
## 9 chic 26.5 29 1987-01-09 NA 33.3 4.5 23.4
## 10 chic 32.5 27.8 1987-01-10 NA NA 4.96 19.5
## # ... with 6,930 more rows
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |
chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
chic | 34.5 | 26.750 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |
chic | 29.0 | 22.000 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |
chic | 26.5 | 29.000 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |
chic | 32.5 | 27.750 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |
In table 1, the Chicago dataset has eight variables and 6940 rows. The city variable appears as ‘chic’ for Chicago, and the missing data is filled in by default with NA
. Also, the date is in the y/m/d format. We can see the different readings for the real numbers, from 1 to 6 decimal places.
names()
names()
function lists all the variable names in the data.
names(chicago)
## [1] "city" "tmpd" "dptp" "date" "pm25tmean2"
## [6] "pm10tmean2" "o3tmean2" "no2tmean2"
List the first three names with the index[:]. Note R is a little different than Python. R will display the last index position [3], where Python would display 1 & 2.
names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
List the last three variables.
names(chicago)[6:8]
## [1] "pm10tmean2" "o3tmean2" "no2tmean2"
Summary Statistics with skimr
The skim()
function in the skimr package gives a good idea, in one place, of what the data looks like in table 2. Noting there are 4447 NA’s in pm25tmean2 and 242 NA’s in pm10tmean2. The date range is from 1987 to 2005. The histogram shows us the skew of each variable and a nice feature for skimr.
In base R, I would also use the summary()
function. As you will see, the tidyverse does not have a summary function that doesn’t require multiple lines of code. I will generate some of these statistics, but it takes more time than typing in one line into the console.
skimr::skim(chicago)
Name | chicago |
Number of rows | 6940 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
city | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 1987-01-01 | 2005-12-31 | 1996-07-01 | 6940 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
tmpd | 1 | 1.00 | 50.31 | 19.41 | -16.00 | 35.00 | 51.00 | 67.00 | 92.00 | ▁▃▇▇▅ |
dptp | 2 | 1.00 | 40.34 | 18.49 | -25.62 | 27.00 | 39.88 | 55.75 | 78.25 | ▁▂▇▇▅ |
pm25tmean2 | 4447 | 0.36 | 16.23 | 8.70 | 1.70 | 9.70 | 14.66 | 20.60 | 61.50 | ▇▇▂▁▁ |
pm10tmean2 | 242 | 0.97 | 33.90 | 17.97 | 2.00 | 21.50 | 30.28 | 42.00 | 365.00 | ▇▁▁▁▁ |
o3tmean2 | 0 | 1.00 | 19.44 | 11.39 | 0.15 | 10.07 | 18.52 | 27.00 | 66.59 | ▇▇▅▁▁ |
no2tmean2 | 0 | 1.00 | 25.23 | 7.99 | 6.16 | 19.65 | 24.56 | 30.14 | 62.48 | ▂▇▃▁▁ |
select()
To view specific variables, I use the select()
function. Perhaps this small dataset in not going to show the benefits of select()
, however, with hundreds or thousands of variables, select()
will prove a worthy function. Here in table 3, we can isolate the variables city through dptp very quickly.
select(chicago,
city:dptp)
## # A tibble: 6,940 x 3
## city tmpd dptp
## <chr> <dbl> <dbl>
## 1 chic 31.5 31.5
## 2 chic 33 29.9
## 3 chic 33 27.4
## 4 chic 29 28.6
## 5 chic 32 28.9
## 6 chic 40 35.1
## 7 chic 34.5 26.8
## 8 chic 29 22
## 9 chic 26.5 29
## 10 chic 32.5 27.8
## # ... with 6,930 more rows
city | tmpd | dptp |
---|---|---|
chic | 31.5 | 31.500 |
chic | 33.0 | 29.875 |
chic | 33.0 | 27.375 |
chic | 29.0 | 28.625 |
chic | 32.0 | 28.875 |
chic | 40.0 | 35.125 |
chic | 34.5 | 26.750 |
chic | 29.0 | 22.000 |
chic | 26.5 | 29.000 |
chic | 32.5 | 27.750 |
In table 4, select()
can also do the inverse of city:dptp and exclude those variables to display the remaining variables.
select(chicago, -(city:dptp))
## # A tibble: 6,940 x 5
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1987-01-01 NA 34 4.25 20.0
## 2 1987-01-02 NA NA 3.30 23.2
## 3 1987-01-03 NA 34.2 3.33 23.8
## 4 1987-01-04 NA 47 4.38 30.4
## 5 1987-01-05 NA NA 4.75 30.3
## 6 1987-01-06 NA 48 5.83 25.8
## 7 1987-01-07 NA 41 9.29 20.6
## 8 1987-01-08 NA 36 11.3 17.0
## 9 1987-01-09 NA 33.3 4.5 23.4
## 10 1987-01-10 NA NA 4.96 19.5
## # ... with 6,930 more rows
date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|
1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
1987-01-02 | NA | NA | 3.304348 | 23.19099 |
1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
1987-01-05 | NA | NA | 4.750000 | 30.33333 |
1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |
1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |
1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |
1987-01-10 | NA | NA | 4.958333 | 19.54167 |
Table 5 demonstrates the ends_with()
and table 6 the starts_with()
functions that only work inside of select()
. Additional functions here would be:
matches()
num_range()
one_of()
everything()
contains()
Lets find all the variables that end with the number 2 in table 5.
select(chicago,
ends_with("2"))
## # A tibble: 6,940 x 4
## pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <dbl> <dbl> <dbl> <dbl>
## 1 NA 34 4.25 20.0
## 2 NA NA 3.30 23.2
## 3 NA 34.2 3.33 23.8
## 4 NA 47 4.38 30.4
## 5 NA NA 4.75 30.3
## 6 NA 48 5.83 25.8
## 7 NA 41 9.29 20.6
## 8 NA 36 11.3 17.0
## 9 NA 33.3 4.5 23.4
## 10 NA NA 4.96 19.5
## # ... with 6,930 more rows
pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|
NA | 34.00000 | 4.250000 | 19.98810 |
NA | NA | 3.304348 | 23.19099 |
NA | 34.16667 | 3.333333 | 23.81548 |
NA | 47.00000 | 4.375000 | 30.43452 |
NA | NA | 4.750000 | 30.33333 |
NA | 48.00000 | 5.833333 | 25.77233 |
NA | 41.00000 | 9.291667 | 20.58171 |
NA | 36.00000 | 11.291667 | 17.03723 |
NA | 33.28571 | 4.500000 | 23.38889 |
NA | NA | 4.958333 | 19.54167 |
Let’s find all the variables that start with the letter “d” in table 6.
select(chicago,
starts_with("d"))
## # A tibble: 6,940 x 2
## dptp date
## <dbl> <date>
## 1 31.5 1987-01-01
## 2 29.9 1987-01-02
## 3 27.4 1987-01-03
## 4 28.6 1987-01-04
## 5 28.9 1987-01-05
## 6 35.1 1987-01-06
## 7 26.8 1987-01-07
## 8 22 1987-01-08
## 9 29 1987-01-09
## 10 27.8 1987-01-10
## # ... with 6,930 more rows
dptp | date |
---|---|
31.500 | 1987-01-01 |
29.875 | 1987-01-02 |
27.375 | 1987-01-03 |
28.625 | 1987-01-04 |
28.875 | 1987-01-05 |
35.125 | 1987-01-06 |
26.750 | 1987-01-07 |
22.000 | 1987-01-08 |
29.000 | 1987-01-09 |
27.750 | 1987-01-10 |
filter()
The filter()
function is a way to find values in the data. Here the results are placed in table 7. Shown is filtering all pm25tmean2 variables that are greater than 30. We can do any combination of variables with specific parameters. The tibble output indicates there are 194 instances where pm25tmean2 is above a reading of 30.
filter(chicago,
pm25tmean2 > 30)
## # A tibble: 194 x 8
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 23 21.9 1998-01-17 38.1 32.5 3.18 25.3
## 2 chic 28 25.8 1998-01-23 34.0 38.7 1.75 29.4
## 3 chic 55 51.3 1998-04-30 39.4 34 10.8 25.3
## 4 chic 59 53.7 1998-05-01 35.4 28.5 14.3 31.4
## 5 chic 57 52 1998-05-02 33.3 35 20.7 26.8
## 6 chic 57 56 1998-05-07 32.1 34.5 24.3 34.0
## 7 chic 75 65.8 1998-05-15 56.5 91 38.6 29.0
## 8 chic 61 59 1998-06-09 33.8 26 17.9 25.5
## 9 chic 73 60.3 1998-07-13 30.3 64.5 37.0 37.9
## 10 chic 78 67.1 1998-07-14 41.4 75 40.1 32.6
## # ... with 184 more rows
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
chic | 23 | 21.9 | 1998-01-17 | 38.10 | 32.46154 | 3.180556 | 25.30000 |
chic | 28 | 25.8 | 1998-01-23 | 33.95 | 38.69231 | 1.750000 | 29.37630 |
chic | 55 | 51.3 | 1998-04-30 | 39.40 | 34.00000 | 10.786232 | 25.31310 |
chic | 59 | 53.7 | 1998-05-01 | 35.40 | 28.50000 | 14.295125 | 31.42905 |
chic | 57 | 52.0 | 1998-05-02 | 33.30 | 35.00000 | 20.662879 | 26.79861 |
chic | 57 | 56.0 | 1998-05-07 | 32.10 | 34.50000 | 24.270422 | 33.99167 |
chic | 75 | 65.8 | 1998-05-15 | 56.50 | 91.00000 | 38.573007 | 29.03261 |
chic | 61 | 59.0 | 1998-06-09 | 33.80 | 26.00000 | 17.890810 | 25.49668 |
chic | 73 | 60.3 | 1998-07-13 | 30.30 | 64.50000 | 37.018864 | 37.93056 |
chic | 78 | 67.1 | 1998-07-14 | 41.40 | 75.00000 | 40.080902 | 32.59054 |
Table 8 filters all values of the variable pm25tmean2 greater than 30, and I summarize to refine that to the minimum and maximum values. The table is showing that 30.05 is the lowest, and 61.5 is the highest value after filtering the variable. %>%
is shown in the code and a way to write a few sequence operations together.
chicago %>%
filter(pm25tmean2 > 30) %>%
summarize(minValue = min(pm25tmean2),
maxValue = max(pm25tmean2))
## # A tibble: 1 x 2
## minValue maxValue
## <dbl> <dbl>
## 1 30.0 61.5
minValue | maxValue |
---|---|
30.05 | 61.5 |
Table 9 is a refinement in table 8. Here I want to find the values higher than 30 in pm25tmean2 and values greater than 80 in tmpd. In the output tibble, there are only 17 instances where this condition is true and a little clustering in 2001, 2002, and 2005.
chicago %>%
select(date, tmpd, pm25tmean2) %>%
filter(pm25tmean2 > 30 & tmpd > 80)
## # A tibble: 17 x 3
## date tmpd pm25tmean2
## <date> <dbl> <dbl>
## 1 1998-08-23 81 39.6
## 2 1998-09-06 81 31.5
## 3 2001-07-20 82 32.3
## 4 2001-08-01 84 43.7
## 5 2001-08-08 85 38.8
## 6 2001-08-09 84 38.2
## 7 2002-06-20 82 33
## 8 2002-06-23 82 42.5
## 9 2002-07-08 81 33.1
## 10 2002-07-18 82 38.8
## 11 2003-06-25 82 33.9
## 12 2003-07-04 84 32.9
## 13 2005-06-24 86 31.9
## 14 2005-06-27 82 51.5
## 15 2005-06-28 85 31.2
## 16 2005-07-17 84 32.7
## 17 2005-08-03 84 37.9
date | tmpd | pm25tmean2 |
---|---|---|
1998-08-23 | 81 | 39.60000 |
1998-09-06 | 81 | 31.50000 |
2001-07-20 | 82 | 32.30000 |
2001-08-01 | 84 | 43.70000 |
2001-08-08 | 85 | 38.83750 |
2001-08-09 | 84 | 38.20000 |
2002-06-20 | 82 | 33.00000 |
2002-06-23 | 82 | 42.50000 |
2002-07-08 | 81 | 33.10000 |
2002-07-18 | 82 | 38.85000 |
2003-06-25 | 82 | 33.90000 |
2003-07-04 | 84 | 32.90000 |
2005-06-24 | 86 | 31.85714 |
2005-06-27 | 82 | 51.53750 |
2005-06-28 | 85 | 31.20000 |
2005-07-17 | 84 | 32.70000 |
2005-08-03 | 84 | 37.90000 |
This filtering collects all values above 80 for the tmpd variable and then summarizes to find the minimum and maximum values. 80.5 is the minimum, and 92 is the maximum.
chicago %>%
filter(tmpd > 80) %>%
summarize(minValue = min(tmpd),
maxValue = max(tmpd))
## # A tibble: 1 x 2
## minValue maxValue
## <dbl> <dbl>
## 1 80.5 92
arrange()
Table 10 shows the arrange()
function and sorts the o3tmean2 variable. Default from lowest to highest numerical values.
chicago %>%
arrange(o3tmean2)
## # A tibble: 6,940 x 8
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 43.5 39 1994-12-05 NA 47.5 0.153 27.2
## 2 chic 32.5 34.4 1994-12-15 NA 43.5 0.361 30.5
## 3 chic 38.5 38.5 1993-12-17 NA 16 0.427 23.9
## 4 chic 23.5 16.9 1994-12-13 NA 48.5 0.492 33.3
## 5 chic 34 33.9 2001-01-14 28 22 0.5 26.2
## 6 chic 33 37.5 1995-01-11 NA 68 0.517 37.8
## 7 chic 40.5 44.9 1993-01-03 NA 21.5 0.528 19.8
## 8 chic 61 59.3 1998-12-04 31.6 42.5 0.556 24.4
## 9 chic 35.5 34 1993-01-21 NA 19 0.653 27.0
## 10 chic 36.5 36 1994-12-16 NA 41 0.694 34.4
## # ... with 6,930 more rows
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
chic | 43.5 | 39.000 | 1994-12-05 | NA | 47.5 | 0.1527778 | 27.20399 |
chic | 32.5 | 34.375 | 1994-12-15 | NA | 43.5 | 0.3611111 | 30.45452 |
chic | 38.5 | 38.500 | 1993-12-17 | NA | 16.0 | 0.4270833 | 23.92361 |
chic | 23.5 | 16.875 | 1994-12-13 | NA | 48.5 | 0.4915459 | 33.34242 |
chic | 34.0 | 33.900 | 2001-01-14 | 28.0 | 22.0 | 0.5000000 | 26.18750 |
chic | 33.0 | 37.500 | 1995-01-11 | NA | 68.0 | 0.5169082 | 37.77083 |
chic | 40.5 | 44.875 | 1993-01-03 | NA | 21.5 | 0.5277778 | 19.82500 |
chic | 61.0 | 59.300 | 1998-12-04 | 31.6 | 42.5 | 0.5555556 | 24.39167 |
chic | 35.5 | 34.000 | 1993-01-21 | NA | 19.0 | 0.6527778 | 27.00284 |
chic | 36.5 | 36.000 | 1994-12-16 | NA | 41.0 | 0.6944444 | 34.39167 |
Table 11 sorts the tmpd variable lowest to highest.
chicago %>%
arrange(tmpd)
## # A tibble: 6,940 x 8
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic -16 -25.6 1994-01-18 NA 27 14.7 22.1
## 2 chic -12 -18.9 1996-02-03 NA 25.6 7.74 30.1
## 3 chic -10.5 -23.1 1994-01-15 NA 36 12.0 24.7
## 4 chic -10.5 -21 1996-02-02 NA 35 9.56 31.5
## 5 chic -8 -2.38 1994-01-19 NA 24.5 4.86 36.6
## 6 chic -6.5 -15.8 1988-01-05 NA 24 13.2 17.2
## 7 chic -6.5 -23 1989-12-21 NA 49 16.7 19.5
## 8 chic -6 -7.88 1996-02-04 NA 21 9.58 30.0
## 9 chic -5.5 -15.1 1988-01-06 NA 75 7.15 27.7
## 10 chic -4.5 -11.9 1989-12-22 NA 49 9.42 29.4
## # ... with 6,930 more rows
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
chic | -16.0 | -25.625 | 1994-01-18 | NA | 27.00000 | 14.656250 | 22.10236 |
chic | -12.0 | -18.875 | 1996-02-03 | NA | 25.61538 | 7.736111 | 30.13542 |
chic | -10.5 | -23.125 | 1994-01-15 | NA | 36.00000 | 12.010417 | 24.70139 |
chic | -10.5 | -21.000 | 1996-02-02 | NA | 35.00000 | 9.555556 | 31.47917 |
chic | -8.0 | -2.375 | 1994-01-19 | NA | 24.50000 | 4.864583 | 36.59796 |
chic | -6.5 | -15.750 | 1988-01-05 | NA | 24.00000 | 13.166667 | 17.22083 |
chic | -6.5 | -23.000 | 1989-12-21 | NA | 49.00000 | 16.666667 | 19.50000 |
chic | -6.0 | -7.875 | 1996-02-04 | NA | 21.00000 | 9.583333 | 29.95833 |
chic | -5.5 | -15.125 | 1988-01-06 | NA | 75.00000 | 7.145833 | 27.66594 |
chic | -4.5 | -11.875 | 1989-12-22 | NA | 49.00000 | 9.416667 | 29.41667 |
In table 12 we can sort the tmpd variable in descending order by adding the parameter to the arrange()
function.
chicago %>%
arrange(desc(tmpd))
## # A tibble: 6,940 x 8
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 92 76.4 1995-07-13 NA 92.5 58.5 27.3
## 2 chic 91.5 75.6 1995-07-14 NA 54.5 52.0 24.8
## 3 chic 90 77.1 1999-07-30 NA 54 34.2 27.5
## 4 chic 90 70.9 2005-07-24 22.0 44 47.6 12.0
## 5 chic 89.5 70.4 1988-08-02 NA 67 49.8 25.9
## 6 chic 89 71.1 1988-08-01 NA 67.2 44.1 26.0
## 7 chic 88.5 70.2 1988-06-20 NA 88 61.7 25.6
## 8 chic 88 67.4 1988-08-03 NA 65 40.6 27.7
## 9 chic 88 66.4 1988-08-17 NA 71 39.3 23.5
## 10 chic 88 71.2 1991-07-22 NA 70 34.9 24.6
## # ... with 6,930 more rows
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
chic | 92.0 | 76.375 | 1995-07-13 | NA | 92.5 | 58.54952 | 27.31775 |
chic | 91.5 | 75.625 | 1995-07-14 | NA | 54.5 | 52.00397 | 24.77355 |
chic | 90.0 | 77.100 | 1999-07-30 | NA | 54.0 | 34.17065 | 27.47977 |
chic | 90.0 | 70.900 | 2005-07-24 | 22.05 | 44.0 | 47.60417 | 11.96354 |
chic | 89.5 | 70.375 | 1988-08-02 | NA | 67.0 | 49.84949 | 25.91973 |
chic | 89.0 | 71.125 | 1988-08-01 | NA | 67.2 | 44.14991 | 25.96309 |
chic | 88.5 | 70.250 | 1988-06-20 | NA | 88.0 | 61.67708 | 25.64027 |
chic | 88.0 | 67.375 | 1988-08-03 | NA | 65.0 | 40.64176 | 27.73958 |
chic | 88.0 | 66.375 | 1988-08-17 | NA | 71.0 | 39.31240 | 23.50521 |
chic | 88.0 | 71.250 | 1991-07-22 | NA | 70.0 | 34.85781 | 24.55123 |
rename()
The rename()
function will be beneficial if you need to give the variables a descriptive name. In this example, table 13, I shorten the variable name and process these all in one line of code.
chicago %>%
rename(temp=tmpd,
dewpoint=dptp,
pm25=pm25tmean2,
pm10=pm10tmean2,
O3=o3tmean2,
NO2=no2tmean2)
## # A tibble: 6,940 x 8
## city temp dewpoint date pm25 pm10 O3 NO2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0
## 2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2
## 3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8
## 4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4
## 5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3
## 6 chic 40 35.1 1987-01-06 NA 48 5.83 25.8
## 7 chic 34.5 26.8 1987-01-07 NA 41 9.29 20.6
## 8 chic 29 22 1987-01-08 NA 36 11.3 17.0
## 9 chic 26.5 29 1987-01-09 NA 33.3 4.5 23.4
## 10 chic 32.5 27.8 1987-01-10 NA NA 4.96 19.5
## # ... with 6,930 more rows
city | temp | dewpoint | date | pm25 | pm10 | O3 | NO2 |
---|---|---|---|---|---|---|---|
chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |
chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
chic | 34.5 | 26.750 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |
chic | 29.0 | 22.000 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |
chic | 26.5 | 29.000 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |
chic | 32.5 | 27.750 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |
mutate()
In this example, the mutate()
function adds a new variable to the data. In this case, we can take the mean of no2tmean2 and subtract it from itself to standardize the variable. This will show if the mean is above or below the average of 25.23 (taken from table 2). In this one line, I calculated and created a new variable for the data called no2detrend. A positive detrend indicates above the mean, and a negative value indicates below the mean for that individual record. See table 14.
smlchicago <- chicago %>%
select(tmpd, dptp, o3tmean2, no2tmean2)
smlchicago %>%
mutate(nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))
## # A tibble: 6,940 x 5
## tmpd dptp o3tmean2 no2tmean2 nO2detrend
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31.5 31.5 4.25 20.0 -5.24
## 2 33 29.9 3.30 23.2 -2.04
## 3 33 27.4 3.33 23.8 -1.42
## 4 29 28.6 4.38 30.4 5.20
## 5 32 28.9 4.75 30.3 5.10
## 6 40 35.1 5.83 25.8 0.540
## 7 34.5 26.8 9.29 20.6 -4.65
## 8 29 22 11.3 17.0 -8.19
## 9 26.5 29 4.5 23.4 -1.84
## 10 32.5 27.8 4.96 19.5 -5.69
## # ... with 6,930 more rows
tmpd | dptp | o3tmean2 | no2tmean2 | nO2detrend |
---|---|---|---|---|
31.5 | 31.500 | 4.250000 | 19.98810 | -5.2437869 |
33.0 | 29.875 | 3.304348 | 23.19099 | -2.0408884 |
33.0 | 27.375 | 3.333333 | 23.81548 | -1.4164060 |
29.0 | 28.625 | 4.375000 | 30.43452 | 5.2026417 |
32.0 | 28.875 | 4.750000 | 30.33333 | 5.1014512 |
40.0 | 35.125 | 5.833333 | 25.77233 | 0.5404485 |
34.5 | 26.750 | 9.291667 | 20.58171 | -4.6501722 |
29.0 | 22.000 | 11.291667 | 17.03723 | -8.1946513 |
26.5 | 29.000 | 4.500000 | 23.38889 | -1.8429933 |
32.5 | 27.750 | 4.958333 | 19.54167 | -5.6902155 |
transmute()
The transmute() does the same as above but breaks the new variables out into a tibble or data frame. In table 15, the three new variables create a new tibble. The temperature tmpd
variable now gives a better idea of where it is in comparison to the mean average temperature.
transmute(chicago,
tmpdS = tmpd - mean(tmpd, na.rm=TRUE),
o3detrend = o3tmean2 - mean(o3tmean2, na.rm=TRUE),
nO2detrend = no2tmean2 - mean(no2tmean2, na.rm=TRUE))
## # A tibble: 6,940 x 3
## tmpdS o3detrend nO2detrend
## <dbl> <dbl> <dbl>
## 1 -18.8 -15.2 -5.24
## 2 -17.3 -16.1 -2.04
## 3 -17.3 -16.1 -1.42
## 4 -21.3 -15.1 5.20
## 5 -18.3 -14.7 5.10
## 6 -10.3 -13.6 0.540
## 7 -15.8 -10.1 -4.65
## 8 -21.3 -8.14 -8.19
## 9 -23.8 -14.9 -1.84
## 10 -17.8 -14.5 -5.69
## # ... with 6,930 more rows
tmpdS | o3detrend | nO2detrend |
---|---|---|
-18.80934 | -15.185513 | -5.2437869 |
-17.30934 | -16.131165 | -2.0408884 |
-17.30934 | -16.102180 | -1.4164060 |
-21.30934 | -15.060513 | 5.2026417 |
-18.30934 | -14.685513 | 5.1014512 |
-10.30934 | -13.602180 | 0.5404485 |
-15.80934 | -10.143846 | -4.6501722 |
-21.30934 | -8.143846 | -8.1946513 |
-23.80934 | -14.935513 | -1.8429933 |
-17.80934 | -14.477180 | -5.6902155 |
group_by()
summarize()
Table 16 shows how to transform the data. I created a new variable “year” The year is extracted from the y/m/d string using lubridates year()
function. The data is organized by year. The summarize()
function creates six new columns of min/max measurements per year for three variables. The pm25tmean2 didn’t have any records for ten years until 1998.
mutate(chicago, year = (lubridate::year(date))) %>%
group_by(year) %>%
summarize(
pm25min=min(pm25tmean2, na.rm=TRUE),
pm25max=max(pm25tmean2, na.rm=TRUE),
o3min=min(o3tmean2, na.rm=TRUE),
o3max=max(o3tmean2, na.rm=TRUE),
no2mmin=min(no2tmean2, na.rm=TRUE),
no2max=max(no2tmean2, na.rm=TRUE)
)
## # A tibble: 19 x 7
## year pm25min pm25max o3min o3max no2mmin no2max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1987 Inf -Inf 2.12 63.0 7.87 57.5
## 2 1988 Inf -Inf 2.38 61.7 7.73 62.5
## 3 1989 Inf -Inf 2 59.7 9.58 59.5
## 4 1990 Inf -Inf 2.12 52.2 6.73 46.2
## 5 1991 Inf -Inf 2 63.1 6.39 55.5
## 6 1992 Inf -Inf 2 50.8 9.96 49.8
## 7 1993 Inf -Inf 0.427 44.3 10.3 48.9
## 8 1994 Inf -Inf 0.153 52.2 10.9 53.9
## 9 1995 Inf -Inf 0.517 66.6 8.93 53.1
## 10 1996 Inf -Inf 1.17 58.4 10.5 52.7
## 11 1997 Inf -Inf 0.957 56.5 7.73 53.0
## 12 1998 2.1 56.5 0.556 50.7 8.8 47.8
## 13 1999 5.18 49.0 0.854 57.5 7.64 49.5
## 14 2000 2.8 46.7 1.04 55.8 6.16 50.8
## 15 2001 3.2 49.5 0.5 51.8 8.65 48.7
## 16 2002 2.91 44.0 2.10 54.9 7.80 47.8
## 17 2003 2.5 47.7 1.10 56.2 7.70 57.6
## 18 2004 2.6 48.2 2.20 44.5 7.55 45.0
## 19 2005 1.7 61.5 1.77 58.8 8.83 48.6
year | pm25min | pm25max | o3min | o3max | no2mmin | no2max |
---|---|---|---|---|---|---|
1987 | Inf | -Inf | 2.1250000 | 62.96966 | 7.866667 | 57.52542 |
1988 | Inf | -Inf | 2.3750000 | 61.67708 | 7.733333 | 62.47998 |
1989 | Inf | -Inf | 2.0000000 | 59.72727 | 9.580729 | 59.51562 |
1990 | Inf | -Inf | 2.1250000 | 52.22917 | 6.726172 | 46.19444 |
1991 | Inf | -Inf | 2.0000000 | 63.10417 | 6.385417 | 55.48958 |
1992 | Inf | -Inf | 2.0000000 | 50.82870 | 9.962500 | 49.77714 |
1993 | Inf | -Inf | 0.4270833 | 44.30093 | 10.263768 | 48.90139 |
1994 | Inf | -Inf | 0.1527778 | 52.17844 | 10.944444 | 53.89485 |
1995 | Inf | -Inf | 0.5169082 | 66.58750 | 8.925000 | 53.13333 |
1996 | Inf | -Inf | 1.1666667 | 58.39583 | 10.458333 | 52.68850 |
1997 | Inf | -Inf | 0.9565217 | 56.54167 | 7.729167 | 53.01042 |
1998 | 2.1000 | 56.50000 | 0.5555556 | 50.66250 | 8.800000 | 47.81558 |
1999 | 5.1800 | 49.04286 | 0.8541667 | 57.48864 | 7.641667 | 49.54924 |
2000 | 2.8000 | 46.70000 | 1.0416667 | 55.76103 | 6.158333 | 50.81123 |
2001 | 3.2000 | 49.47143 | 0.5000000 | 51.81984 | 8.650000 | 48.71875 |
2002 | 2.9125 | 44.04286 | 2.0964674 | 54.88043 | 7.802083 | 47.76042 |
2003 | 2.5000 | 47.73750 | 1.0978261 | 56.16608 | 7.697917 | 57.56439 |
2004 | 2.6000 | 48.15714 | 2.1965580 | 44.48240 | 7.552083 | 44.97817 |
2005 | 1.7000 | 61.50000 | 1.7708333 | 58.84126 | 8.833333 | 48.59722 |
Table 17 displays the same min/max data for the three variables; pm25tmean2, o3tmean2, and no2tmean2. And using the group_by()
I extracted the month()
out with lubridate. This breakdown shows further relationships in the data and has transformed the original data into something to devise a hypothesis.
mutate(chicago, month = (lubridate::month(date))) %>%
group_by(month) %>%
summarize(
pm25min=min(pm25tmean2, na.rm=TRUE),
pm25max=max(pm25tmean2, na.rm=TRUE),
o3min=min(o3tmean2, na.rm=TRUE),
o3max=max(o3tmean2, na.rm=TRUE),
no2min=min(no2tmean2, na.rm=TRUE),
no2max=max(no2tmean2, na.rm=TRUE)
)
## # A tibble: 12 x 7
## month pm25min pm25max o3min o3max no2min no2max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.3 49.5 0.5 28.2 7.73 52.4
## 2 2 3.8 61.5 1.05 37.4 10.5 57.6
## 3 3 5.4 47.7 1.13 39.0 7.73 53.9
## 4 4 1.7 46.7 7.24 47.9 7.80 57.5
## 5 5 2.4 56.5 4.06 52.8 7.95 51.3
## 6 6 2.8 51.5 6.13 66.6 6.39 52.2
## 7 7 3 41.4 9.69 59.5 7.06 62.5
## 8 8 2.5 44.5 4.90 54.0 7.30 55.9
## 9 9 3.2 56.5 2.71 57.5 6.16 53.1
## 10 10 4 43.6 1.14 47.1 6.73 59.5
## 11 11 3.28 41.7 0.854 29.5 7.35 50
## 12 12 3.91 42.3 0.153 27.7 9.20 50.3
month | pm25min | pm25max | o3min | o3max | no2mmin | no2max |
---|---|---|---|---|---|---|
1 | 3.300000 | 49.47143 | 0.5000000 | 28.22222 | 7.729167 | 52.44697 |
2 | 3.800000 | 61.50000 | 1.0492424 | 37.37500 | 10.490972 | 57.56439 |
3 | 5.400000 | 47.73750 | 1.1316425 | 39.05000 | 7.733333 | 53.89485 |
4 | 1.700000 | 46.70000 | 7.2409656 | 47.94907 | 7.802083 | 57.52542 |
5 | 2.400000 | 56.50000 | 4.0615942 | 52.75000 | 7.947917 | 51.28535 |
6 | 2.800000 | 51.53750 | 6.1296212 | 66.58750 | 6.385417 | 52.20833 |
7 | 3.000000 | 41.40000 | 9.6912202 | 59.54167 | 7.064513 | 62.47998 |
8 | 2.500000 | 44.48750 | 4.8981884 | 53.96701 | 7.295592 | 55.90522 |
9 | 3.200000 | 56.50000 | 2.7121706 | 57.48864 | 6.158333 | 53.13333 |
10 | 4.000000 | 43.60000 | 1.1443511 | 47.09275 | 6.726172 | 59.51562 |
11 | 3.275000 | 41.70000 | 0.8541667 | 29.45833 | 7.350000 | 50.00000 |
12 | 3.914286 | 42.30000 | 0.1527778 | 27.70833 | 9.204167 | 50.28154 |
Table 18 focuses on per year measurements. Rather than min/max (0% and 100% percentiles), we can look at the quantiles of 20-80 and get a better understanding of the probability distribution by year.
mutate(chicago, year = (lubridate::year(date))) %>%
group_by(year) %>%
summarize(
pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
)
## # A tibble: 19 x 7
## year pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1987 NA NA 9.02 31.2 18.3 32.3
## 2 1988 NA NA 9.33 34.1 18.7 31.4
## 3 1989 NA NA 9.55 30.9 20.0 33.2
## 4 1990 NA NA 9.40 29.0 15.8 29.3
## 5 1991 NA NA 9.18 29.6 15.5 27.9
## 6 1992 NA NA 6.77 23.7 20.2 30.9
## 7 1993 NA NA 6.85 24.7 20.0 31.6
## 8 1994 NA NA 7.97 26.6 21.1 35.9
## 9 1995 NA NA 6.84 27.6 20.7 34.9
## 10 1996 NA NA 6.34 25.1 20.7 33.2
## 11 1997 NA NA 8.13 28.5 19.0 32.2
## 12 1998 10.5 25.4 6.80 30.0 18.6 30.9
## 13 1999 10.4 25.3 8.50 30.3 18.7 30.9
## 14 2000 9.54 24.2 8.31 27.6 16.8 31.0
## 15 2001 9.1 23.5 8.13 29.2 19.3 31.3
## 16 2002 8.51 21.0 10.4 30.5 17.9 29.3
## 17 2003 8.58 21.1 10.7 31.1 18.7 31.6
## 18 2004 7.89 19.4 11.2 28.9 17.3 29.4
## 19 2005 7.9 22.2 11.4 34.5 16.6 28.6
year | pm25Q20 | pm25Q80 | o3Q20 | o3Q80 | no2Q20 | no2Q80 |
---|---|---|---|---|---|---|
1987 | NA | NA | 9.015942 | 31.17821 | 18.27433 | 32.33631 |
1988 | NA | NA | 9.333333 | 34.06250 | 18.65417 | 31.37754 |
1989 | NA | NA | 9.550000 | 30.89374 | 20.04166 | 33.24937 |
1990 | NA | NA | 9.395652 | 29.02654 | 15.77971 | 29.32133 |
1991 | NA | NA | 9.183333 | 29.58202 | 15.51984 | 27.91394 |
1992 | NA | NA | 6.770833 | 23.66204 | 20.20698 | 30.87500 |
1993 | NA | NA | 6.845169 | 24.65485 | 20.02333 | 31.60217 |
1994 | NA | NA | 7.966667 | 26.55242 | 21.07083 | 35.85169 |
1995 | NA | NA | 6.844192 | 27.57417 | 20.65522 | 34.94202 |
1996 | NA | NA | 6.335824 | 25.11111 | 20.69792 | 33.22391 |
1997 | NA | NA | 8.127778 | 28.49761 | 19.00802 | 32.19722 |
1998 | 10.520000 | 25.40000 | 6.796860 | 30.00888 | 18.63361 | 30.94758 |
1999 | 10.377500 | 25.33429 | 8.496717 | 30.28982 | 18.71836 | 30.88763 |
2000 | 9.540000 | 24.19714 | 8.305556 | 27.58080 | 16.79167 | 31.04167 |
2001 | 9.100000 | 23.51600 | 8.131966 | 29.24917 | 19.29803 | 31.26060 |
2002 | 8.510000 | 20.96000 | 10.408333 | 30.50463 | 17.90625 | 29.29532 |
2003 | 8.580000 | 21.14000 | 10.738095 | 31.06458 | 18.66014 | 31.62000 |
2004 | 7.887143 | 19.43000 | 11.166667 | 28.85202 | 17.30729 | 29.37500 |
2005 | 7.900000 | 22.20000 | 11.426826 | 34.53244 | 16.59653 | 28.55060 |
Table 19 is the break down of per month probability distribution using 20/80. The break down helps us explore the data deeper and find if the data is normal or skewed over a monthly trend.
mutate(chicago, month = (lubridate::month(date))) %>%
group_by(month) %>%
summarize(
pm25Q20=quantile(pm25tmean2, 0.2, na.rm=TRUE),
pm25Q80=quantile(pm25tmean2, 0.8, na.rm=TRUE),
o3Q20=quantile(o3tmean2, 0.2, na.rm=TRUE),
o3Q80=quantile(o3tmean2, 0.8, na.rm=TRUE),
no2Q20=quantile(no2tmean2, 0.2, na.rm=TRUE),
no2Q80=quantile(no2tmean2, 0.8, na.rm=TRUE)
)
## # A tibble: 12 x 7
## month pm25Q20 pm25Q80 o3Q20 o3Q80 no2Q20 no2Q80
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 10.9 24.4 4.45 13.5 19.9 32.1
## 2 2 10.5 27.6 7.90 18.5 21.0 33.2
## 3 3 9.7 25.5 12.5 26.6 19.9 33.3
## 4 4 7.72 18.4 18.1 30.5 18.6 32.6
## 5 5 7.8 19.1 20.9 33.5 17.9 31.5
## 6 6 8.7 21.8 22.5 40.5 19.0 32.8
## 7 7 9.24 23.1 22.5 38.3 16.4 29.3
## 8 8 9.24 23.2 19.3 33.3 17.5 30.3
## 9 9 7.62 22.7 12.6 28.7 18.2 33.0
## 10 10 7.6 19.3 8.13 18.4 18.3 31.5
## 11 11 8.50 20.5 4.16 12.4 17.6 29.6
## 12 12 10.6 23.3 3.26 11.2 19.5 29.9
month | pm25Q20 | pm25Q80 | o3Q20 | o3Q80 | no2Q20 | no2Q80 |
---|---|---|---|---|---|---|
1 | 10.920000 | 24.41333 | 4.445652 | 13.46167 | 19.88214 | 32.08750 |
2 | 10.470000 | 27.61714 | 7.899758 | 18.51063 | 20.99167 | 33.17778 |
3 | 9.700000 | 25.46500 | 12.525000 | 26.62478 | 19.86111 | 33.27674 |
4 | 7.724286 | 18.42000 | 18.090657 | 30.48091 | 18.61875 | 32.58999 |
5 | 7.800000 | 19.10000 | 20.944556 | 33.54967 | 17.86199 | 31.46076 |
6 | 8.700000 | 21.85000 | 22.476235 | 40.49473 | 19.00893 | 32.78384 |
7 | 9.240000 | 23.08500 | 22.522239 | 38.30213 | 16.40743 | 29.28919 |
8 | 9.240000 | 23.17500 | 19.326223 | 33.28750 | 17.47797 | 30.29226 |
9 | 7.617143 | 22.66714 | 12.640212 | 28.65187 | 18.18177 | 33.00062 |
10 | 7.600000 | 19.28000 | 8.134375 | 18.38452 | 18.25942 | 31.51581 |
11 | 8.495000 | 20.51714 | 4.161111 | 12.40333 | 17.56083 | 29.62459 |
12 | 10.580000 | 23.26667 | 3.257699 | 11.18927 | 19.52083 | 29.87431 |
Conclusion
The article covered several of the main tools found in the tidyverse package for R. I demonstrated around ten different ways to manipulate and transform a dataset using the dplyr functions. We learned the Chicago data records over 19 years of air quality measurements. The pm25tmeans2 was not measured for the first ten years and showed by 4447 missing values. The primary dplyr functions used and combined are:
names()
select()
filter()
summarize()
arrange()
rename()
mutate()
transmute()
group_by()
%>%
References
Peng, R. D. (n.d.). Chicago dataset. Retrieved from: http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip
Peng, R. D. (2016). Exploratory Data Analysis with R. p. 4-17. Retrieved from: https://leanpub.com/exdata
RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.