class: title-slide # Econ 520: Data Science for Economists ## Lecture 7: Data cleaning & wrangling via tidyverse <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- # Table of contents <div style="margin-top: -.6cm;"></div> 1. [Prologue](#prologue) 2. [Tidyverse basics](#basics) 3. [Data wrangling with dplyr](#dplyr) - [filter](#filter) - [arrange](#arrange) - [select](#select) - [mutate](#mutate) - [summarize](#summarize) - [joins](#joins) 4. [Data tidying with tidyr](#tidyr) - [pivot_longer](#pivot_longer) / [pivot_wider](#pivot_wider) - [separate](#separate) - [unite](#unite) --- class: center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Who are we building on? - This lecture builds on Grant McDermott's [Lectures](https://github.com/uo-ec607/lectures) <div style="margin-top: .5cm;"></div> - We will be using the .hi[tidyverse] package, which is a collection of packages designed for data science. <div style="margin-top: 0.5cm;"></div> - The .hi[tidyverse] is built around the following core packages: - **ggplot2** for data visualization - **dplyr** for data manipulation - **tidyr** for data tidying - **readr** for data import - **purrr** for functional programming - **tibble** for tibbles, a modern reimagining of data frames - **stringr** for strings - **forcats** for factors --- # What is "tidy" data? ### Resources: - [Vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (from the **tidyr** package) - [Original paper](https://vita.had.co.nz/papers/tidy-data.pdf) (Hadley Wickham, 2014 JSS) -- <div style="margin-top: 1.5cm;"></div> ### Key points: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. -- Basically, tidy data is more likely to be in [long format](https://en.wikipedia.org/wiki/Wide_and_narrow_data) than in wide format. --- # Checklist ### R packages we will need today: ☑ [**tidyverse**](https://www.tidyverse.org/) ☑ [**nycflights13**](hhttps://github.com/hadley/nycflights13) -- <div style="margin-top: 1.cm;"></div> We will not load them just yet but you can install/update them using ```r install.packages(c('tidyverse', 'nycflights13'), dependencies = TRUE) ``` As we saw last lecture, another way is to use the `p_load` function in the `pacman` package. --- class: center, middle name: basics # Tidyverse basics <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Tidyverse vs. base R Much digital ink has been spilled over the "tidyverse vs. base R vs. data.table " debate. -- We won't delve into this debate here, and will focus on tidyverse. - The documentation and community support are outstanding. - Having a consistent philosophy and syntax makes it easier to learn. - Provides a convenient "front-end" to big data tools that we'll use later in the course. - For data cleaning, wrangling, and plotting, the tidyverse really is a no-brainer.<sup>1</sup> .footnote[ <sup>1</sup> I'm also a fan of [**data.table**](http://r-datatable.com/), which can be very fast. We will not cover it, though. ] --- # Tidyverse vs. base R **But**... this certainly shouldn't put you off learning base R alternatives. - Base R is extremely flexible and powerful (and stable). - There are some things that you'll have to venture outside of the tidyverse for. - A combination of tidyverse and base R is often the best solution to a problem. - Excellent base R data manipulation tutorials: [here](https://www.rspatial.org/intr/index.html) and [here](https://github.com/matloff/fasteR). --- # Tidyverse vs. base R (cont.) - One point of convenience is that there is often a direct correspondence between a tidyverse command and its base R equivalent. - These generally follow a `tidyverse::snake_case` vs. `base::period.case` rule. E.g. Compare: | tidyverse | base | |---|---| | `?readr::read_csv` | `?utils::read.csv` | | `?dplyr::if_else` | `?base::ifelse` | | `?tibble::tibble` | `?base::data.frame` | - The tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart. --- # Tidyverse packages <div style="margin-top: -0.7cm;"></div> Let's load the tidyverse meta-package and check the output. ```r library(tidyverse) ``` ``` ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` -- <div style="margin-top: -0.5cm;"></div> - We see that we have actually loaded a number of packages (which could also be loaded individually): **ggplot2**, **tibble**, **dplyr**, etc. - We can also see information about the package versions and some namespace conflicts. --- # Tidyverse packages (cont.) - The tidyverse actually comes with a lot more packages than those that are just loaded automatically. ```r tidyverse_packages() ``` ``` ## [1] "broom" "conflicted" "cli" "dbplyr" ## [5] "dplyr" "dtplyr" "forcats" "ggplot2" ## [9] "googledrive" "googlesheets4" "haven" "hms" ## [13] "httr" "jsonlite" "lubridate" "magrittr" ## [17] "modelr" "pillar" "purrr" "ragg" ## [21] "readr" "readxl" "reprex" "rlang" ## [25] "rstudioapi" "rvest" "stringr" "tibble" ## [29] "tidyr" "xml2" "tidyverse" ``` - We'll use (or have alreadu used) several of these additional packages during the course. - E.g., the **lubridate** package for working with dates, and the **rvest** package for web scraping. - However, these packages will have to be loaded separately from **tidyverse**. --- # Tidyverse packages (cont.) - We will cover many tidyverse packages over the length of the course. - Today, however, we will focus on two packages: 1. [**dplyr**](https://dplyr.tidyverse.org/) 2. [**tidyr**](https://tidyr.tidyverse.org/) - These are the workhorse packages for cleaning and wrangling data. They are thus the ones that you will likely make the most use of, alongside **ggplot2**. - Data cleaning and wrangling occupies an inordinate amount of time, no matter where you are in your career. --- # An aside on pipes: %>% - We already used the tidyverse pipe operator, denoted `%>%`, in previous lectrues. - I want to reiterate how cool pipes are, and how using them can dramatically improve the experience of reading and writing code. Compare: ```r ## These next two lines of code do exactly the same thing. mpg %>% filter(manufacturer=="audi") %>% group_by(model) %>% summarize(hwy_mean = mean(hwy)) summarize(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy)) ``` -- The first line reads from left to right, exactly how I thought of the operations in my head. - Take this object (`mpg`), do this (`filter`), then do this (`group_by`), etc. The second line totally inverts this logical order (the final operation comes first!) - Who wants to read things inside out? --- # An aside on pipes: %>% (cont.) - The piped version of the code is even more readable if we write it over several lines. - Here it is again and, this time, I'll run it for good measure so you can see the output: ```r mpg %>% filter(manufacturer=="audi") %>% group_by(model) %>% summarize(hwy_mean = mean(hwy)) ``` ``` ## # A tibble: 3 × 2 ## model hwy_mean ## <chr> <dbl> ## 1 a4 28.3 ## 2 a4 quattro 25.8 ## 3 a6 quattro 24 ``` - Remember: Using vertical space costs nothing and makes for much more readable/writable code than cramming things horizontally. --- name: nativepipe # A further aside on the base R pipe: |> The magrittr pipe has proven so successful and popular, that the R core team [recently announced](https://t.co/HVECPENQ5C?amp=1) a "native" pipe would be coming to base R, denoted `|>`.<sup>1</sup> For example: ```r mtcars |> subset(cyl==4) |> head() mtcars |> subset(cyl==4) |> d => lm(mpg ~ disp, data = d) ``` .footnote[<sup>1</sup> That's actually a `|` followed by a `>`. The default font on these slides just makes it look extra fancy.] - This native pipe complements some other new cool features, like support for ["lambda" functions](https://stackoverflow.com/questions/16501/what-is-a-lambda-function) in R. - If you do not want to load **tidyverse** just for the pipe, you can use the native pipe in base R. --- class: center, middle name: dplyr # dplyr <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Aside: dplyr 1.0.0 release Some of the **dplyr** features that we'll cover today were introduced in [version 1.0.0](https://www.tidyverse.org/blog/2020/06/dplyr-1-0-0/) of the package. - Version 1.0.0 is a big deal since it marks a stable code base for the package going forward. - We are currently on version 1.1.4, I guess. - Please make sure that you are running at least **dplyr** 1.0.0 before continuing. ```r packageVersion('dplyr') ``` ``` ## [1] '1.1.4' ``` ```r # install.packages('dplyr') ## install updated version if < 1.0.0 ``` --- # Key dplyr verbs There are five key .hi[dplyr] verbs that you need to learn. <div style="margin-top: -0.4cm;"></div> 1. `filter`: Filter (i.e. subset) rows based on their values. 2. `arrange`: Arrange (i.e. reorder) rows based on their values. 3. `select`: Select (i.e. subset) columns by their names: 4. `mutate`: Create new columns. 5. `summarize`: Collapse multiple rows into a single summary value.<sup>1</sup> .footnote[ <sup>1</sup> `summarise` with a "s" works too. ] -- <div style="margin-top: -0.3cm;"></div> - Let's practice these commands together using the `starwars` data frame that comes pre-packaged with dplyr. --- name: filter # 1) dplyr::filter We can chain multiple filter commands with the pipe (`%>%`), or just separate them within a single filter command using commas. ```r starwars %>% filter( species == "Human", height >= 190 ) ``` ``` ## # A tibble: 4 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Darth Va… 202 136 none white yellow 41.9 male mascu… ## 2 Qui-Gon … 193 89 brown fair blue 92 male mascu… ## 3 Dooku 193 80 white fair brown 102 male mascu… ## 4 Bail Pre… 191 NA black tan brown 67 male mascu… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- # 1) dplyr::filter *cont.* Regular expressions work well too. ```r starwars %>% filter(grepl("Skywalker", name)) ``` ``` ## # A tibble: 3 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… ## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu… ## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- # 1) dplyr::filter *cont.* A very common `filter` use case is identifying (or removing) missing data cases. ```r starwars %>% filter(is.na(height)) ``` ``` ## # A tibble: 6 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Arvel Cr… NA NA brown fair brown NA male mascu… ## 2 Finn NA NA black dark dark NA male mascu… ## 3 Rey NA NA brown light hazel NA fema… femin… ## 4 Poe Dame… NA NA brown light brown NA male mascu… ## 5 BB8 NA NA none none black NA none mascu… ## 6 Captain … NA NA none none unknown NA fema… femin… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` -- - To remove missing observations, simply use negation: `filter(!is.na(height))`. Try this yourself. - Alternative, you can use the `tidyr::drop_na(height)` function to remove missing observations. --- # 1) dplyr::filter *cont.* Let's illustrate! ```r starwars %>% tidyr::drop_na(height) ``` ``` ## # A tibble: 81 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Darth V… 202 136 none white yellow 41.9 male mascu… ## 5 Leia Or… 150 49 brown light brown 19 fema… femin… ## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… ## 7 Beru Wh… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 <NA> white, red red NA none mascu… ## 9 Biggs D… 183 84 black light brown 24 male mascu… ## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… ## # ℹ 71 more rows ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- name: arrange # 2) dplyr::arrange ```r starwars %>% arrange(birth_year) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Wicket … 88 20 brown brown brown 8 male mascu… ## 2 IG-88 200 140 none metal red 15 none mascu… ## 3 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 4 Leia Or… 150 49 brown light brown 19 fema… femin… ## 5 Wedge A… 170 77 brown fair hazel 21 male mascu… ## 6 Plo Koon 188 80 none orange black 22 male mascu… ## 7 Biggs D… 183 84 black light brown 24 male mascu… ## 8 Han Solo 180 80 brown fair brown 29 male mascu… ## 9 Lando C… 177 79 black dark brown 31 male mascu… ## 10 Boba Fe… 183 78.2 black fair brown 31.5 male mascu… ## # ℹ 77 more rows ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` -- *Note:* Arranging on a character-based column (i.e. strings) will sort alphabetically. Try this yourself by arranging according to the "name" column. --- # 2) dplyr::arrange *cont.* We can also arrange items in descending order using `arrange(desc())`. ```r starwars %>% arrange(desc(birth_year)) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Yoda 66 17 white green brown 896 male mascu… ## 2 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu… ## 3 Chewbac… 228 112 brown unknown blue 200 male mascu… ## 4 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 5 Dooku 193 80 white fair brown 102 male mascu… ## 6 Qui-Gon… 193 89 brown fair blue 92 male mascu… ## 7 Ki-Adi-… 198 82 white pale yellow 92 male mascu… ## 8 Finis V… 170 NA blond fair blue 91 male mascu… ## 9 Palpati… 170 75 grey pale yellow 82 male mascu… ## 10 Cliegg … 183 NA brown fair blue 82 male mascu… ## # ℹ 77 more rows ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- name: select # 3) dplyr::select Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-". ```r starwars %>% select(name:skin_color, species, -height) ``` ``` ## # A tibble: 87 × 5 ## name mass hair_color skin_color species ## <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywalker 77 blond fair Human ## 2 C-3PO 75 <NA> gold Droid ## 3 R2-D2 32 <NA> white, blue Droid ## 4 Darth Vader 136 none white Human ## 5 Leia Organa 49 brown light Human ## 6 Owen Lars 120 brown, grey light Human ## 7 Beru Whitesun Lars 75 brown light Human ## 8 R5-D4 32 <NA> white, red Droid ## 9 Biggs Darklighter 84 black light Human ## 10 Obi-Wan Kenobi 77 auburn, white fair Human ## # ℹ 77 more rows ``` --- # 3) dplyr::select *cont.* You can also rename some (or all) of your selected variables in place. ```r starwars %>% select(alias=name, crib=homeworld, sex=gender) ``` ``` ## # A tibble: 87 × 3 ## alias crib sex ## <chr> <chr> <chr> ## 1 Luke Skywalker Tatooine masculine ## 2 C-3PO Tatooine masculine ## 3 R2-D2 Naboo masculine ## 4 Darth Vader Tatooine masculine ## 5 Leia Organa Alderaan feminine ## 6 Owen Lars Tatooine masculine ## 7 Beru Whitesun Lars Tatooine feminine ## 8 R5-D4 Tatooine masculine ## 9 Biggs Darklighter Tatooine masculine ## 10 Obi-Wan Kenobi Stewjon masculine ## # ℹ 77 more rows ``` --- # 3) dplyr::select *cont.* (rename) - If you just want to rename columns without subsetting them, you can use `rename`. - Try this now by replacing `select(...)` in the above code chunk with `rename(...)`. --- # 3) dplyr::select *cont.* The `select(contains(PATTERN))` option provides a nice shortcut in relevant cases. ```r starwars %>% select(name, contains("color")) ``` ``` ## # A tibble: 87 × 4 ## name hair_color skin_color eye_color ## <chr> <chr> <chr> <chr> ## 1 Luke Skywalker blond fair blue ## 2 C-3PO <NA> gold yellow ## 3 R2-D2 <NA> white, blue red ## 4 Darth Vader none white yellow ## 5 Leia Organa brown light brown ## 6 Owen Lars brown, grey light blue ## 7 Beru Whitesun Lars brown light blue ## 8 R5-D4 <NA> white, red red ## 9 Biggs Darklighter black light brown ## 10 Obi-Wan Kenobi auburn, white fair blue-gray ## # ℹ 77 more rows ``` --- # 3) dplyr::select *cont.* The `select(..., everything())` option is another useful shortcut if you only want to bring some variable(s) to the "front" of a data frame. ```r starwars %>% select(species, homeworld, everything()) %>% head(5) ``` ``` ## # A tibble: 5 × 14 ## species homeworld name height mass hair_color skin_color eye_color ## <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> ## 1 Human Tatooine Luke Skywalker 172 77 blond fair blue ## 2 Droid Tatooine C-3PO 167 75 <NA> gold yellow ## 3 Droid Naboo R2-D2 96 32 <NA> white, blue red ## 4 Human Tatooine Darth Vader 202 136 none white yellow ## 5 Human Alderaan Leia Organa 150 49 brown light brown ## # ℹ 6 more variables: birth_year <dbl>, sex <chr>, gender <chr>, films <list>, ## # vehicles <list>, starships <list> ``` -- *Note:* The `relocate` function has brought a lot more functionality to ordering of columns. See [here](https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-select-rename-relocate/). --- name: mutate # 4) dplyr::mutate You can create new columns from scratch, or as transformations of existing columns. ```r starwars %>% select(name, birth_year) %>% mutate(dog_years = birth_year * 7) %>% mutate(comment = paste0(name, " is ", dog_years, " in dog years.")) ``` ``` ## # A tibble: 87 × 4 ## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun Lars 47 329 Beru Whitesun Lars is 329 in dog yea… ## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year… ## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # ℹ 77 more rows ``` --- # 4) dplyr::mutate *cont.* *Note:* `mutate` is order aware. So you can chain multiple mutates in a single call. ```r starwars %>% select(name, birth_year) %>% mutate( dog_years = birth_year * 7, ## Separate with a comma comment = paste0(name, " is ", dog_years, " in dog years.") ) ``` ``` ## # A tibble: 87 × 4 ## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun Lars 47 329 Beru Whitesun Lars is 329 in dog yea… ## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year… ## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # ℹ 77 more rows ``` --- # 4) dplyr::mutate *cont.* Boolean, logical and conditional operators all work well with `mutate` too. ```r starwars %>% select(name, height) %>% filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) %>% mutate(tall1 = height > 180) %>% mutate(tall2 = ifelse(height > 180, "Tall", "Short")) ## Same effect, but can choose labels ``` ``` ## # A tibble: 2 × 4 ## name height tall1 tall2 ## <chr> <int> <lgl> <chr> ## 1 Luke Skywalker 172 FALSE Short ## 2 Anakin Skywalker 188 TRUE Tall ``` --- # 4) dplyr::mutate *cont.* Lastly, combining `mutate` with the new `across` feature in dplyr 1.0.0+ allows you to easily work on a subset of variables. For example: ```r starwars %>% select(name:eye_color) %>% * mutate(across(where(is.character), toupper)) %>% head(5) ``` ``` ## # A tibble: 5 × 6 ## name height mass hair_color skin_color eye_color ## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 LUKE SKYWALKER 172 77 BLOND FAIR BLUE ## 2 C-3PO 167 75 <NA> GOLD YELLOW ## 3 R2-D2 96 32 <NA> WHITE, BLUE RED ## 4 DARTH VADER 202 136 NONE WHITE YELLOW ## 5 LEIA ORGANA 150 49 BROWN LIGHT BROWN ``` --- name: summarize # 5) dplyr::summarize Particularly useful in combination with the `group_by` command. ```r starwars %>% group_by(species, gender) %>% summarize(mean_height = mean(height, na.rm = TRUE)) ``` ``` ## `summarise()` has grouped output by 'species'. You can override using the ## `.groups` argument. ``` ``` ## # A tibble: 42 × 3 ## # Groups: species [38] ## species gender mean_height ## <chr> <chr> <dbl> ## 1 Aleena masculine 79 ## 2 Besalisk masculine 198 ## 3 Cerean masculine 198 ## 4 Chagrian masculine 196 ## 5 Clawdite feminine 168 ## 6 Droid feminine 96 ## 7 Droid masculine 140 ## 8 Dug masculine 112 ## 9 Ewok masculine 88 ## 10 Geonosian masculine 183 ## # ℹ 32 more rows ``` --- # 5) dplyr::summarize *cont.* Note that including "na.rm = TRUE" is usually a good idea with summarize functions. Otherwise, any missing value will propagate to the summarized value too. ```r ## Probably not what we want starwars %>% summarize(mean_height = mean(height)) ``` ``` ## # A tibble: 1 × 1 ## mean_height ## <dbl> ## 1 NA ``` ```r ## Much better starwars %>% summarize(mean_height = mean(height, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## mean_height ## <dbl> ## 1 175. ``` --- # 5) dplyr::summarize *cont.* The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarize`. For example: ```r starwars %>% group_by(species) %>% * summarize(across(where(is.numeric), mean, na.rm=TRUE)) %>% head(5) ``` ``` ## Warning: There was 1 warning in `summarize()`. ## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`. ## ℹ In group 1: `species = "Aleena"`. ## Caused by warning: ## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0. ## Supply arguments directly to `.fns` through an anonymous function instead. ## ## # Previously ## across(a:b, mean, na.rm = TRUE) ## ## # Now ## across(a:b, \(x) mean(x, na.rm = TRUE)) ``` ``` ## # A tibble: 5 × 4 ## species height mass birth_year ## <chr> <dbl> <dbl> <dbl> ## 1 Aleena 79 15 NaN ## 2 Besalisk 198 102 NaN ## 3 Cerean 198 82 92 ## 4 Chagrian 196 NaN NaN ## 5 Clawdite 168 55 NaN ``` --- # Other dplyr goodies `group_by` and `ungroup`: For (un)grouping. - Particularly useful with the `summarize` and `mutate` commands, as we've already seen. -- `slice`: Subset rows by position rather than filtering by values. - E.g. `starwars %>% slice(c(1, 5))` -- `pull`: Extract a column from a data frame as a vector or scalar. - E.g. `starwars %>% filter(sex=="female") %>% pull(height)` --- # Other dplyr goodies (cont.) `count` and `distinct`: Number and isolate unique observations. - E.g. `starwars %>% count(species)`, or `starwars %>% distinct(species)`. - You could also use a combination of `mutate`, `group_by`, and `n()`, - E.g., `starwars %>% group_by(species) %>% mutate(num = n())`. --- # Other dplyr goodies (cont.) - There are also a whole class of [window functions](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) for getting leads and lags, ranking, creating cumulative aggregates, etc. - See `vignette("window-functions")`. -- </br> - The final set of dplyr "goodies" are the family of join operations. - However, these are important enough that I want to go over some concepts in a bit more depth. - You will encounter these often. --- name: joins # Joins One of the purposes of the dplyr package is merging data with the family [join operations](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html). - `inner_join(df1, df2)` - `left_join(df1, df2)` - `right_join(df1, df2)` - `full_join(df1, df2)` - `semi_join(df1, df2)` - `anti_join(df1, df2)` You find find it helpful to to see visual depictions of the different join operations [here](https://r4ds.had.co.nz/relational-data.html). --- # Joins (cont.) For the simple examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. - This package contains information about all flights that departed from New York City in 2013. - Load it now and then inspect these data frames in your own console. ```r library(nycflights13) flights planes ``` --- # Joins (cont.) Let's perform a [left join](https://stat545.com/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers) on the flights and planes datasets. - *Note*: I'm going subset columns after the join, but only to keep text on the slide. -- ```r left_join(flights, planes) %>% select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>% head(5) ``` ``` ## Joining with `by = join_by(year, tailnum)` ``` ``` ## # A tibble: 5 × 10 ## year month day dep_time arr_time carrier flight tailnum type model ## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA> ## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA> ## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA> ## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA> ## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA> ``` --- # Joins (cont.) (*continued from previous slide*) Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices: ``` *## Joining, by = c("year", "tailnum") ``` However, there's an obvious problem here: the variable "year" does not have a consistent meaning across our joining datasets! - In one it refers to the *year of flight*, in the other it refers to *year of construction*. -- - Luckily, there's an easy way to avoid this problem. - You just need to be more explicit in your join call by using the `by = ` argument. --- # Joins (cont.) (*continued from previous slide*) - You can also rename any ambiguous columns to avoid confusion. ```r left_join( flights, planes %>% rename(year_built = year), ## Not necessary w/ below line, but helpful by = "tailnum" ## Be specific about the joining column ) %>% select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>% head(3) ## Just to save vertical space on the slide ``` ``` ## # A tibble: 3 × 11 ## year month day dep_time arr_time carrier flight tailnum year_built type ## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixed w… ## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixed w… ## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixed w… ## # ℹ 1 more variable: model <chr> ``` --- # Joins (cont.) (*continued from previous slide*) Last thing I'll mention for now; note what happens if we again specify the join column... but don't rename the ambiguous "year" column in at least one of the given data frames. ```r left_join( flights, planes, ## Not renaming "year" to "year_built" this time by = "tailnum" ) %>% select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>% head(3) ``` ``` ## # A tibble: 3 × 11 ## year.x year.y month day dep_time arr_time carrier flight tailnum type model ## <int> <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr> ## 1 2013 1999 1 1 517 830 UA 1545 N14228 Fixe… 737-… ## 2 2013 1998 1 1 533 850 UA 1714 N24211 Fixe… 737-… ## 3 2013 1990 1 1 542 923 AA 1141 N619AA Fixe… 757-… ``` -- Make sure you know what "year.x" and "year.y" are. Again, it pays to be specific. --- class: center, middle name: tidyr # tidyr <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Key tidyr verbs 1. `pivot_longer`: Pivot wide data into long format (i.e. "melt").<sup>1</sup> 2. `pivot_wider`: Pivot long data into wide format (i.e. "cast").<sup>2</sup> 3. `separate`: Separate (i.e. split) one column into multiple columns. 4. `unite`: Unite (i.e. combine) multiple columns into one. .footnote[ <sup>1</sup> Updated version of `tidyr::gather`. <sup>2</sup> Updated version of `tidyr::spread`. ] -- Let's practice these verbs together in class. - Side question: Which of `pivot_longer` vs `pivot_wider` produces "tidy" data? --- name: pivot_longer # 1) tidyr::pivot_longer ```r stocks = tibble( ## Could use "data.frame" instead of "tibble" if you prefer time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 0, 1), Y = rnorm(2, 0, 2), Z = rnorm(2, 0, 4) ) stocks ``` ``` ## # A tibble: 2 × 4 ## time X Y Z ## <date> <dbl> <dbl> <dbl> ## 1 2009-01-01 -0.313 0.216 0.533 ## 2 2009-01-02 -3.06 2.42 -4.89 ``` ```r stocks %>% pivot_longer(-time, names_to="stock", values_to="price") ``` ``` ## # A tibble: 6 × 3 ## time stock price ## <date> <chr> <dbl> ## 1 2009-01-01 X -0.313 ## 2 2009-01-01 Y 0.216 ## 3 2009-01-01 Z 0.533 ## 4 2009-01-02 X -3.06 ## 5 2009-01-02 Y 2.42 ## 6 2009-01-02 Z -4.89 ``` --- # 1) tidyr::pivot_longer *cont.* Let's quickly save the "tidy" (i.e. long) stocks data frame for use on the next slide. ```r ## Write out the argument names this time: i.e. "names_to=" and "values_to=" tidy_stocks = stocks %>% pivot_longer(-time, names_to="stock", values_to="price") ``` --- name: pivot_wider # 2) tidyr::pivot_wider ```r tidy_stocks %>% pivot_wider(names_from=stock, values_from=price) ``` ``` ## # A tibble: 2 × 4 ## time X Y Z ## <date> <dbl> <dbl> <dbl> ## 1 2009-01-01 -0.313 0.216 0.533 ## 2 2009-01-02 -3.06 2.42 -4.89 ``` ```r tidy_stocks %>% pivot_wider(names_from=time, values_from=price) ``` ``` ## # A tibble: 3 × 3 ## stock `2009-01-01` `2009-01-02` ## <chr> <dbl> <dbl> ## 1 X -0.313 -3.06 ## 2 Y 0.216 2.42 ## 3 Z 0.533 -4.89 ``` -- - Note that the second example, which has combined different pivoting arguments, has effectively transposed the data. --- name: separate # 3) tidyr::separate ```r economists = data.frame(name = c("Adam.Smith", "Paul.Samuelson", "Milton.Friedman")) economists ``` ``` ## name ## 1 Adam.Smith ## 2 Paul.Samuelson ## 3 Milton.Friedman ``` ```r economists %>% separate(name, c("first_name", "last_name")) ``` ``` ## first_name last_name ## 1 Adam Smith ## 2 Paul Samuelson ## 3 Milton Friedman ``` -- - This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with `separate(..., sep=".")`. --- # 3) tidyr::separate *cont.* - A related function is `separate_rows`, for splitting up cells that contain multiple fields or observations (common occurrence with survey data). ```r jobs = data.frame( name = c("Jack", "Jill"), occupation = c("Homemaker", "Philosopher, Philanthropist, Troublemaker") ) jobs ``` ``` ## name occupation ## 1 Jack Homemaker ## 2 Jill Philosopher, Philanthropist, Troublemaker ``` --- # 3) tidyr::separate *cont.* - A related function is `separate_rows`, for splitting up cells that contain multiple fields or observations (common occurrence with survey data). ```r ## Now split out Jill's various occupations into different rows jobs %>% separate_rows(occupation) ``` ``` ## # A tibble: 4 × 2 ## name occupation ## <chr> <chr> ## 1 Jack Homemaker ## 2 Jill Philosopher ## 3 Jill Philanthropist ## 4 Jill Troublemaker ``` --- name: unite # 4) tidyr::unite ```r gdp = data.frame( yr = rep(2023, times = 3), mnth = rep(1, times = 3), dy = 1:3, gdp = rnorm(3, mean = 100, sd = 2) ) gdp ``` ``` ## yr mnth dy gdp ## 1 2023 1 1 99.60413 ## 2 2023 1 2 98.06688 ## 3 2023 1 3 102.47222 ``` ```r ## Combine "yr", "mnth", and "dy" into one "date" column gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") ``` ``` ## date gdp ## 1 2023-1-1 99.60413 ## 2 2023-1-2 98.06688 ## 3 2023-1-3 102.47222 ``` --- # 4) tidyr::unite *cont.* Note that `unite` will automatically create a character variable. You can see this better if we convert it to a tibble. ```r gdp_u = gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") %>% as_tibble() gdp_u ``` ``` ## # A tibble: 3 × 2 ## date gdp ## <chr> <dbl> ## 1 2023-1-1 99.6 ## 2 2023-1-2 98.1 ## 3 2023-1-3 102. ``` -- - If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using `mutate`. -See the next slide for an example, using the [lubridate](https://lubridate.tidyverse.org/) package's super helpful date conversion functions. --- # 4) tidyr::unite *cont.* *(continued from previous slide)* ```r library(lubridate) gdp_u %>% mutate(date = ymd(date)) ``` ``` ## # A tibble: 3 × 2 ## date gdp ## <date> <dbl> ## 1 2023-01-01 99.6 ## 2 2023-01-02 98.1 ## 3 2023-01-03 102. ``` --- # Other tidyr goodies Use `crossing` to get the full combination of a group of variables.<sup>1</sup> ```r crossing(side=c("left", "right"), height=c("top", "bottom")) ``` ``` ## # A tibble: 4 × 2 ## side height ## <chr> <chr> ## 1 left bottom ## 2 left top ## 3 right bottom ## 4 right top ``` .footnote[ <sup>1</sup> Base R alternative: `expand.grid`. ] -- See `?expand` and `?complete` for more specialized functions that allow you to fill in (implicit) missing data or variable combinations in existing data frames. --- class: center, middle name: summary # Summary <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Key verbs ### dplyr 1. `filter` 2. `arrange` 3. `select` 4. `mutate` 5. `summarize` ### tidyr 1. `pivot_longer` 2. `pivot_wider` 3. `separate` 4. `unite` -- Other useful items include: pipes (`%>%`), grouping (`group_by`), joining functions (`left_join`, `inner_join`, etc.).