class: center, middle, inverse, title-slide # Econ 3035 - Econometric Methods ## R language basics ### Pedro H. C. Sant’Anna
Vanderbilt University
These slides are heavily based on material shared by Grant McDermott --- name: toc # Table of contents 1. [Introduction](#intro) 2. [Object-oriented programming in R](#oop) 3. ["Everything is an object"](#eobject) 4. ["Everything has a name"](#ename) 5. [Indexing](#indexing) 6. [Cleaning up](#cleaning) --- class: inverse, center, middle name: intro # Introduction <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Some important R concepts) --- # Basic arithmetic R is a powerful calculator and recognizes all of the standard arithmetic operators: ```r 1+2 ## Addition ``` ``` ## [1] 3 ``` ```r 6-7 ## Subtraction ``` ``` ## [1] -1 ``` ```r 5/2 ## Division ``` ``` ## [1] 2.5 ``` ```r 2^3 ## Exponentiation ``` ``` ## [1] 8 ``` ```r 2+4*1^3 ## Standard order of precedence (`*` before `+`, etc.) ``` ``` ## [1] 6 ``` --- # Logic R also comes equipped with a full set of logical operators and Booleans, which follow standard programming protocol. For example: ```r 1 > 2 ``` ``` ## [1] FALSE ``` ```r (1 > 2) & (1 > 0.5) ## The "&" stands for "and" ``` ``` ## [1] FALSE ``` ```r (1 > 2) | (1 > 0.5) ## The "|" stands for "or" (not a pipe a la the shell) ``` ``` ## [1] TRUE ``` ```r isTRUE (1 < 2) ``` ``` ## [1] TRUE ``` -- You can read more about logical operators and types <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html" target="_blank">here</a> and <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/logical.html" target="_blank">here</a>. In the next few slides, however, I want to emphasise some special concepts and gotchas... --- # Logic (cont.) ### Order of precedence Much like standard arithmetic, logic statements follow a strict order of precedence. Logical operators (`>`, `==`, etc) are evaluated before Boolean operators (`&` and `|`). Failure to recognize this can lead to unexpected behavior... ```r 1 > 0.5 & 2 ``` ``` ## [1] TRUE ``` -- What's happening here is that R is evaluating two separate "logical" statements: - `1 > 0.5`, which is is obviously TRUE. - `2`, which is TRUE(!) because R is "helpfully" converting it to `as.logical(2)`. -- **Solution:** Be explicit about each component of your logic statement(s) and use parenthesis ```r (1 > 0.5) & (1 > 2) ``` ``` ## [1] FALSE ``` --- # Logic (cont.) ### Negation: `!` We use `!` as a short hand for negation. This will come in very handy when we start filtering data objects based on non-missing (i.e. non-NA) observations. ```r is.na(1:10) ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` ```r !is.na(1:10) ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ``` ```r # Negate(is.na)(1:10) ## This also works. Try it yourself. ``` --- # Logical operators (cont.) ### Value matching: `%in%` To see whether an object is contained within (i.e. matches one of) a list of items, use `%in%`. ```r 4 %in% 1:10 ``` ``` ## [1] TRUE ``` ```r 4 %in% 5:10 ``` ``` ## [1] FALSE ``` --- # Logical operators (cont.) ### Evaluation We'll get to assignment shortly. However, to preempt it somewhat, we always use two equal signs for logical evaluation. ```r 1 = 1 ## This doesn't work ``` ``` ## Error in 1 = 1: invalid (do_set) left-hand side to assignment ``` ```r 1 == 1 ## This does. ``` ``` ## [1] TRUE ``` ```r 1 != 2 ## Note the single equal sign when combined with a negation. ``` ``` ## [1] TRUE ``` --- # Logical operators (cont.) ### Evaluation caveat: Floating-point numbers What do you think will happen if we evaluate `0.1 + 0.2 == 0.3`? -- ```r 0.1 + 0.2 == 0.3 ``` ``` ## [1] FALSE ``` Whattttt?!!! (Or, maybe you're thinking: This makes no sense at all!!) -- **Problem:** Computers represent numbers as binary (i.e. base 2) floating-points. More [here](https://floating-point-gui.de/basic/). - Fast and memory efficient, but can lead to unexpected behavior. - Similar to the way that standard decimal (i.e. base 10) representation can't precisely capture certain fractions (e.g. `\(\frac{1}{3} = 0.3333...\)`). -- **Solution:** Use `all.equal()` for evaluating floats (i.e fractions). ```r all.equal(0.1 + 0.2, 0.3) ``` ``` ## [1] TRUE ``` --- # Assignment In R, we can use either `<-` or `=` to handle assignment.<sup>1</sup> .footnote[ <sup>1</sup> The `<-` is really a `<` followed by a `-`. It just looks like one thing b/c of the [font](https://github.com/tonsky/FiraCode) I'm using here. ] -- ### Assignment with `<-` `<-` is normally read aloud as "gets". You can think of it as a (left-facing) arrow saying *assign in this direction*. ```r a <- 10 + 5 a ``` ``` ## [1] 15 ``` -- Of course, an arrow can point in the other direction too (i.e. `->`). So, the following code chunk is equivalent to the previous one, although used much less frequently. ```r 10 + 5 -> a ``` --- # Assignment (cont.) ### Assignment with `=` You can also use `=` for assignment. ```r b = 10 + 10 ## Note that the assigned object *must* be on the left with "=". b ``` ``` ## [1] 20 ``` -- ### Which assignment operator to use? Most R users (purists?) seem to prefer `<-` for assignment, since `=` also has specific role for evaluation *within* functions. - We'll see lots of examples of this later. - But I don't think it matters; `=` is quicker to type and is more intuitive if you're coming from another programming language. (More discussion [here](https://github.com/Robinlovelace/geocompr/issues/319#issuecomment-427376764) and [here](https://www.separatinghyperplanes.com/2018/02/why-you-should-use-and-never.html).) **Bottom line:** Use whichever you prefer. Just be consistent. --- # Help For more information on a (named) function or object in R, consult the "help" documentation. For example: ```R help(plot) ``` Or, more simply, just use `?`: ```R # This is what most people use. ?plot ``` -- </br> **Aside 1:** Comments in R are demarcated by `#`. - Hit `Ctrl+Shift+c` in RStudio to (un)comment whole sections of highlighted code. -- **Aside 2:** See the *Examples* section at the bottom of the help file? - You can run them with the `example()` function. Try it: `example(plot)`. --- class: inverse, center, middle name: eobject # "Everything is an object" <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # What are objects? It's important to emphasise that there are many different *types* (or *classes*) of objects. We'll revisit the issue of "type" vs "class" in a bit. For the moment, it is helpful simply to name some objects that we'll be working with regularly: - vectors - matrices - data frames - lists - functions - etc. -- Most likely, you already have a good idea of what distinguishes these objects and how to use them. - However, bear in mind that there subtleties that may confuse while you're still getting used to R. - E.g. There are different kinds of data frames. We'll soon encounter "[tibbles](https://tibble.tidyverse.org/)", which are enhanced versions of the standard data frame in R. --- # What are objects? (cont.) Each object class has its own set of rules ("methods") for determining valid operations. - For example, you can perform many of the same operations on matrices and data frames. But there are some operations that only work on a matrix, and vice-versa. - At the same time, you can (usually) convert an object from one type to another. ```r ## Create a small data frame called "d". d = data.frame(x = 1:2, y = 3:4) d ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` ```r ## Convert it to (i.e. create) a matrix call "m". m = as.matrix(d) m ``` ``` ## x y ## [1,] 1 3 ## [2,] 2 4 ``` --- # Object class, type, and structure Use the `class`, `typeof`, and `str` commands if you want understand more about a particular object. ```r # d = data.frame(x = 1:2, y = 3:4) ## Create a small data frame called "d". class(d) ## Evaluate its class. ``` ``` ## [1] "data.frame" ``` ```r typeof(d) ## Evaluate its type. ``` ``` ## [1] "list" ``` ```r str(d) ## Show its structure. ``` ``` ## 'data.frame': 2 obs. of 2 variables: ## $ x: int 1 2 ## $ y: int 3 4 ``` -- PS — Confused by the fact that `typeof(d)` returns "list"? See [here](https://stackoverflow.com/questions/45396538/typeofdata-frame-shows-list-in-r). --- # Object class, type, and structure (cont.) Of course, you can always just inspect/print an object directly in the console. - E.g. Type `d` and hit Enter. The `View()` function is also very helpful. This is the same as clicking on the object in your RStudio *Environment* pane. (Try both methods now.) - E.g. `View(d)`. --- # Global environment Let's go back to the simple data frame that we created a few slides earlier. ```r d ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` -- Now, let's try to compute sample mean <sup>1</sup> on these "x" and "y" variables: ```r mean(x) ``` ``` ## Error in mean(x): object 'x' not found ``` ```r mean (y) ``` ``` ## Error in mean(y): object 'y' not found ``` -- Uh-oh. What went wrong here? (Answer on next slide.) --- # Global environment (cont.) The error message provides the answer to our question: ``` *## Error in mean(x): object 'x' not found ``` -- R can't find the variables that we've supplied in our [Global Environment](https://www.datamentor.io/r-programming/environment-scope/): ![No "x" or "y" here...](pics/environment.png) -- Put differently: Because the variables "x" and "y" live as separate objects in the global environment, we have to tell R that they belong to the object `d`. - Think about how you might do this before clicking through to the next slide. --- # Global environment (cont.) There are a various ways to solve this problem. A simple one is using the indexing operator $ (more about this operator in a bit). ```r mean(d$x) ``` ``` ## [1] 1.5 ``` ```r mean(d$y) ``` ``` ## [1] 3.5 ``` -- I wanted to emphasize this global environment issue, because it is something that Stata users (i.e. many economists) struggle with when they first come to R. - In Stata, the entire workspace essentially consists of one (and only one) data frame. So there can be no ambiguity where variables are coming from. - However, that "convenience" comes at a really high price IMO. You can never read more than two separate datasets (let alone object types) into memory at the same time, have to resort all sorts of hacks to add summary variables to your dataset, etc. - Speaking of which... --- # Working with multiple objects As I keep saying, R's ability to keep multiple objects in memory at the same time is a huge plus when it comes to effective data work. - E.g. We can copy an exiting data frame, or create new one entirely from scratch. Either will exist happily with our existing objects in the global environment. ```r d2 = data.frame(x = rnorm(10), y = runif(10)) ``` ![Now with d2 added](pics/environment2.png) --- # Working with multiple objects (cont.) Again, however, it does mean that you have to pay attention to the names of those distinct data frames and be specific about which objects you are referring to. - Do we want to compute sample mean of "y" and "x" from data frame `d` or from data frame `d2`? --- class: inverse, center, middle name: ename # "Everything has a name" <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Reserved words We've seen that we can assign objects to different names. However, there are a number of special words that are "reserved" in R. - These are are fundamental commands, operators and relations in base R that you cannot (re)assign, even if you wanted to. - We already encountered examples with the logical operators. See [here](http://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a full list, including (but not limited to): ```R if else while function for TRUE FALSE NULL Inf NaN NA ``` --- # Semi-reserved words In addition to the list of strictly reserved words, there is a class of words and strings that I am going to call "semi-reserved". - These are named functions or constants (e.g. `pi`) that you can re-assign if you really wanted to... but already come with important meanings from base R. Arguably the most important semi-reserved character is `c()`, which we use for concatenation; i.e. creating vectors and binding different objects together. ```r my_vector = c(1, 2, 5) my_vector ``` ``` ## [1] 1 2 5 ``` -- What happens if you type the following? (Try it in your console.) ```R c = 4 c(1, 2 ,5) ``` ??? Vectors are very important in R, because the language has been optimised for them. Don't worry about this now; later you'll learn what I mean by "vectorising" a function. --- # Semi-reserved words (cont.) *(Continued from previous slide.)* In this case, thankfully nothing. R is "smart" enough to distinguish between the variable `c = 4` that we created and the built-in function `c()` that calls for concatenation. -- However, this is still *extremely* sloppy coding. R won't always be able to distinguish between conflicting definitions. And neither will you. For example: ```r pi ``` ``` ## [1] 3.141593 ``` ```r pi = 2 pi ``` ``` ## [1] 2 ``` -- **Bottom line:** Don't use (semi-)reserved characters! --- # Namespace conflicts A similar issue crops up when we load two packages, which have functions that share the same name. E.g. Look what happens we load the `dplyr` package - we talked about how to install packages in class, so go ahead and install `dplyr` first! ```r library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` -- The messages that you see about some object being *masked from 'package:X'* are warning you about a namespace conflict. - E.g. Both `dplyr` and the `stats` package (which gets loaded automatically when you start R) have functions named "filter" and "lag". --- # Namespace conflicts (cont.) The potential for namespace conflicts is a result of the OOP approach.<sup>1</sup> - Also reflects the fundamental open-source nature of R and the use of external packages. People are free to call their functions whatever they want, so some overlap is only to be expected. .footnote[ <sup>1</sup> Similar problems arise in virtually every other programming language (Python, C, etc.) ] -- Whenever a namespace conflict arises, the most recently loaded package will gain preference. So the `filter()` function now refers specifically to the `dplyr` variant. But what if we want the `stats` variant? Well, we have two options: 1. Temporarily use `stats::filter()` 2. Permanently assign `filter = stats::filter` --- # Solving namespace conflicts ### 1. Use `package::function()` We can explicitly call a conflicted function from a particular package using the `package::function()` syntax. For example: ```r stats::filter(1:10, rep(1, 2)) ``` ``` ## Time Series: ## Start = 1 ## End = 10 ## Frequency = 1 ## [1] 3 5 7 9 11 13 15 17 19 NA ``` -- We can also use `::` for more than just conflicted cases. - E.g. Being explicit about where a function (or dataset) comes from can help add clarity to our code. Try these lines of code in your R console. ```R dplyr::starwars ## Print the starwars data frame from the dplyr package scales::comma(c(1000, 1000000)) ## Use the comma function, which comes from the scales package ``` ??? The `::` syntax also means that we can call functions without loading package first. E.g. As long as `dplyr` is installed on our system, then `dplyr::filter(iris, Species=="virginica")` will work. --- # Solving namespace conflicts (cont.) ### 2. Assign `function = package::function` A more permanent solution is to assign a conflicted function name to a particular package. This will hold for the remainder of your current R session, or until you change it back. E.g. ```r filter = stats::filter ## Note the lack of parentheses. filter = dplyr::filter ## Change it back again. ``` -- ### General advice I would generally advocate for the temporary `package::function()` solution. Other than that, simply pay attention to any warnings when loading a new package and `?` is your friend if you're ever unsure. (E.g. `?filter` will tell you which variant is being used.) - In truth, problematic namespace conflicts are rare. But it's good to be aware of them. --- # User-side namespace conflicts A final thing to say about namespace conflicts is that they don't only arise from loading packages. They can arise when users create their own functions with a conflicting name. - E.g. If I was naive enough to create a new function called `c()`. -- </br> In a similar vein, one of the most common and confusing errors that even experienced R programmers run into is related to the habit of calling objects "df" or "data"... both of which are functions in base R! - See for yourself by typing `?df` or `?data`. Again, R will figure out what you mean if you are clear/lucky enough. But, much the same as with `c()`, it's relatively easy to run into problems. - Case in point: Triggering the infamous "object of type closure is not subsettable" error message. (See from 1:45 [here](https://rstudio.com/resources/rstudioconf-2020/object-of-type-closure-is-not-subsettable/).) --- class: inverse, center, middle name: indexing # Indexing <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Option 1: [] We've already seen an example of indexing in the form of R console output. For example: ```r 1+2 ``` ``` ## [1] 3 ``` The `[1]` above denotes the first (and, in this case, only) element of our output.<sup>1</sup> In this case, a vector of length one equal to the value "3". -- Try the following in your console to see a more explicit example of indexed output: ```r rnorm(n = 100, mean = 0, sd = 1) # rnorm(100) ## Would work just as well. (Why? Hint: see ?rnorm) ``` .footnote[ [1] Indexing in R begins at 1. Not 0 like some languages (e.g. Python and JavaScript). ] --- # Option 1: [] (cont.) More importantly, we can also use `[]` to index objects that we create in R. ```r a = 1:10 a[4] ## Get the 4th element of object "a" ``` ``` ## [1] 4 ``` ```r a[c(4, 6)] ## Get the 4th and 6th elements ``` ``` ## [1] 4 6 ``` It also works on larger arrays (vectors, matrices, data frames, and lists). For example: ```r starwars[1, 1] ## Show the cell corresponding to the 1st row & 1st column of the data frame. ``` ``` ## # A tibble: 1 x 1 ## name ## <chr> ## 1 Luke Skywalker ``` -- What does `starwars[1:3, 1]` give you? --- # Option 1: [] (cont.) We haven't covered them properly yet (patience), but **lists** are a more complex type of array object in R. - They can contain a random assortment of objects that don't share the same class, or have the same shape (e.g. rank) or common structure. - E.g. A list can contain a scalar, a string, and a data frame. Or you can have a list of data frames, or even lists of lists. -- The relevance to indexing is that lists require two square brackets `[[]]` to index the parent list item and then the standard `[]` within that parent item. An example might help to illustrate: ```r my_list = list(a = "hello", b = c(1,2,3), c = data.frame(x = 1:5, y = 6:10)) my_list[[1]] ## Return the 1st list object ``` ``` ## [1] "hello" ``` ```r my_list[[2]][3] ## Return the 3rd element of the 2nd list object ``` ``` ## [1] 3 ``` --- # Option 2: $ Lists provide a nice segue to our other indexing operator: `$`. - Let's continue with the `my_list` example from the previous slide. ```r my_list ``` ``` ## $a ## [1] "hello" ## ## $b ## [1] 1 2 3 ## ## $c ## x y ## 1 1 6 ## 2 2 7 ## 3 3 8 ## 4 4 9 ## 5 5 10 ``` --- count: false # Option 2: $ Lists provide a nice segue to our other indexing operator: `$`. - Let's continue with the `my_list` example from the previous slide. ```r my_list ``` ``` *## $a ## [1] "hello" ## *## $b ## [1] 1 2 3 ## *## $c ## x y ## 1 1 6 ## 2 2 7 ## 3 3 8 ## 4 4 9 ## 5 5 10 ``` Notice how our (named) parent list objects are demarcated: "$a", "$b" and "$c". --- # Option 2: $ (cont.) We can call these objects directly by name using the dollar sign, e.g. ```r my_list$a ## Return list object "a" ``` ``` ## [1] "hello" ``` ```r my_list$b[3] ## Return the 3rd element of list object "b" ``` ``` ## [1] 3 ``` ```r my_list$c$x ## Return column "x" of list object "c" ``` ``` ## [1] 1 2 3 4 5 ``` -- </br> **Aside:** Typing `View(my_list)` (or, equivalently, clicking on the object in RStudio's environment pane) provides a nice interactive window for exploring the nested structure of lists. --- # Option 2: $ (cont.) The `$` form of indexing also works (and in the manner that you probably expect) for other object types in R. In some cases, you can also combine the two index options. - E.g. Get the 1st element of the "name" column from the starwars data frame. ```r starwars$name[1] ``` ``` ## [1] "Luke Skywalker" ``` -- However, note some key differences between the output from this example and that of our previous `starwars[1, 1]` example. What are they? - Hint: Apart from the visual cues, try wrapping each command in `str()`. --- # Option 2: $ (cont.) The last thing that I want to say about `$` is that it provides another way to avoid the "object not found" problem that we ran into with our earlier regression example. ```r lm(y ~ x) ## Doesn't work ``` ``` ## Error in eval(predvars, data, env): object 'y' not found ``` ```r lm(d$y ~ d$x) ## Works! ``` ``` ## ## Call: ## lm(formula = d$y ~ d$x) ## ## Coefficients: ## (Intercept) d$x ## 2 1 ``` --- class: inverse, center, middle name: cleaning # Cleaning up <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Removing objects (and packages) Use `rm()` to remove an object or objects from your working environment. ```r a = "hello" b = "world" rm(a, b) ``` You can also use `rm(list = ls())` to remove all objects in your working environment (except packages), but this is [frowned upon](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/). - Better just to start a new R session. -- Detaching packages is more complicated, because there are so many cross-dependencies (i.e. one package depends on, and might even automatically load, another.) However, you can try, e.g. `detach(package:dplyr)` - Again, better just to restart your R session. --- # Removing plots You can use `dev.off()` to removing any (i.e. all) plots that have been generated during your session. For example, try this in your R console: ```r plot(1:10) dev.off() ``` -- You may also have noticed that RStudio has convenient buttons for clearing your workspace environment and removing (individual) plots. Just look for these icons in the relevant window panels: ![](pics/broom.png?display=inline-block)