class: title-slide # Econ 520: Data Science for Economists ## Lecture 12: Linear Regression Refresher <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- class: center, middle name: prologue # Lecture Structure <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Main Goals - The main goal of this lecture is to provide a quick refresher of linear regression analysis related to what you have seen in .hi[Econ 320] - We will cover the basics of linear regression analysis in R - We will cover estimation, and heteroskesdastic-robust inference. - We will also cover marginal effects and other post-estimation tools. --- # R packages - We will use several R packages today. - New: **estimatr**, **lmtest**, **broom** - I'll try to be explicit about where a particular function is coming from, whenever we use it below. - A convenient way to install (if necessary) and load everything is by running the below code chunk. ```r ## Load and install the packages that we'll be using today if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, estimatr, sandwich, lmtest, margins, broom, modelsummary) # Plot theme theme_set(hrbrthemes::theme_ipsum()) ``` --- # Example: starwars data - I want to mention up front is that we'll mostly be working with the `starwars` data frame that we've already seen from previous lectures. - This is not a big deal because I just want to refresh your memory about some regression tools. - Here's a quick reminder of what `starwars` data frame looks like to refresh your memory. ```r starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Darth V… 202 136 none white yellow 41.9 male mascu… ## 5 Leia Or… 150 49 brown light brown 19 fema… femin… ## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… ## 7 Beru Wh… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 <NA> white, red red NA none mascu… ## 9 Biggs D… 183 84 black light brown 24 male mascu… ## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… ## # ℹ 77 more rows ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- # Regression Basics - R's workhorse command for running linear regression models is the built-in `lm()` function. - The "**lm**" stands for "**l**inear **m**odels" and the syntax is very intuitive. ```r lm(y ~ x1 + x2 + x3 + ..., data = df) ``` - You'll note that the `lm()` call includes a reference to the data source (in this case, a hypothetical data frame called `df`). - All the variables `y`, `x1`, `x2`, `x3`, etc. are assumed to be columns in `df`. --- # Example: OLS - Let's run a simple bivariate regression of mass on height using our dataset of starwars characters. ```r ols1 = lm(mass ~ height, data = starwars) ols1 ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Coefficients: ## (Intercept) height ## -11.487 0.624 ``` --- # Example: OLS (continued) - The resulting object is pretty terse, but that's only because it buries most of its valuable information --- of which there is a lot --- within its internal list structure. - If you're in RStudio, you can inspect this structure by typing `View(ols1)` or simply clicking on the "ols1" object in your environment pane. - Doing so will prompt an interactive panel to pop up for you to play around with. - That approach won't work for this knitted R Markdown document, however, so I'll use the `listviewer::jsonedit()` function instead. --- # Example: OLS (continued) ```r # View(ols1) ## Run this instead if you're in a live session listviewer::jsonedit(ols1, mode="view", height = "400px") ## Better for R Markdown ```
--- # Example: OLS (continued) - As we can see, this `ols1` object has a bunch of important slots - regression coefficients, - vectors of the residuals and fitted (i.e. predicted) values, - rank of the design matrix, - to the input data, - etc. etc. --- # Example: OLS (continued) - To summarize the key pieces of information, we can use the `summary()` function. ```r summary(ols1) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -60.95 -29.51 -20.83 -17.65 1260.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11.4868 111.3842 -0.103 0.918 ## height 0.6240 0.6262 0.997 0.323 ## ## Residual standard error: 169.5 on 57 degrees of freedom ## (28 observations deleted due to missingness) ## Multiple R-squared: 0.01712, Adjusted R-squared: -0.0001194 ## F-statistic: 0.9931 on 1 and 57 DF, p-value: 0.3232 ``` --- # Example: OLS (continued) We can then dig down further by extracting a summary of the regression coefficients: ```r summary(ols1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11.4868157 111.3841576 -0.1031279 0.9182234 ## height 0.6240033 0.6261744 0.9965328 0.3232031 ``` --- # Get "tidy" OLS coefficients with the `broom` package - While it's easy to extract regression coefficients via the `summary()` function, in practice I always use the **broom** package ([link ](https://broom.tidyverse.org/)) to do so. - **broom** has a bunch of neat features to convert regression (and other statistical) objects into "tidy" data frames. - This is especially useful because regression output is so often used as an input to something else, e.g., a plot of coefficients or marginal effects. --- # Get "tidy" OLS coefficients with the `broom` package (continued) - Here, I'll use `broom::tidy(..., conf.int = TRUE)` to coerce the `ols1` OLS object into a tidy data frame of coefficient values and key statistics. ```r # library(broom) ## Already loaded tidy(ols1, conf.int = TRUE) ``` ``` ## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -11.5 111. -0.103 0.918 -235. 212. ## 2 height 0.624 0.626 0.997 0.323 -0.630 1.88 ``` --- # Get "tidy" OLS coefficients with the `broom` package (continued) - Again, I could now pipe this tidied coefficients data frame to a **ggplot2** call, using saying `geom_pointrange()` to plot the error bars. - We'll get to some explicit examples further below. --- # Regressing on subsetted data - Different species and homeworlds aside, we may have an extreme outlier in our data <img src="12slides_files/figure-html/jabba-1.png" style="display: block; margin: auto;" /> --- # How to handle this? - Maybe we should exclude Jabba from our regression? - You can do this in two ways: 1) Create a new data frame and then regress, or 2) Subset the original data frame directly in the `lm()` call. --- # Create a new data frame style - Recall that we can keep multiple objects in memory in R. - So we can easily create a new data frame that excludes Jabba using, say, **dplyr** ```r starwars2 = starwars %>% filter(name != "Jabba Desilijic Tiure") # filter(!(grepl("Jabba", name))) ## Regular expressions also work ols2 = lm(mass ~ height, data = starwars2) summary(ols2) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -39.006 -7.804 0.508 4.007 57.901 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -31.25047 12.81488 -2.439 0.0179 * ## height 0.61273 0.07202 8.508 1.14e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 19.49 on 56 degrees of freedom ## (28 observations deleted due to missingness) ## Multiple R-squared: 0.5638, Adjusted R-squared: 0.556 ## F-statistic: 72.38 on 1 and 56 DF, p-value: 1.138e-11 ``` --- # Subset directly in the `lm()` call Our other alternative is to run direct from `lm()`. ```r ols2a = lm(mass ~ height, data = starwars %>% filter(!(grepl("Jabba", name)))) summary(ols2a) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars %>% filter(!(grepl("Jabba", ## name)))) ## ## Residuals: ## Min 1Q Median 3Q Max ## -39.006 -7.804 0.508 4.007 57.901 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -31.25047 12.81488 -2.439 0.0179 * ## height 0.61273 0.07202 8.508 1.14e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 19.49 on 56 degrees of freedom ## (28 observations deleted due to missingness) ## Multiple R-squared: 0.5638, Adjusted R-squared: 0.556 ## F-statistic: 72.38 on 1 and 56 DF, p-value: 1.138e-11 ``` --- # Issue with `lm` is inference - The `lm` function is great for estimating the coefficients of a linear model, but it's not so great for inference. - The `lm` function assumes that the variance of the residuals is constant across all levels of the covariates. - This is known as homoskedasticity assumption. - I strongly recommend you to not rely on that as this assumption is .hi[VERY] often violated. - How can we fix that? --- # Hetereoskedastic-robust standard errors - One way to fix this is to use heteroskedastic-robust standard errors. - This is done by using the `coeftest` function from the `lmtest` package. ```r # library(lmtest) already loaded lmtest::coeftest(ols2a, vcov = vcovHC(ols2a, type = "HC2")) %>% tidy(conf.int = TRUE) ``` ``` ## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -31.3 9.23 -3.38 1.31e- 3 -49.7 -12.8 ## 2 height 0.613 0.0608 10.1 3.54e-14 0.491 0.735 ``` --- # `estimatr::lm_robust()` as a one-stop alternative - Instead of using `lm` and then `coeftest`, you can use the `estimatr` package to do both at once. - Let's illustrate by implementing a robust version of the `ols2a` OLS that we ran - Note that **estimatr** models automatically print in pleasing tidied/summary format, although you can certainly pipe them to `tidy()` too. ```r # library(estimatr) ## Already loaded ols1_robust = lm_robust(mass ~ height, data = starwars %>% filter(!(grepl("Jabba", name)))) # tidy(ols1_robust, conf.int = TRUE) ## Could tidy too ols1_robust ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower ## (Intercept) -31.2504692 9.23314438 -3.384597 1.307917e-03 -49.7466800 ## height 0.6127301 0.06084511 10.070327 3.544723e-14 0.4908427 ## CI Upper DF ## (Intercept) -12.7542584 56 ## height 0.7346175 56 ``` --- class: center, middle name: Interactions # Dummy variables and interactions <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Focusing on humans - For the next few sections, it will prove convenient to demonstrate using a subsample of the starwars data that comprises only the human characters. - Let's quickly create this new dataset before continuing. ```r humans = starwars %>% filter(species=="Human") %>% select(where(Negate(is.list))) ## Drop list columns (optional) humans ``` ``` ## # A tibble: 35 × 11 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 2 Darth V… 202 136 none white yellow 41.9 male mascu… ## 3 Leia Or… 150 49 brown light brown 19 fema… femin… ## 4 Owen La… 178 120 brown, gr… light blue 52 male mascu… ## 5 Beru Wh… 165 75 brown light blue 47 fema… femin… ## 6 Biggs D… 183 84 black light brown 24 male mascu… ## 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… ## 8 Anakin … 188 84 blond fair blue 41.9 male mascu… ## 9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu… ## 10 Han Solo 180 80 brown fair brown 29 male mascu… ## # ℹ 25 more rows ## # ℹ 2 more variables: homeworld <chr>, species <chr> ``` --- # Dummy variables - Dummy variables are a core component of many regression models. - However, these can be a pain to create in some statistical languages, since you first have to tabulate a whole new matrix of binary variables and then append it to the original data frame. - In contrast, R has a very convenient framework for creating and evaluating dummy variables in a regression: Simply specify the variable of interest as a [factor](https://r4ds.had.co.nz/factors.html). --- # Dummy variables (cont'd) - Here's an example where we explicitly tell R that "gender" is a factor. Since I don't plan on reusing this model, I'm just going to print the results to screen rather than saving it to my global environment. ```r summary(estimatr::lm_robust(mass ~ height + as.factor(gender), data = humans))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -39.5021260 93.3927675 -0.4229677 0.6776209 ## height 0.5750128 0.5869553 0.9796534 0.3409941 ## as.factor(gender)masculine 20.1953844 18.4225031 1.0962346 0.2882632 ## CI Lower CI Upper DF ## (Intercept) -236.5436417 157.53939 17 ## height -0.6633547 1.81338 17 ## as.factor(gender)masculine -18.6726996 59.06347 17 ``` - Okay, I should tell you that I'm actually making things more complicated than they need to be with the heavy-handed emphasis on factors. - A case in point is that we don't actually *need* to specify a string (i.e. character) variable as a factor in a regression. R will automatically do this for you regardless, as long as the variable is not a number. --- # Dummy variables (cont'd) ```r ## Use the non-factored version of "gender" instead; R knows it must be ordered ## for it to be included as a regression variable summary(estimatr::lm_robust(mass ~ height + gender, data = humans))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower ## (Intercept) -39.5021260 93.3927675 -0.4229677 0.6776209 -236.5436417 ## height 0.5750128 0.5869553 0.9796534 0.3409941 -0.6633547 ## gendermasculine 20.1953844 18.4225031 1.0962346 0.2882632 -18.6726996 ## CI Upper DF ## (Intercept) 157.53939 17 ## height 1.81338 17 ## gendermasculine 59.06347 17 ``` --- # Interactions - Like dummy variables, R provides a convenient syntax for specifying interaction terms directly in the regression model without having to create them manually. - You can use any of the following expansion operators: - `x1:x2` "crosses" the variables (equiv. to including only the x1 × x2 interaction term) - `x1/x2` "nests" the second variable within the first (equivalent to `x1 + x1:x2`) - `x1*x2` includes all parent and interaction terms (equivalent to `x1 + x2 + x1:x2`) - As a rule of thumb, it is generally advisable to include all of the parent terms alongside their interactions. - This makes the `*` option a good default. --- # Interactions: Example We might wonder whether the relationship between a person's body mass and their height is modulated by their gender. That is, we want to run a regression of the form, `$$Mass = \beta_0 + \beta_1 D_{Male} + \beta_2 Height + \beta_3 D_{Male} \times Height$$` To implement this in R, we simply run the following, ```r ols_ie = estimatr::lm_robust(mass ~ gender * height, data = humans) summary(ols_ie)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 87.8648649 151.9183924 0.5783688 0.5710671 ## gendermasculine -177.2710448 193.7338992 -0.9150234 0.3737641 ## height -0.1891892 0.9081564 -0.2083223 0.8376061 ## gendermasculine:height 1.1479992 1.1278747 1.0178429 0.3238976 ## CI Lower CI Upper DF ## (Intercept) -234.187740 409.917470 16 ## gendermasculine -587.968564 233.426475 16 ## height -2.114395 1.736016 16 ## gendermasculine:height -1.242988 3.538987 16 ``` .hi[How do you interpret the coefficients?]