Web scraping tutorial in R

The following is a R notebook/Markdown of a Medium blog post from August 2, 2017 by José Roberto Ayala Solares. The following is a direct quote from his blog; We are omitting quotation markings for ease of interface. We are also directly leveraging Tyler Ransom’s Lecture, and Grant McDermott’s Lecture.

The story of this lecture goes like this:

Some time ago, Kevin Markham from Data School, published a nice tutorial about web scraping using 16 lines of Python code.

The tutorial is simple and really well-made. I strongly encourage you to have a look at it. In fact, such a tutorial is the base of this class, but we are using R instead.

Also, we’ll use the same website about an opinion article called Trump’s Lies. This should facilitate any comparison between the two approaches.

Examining the New York Times Article

For a nice description about the article that we’ll be working with, I encourage you to have a look at Kevin’s tutorial. In summary, the data that we are interested in consists of a record of lies, each with 4 parts:

The date of the lie
The lie itself
An explanation of why it was a lie
A URL for an article that supports the explanation (embedded within the text)

NYT snapshot

Reading the web page into R

The primary R package that we’ll be using today is rvest (link), a simple webscraping library inspired by Python’s Beautiful Soup (link), but with extra tidyverse functionality. rvest is designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.

CSS and SelectorGadget

Time for a quick presentation on CSS (i.e Cascading Style Sheets) and SelectorGadget. In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:

Properties. CSS properties are the “how” of the display rules. These are things like which font family, styles and colours to use, page width, etc.
Selectors. CSS selectors are the “what” of the display rules. They identify which rules should be applied to which elements. E.g. Text elements that are selected as “.h1” (i.e. top line headers) are usually larger and displayed more prominently than text elements selected as “.h2” (i.e. sub-headers).

The key point is that if you can identify the CSS selector(s) of the content you want, then you can isolate it from the rest of the webpage content that you don’t want. This where SelectorGadget, an open source tool that makes CSS selector generation and discovery easy, comes in. We’ll work through an example below, but I highly recommend looking over this quick vignette before proceeding.

Aside: Get CSS selectors via browser inspection tools

SelectorGadget is a great tool. But it isn’t available on all browsers and can involve more work than I’d like sometimes, with all that iterative clicking. I therefore wanted to mention an alternative (and very precise) approach to obtaining CSS selectors: Use the “inspect web element” feature of your browser.

Back to NYT Exercise

The first important rvest function to use is read_html(), which returns an XML document that contains all the information about the web page. And we talked about XML last lecture!

library(rvest)
webpage <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
webpage

## {html_document}
## <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="https://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n  display: block !imp ...

#> {xml_document}
#> <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default  limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>President Trump's L ...
#> [2] <body>\n    \n    <style>\n    .lt-ie10 .messenger.suggestions {\n        display: block !important;\n  ...

Collecting all of the records

As explained in Kevin’s tutorial, every record has the following structure in the HTML code:

 DATE LIE <a href="URL"> EXPLANATION </a>

Therefore, to collect all the lies, we need to identify all the  tags that belong to class="short-desc". The function that will help us to do so is html_nodes(). This function requires the XML document that we have read and the nodes that we want to select. For the later, it is encouraged to use the SelectorGadget.

Using such a tool, we find that all the lies can be selected by using the selector ".short-desc".

results <- webpage %>% html_nodes(".short-desc")
results

## {xml_nodeset (180)}
##  [1] <span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Ira ...
##  [2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time m ...
##  [3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...
##  [4] <span class="short-desc"><strong>Jan. 25 </strong>“Now, the audience was ...
##  [5] <span class="short-desc"><strong>Jan. 25 </strong>“Take a look at the Pe ...
##  [6] <span class="short-desc"><strong>Jan. 25 </strong>“You had millions of p ...
##  [7] <span class="short-desc"><strong>Jan. 25 </strong>“So, look, when Presid ...
##  [8] <span class="short-desc"><strong>Jan. 26 </strong>“We've taken in tens o ...
##  [9] <span class="short-desc"><strong>Jan. 26 </strong>“I cut off hundreds of ...
## [10] <span class="short-desc"><strong>Jan. 28 </strong>“The coverage about me ...
## [11] <span class="short-desc"><strong>Jan. 29 </strong>“The Cuban-Americans,  ...
## [12] <span class="short-desc"><strong>Jan. 30 </strong>“Only 109 people out o ...
## [13] <span class="short-desc"><strong>Feb. 3 </strong>“Professional anarchist ...
## [14] <span class="short-desc"><strong>Feb. 4 </strong>“After being forced to  ...
## [15] <span class="short-desc"><strong>Feb. 5 </strong>“We had 109 people out  ...
## [16] <span class="short-desc"><strong>Feb. 6 </strong>“I have already saved m ...
## [17] <span class="short-desc"><strong>Feb. 6 </strong>“It's gotten to a point ...
## [18] <span class="short-desc"><strong>Feb. 6 </strong>“The failing @nytimes w ...
## [19] <span class="short-desc"><strong>Feb. 6 </strong>“And the previous admin ...
## [20] <span class="short-desc"><strong>Feb. 7 </strong>“And yet the murder rat ...
## ...

#> {xml_nodeset (116)}
#> [1] <span class="short-desc"><strong>Jan. 21 </strong>"I wasn't a fan of Iraq. I didn't want to go into Ir ...
#> [2] <span class="short-desc"><strong>Jan. 21 </strong>"A reporter for Time magazine — and I have been on t ...
#> [3] <span class="short-desc"><strong>Jan. 23 </strong>"Between 3 million and 5 million illegal votes cause ...
#> [4] <span class="short-desc"><strong>Jan. 25 </strong>"Now, the audience was the biggest ever. But this cr ...
#> [5] <span class="short-desc"><strong>Jan. 25 </strong>"Take a look at the Pew reports (which show voter fr ...
#> [6] <span class="short-desc"><strong>Jan. 25 </strong>"You had millions of people that now aren't insured  ...
#> [7] <span class="short-desc"><strong>Jan. 25 </strong>"So, look, when President Obama was there two weeks  ...
#> [8] <span class="short-desc"><strong>Jan. 26 </strong>"We've taken in tens of thousands of people. We know ...
#> ...

This returns a list with 116 XML nodes that contain the information for each of the 116 lies in the web page.

Notice that I am using the %>% pipe-operator from the magrittr package, which can help to express complex operations as elegant pipelines composed of simple, easily understood pieces.

Extracing the date

Let’s start simple and focus on extracting all the necessary details from the first lie. We can then extend this to all the others easily. Remember that the general structure for a single record is:

 DATE LIE <a href="URL"> EXPLANATION </a>

Notice that the date is embedded within the  tag. To select it, we can use the html_nodes() function using the selector "strong".

first_result <- results[1]
first_result %>% html_nodes("strong")

## {xml_nodeset (1)}
## [1] <strong>Jan. 21 </strong>

#> {xml_nodeset (1)}
#> [1] <strong>Jan. 21 </strong>

We then need to use the html_text() function to extract only the text, with the trim argument active to trim leading and trailing spaces. Finally, we make use of the stringr package to add the year to the extracted date.

Extracting the lie

To select the lie, we need to make use of the xml_contents() function that is part of the xml2 package (this package is required by the rvest package, so it is not necessary to load it). The function returns a list with the nodes that are part of first_result.

library(xml2)

## Warning: package 'xml2' was built under R version 4.3.1

xml_contents(first_result)

## {xml_nodeset (3)}
## [1] <strong>Jan. 21 </strong>
## [2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
## [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczyns ...

#> {xml_nodeset (3)}
#> [1] <strong>Jan. 21 </strong>
#> [2] "I wasn't a fan of Iraq. I didn't want to go into Iraq." 
#> [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-don ...

We are interested in the lie, which is the text of the second node.

xml_contents(first_result)[2] %>% html_text(trim = TRUE)

## [1] "“I wasn't a fan of Iraq. I didn't want to go into Iraq.”"

#> [1] ""I wasn't a fan of Iraq. I didn't want to go into Iraq.""

Notice that there is an extra pair of quotes (“…”) surrounding the lie. To get rid of them, we simply use the str_sub() function from the stringr package to select just the lie.

library(stringr)

## Warning: package 'stringr' was built under R version 4.3.1

lie <- xml_contents(first_result)[2] %>% html_text(trim = TRUE)
str_sub(lie, 2, -2)

## [1] "I wasn't a fan of Iraq. I didn't want to go into Iraq."

#> [1] "I wasn't a fan of Iraq. I didn't want to go into Iraq."

Extracting the explanation

Hopefully by now it shouldn’t be too complicated to see that to extract the explanation we simply need to select the text within the  tag that belongs to class=".short-truth". This will extract the text together with the opening and closing parentheses, but we can easily get rid of them.

explanation <- first_result %>% html_node(".short-truth") %>% html_text(trim = TRUE)
str_sub(explanation, 2, -2)

## [1] "He was for an invasion before he was against it."

#> [1] "He was for an invasion before he was against it."

Extracting the URL

Finally, to get the URL, notice that this is an attribute within the <a> tag. We simply select this node with the html_nodes() function, and then select the href attribute with the html_attr() function.

url <- first_result %>% html_node("a") %>% html_attr("href")
url

## [1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"

#> [1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"

Building the dataset

We found a way to extract each of the 4 parts of the first record. We can extend this process to all the rest using a for loop. In the end, we want to have a data frame with 116 rows (one for each record) and 4 columns (to keep the date, the lie, the explanation and the URL). One way to do so is to create an empty data frame and simply add a new row as each new record is processed. However, this is not considered a good practice. As suggested here, we are going to create a single data frame for each record and store all of them in a list. Once we have the 116 data frames, we’ll bind them together using the bind_rows() function from the dplyr package. This creates our desired dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

records <- vector("list", length = length(results))

for (i in seq_along(results)) {
    date <- str_c(results[i] %>% html_nodes("strong") %>% html_text(trim = TRUE), ", 2017")
    lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
    explanation <- str_sub(results[i] %>% html_nodes(".short-truth") %>% html_text(trim = TRUE), 2, -2)
    url <- results[i] %>% html_nodes("a") %>% html_attr("href")
    records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

df <- bind_rows(records)
glimpse(df)

## Rows: 180
## Columns: 4
## $ date        <chr> "Jan. 21, 2017", "Jan. 21, 2017", "Jan. 23, 2017", "Jan. 2…
## $ lie         <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url         <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…

#> Observations: 116
#> Variables: 4
#> $ date        <chr> "Jan. 21, 2017", "Jan. 21, 2017", "Jan. 23, 2017", "Jan. 25, 2017", "...
#> $ lie         <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", "A reporter...
#> $ explanation <chr> "He was for an invasion before he was against it.", "Trump was on the...
#> $ url         <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-h...

Notice that the column for the date is considered a character vector. It’d be nice to have it as a datetime vector instead. To do so, we can use the lubridate package and use the mdy() function (month-day-year) to make the conversion.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

df$date <- mdy(df$date)
glimpse(df)

## Rows: 180
## Columns: 4
## $ date        <date> 2017-01-21, 2017-01-21, 2017-01-23, 2017-01-25, 2017-01-2…
## $ lie         <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url         <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…

#> Observations: 116
#> Variables: 4
#> $ date        <date> 2017-01-21, 2017-01-21, 2017-01-23, 2017-01-25, 2017-01-25, 2017-01-...
#> $ lie         <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", "A reporter...
#> $ explanation <chr> "He was for an invasion before he was against it.", "Trump was on the...
#> $ url         <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-h...

Exporting the dataset to a CSV file

If you want to export your dataset, you can use either the write.csv() function that comes by default with R, or the write_csv() function from the readr package, which is twice faster and more convenient than the first one.

library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:rvest':
## 
##     guess_encoding

write_csv(df, "trump_lies.csv")

Similarly, to retrieve your dataset, you can use either the default function read.csv() or the read_csv() function from the readr package.

df <- read_csv("trump_lies.csv")

## Rows: 180 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): lie, explanation, url
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> Parsed with column specification:
#> cols(
#>   date = col_date(format = ""),
#>   lie = col_character(),
#>   explanation = col_character(),
#>   url = col_character()
#> )

Summary

The full code for this tutorial is shown below:

# Load packages
library(rvest)
library(stringr)
library(dplyr)
library(lubridate)
library(readr)

# Read web page
webpage <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")

# Extract records info
results <- webpage %>% html_nodes(".short-desc")

# Building the dataset
records <- vector("list", length = length(results))

for (i in seq_along(results)) {
    date <- str_c(results[i] %>% 
                      html_nodes("strong") %>% 
                      html_text(trim = TRUE), ', 2017')
    lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
    explanation <- str_sub(results[i] %>% 
                               html_nodes(".short-truth") %>% 
                               html_text(trim = TRUE), 2, -2)
    url <- results[i] %>% html_nodes("a") %>% html_attr("href")
    records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}

df <- bind_rows(records)

# Transform to datetime format
df$date <- mdy(df$date)

# Export to csv
write_csv(df, "trump_lies.csv")