The following is a R notebook/Markdown of a Medium blog post from August 2, 2017 by José Roberto Ayala Solares. The following is a direct quote from his blog; We are omitting quotation markings for ease of interface. We are also directly leveraging Tyler Ransom’s Lecture, and Grant McDermott’s Lecture.
The story of this lecture goes like this:
Some time ago, Kevin Markham from Data School, published a nice tutorial about web scraping using 16 lines of Python code.
The tutorial is simple and really well-made. I strongly encourage you to have a look at it. In fact, such a tutorial is the base of this class, but we are using R instead.
Also, we’ll use the same website about an opinion article called Trump’s Lies. This should facilitate any comparison between the two approaches.
For a nice description about the article that we’ll be working with, I encourage you to have a look at Kevin’s tutorial. In summary, the data that we are interested in consists of a record of lies, each with 4 parts:
The primary R package that we’ll be using today is rvest (link), a simple webscraping library inspired by Python’s Beautiful Soup (link), but with extra tidyverse functionality. rvest is designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.
Time for a quick presentation on CSS (i.e Cascading Style Sheets) and SelectorGadget. In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:
The key point is that if you can identify the CSS selector(s) of the content you want, then you can isolate it from the rest of the webpage content that you don’t want. This where SelectorGadget, an open source tool that makes CSS selector generation and discovery easy, comes in. We’ll work through an example below, but I highly recommend looking over this quick vignette before proceeding.
SelectorGadget is a great tool. But it isn’t available on all browsers and can involve more work than I’d like sometimes, with all that iterative clicking. I therefore wanted to mention an alternative (and very precise) approach to obtaining CSS selectors: Use the “inspect web element” feature of your browser.
The first important rvest
function to use is
read_html()
, which returns an XML document that contains
all the information about the web page. And we talked about XML last
lecture!
library(rvest)
webpage <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
webpage
## {html_document}
## <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="https://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n display: block !imp ...
#> {xml_document}
#> <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>President Trump's L ...
#> [2] <body>\n \n <style>\n .lt-ie10 .messenger.suggestions {\n display: block !important;\n ...
As explained in Kevin’s tutorial, every record has the following structure in the HTML code:
<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
Therefore, to collect all the lies, we need to identify all the
<span>
tags that belong to
class="short-desc"
. The function that will help us to do so
is html_nodes()
. This function requires the XML document
that we have read and the nodes that we want to select. For the later,
it is encouraged to use the SelectorGadget.
Using such a tool, we find that all the lies can be selected by using
the selector ".short-desc"
.
## {xml_nodeset (180)}
## [1] <span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Ira ...
## [2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time m ...
## [3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...
## [4] <span class="short-desc"><strong>Jan. 25 </strong>“Now, the audience was ...
## [5] <span class="short-desc"><strong>Jan. 25 </strong>“Take a look at the Pe ...
## [6] <span class="short-desc"><strong>Jan. 25 </strong>“You had millions of p ...
## [7] <span class="short-desc"><strong>Jan. 25 </strong>“So, look, when Presid ...
## [8] <span class="short-desc"><strong>Jan. 26 </strong>“We've taken in tens o ...
## [9] <span class="short-desc"><strong>Jan. 26 </strong>“I cut off hundreds of ...
## [10] <span class="short-desc"><strong>Jan. 28 </strong>“The coverage about me ...
## [11] <span class="short-desc"><strong>Jan. 29 </strong>“The Cuban-Americans, ...
## [12] <span class="short-desc"><strong>Jan. 30 </strong>“Only 109 people out o ...
## [13] <span class="short-desc"><strong>Feb. 3 </strong>“Professional anarchist ...
## [14] <span class="short-desc"><strong>Feb. 4 </strong>“After being forced to ...
## [15] <span class="short-desc"><strong>Feb. 5 </strong>“We had 109 people out ...
## [16] <span class="short-desc"><strong>Feb. 6 </strong>“I have already saved m ...
## [17] <span class="short-desc"><strong>Feb. 6 </strong>“It's gotten to a point ...
## [18] <span class="short-desc"><strong>Feb. 6 </strong>“The failing @nytimes w ...
## [19] <span class="short-desc"><strong>Feb. 6 </strong>“And the previous admin ...
## [20] <span class="short-desc"><strong>Feb. 7 </strong>“And yet the murder rat ...
## ...
#> {xml_nodeset (116)}
#> [1] <span class="short-desc"><strong>Jan. 21 </strong>"I wasn't a fan of Iraq. I didn't want to go into Ir ...
#> [2] <span class="short-desc"><strong>Jan. 21 </strong>"A reporter for Time magazine — and I have been on t ...
#> [3] <span class="short-desc"><strong>Jan. 23 </strong>"Between 3 million and 5 million illegal votes cause ...
#> [4] <span class="short-desc"><strong>Jan. 25 </strong>"Now, the audience was the biggest ever. But this cr ...
#> [5] <span class="short-desc"><strong>Jan. 25 </strong>"Take a look at the Pew reports (which show voter fr ...
#> [6] <span class="short-desc"><strong>Jan. 25 </strong>"You had millions of people that now aren't insured ...
#> [7] <span class="short-desc"><strong>Jan. 25 </strong>"So, look, when President Obama was there two weeks ...
#> [8] <span class="short-desc"><strong>Jan. 26 </strong>"We've taken in tens of thousands of people. We know ...
#> ...
This returns a list with 116 XML nodes that contain the information for each of the 116 lies in the web page.
Notice that I am using the %>%
pipe-operator from the
magrittr package, which can
help to express complex operations as elegant pipelines composed of
simple, easily understood pieces.
Let’s start simple and focus on extracting all the necessary details from the first lie. We can then extend this to all the others easily. Remember that the general structure for a single record is:
<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
Notice that the date is embedded within the
<strong>
tag. To select it, we can use the
html_nodes()
function using the selector
"strong"
.
## {xml_nodeset (1)}
## [1] <strong>Jan. 21 </strong>
We then need to use the html_text()
function to extract
only the text, with the trim argument active to trim leading and
trailing spaces. Finally, we make use of the stringr package to add the year
to the extracted date.
To select the lie, we need to make use of the
xml_contents()
function that is part of the xml2 package
(this package is required by the rvest package, so it is not necessary
to load it). The function returns a list with the nodes that are part of
first_result
.
## Warning: package 'xml2' was built under R version 4.3.1
## {xml_nodeset (3)}
## [1] <strong>Jan. 21 </strong>
## [2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
## [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczyns ...
#> {xml_nodeset (3)}
#> [1] <strong>Jan. 21 </strong>
#> [2] "I wasn't a fan of Iraq. I didn't want to go into Iraq."
#> [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-don ...
We are interested in the lie, which is the text of the second node.
## [1] "“I wasn't a fan of Iraq. I didn't want to go into Iraq.”"
Notice that there is an extra pair of quotes (“…”) surrounding the
lie. To get rid of them, we simply use the str_sub()
function from the stringr package to select just the lie.
## Warning: package 'stringr' was built under R version 4.3.1
## [1] "I wasn't a fan of Iraq. I didn't want to go into Iraq."
Hopefully by now it shouldn’t be too complicated to see that to
extract the explanation we simply need to select the text within the
<span>
tag that belongs to
class=".short-truth"
. This will extract the text together
with the opening and closing parentheses, but we can easily get rid of
them.
explanation <- first_result %>% html_node(".short-truth") %>% html_text(trim = TRUE)
str_sub(explanation, 2, -2)
## [1] "He was for an invasion before he was against it."
Finally, to get the URL, notice that this is an attribute within the
<a>
tag. We simply select this node with the
html_nodes()
function, and then select the href attribute
with the html_attr()
function.
## [1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"
We found a way to extract each of the 4 parts of the first record. We
can extend this process to all the rest using a for loop. In the end, we
want to have a data frame with 116 rows (one for each record) and 4
columns (to keep the date, the lie, the explanation and the URL). One
way to do so is to create an empty data frame and simply add a new row
as each new record is processed. However, this is not considered a good
practice. As suggested here,
we are going to create a single data frame for each record and store all
of them in a list. Once we have the 116 data frames, we’ll bind them
together using the bind_rows()
function from the dplyr package. This creates our
desired dataset.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
records <- vector("list", length = length(results))
for (i in seq_along(results)) {
date <- str_c(results[i] %>% html_nodes("strong") %>% html_text(trim = TRUE), ", 2017")
lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
explanation <- str_sub(results[i] %>% html_nodes(".short-truth") %>% html_text(trim = TRUE), 2, -2)
url <- results[i] %>% html_nodes("a") %>% html_attr("href")
records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Rows: 180
## Columns: 4
## $ date <chr> "Jan. 21, 2017", "Jan. 21, 2017", "Jan. 23, 2017", "Jan. 2…
## $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…
#> Observations: 116
#> Variables: 4
#> $ date <chr> "Jan. 21, 2017", "Jan. 21, 2017", "Jan. 23, 2017", "Jan. 25, 2017", "...
#> $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", "A reporter...
#> $ explanation <chr> "He was for an invasion before he was against it.", "Trump was on the...
#> $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-h...
Notice that the column for the date is considered a character vector.
It’d be nice to have it as a datetime vector instead. To do so, we can
use the lubridate package and use the mdy()
function
(month-day-year) to make the conversion.
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Rows: 180
## Columns: 4
## $ date <date> 2017-01-21, 2017-01-21, 2017-01-23, 2017-01-25, 2017-01-2…
## $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…
#> Observations: 116
#> Variables: 4
#> $ date <date> 2017-01-21, 2017-01-21, 2017-01-23, 2017-01-25, 2017-01-25, 2017-01-...
#> $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", "A reporter...
#> $ explanation <chr> "He was for an invasion before he was against it.", "Trump was on the...
#> $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-h...
If you want to export your dataset, you can use either the
write.csv()
function that comes by default with R, or the
write_csv()
function from the readr package, which is twice
faster and more convenient than the first one.
##
## Attaching package: 'readr'
## The following object is masked from 'package:rvest':
##
## guess_encoding
Similarly, to retrieve your dataset, you can use either the default
function read.csv()
or the read_csv()
function
from the readr package.
## Rows: 180 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): lie, explanation, url
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The full code for this tutorial is shown below:
# Load packages
library(rvest)
library(stringr)
library(dplyr)
library(lubridate)
library(readr)
# Read web page
webpage <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
# Extract records info
results <- webpage %>% html_nodes(".short-desc")
# Building the dataset
records <- vector("list", length = length(results))
for (i in seq_along(results)) {
date <- str_c(results[i] %>%
html_nodes("strong") %>%
html_text(trim = TRUE), ', 2017')
lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
explanation <- str_sub(results[i] %>%
html_nodes(".short-truth") %>%
html_text(trim = TRUE), 2, -2)
url <- results[i] %>% html_nodes("a") %>% html_attr("href")
records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}
df <- bind_rows(records)
# Transform to datetime format
df$date <- mdy(df$date)
# Export to csv
write_csv(df, "trump_lies.csv")