This lecture is leveraging Grant McDermott’s Lecture.
We’re going to be downloading economic data from the FRED API. This will require that you first create a user account and then register a personal API key.
Today we’ll be using JSONView, a browser extension that renders JSON output nicely in Chrome and Firefox. (Not required, but recommended.)
We will use the following R Packages: - jsonlite, httr, listviewer, usethis, fredr, tidyverse, lubridate, hrbrthemes, janitor
Here’s a convenient way to install (if necessary) and load all of the above packages.
During the last lecture, we practiced scraping data using the rvest package. This technique focuses on CSS selectors (with help from SelectorGadget) and HTML tags. We also saw that web-scraping often involves as much art as science. The plethora of CSS options and the flexibility of HTML itself means that steps which work perfectly well on one website can easily fail on another website.
Today we focus on a different type of scraping: one that is rendered client-side. The good news is that, when available, this approach typically makes it much easier to scrape data from the web. The downside is that, again, it can involve as much art as it does science. Moreover, as we discussed last lecture, just because because we can scrape data, doesn’t mean that we should (i.e. ethical, legal and other considerations). But let’s proceed…
Recall that websites or applications that are built using a client-side framework typically involve something like the following steps:
All of this requesting, responding and rendering takes places through the host application’s API (or Application Program Interface).
If you’re new to APIs, then I recommend this excellent resource from Zapier: An Introduction to APIs. It’s fairly in-depth, but you don’t need to work through the whole thing to get the gist. The summary version is that an API is really just a collection of rules and methods that allow different software applications to interact and share information. This includes not only web servers and browsers, but also software packages like the R libraries we’ve been using. Key concepts include:
GET
(i.e. ask
a server to retrieve information), but other common methods are
POST
, PUT
and DELETE
.A key point in all of this is that, in the case of web APIs, we can access information directly from the API database if we can specify the correct URL(s). These URLs are known as an API endpoints.
API endpoints are in many ways similar to the normal website URLs that we’re all used to visiting. For starters, you can navigate to them in your web browser. However, whereas normal websites display information in rich HTML content — pictures, cat videos, nice formatting, etc. — an API endpoint is much less visually appealing.
Navigate your browser to an API endpoint and you’ll just see a load of seemingly unformatted text. In truth, what you’re really seeing is (probably) either JSON (JavaScript Object Notation) or XML (Extensible Markup Language).
You don’t need to worry too much about the syntax of JSON and XML. The important thing is that the object in your browser — that load of seemingly unformatted text — is actually very precisely structured and formatted. Moreover, it contains valuable information that we can easily read into R (or Python). We just need to know the right API endpoint for the data that we want.
Let’s practice doing this through a few example applications. I’ll start with the simplest case (no API key required, explicit API endpoint) and then work through some a complicated example. There is a third case of Hidden API endpoints that we won’t cover today, but I’ll mention it briefly at the end.
NYC Open Data is a pretty amazing initiative. Its mission is to “make the wealth of public data generated by various New York City agencies and other City organizations available for public use”. You can get data on everything from arrest data, to the location of WiFi hotspots, to city job postings, to homeless population counts, to dog licenses, to a directory of toilets in public parks… The list goes on.
I highly encourage you to explore in your own time, but we’re going to do something “earthy” for this first application: Download a sample of tree data from the 2015 NYC Street Tree Census.
Let’s begin with an example from NYC Open Data, because we don’t need to set up an API key in advance.1
All you need to do is complete the following steps:
Here’s an outdated GIF of Grant completing these steps: