Lecture 4

class: title-slide

# Econ 520: Data Science for Economists

## Lecture 4: Data formats, storage, and Big Data tools
<br>

<p align=center>
Pedro H. C. Sant'Anna
</p>
<div style="margin-top: -.7cm;"></div>
<p align=center>
Emory University
</p>
<br>
<p align=center>
Spring 2024
</p>

---

# Prologue

- Today's material builds on [Tyler Ransom's Lecture Notes](https://github.com/tyleransom/DScourseS24) and on  [Grant McDermott's Lecture Notes](https://raw.githack.com/uo-ec607/lectures/master/16-databases/16-databases.html).

- We will provide several links, too, as we will not cover everything in detail.
<div style="margin-top: 2cm;"></div>

- To be clear, we could spend an entire course talking only about this topic.

---
class: center, middle

# About file extensions and file formats

---
# A note on file extensions
<div style="margin-top: 1.5cm;"></div>
- Often, if you download a file, you will immediately understand what type of a file it is by its extension. 
<div style="margin-top: 1.5cm;"></div>
- File extensions in and of themselves don't serve any particular purpose other than convenience.
<div style="margin-top: 1.5cm;"></div>
- File extensions were created so that we could keep track of which files on their workspace are scripts, which are binaries, etc.

---
# Why is the file format important?
<div style="margin-top: 1.5cm;"></div>
- File formats matter because they may need to match the environment you're working in. 
<div style="margin-top: 1.5cm;"></div>
- If you use the wrong file format, it may cause your computations to run slower than otherwise. 
<div style="margin-top: 1.5cm;"></div>
- To the extent that the environment you're working in requires a specific file format, then using the correct format is essential.

---
# Common file extensions when working with data
<div style="margin-top: 2cm;"></div>
- In the following table, I list some of the most common file extensions.
<div style="margin-top: 2cm;"></div>
- For a more complete list of almost every file extension imaginable (note: they missed Stata's `.do` and `.dta` formats), see [here](https://en.wikipedia.org/wiki/List_of_file_formats).
<div style="margin-top: 2cm;"></div>
- Another great discussion about file formats is [here](https://opendata.stackexchange.com/questions/1208/a-python-guide-for-open-data-file-formats) on stackexchange

---
# Open-format file extensions

- The following file extensions are not tied to a specific software program.

- In this sense they are "raw" and can be viewed in any sort of text editor.

.scroll-box-12[
| File extension                     | Description |
|------------------------------------|-------------|
| CSV                                | Comma separated values; data is in tabular form with column breaks marked by commas |
| TSV                                | Tab separated values; data is in tabular form with column breaks marked by tabs |
| DAT                                | Tab-delimited tabular data (ASCII file) |
| TXT                                | Plain text; not organized in any specific manner (though usually columns are delimited with tabs or commas) |
| TEX                                | LaTeX; markup-style typesetting system used in scientific writing |
| XML                                | eXtensible Markup Language; data is in text form with tags marking different fields |
| HTML                               | HyperText Markup Language; similar to XML; used for almost every webpage you view |
| [YAML](https://en.wikipedia.org/wiki/YAML) | YAML Ain't Markup Language; human readable version of XML |
| JSON                               | JavaScript Object Notation; similar to YAML in that it has a human readable element. YAML is technically a superset of JSON. |
| HDF                                | Hierarchical Data Format; bills itself as a "scientific data format" that can handle all types of data (tabular, text, images, etc.) |
| PDF                                | Portable Document Format; not a great way to store data, but there exists much data in PDF format. Good luck unpacking it! |
| TIFF, JPEG                         | These common image formats are used to store data in the form of images, or sometimes pictures of text data. There exist image processing libraries in almost every scientific programming language that can convert data in this format into more usable formats. |
| MP3, WAV                           | These common audio formats may be used to store data. For example, voice-to-text applications have some way of converting income audio (in some format) into data that the machine can comprehend. The same holds for MP4 and other video file formats (e.g. for video input to self-driving cars, etc.) |
]

---
# Examples of CSV, TSV, XML, YAML, and JSON files

A possible JSON representation describing a person ([source](https://en.wikipedia.org/wiki/JSON#Example))
.scroll-box-16[
```JSON
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
```
]

---
# Examples of CSV, TSV, XML, YAML, and JSON files

The same example as previously, but in XML: ([source](https://en.wikipedia.org/wiki/JSON#Example))
.scroll-box-16[
```xml
<person>
  <firstName>John</firstName>
  <lastName>Smith</lastName>
  <isAlive>true</isAlive>
  <age>27</age>
  <address>
    <streetAddress>21 2nd Street</streetAddress>
    <city>New York</city>
    <state>NY</state>
    <postalCode>10021</postalCode>
  </address>
  <phoneNumber>
    <type>home</type>
    <number>212 555-1234</number>
  </phoneNumber>
  <phoneNumber>
    <type>fax</type>
    <number>646 555-4567</number>
  </phoneNumber>
  <gender>
    <type>male</type>
  </gender>
</person>
```
]

---
# Examples of CSV, TSV, XML, YAML, and JSON files

The same example, but in YAML: ([source](https://en.wikipedia.org/wiki/JSON#Example))
.scroll-box-12[
```YAML
firstName: John
lastName: Smith
isAlive: true
age: 27
address: 
  streetAddress: 21 2nd Street
  city: New York
  state: NY
  postalCode: '10021'
phoneNumber: 
- type: home
  number: 212 555-1234
- type: fax
  number: 646 555-4567
gender: 
  type: male
```
]

Note that the JSON code above is also valid YAML; YAML simply has an alternative syntax that makes it more human-readable.

---
# Proprietary file extensions

The following file extensions typically require additional software to read, edit, or convert to another format:

| File extension                     | Description |
|------------------------------------|-------------|
| DB                                 | A common file extension for tabular data for SQLite |
| SQLITE                             | Another common file extension for tabular data for SQLite |
| XLS, XLSX                          | Tab-delimited tabular data for Microsoft Excel |
| RDA, RDATA                         | Tabular file format for R |
| MAT                                | ... for Matlab |
| SAS7BDAT                           | ... for SAS |
| SAV                                | ... for SPSS |
| DTA                                | ... for Stata |

---
# Archiving & file compression

Because data can be big and bulky, it is often easier to store and share the data in compressed form.

.scroll-box-8[
| File extension                     | Description |
|------------------------------------|-------------|
| ZIP                                | The most common format for file compression |
| Z                                  | Alternative to ZIP; uses a slightly different format for compression |
| 7Z                                 | Alternative to ZIP; uses [7-Zip](http://www.7-zip.org/) software for compression |
| GZ                                 | Another alternative to ZIP (primarily used in Linux systems), using what's called `gzip` |
| TAR                                | So-called "tarball" which is a way to collect many files into one archive file. TAR stands for "Tape ARchive" |
| TAR.GZ; TGZ                        | A compressed version of a tarball (compression via `gzip`) |
| TAR.BZ2; .TB2; .TBZ; .TBZ2         | Compressed tarball (via `bzip2`) |
]
<div align="center">
<img src="Targzip.svg" height=150>
</div>

---
# Other file types that aren't data
<div style="margin-top: 2cm;"></div>
- There are many file types that don't correspond to readable data. For example, script files (e.g. `.R`, `.py`, `.jl`, `.sql`, `.do`, `.cpp`, `.f90`, ...) are text files with convenient extensions to help the user remember which programming language the code is in
<div style="margin-top: 2cm;"></div>
- As a rule of thumb, if you don't recognize the extension of a file, it's best to inspect the file in a text editor (though pay attention to the size of the file as this can also help you discern whether it's code or data).

---
class: center, middle

# General types of data

---
# General Types of Data
<div style="margin-top: 2cm;"></div>
- When you think of data, you probably think of rows and columns, like a matrix or a spreadsheet. 
<div style="margin-top: 2cm;"></div>
- But it turns out there are other ways to store data, and you should know their similarities and differences to tabular data.
<div style="margin-top: 2cm;"></div>
---
# Dictionaries (a.k.a. Hash tables)

- A dictionary is a list that contains `keys` and `values`.

- Each key points to one value.

- While this may seem like an odd way to store data, it turns out that there are many, many applications in which this is the most efficient way to store things.

- We won't get into the nitty gritty details of dictionaries, but they are the workhorse of computer science, and you should at least know what they are and how they differ from tabular data.

- In fact, dictionaries are often used to store multiple arrays in one file (e.g. Matlab `.mat` files, R `.RData` files, etc.).

---
# Dictionaries (a.k.a Hash tables)

- The capability to manipulate hash tables is included in almost every major scientific programming language (although it is quite clunky in R ... this is why R is not considered to be a "general purpose programming language" by some people).
<div align="center">
<img src="Hash_table_3_1_1_0_1_0_0_SP.svg" height=350>
</div>

```r
#Image source: By Jorge Stolfi - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6471238 
```

---
# Why are dictionaries useful?

- You might look at the previous example and think a vector would be a better way to store phone numbers.

- The power of dictionaries is in their .hi[lookup speed]

- Looking up an index in a dictionary takes the same amount of time no matter how long the dictionary is!

- Computer scientists call this `\(O(1)\)` access time.

- Moreover, dictionaries can index .hi[objects], not just scalars.

- So I could have a dictionary of data frames, a dictionary of arrays, ...

---
class: center, middle

# Big Data file types

---
# Big Data file types

- Big Data file systems like Hadoop and Spark often use the same file types as R, SQL, and Python.
<div style="margin-top: 0.7cm;"></div>
- That is, `CSV` and `TSV` files are the workhorse.
<div style="margin-top: 0.7cm;"></div>
- Because of the nature of distributed file systems, it is often the case that JSON and XML are not good choices because they can't be broken up across machines. 
<div style="margin-top:0.7cm;"></div>
- Note: there is a distinction between JSON files and JSON records; see the second link at the end of this document for further details
<div style="margin-top: 0.7cm;"></div>
- Some terminology about big data file types:
    - `Sequence`
    - `Avro`
    - `Parquet`
---
# Sequence
<div style="margin-top: 2cm;"></div>
- Sequence files are dictionaries that have been optimized for Hadoop and friends.
<div style="margin-top: 2cm;"></div>
- The advantage to taking the dictionary approach is that the files can easily be coupled and decoupled.
<div style="margin-top: 2cm;"></div>
---
# Avro
<div style="margin-top: 2cm;"></div>
- Avro is an evolved version of Sequence---it contains more capability to store complex objects natively.

---
# Parquet

- Parquet is a format that allows Hadoop and friends to partition the data column-wise (rather than row-wise).

- Other formats in this vein are RC (Record Columnar) and ORC (Optimized Record Columnar).

---
# Useful Links about these tools (we won't go in detail)

- [A beginner's guide to Hadoop storage formats](https://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html)

- [Hadoop File Formats: It's not just CSV anymore](https://community.hds.com/community/products-and-solutions/pentaho/blog/2017/11/07/hadoop-file-formats-its-not-just-csv-anymore)

---
class: center, middle

# More about datasets

---
# More about Databases

- Many "big data" problems could be more accurately described as "small data problems in disguise".

- Which is to say, the data that we care about is only a subset or aggregation of some larger dataset.
  - For example, we might want to access US Census data, but only for a handful of counties along the border of two contiguous states. 
  
  - Or, we might want to analyse climate data collected from a large number of weather stations, but aggregated up to the national or monthly level. 
 
---
# More about Databases

- In such cases, the underlying bottleneck is interacting with the original data, which is too big to fit into memory.

- How do we store data of this magnitude and and then access it effectively?

- The answer is through a **database**.

---
# What is a database?
- Databases can exist either locally or remotely, as well as in-memory or on-disk.

- Regardless of where a database is located, the key point is that information is stored in a way that allows for very quick extraction and/or aggregation.

- Think about a filing cabinet analogy:
> A filing cabinet arranges items by alphabetical order: Files starting "ABC" in the top drawer, "DEF" in the second drawer, etc. To find Alice's file, you'd only have to search the top draw. For Fred, the second draw, and so on.

- This analogy, whilst slightly imperfect, captures the essence of what makes databases so efficient.

- In truth, most databases rely on [binary search trees](https://en.wikipedia.org/wiki/Binary_search_tree) ("b-trees").

---
# What is a database?

- A b-tree shares the spirit of the file cabinet ordering system, but takes it step further so that we eliminate at least half of the remaining data from our search at each step.

- E.g. We need to find a person's tax records in a million-row dataset. 
  - In the first step we immediately realize that it's somewhere in the first 500k rows.
  - In the next step we immediately realize that it's somewhere in rows 250k-500k, etc.]

- They can very quickly identify the components that they need to focus on for a particular operation.
- Extracting the specific information that we want is a simple matter of submitting a **query** to the database.
- The query is where we tell the database how to manipulate or subset the data into a more manageable form, which we can then pull into our analysis environment (R, Python, etc.) 
---
# What is a database?

- At this point, you might be tempted to think of a database as the "thing" that you interact with directly.

- However, it's important to realize that the data are actually organised in one or more **tables** within the database.

- These tables are rectangular, consisting of rows and columns, where each row is identified by a unique key (remember the dictionary terminology?)

- In that sense, they are very much like the data frames that we're all used to working with.

---
# What is a database?

- Continuing with the analogy, a database then is rather like a list of data frames of `R`.

- To access information from a specific table (data frame), we first have to index it from the database (list) and then execute our query functions.

- The only material difference being that databases can hold much more information and are extremely efficient at executing queries over their vast contents.

> **Analogy Tip:** A table in a database is like a data frame in an R list.

---
class: center, middle

# Database and SQL

---
# Database and SQL

- Virtually every database in existence makes use of [**SQL**](https://en.wikipedia.org/wiki/SQL) (**S**tructured **Q**uery **L**anguage).

- SQL is an extremely powerful tool and has become something of prerequisite for many data science jobs.

---
# Database, SQL and R

- However, SQL is also an archaic language that is much less intuitive than the most popular R tools.

- We'll see several examples of this shortly, but first the good news:
  - We do not really need to learn SQL in detail to start working with databases. 
  
  - This is because the tidyverse --- through `dplyr` --- allows for direct communication with databases from your local R environment.
  
  - We will use `dplyr` and `tidyverse` for data wrangling, so it is the same workflow!

---
# Database, SQL and R

.hi[What does this mean? ]

- Simply that we can interact with the vast datasets that are stored in relational databases using the tidyverse verbs and syntax that we will use for data wrangling.

- All of this is possible thanks to the **dbplyr** package ([link](https://dbplyr.tidyverse.org/)), which provides a database backend to **dplyr**.

- What's happening even further behind the scenes is that, upon installation, **dbplyr** suggests the **DBI** package ([link](https://db.rstudio.com/dbi)) as a dependency.

- **DBI** provides a common interface that allows **dplyr** to work with many different databases using exactly the same code. 
  - You don't even need to leave your RStudio session or learn SQL in detail.
  
---
# Database, SQL and R

- While **DBI** is automatically bundled with **dbplyr**, you'll need to install a specific backend package for the type of database that you want to connect to.

- Some popular ones:
  1. **duckdb** embeds a DuckDB database.
  2. **bigrquery** connects to Google BigQuery.

- Unfortunately, we do not have the time to cover all of these backends.

- But there are some very good guides online!
  - [dbplyr](https://dbplyr.tidyverse.org/)
  - [Grant McDermott's guide](https://raw.githack.com/uo-ec607/lectures/master/16-databases/16-databases.html)
  - We will do a problem set on this!

---
# What if we want to learn SQL?

.hi[Where should I look for materials?]

- Luckily, we also have great guides:

- [W3Schools](https://www.w3schools.com/sql/)
  - [SQLZoo](https://sqlzoo.net/)
  - [Tyler Ransom's guide](https://raw.githack.com/tyleransom/DScourseS24/master/LectureNotes/06-SQL-RDD/06slides.html)

---
# Spark and databases

- We did not have the time to talk about all big data tools out there.

- Something that is getting very popular is [Spark](https://spark.apache.org/docs/0.9.1/scala-programming-guide.html).

- And its connection with Python, [PySpark](https://spark.apache.org/docs/latest/api/python/index.html).

- We will not cover this in this course, but it is a very important tool to know.

- Actually, we could spend the entire course only talking about these tools but we need to move on!

---
class: center, middle

# What is next
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html>

---
# What is next?

.hi[Web scraping/APIs to gather data from the web]

- We will learn how to gather data from the web.

- Hands-on experience with web scraping.

- Hands-on experience with APIs.