Lecture 9

class: title-slide

# Econ 520: Data Science for Economists

## Lecture 9: Data wrangling and Visualization, in practice
<br>

<p align=center>
Pedro H. C. Sant'Anna
</p>
<div style="margin-top: -.7cm;"></div>
<p align=center>
Emory University
</p>
<br>
<p align=center>
Spring 2024
</p>

---
class: center, middle
name: prologue

# Prologue

---
# Who are we building on?

- This lecture builds on the materials of the book by Békés & Kézdi,"Data Analysis for Business, Economics, and Policy" (2021).

- We will essentially flip the classroom and work on exercises in class.

---
# Background

- We are interested in finding good deals among hotels, hostels and other accommodations in Vienna.

- Towards that end, we managed to webs crape a price comparison website and obtained a dataset with the prices and other features such as average rate reviews, number of reviews, stars, distance from city center, among other variables.

- The data was collected in a given weekday in November 2017.

- Given that the website only displays rooms that are available in that day, among the hotels that are actually displayed in the website.

- Our main goal is to explore this dataset and understand the relationship between the price and the other variables.

---

# Part I: Getting started

- Create a new R Project in RStudio.

- Ensure that you have a well-structured folder for your project.
  - data
      - raw
      - processed
  - codes
  - plots
  - tables
  - docs
---

# Part I: Getting the data in

- Download the raw data files from https://www.dropbox.com/scl/fo/m92e6srz2p1rydu21ipdn/h?rlkey=4z1nn6tlhkud7qrbxetvjjowr&dl=0

- Save the raw dataset in the `data/raw` folder in your project.

- Start an R script in the `codes` folder and name it `01_getting_data.R`.

- Open `01_getting_data.R` and load the data from `hotelbookingdata-vienna.csv` into R as tibble.

- How many observations and variables are there in the dataset?

- Do we have any accommodation with multiple entries? If so, how many?
---

# Part II: Analyzing the data

- Do we have any accommodation with multiple entries? 
  - If so, how many? 
  
  - Are they duplicates?
  
  - If so, keep only one of them.

- What is the average price of the accommodations in the dataset?

---

# Part II: Analyzing the data (cont.)

- How many of the observations are hotels, hostels, and other types of accommodations?

- Here, note that you should create a new variable `type` that classifies the accommodations into these three categories.

- What is the proportion of missing average customer rating in the entire dataset?

- What is the proportion of missing average customer rating by type of accommodation (hotels, hostels, and other types of accommodations)?

---

# Part II: Analyzing the data (cont.)

- How does the average price vary with type of accommodation?

- Produce a table with the average price by type of accommodation.
  
  - Expand the table to also include the standard deviation of the price, the median of the price, and the number of observations for each type of accommodation.
  
  - Save the last table as a .csv file in the `tables` folder in your project.
  
- Produce a bar plot with the average price by type of accommodation

---
# Part II: Analyzing the data (cont.)

- How do the average price and the average rating vary with the type of accommodation?

- Produce a scatter plot with the average price and the average rating by type of accommodation.
  
  - Add a linear fit to the scatter plot.
  
  - Add a legend to the scatter plot.
  
  - Add a title to the scatter plot.
  
  - Add labels to the x and y axes.
  
  - Save the scatter plot as a .png file in the `plots` folder in your project.
  
---
# Part II: Analyzing the data (cont.)

- How do the average price and the average rating vary with the type of accommodation?

- Construct a histogram for the average star rating.

- Construct a histogram for the average number of reviews.

- Construct a histogram for the average distance from the city center.
  - Do we see any extreme value? Is that an "outlier"?

- Construct a density plot for the average price.

- Construct a histogram for the average price, among hotels only.

- Add a density plot to the histogram for the average price, among hotels only.

---
# Part III: More on visualization (cont.)

- Construct a histogram for the average price, among hotels only.

- Now restricts the observations to be within 8 miles of the city center and construct a histogram plot for the average price.
  
  - Now, instead of restricting the observations to be within 8 miles of the city center, keep the original dataset and construct a histogram plot for the average price but add a horizontal line at 8 miles to indicate that we are interested in the average price of accommodations within 8 miles of the city center.

---
# Part III: More on visualization (cont.)

- Construct a box plot for the average price by type of accommodation.

- Construct a violin plot for the average price by type of accommodation.

- Save the box plot and the violin plot as .png files in the `plots` folder in your project.

---
# Part IV: Conclusion

- What have you learned in this exercise?

- What would be the next steps in this analysis?

- What are the limitations of the analysis?

- Summarize the main findings in two pages, with the most useful insights. Include the most important plots and tables.
  - Think of this as an executive summary.
  - Write this in a Markdown file and export it as a .pdf file in the `docs` folder in your project.