class: title-slide # Econ 520: Data Science for Economists ## Lecture 9: Data wrangling and Visualization, in practice <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- class: center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Who are we building on? - This lecture builds on the materials of the book by Békés & Kézdi,"Data Analysis for Business, Economics, and Policy" (2021). - We will essentially flip the classroom and work on exercises in class. --- # Background - We are interested in finding good deals among hotels, hostels and other accommodations in Vienna. - Towards that end, we managed to webs crape a price comparison website and obtained a dataset with the prices and other features such as average rate reviews, number of reviews, stars, distance from city center, among other variables. - The data was collected in a given weekday in November 2017. - Given that the website only displays rooms that are available in that day, among the hotels that are actually displayed in the website. - Our main goal is to explore this dataset and understand the relationship between the price and the other variables. --- # Part I: Getting started - Create a new R Project in RStudio. - Ensure that you have a well-structured folder for your project. - data - raw - processed - codes - plots - tables - docs --- # Part I: Getting the data in - Download the raw data files from https://www.dropbox.com/scl/fo/m92e6srz2p1rydu21ipdn/h?rlkey=4z1nn6tlhkud7qrbxetvjjowr&dl=0 - Save the raw dataset in the `data/raw` folder in your project. - Start an R script in the `codes` folder and name it `01_getting_data.R`. - Open `01_getting_data.R` and load the data from `hotelbookingdata-vienna.csv` into R as tibble. - How many observations and variables are there in the dataset? - Do we have any accommodation with multiple entries? If so, how many? --- # Part II: Analyzing the data - Do we have any accommodation with multiple entries? - If so, how many? - Are they duplicates? - If so, keep only one of them. - What is the average price of the accommodations in the dataset? --- # Part II: Analyzing the data (cont.) - How many of the observations are hotels, hostels, and other types of accommodations? - Here, note that you should create a new variable `type` that classifies the accommodations into these three categories. - What is the proportion of missing average customer rating in the entire dataset? - What is the proportion of missing average customer rating by type of accommodation (hotels, hostels, and other types of accommodations)? --- # Part II: Analyzing the data (cont.) - How does the average price vary with type of accommodation? - Produce a table with the average price by type of accommodation. - Expand the table to also include the standard deviation of the price, the median of the price, and the number of observations for each type of accommodation. - Save the last table as a .csv file in the `tables` folder in your project. - Produce a bar plot with the average price by type of accommodation --- # Part II: Analyzing the data (cont.) - How do the average price and the average rating vary with the type of accommodation? - Produce a scatter plot with the average price and the average rating by type of accommodation. - Add a linear fit to the scatter plot. - Add a legend to the scatter plot. - Add a title to the scatter plot. - Add labels to the x and y axes. - Save the scatter plot as a .png file in the `plots` folder in your project. --- # Part II: Analyzing the data (cont.) - How do the average price and the average rating vary with the type of accommodation? - Produce a scatter plot with the average price and the average rating by type of accommodation. - Add a linear fit to the scatter plot. - Add a legend to the scatter plot. - Add a title to the scatter plot. - Add labels to the x and y axes. - Save the scatter plot as a .png file in the `plots` folder in your project. --- # Part III: More on visualization - Construct a histogram for the average star rating. - Construct a histogram for the average number of reviews. - Construct a histogram for the average distance from the city center. - Do we see any extreme value? Is that an "outlier"? - Construct a density plot for the average price. - Construct a histogram for the average price, among hotels only. - Add a density plot to the histogram for the average price, among hotels only. --- # Part III: More on visualization (cont.) - Construct a histogram for the average price, among hotels only. - Now restricts the observations to be within 8 miles of the city center and construct a histogram plot for the average price. - Now, instead of restricting the observations to be within 8 miles of the city center, keep the original dataset and construct a histogram plot for the average price but add a horizontal line at 8 miles to indicate that we are interested in the average price of accommodations within 8 miles of the city center. --- # Part III: More on visualization (cont.) - Construct a box plot for the average price by type of accommodation. - Construct a violin plot for the average price by type of accommodation. - Save the box plot and the violin plot as .png files in the `plots` folder in your project. --- # Part IV: Conclusion - What have you learned in this exercise? - What would be the next steps in this analysis? - What are the limitations of the analysis? - Summarize the main findings in two pages, with the most useful insights. Include the most important plots and tables. - Think of this as an executive summary. - Write this in a Markdown file and export it as a .pdf file in the `docs` folder in your project.