class: title-slide # Econ 520: Data Science for Economists ## Lecture 1: Course Overview <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- # Outline 1. Course preliminaries <div style="margin-top: 2.5cm;"></div> 2. What is Data Science? <div style="margin-top: 2.5cm;"></div> 3. Course Roadmap --- # Course Preliminaries I - Introducing Ourselves - Welcome to .hi[ECON 520!] I'm looking forward to teaching you all. - I'm Professor Pedro Sant'Anna. - You reach me via [pedro.santanna@emory.edu](mailto:pedro.santanna@emory.edu). - Before joining Emory, I was a Principal Researcher at Microsoft, an Amazon Visiting Academic at Amazon, and Assistant Professor at Vanderbilt University. - At MSFT, I have worked on projects with Xbox, Azure, and Office. - At Amazon, I have worked mostly on projects around supply chain. - I am a father of 4 kids: twins are expected to be born any day now! --- # Course Preliminaries II - Course materials and communications will be posted on Canvas. <div style="margin-top: 0.6cm;"></div> - I will also post lectures at https://psantanna.com/Econ520. <div style="margin-top: 0.6cm;"></div> - I am setting up a Github Classroom for us. <div style="margin-top: 0.6cm;"></div> - Lectures: Tuesdays and Thursdays, 11:30 am - 12:45 pm. <div style="margin-top: 0.6cm;"></div> - Recording links will be posted online after class. <div style="margin-top: 0.6cm;"></div> - Office hours: Wednesdays 10am--11 am, by appointment (via Zoom). <div style="margin-top: 0.6cm;"></div> - Software: R or Python - I will use mostly R, but AI (ChatGPT or Copilot) can translate these easily! --- # Course Preliminaries II - Katie Leinenbach is our TA. - She will hold office hours on Tuesdays, 5-6 PM, at R400A-9 (4th floor of Randall Rollins building). - She will help us with almost everything we need, and I will be available to help her as well. - Her email is [katherine.leinenbach@emory.edu](mailto:katherine.leinenbach@emory.edu) --- # Course Preliminaries III - Course Materials - We will borrow material from a few textbooks and online materials, as no ideal textbook is available. - As a result, we will not follow a textbook in a chapter-by-chapter manner. - .hi[Main textbooks we will borrow from]: - Gábor Békés & Gábor Kézdi, .it[Data Analysis for Business, Economics, and Politics]. Cambridge University Press. 2021. <div style="margin-top: 1cm;"></div> - Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor, .it[An Introduction to Statistical Learning, with Applications in Python]. Springer, 2023. Freely available at https://www.statlearning.com/ --- # Course Preliminaries III - Course Materials - .hi[Other textbooks we will borrow from]: - Matt Taddy, .it[Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions]. 1st Edition, McGraw-Hill, 2019. <div style="margin-top: 1.5cm;"></div> - Matt Taddy, Leslie Hendrix and Matthew Harding, .it[Modern Business Analytics: Practical Data Science for Decision Making', 1st Edition, McGraw-Hill, 2023. --- # Course Preliminaries III - Course Materials - .hi[We will also borrow and build on the following online materials]: - Grant McDermott notes on "Data Science for Economists", freely available at https://github.com/uo-ec607/lectures. <div style="margin-top: 1cm;"></div> - Tyler Ransom notes on ``Data Science for Economists'', freely available at https://github.com/pedrohcgs/DScourseS24_Ransom. <div style="margin-top: 1cm;"></div> - Michael Knaus notes on ``Causal Machine Learning'' freely available at https://github.com/pedrohcgs/causalML-teaching. --- # Course Preliminaries IV - Grading - The final grade will be determined by a weighted average of scores from: - 7 Problem Sets (50%); - 7 Quizzes (25%); - Course Project (25%). <div style="margin-top: 1.5cm;"></div> - Only the 6 items with the highest points in the PS and Quizzes will be used for the grade. <div style="margin-top: 1.5cm;"></div> - Map from numerical grade to letter grade as stated in the [Syllabus](psantanna.com/Econ520/files/Syllabus_Econ520.pdf). --- # Course Preliminaries V - Project - Since this course is hands-on, we want you to have a final project summarizing all the skills you have learned here. - You should form a team of .ul[two students] so the project is collaborative. - You should collect data on and analyze a research question of your choosing using methods taught in this course. - If you are short on ideas, check out https://www.kaggle.com/datasets which routinely hosts analytics competitions. - Kaggle publishes data for each competition. - Main criterion for the project is to apply skills developed in this class. --- # Course Preliminaries V - Project - Here are more details about the project: <div style="margin-top: 1.5cm;"></div> - Turn in the written summary report (~10 pages) and a GitHub repository containing all materials required to reproduce the results. <div style="margin-top: 1.5cm;"></div> - Summary report should be written in LaTeX (or Markdown) and turned in as a PDF (source code for the summary report should also be included in your GitHub repository). <div style="margin-top: 1.5cm;"></div> - If you have more questions, please let me know. --- # Course Preliminaries V - Guest Lectues - I am arranging one or two guest lectures from people working in the industry. - Likely to be someone from Amazon and someone from Microsoft. - Idea is to them to explain their work routine and how they use data science in their jobs. - I will announce this in advance. --- # Outline 1. Course preliminaries ✅ <div style="margin-top: 2.5cm;"></div> 2. What is Data Science? <div style="margin-top: 2.5cm;"></div> 3. Course Roadmap --- # What is Data Science? - .hi[Data science (DS):] The scientific discipline that deals with transforming data into useful information ("insights") using a variety of statistics, econometrics, AI, ML, data management and visualization techniques. - Amazon: Collects data on search history, cart history, purchases: - Analyzes the data to estimate users' willingness to pay for various products (including Prime); recommend new products; makes decisions on stocking, and advertising. - Microsoft: Collects gaming data from Xbox users: - Analyzes the data to improve gaming experience; recommend new games; give discounts for Game Pass. --- # What is Data Science? - The rise of data science has come because of the so-called Big Data revolution <div style="margin-top: 1.5cm;"></div> - The rise of the internet in the late-1990s and 2000s `\(\Rightarrow \,\uparrow\)` opportunities for companies and governments to collect data on consumers & citizens. <div style="margin-top: 1.5cm;"></div> - Proliferation of mobile devices & social media from late 2000s until now has generated even more data. --- # Skills required for data science .center[ <img src="skills.jpg" width="75%" /> Source: [NC State Univ.](http://sites.nationalacademies.org/cs/groups/cstbsite/documents/webpage/cstb_181680.pdf) (p. 26) ] --- # Pillars of Data Science - Programming (for automation of data collection, manipulation, cleaning, visualization, and modeling). <div style="margin-top: 1.5cm;"></div> - Visualization & exploration. <div style="margin-top: 1.5cm;"></div> - Statistics, Econometrics and Machine Learning (to select models, compress data). <div style="margin-top: 1.5cm;"></div> - Causal inference (to be able to make a policy prescription). <div style="margin-top: 1.5cm;"></div> ... .hi[Assuming one has the appropriate foundation of basic calculus and statistics]. --- # The data science workflow .center[ <img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="85%" /> Source: [R for Data Science](http://r4ds.had.co.nz/introduction.html) ] --- # Big Data .center[ <img src="frisch.jpg" width="90%" /> ] --- # Big Data .center[ <img src="frisch.jpg" width="90%" /> ] Source: Frisch, Ragnar. 1933. "Editor's Note" _Econometrica_ 1(1): 1--4 --- # What is Big Data? It depends on who you ask. It could mean: 1. "Wild" data (unstructured; collected without a particular intention; e.g. twitter, contrast with Census surveys). 2. "Wide" data (a.k.a. "Big-K" data because `\(K>N\)`). 3. "Long" data (a.k.a. "Big-N" data because `\(N\)` very, very large [and may not all fit onto a single hard drive!]). 4. Any data set that cannot be analyzed with classical methods such as OLS. "Big Data" not so much about size of data, but about whether or not "small data" (read: classical) methods can be used. --- # What is machine learning? What is AI? - .hi[Machine learning (ML):] - Allowing computers to learn for themselves without explicitly being programmed - USPS: Computer to read handwriting on envelopes. - Google: AlphaGo, computer that defeated world champion Go player. - Apple/Amazon/Microsoft: Siri, Alexa, Cortana voice assistants . - .hi[Artificial intelligence (AI):] - Constructing machines (robots, computers) to think and act like human beings. - ML is a subset of AI. --- # Big data & machine learning - You'll often hear the phrase "big data and machine learning". <div style="margin-top: 1.5cm;"></div> - This is because many machine learning algorithms are helpful for big data problems: - Selecting which `\(k<K\)` covariates should enter your model. - Streamlined techniques for processing "wild" data. - New modeling approaches that can leverage the greater amount of information that Big Data has. --- # Correlation vs. causation - Machine learning is not the end-all, be-all of data science. - A good data scientist knows that correlation is not causation! - Ultimately companies want "insights" that they can use to `\(\uparrow\)` profits - Tech co.'s run randomized experiments to estimate causal effects. - Can also estimate fancier statistical models to account for selection. - Economists' comparative advantage is in combining machine learning with economic theory to produce optimal policies. --- # Example > One very classic example comes from looking at the data of a shopping cart. Why do sales of beer and diapers go hand in hand? The correlation is women tell their husbands to go pick up diapers, and on the way, they pick up beer, too. That is data science: finding trends from your data and using that insight to increase your sales or market better Source: http://www.chicagotribune.com/bluesky/originals/ct-bsi-inside-job-4c-insights-20171002-story,amp.html --- # Lifestyle - What's it like to have a DS job right now: https://www.simplilearn.com/a-day-in-the-life-of-a-data-scientist-article - Data scientist job profiles: https://www.mygreatlearning.com/blog/different-data-science-jobs-roles-industry/ --- # Outline 1. Course preliminaries ✅ <div style="margin-top: 2.5cm;"></div> 2. What is Data Science? ✅ <div style="margin-top: 2.5cm;"></div> 3. Course Roadmap --- # Course Roadmap - Our schedule is ambitious and may be adjusted over the course. <div style="margin-top: 1.5cm;"></div> - In addition, as I am expecting twins in the first three weeks of the course, so I may miss some classes. <div style="margin-top: 1.5cm;"></div> - If that happens, recordings will be posted. --- # Course Roadmap: Where we are going - .hi[Part I (~3 weeks): Introduction to Data-Science:] - Git, Github and coding best practices - Sources of data - APIs and Webscrapping - Experiments - SQL and Big Data tools - .hi[Part II (~3 weeks): Exploratory Data Analysis:] - Data visualization - Data wrangling - Statistics and probability --- # Course Roadmap: Where we are going - .hi[Part III (~6 weeks): Statistical Supervised Learning] - Inference in A/B tests - Optimization - Linear and logistic regression - .hi[Part IV (~3 weeks): Introduction to Machine Learning] - Supervised Machine Learning - Unsupervised Machine Learning - Causal Inference with Machine Learning --- # Outline 1. Course preliminaries ✅ <div style="margin-top: 2.5cm;"></div> 2. What is Data Science? ✅ <div style="margin-top: 2.5cm;"></div> 3. Course Roadmap ✅