class: title-slide # Econ 520: Data Science for Economists ## Lecture 3: Clean Code <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- # Prologue - Today's material comes from [Tyler Ransom's Lecture Notes](https://github.com/OU-PhD-Econometrics/fall-2022/tree/master/LectureNotes/01a-CleanCode) at University of Oklahoma. <div style="margin-top: 1cm;"></div> - Tyler built his lectures on the following sources: 1. *The Clean Coder*, by Robert C. Martin. 2. [*Code and Data for the Social Sciences: A Practitioner's Guide*](https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf), by Gentzkow and Shapiro. 3. Also a small contribution from [here](https://garywoodfine.com/what-is-clean-code/) and other internet pages. --- # What is Clean Code? <div style="margin-top: 2cm;"></div> - .hi[Clean Code:] Code that is easy to understand, easy to modify, and hence easy to debug. <div style="margin-top: 3cm;"></div> - Clean code saves you and your collaborators time. --- # Why clean code matters: Scientific progress - Good science is based on careful observations. - Science progresses through interactively testing hypotheses and making predictions. - Scientific progress is impeded if: - mistaken previous results are erroneously given authority, - previous hypothesis tests are not reproducible, - previous methods and results are not transparent. - Thus, for science that involves computer code, clean code is a must! --- # Why clean code matters: Personal and team sanity - You will always make a mistake while coding. <div style="margin-top: 1cm;"></div> - What makes good programmers great is their ability to quickly identify and correct mistakes. <div style="margin-top: 1cm;"></div> - Developing a habit of clean coding from the outset of your career will help you more quickly identify and correct mistakes. <div style="margin-top: 1cm;"></div> - It will save you a lot of stress in the long-run. <div style="margin-top: 1cm;"></div> - It will make your collaborative relationships more pleasant. --- # Why clean code is under-produced - If clean code is so beneficial and important, why isn't there more of it? <div style="margin-top: 1cm;"></div> -- 1. .hi[Competitive pressure] to produce research/products as quickly as possible. <div style="margin-top: 1cm;"></div> 2. .hi[End user] (journal editor, reviewer, reader, dean) .hi[doesn't care what the code looks like], just that the product works. <div style="margin-top: 1cm;"></div> 3. In the moment, clean code .hi[takes longer to produce] while seemingly conferring no benefit --- # How does one produce clean code? Principles - Automation; - Version control; - Organization of data and software files; - Abstraction; - Documentation; - Time / task management; - Test-driven development (unit testing, profiling, refactoring); - Pair programming / Code Reviews/ --- # Automation - Gentzkow & Shapiro's two rules for automation: 1. Automate everything that can be automated. 2. Write a single script that executes all code from beginning to end. - There are two reasons automation is so important: - Reproducibility (helps with debugging and revisions); - Efficiency (having a code base saves you time in the future). - A single script that shows the sequence of steps taken is the equivalent to "showing your work". --- # Version control <div style="margin-top: 2cm;"></div> - We've discussed Git and GitHub in previous lecture. <div style="margin-top: 2cm;"></div> - Version control provides a principled way for you to easily undo changes, test out new specifications, and more. --- # File organization 1. Separate directories by function. 2. Separate files into inputs and outputs. 3. Make directories portable. - To see how professionals do this, check out the source code for R's [dplyr](https://github.com/tidyverse/dplyr) package. - There are separate directories for source code (`/src`), documentation (`/man`), code tests (`/test`), data (`/data`), examples (`/vignettes`), and more. - When you use version control, it forces you to make directories portable (otherwise a collaborator will not be able to run your code). - use __relative__ file paths, not absolute file paths. --- # Data organization - The key idea is to practice .hi[relational data base management]. - A relational database consists of many smaller data sets. - Each data set is tabular and has a unique, non-missing key. - Data sets "relate" to each other based on these keys. - You can implement these practices in any modern statistical analysis software (R, Stata, SAS, Python, Julia, SQL, ...). - Gentzkow & Shapiro recommend not merging data sets until as far into your code pipeline as possible. --- # Abstraction - What is abstraction? It means "reducing the complexity of something by hiding unnecessary details from the user". - E.g.: A dishwasher. All I need to know is how to put dirty dishes into the machine, and which button to press. I don't need to understand how the electrical wiring or plumbing work. - In programming, abstraction is usually handled with functions. - Abstraction is usually a good thing. - But it can be taken to a harmful extreme: overly abstract code can be "impenetrable" which makes it difficult to modify or debug. --- # Rules for Abstraction - Gentzkow & Shapiro give three rules for abstraction: 1. Abstract to eliminate redundancy. 2. Abstract to improve clarity. 3. Otherwise, don't abstract. --- # Abstract to eliminate redundancy - Sometimes you might find yourself repeating lines of code with small modifications across the lines: ```r x1 <- matrix(1, ncol = 2, nrow = 5) x2 <- 2 * matrix(1, ncol = 2, nrow = 5) x3 <- 3 * matrix(1, ncol = 2, nrow = 5) ``` A better way to eliminate this redundancy is to write a function: ```r scaled_mat <- function(j, scal = 1, nc, nr){ j * base::matrix(scal, ncol = nc, nrow = nr ) } x1 <- scaled_mat(j = 1, scal = 1, nc = 15, nr = 6) x2 <- scaled_mat(j = 2, scal = 1, nc = 15, nr = 6) x3 <- scaled_mat(j = 3, scal = 1, nc = 15, nr = 6) ``` Adjusting the `scaled_mat()` function involves a single change (and not 3)! --- # Abstract to improve clarity - Consider the example of obtaining OLS estimates from a N x 1vector `Y` and N x K covariate matrix `X` that already exist on our workspace - We could code this in three ways: ```r beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% Y) ``` or, to improve efficiency, ```r beta_hat_alt <- solve(crossprod(X)) %*% crossprod(X, Y) ``` --- # Abstract to improve clarity - Alternatively, we could use ```r estimate_ols <- function(Yvar,Xmat) { solve(crossprod(Xmat)) %*% crossprod(Xmat, Yvar) } beta_hat = estimate_ols(Y,X) ``` This last approach is easier to read and understand what the code is doing. Note that I used `Yvar` instead of `Y` in the function definition because the variables inside of functions do not exist outside of them. --- # Otherwise, don't abstract - One could argue that the examples on the previous slides are overly abstract. - OLS is a simple operation that only takes one line of code. - If we're only doing it once in our script, then it may not make sense to use the function version. - Similarly, it may not make sense to use the `scaled_mat()` function if I only need to use it for three lines of code. - This discussion points out that it can be difficult to know if one has reached the optimal level of abstraction. - As you're starting out programming, I would advise doing almost every inside of a function (i.e. err on the side of over-abstraction when starting out). --- # Documentation 1. Don't write documentation you will not maintain. 2. Code should be self-documenting. - Generally speaking, commented code is helpful. - However, sometimes it can be harmful if, e.g. code comments contain dynamic information. - It may not be helpful to have to rewrite comments every time you change the code. - Code can be "self-documenting" by leveraging abstraction: function arguments make it easier to understand what is a variable and what is a constant. --- # Documentation in R - R has excellent packages to handle documentation, `roxygen2` (for help files) and `docstring` (built-in documentation). - These make great documents and help files to increase readabilityn - These generally also generate a code guide. --- # Docstring in action ```r library(docstring) estimate_ols <- function(Yvar,Xmat) { #' Compute OLS coefficients #' #' @description This function computes OLS estimates for dependent variable `Yvar` and covariates `Xmat`. #' #' @param Yvar Nx1 outcome vector. #' @param Xmat NxK covariate matrix. #' @return OLS coefficients b <- solve(crossprod(Xmat)) %*% crossprod(Xmat, Yvar) return(b) } ?estimate_ols ``` --- # Docstring in action .center[ <img src="docstring.png" width="70%" /> ] If you change the function `estimate_ols()`, e.g. to add a new argument, then you can easily make the same change to the documentation. --- # Time management - Time management is key to writing clean code. - It is foolish to think that one can write clean code in a strained mental state. - Code written when you are groggy, overly anxious, or distracted will come back to bite you. - Schedule long blocks of time (1.5 hours - 3 hours) to work on coding where you eliminate distractions (email, social media, etc.) - Stop coding when you feel that your focus or energy is dissipating. --- # Task management - When collaborating on code, it is essential to not use email or Slack threads to discuss coding tasks. - Rather, use a task management system that has dedicated messages for a particular point of discussion (bug in the code, feature to develop, etc.). - I use GitHub issues for all of my coding projects. - For my personal task management, I use Trello to take all tasks out of my email inbox and put them in Trello's task management system. - GitHub and Trello also have Kanban-style boards where you can easily visually track progress on tasks. --- # Test-driven development (unit testing, refactoring, profiling) - The only way to know that your code works is to test it! - Test-driven development (TDD) consists of a suite of tools for writing code that can be automatically tested. - .hi[unit testing] is nearly universally used in professional software development. - Unit testing is to software developers what washing hands is to surgeons. --- # Unit testing - Unit tests are scripts that check that a piece of code does everything it is supposed to do. - When professionals write code, they also write unit tests for that code at the same time. - If code doesn't pass tests, then bugs are caught on the front end. - Test coverage determines how much of the code base is tested. High coverage rates are a must for unit testing to be useful. - R's [dplyr package](https://github.com/tidyverse/dplyr) shows that all unit tests are passing and that tests cover 88% of the code base. - [Here](https://testthat.r-lib.org/) is a nice step-by-step guide for doing this in R. --- # Refactoring <div style="margin-top: -.5cm;"></div> - Refactoring refers to the action of restructuring code without changing its external behavior or functionality. Think of it as "reorganizing" <div style="margin-top: -.1cm;"></div> - Example: ```r estimate_ols <- function(Yvar,Xmat) { solve(crossprod(Xmat)) %*% crossprod(Xmat, Yvar) } ``` <div style="margin-top: cm;"></div> after refactoring becomes <div style="margin-top: -.5cm;"></div> ```r estimate_ols <- function(Yvar,Xmat) { chol2inv(Xmat) %*% crossprod(Xmat, Yvar) } ``` <div style="margin-top: -.5cm;"></div> - Nothing changed in the code except the number of characters in the function - The new version may run faster. The output is unchanged. - Refactoring could also mean reducing the number of input arguments --- # Profiling - Profiling refers to checking the resource demands of your code. <div style="margin-top: 1cm;"></div> - How much processing time does your script take? How much memory? <div style="margin-top: 1cm;"></div> - Clean code should be highly performant: it uses minimal computational resources. <div style="margin-top: 1cm;"></div> - Profiling and refactoring go hand in hand, along with unit testing, to ensure that code is maximally optimized. <div style="margin-top: 1cm;"></div> - [Here](https://adv-r.hadley.nz/perf-measure.html) is an intro guide to profiling in R. --- # Pair programming and code reviews - An essential part of clean code is reviewing code. <div style="margin-top: .5cm;"></div> - An excellent way to review code is to do so at the time of writing. <div style="margin-top: .5cm;"></div> - .hi[Pair programming] involves sitting two programmers at one computer. <div style="margin-top: .5cm;"></div> - One programmer does the writing while the other reviews. <div style="margin-top: .5cm;"></div> - This is a great way to spot silly typos and other issues that would extend development time. <div style="margin-top: .5cm;"></div> - It's also a great way to quickly refactor code at the start. <div style="margin-top: .5cm;"></div> - .hi[I strongly encourage you to do pair programming on problem sets in this course!] <div style="margin-top: .5cm;"></div> - You can also do this with code reviews, via GitHub!