An introductory course in Applied Econometrics
Topics
-
Data Wrangling and Visualisations (Exploratory Data Analysis)
- Introductions to some basic data wrangling using the NYT and the JHU Covid data.
- The Titanic data: Which passengers were more likely to survive ?
- Data visualisations: Bar plots, histograms, density estimation, empirical distribution function, scatter plots.
- Ordinary Least Squares (OLS)
- Regression mechanics: Running regressions is easy, but …
- do the anwers we obtain make any sense ?
- The art of making sandwiches (or robust standard errors)
- Instrumental variables (IV, 2SLS) in action
Exercises
- Data Wrangling and Visualisations:
- Covid in France, Modelling excess covid death in France
- Extracting data from IPUMS using the API
- The Big Mac Index
- OLS and IV in action:
- Replicating Krueger (1999, QJE) on Project STAR
- Replicating Ashenfelter and Rouse (1998, QJE) using the Twinsburg twins
Projects
- Project based on Angrist and Krueger (1991, QJE)
- Project based on Dustmann et al. (2016, JEP)
- Project based on Autor et al. (2013, AER)
- Project based on Acemoglou and Angrist (2000, NBER)
No free lunch
Empirical work is hard for several reasons:
- Fundamentally messy.
- No clear start (statement of a theorem) and end (proof of the theorem).
- Data is always messy discretion will always come into play.
- There is no Q.E.D at the end of empirical papers.
“Why is it necessary to introduce so many different statistical learning approaches, rather than just a single best method? There is no free lunch in statistics: no one method dominates all others over all possible data sets. On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.” – James et al. (An Introduction to Statistical Learning)
This is the “No Free Lunch” (NFL) Theorem (Wolpert 1996): Without any specific knowledge of the problem or data at hand, no one predictive model can be said to be the best.
Consider the relation between a response variable and covariates (or predictors or explanatory variables)
where (the model) represents the systematic information that provides about , and is a random error term.
Econometrics:
- Estimation and Inference, usually for the linear (regression) model : We seek to understand the association (or correlation) between and .
- Which predictors are associated with the response? What is the relationship between the response and each predictor?
- Identification (this should really the first point): What are we willing to assume about the relation between observables and unobservables ? Hence: Based on this (usually untestable) hypothesis, which estimation methods are (in)valid ?
- Causality: Does have a causal impact on the response ?
Statistical Learning and Machine Learning, usually assuming that and are independent:
- Estimation and Inference (as above).
- Prediction and Classification: where is our estimate of .
Predictions (based on correlations) vs. Causality:
- In a prediction task, the goal is to predict an outcome given a set of features . A key component of a prediction exercise is that we only care that the prediction is close to in data distributions similar to our training set. A simple correlation between and can be helpful for these types of predictions.
- In a causal task, we want to know how changing an aspect of the world (not just in our training data) affects an outcome . In this case, it’s critical to know whether changing causes an increase in Y, or whether the relationship in the data is merely correlational.
“If applied econometrics was easy, theorists would do it. … Carefully applied to coherent causal questions, regression and 2SLS almost always make sense. Your standard errors probably won’t be quite right, but they rarely are.” - Angrist and Pischke, Mostly Harmless Econometrics.