Skip to contents

An introductory course in Applied Econometrics

Topics

  • Data Wrangling and Visualisations (Exploratory Data Analysis)
    • Introductions to some basic data wrangling using the NYT and the JHU Covid data.
    • The Titanic data: Which passengers were more likely to survive ?
    • Data visualisations: Bar plots, histograms, density estimation, empirical distribution function, scatter plots.
  • Ordinary Least Squares (OLS)
    • Regression mechanics: Running regressions is easy, but …
    • do the anwers we obtain make any sense ?
  • The art of making sandwiches (or robust standard errors)
  • Instrumental variables (IV, 2SLS) in action

Exercises

  • Data Wrangling and Visualisations:
    • Covid in France, Modelling excess covid death in France
    • Extracting data from IPUMS using the API
    • The Big Mac Index
  • OLS and IV in action:
    • Replicating Krueger (1999, QJE) on Project STAR
    • Replicating Ashenfelter and Rouse (1998, QJE) using the Twinsburg twins

Projects

  1. Project based on Angrist and Krueger (1991, QJE)
  2. Project based on Dustmann et al. (2016, JEP)
  3. Project based on Autor et al. (2013, AER)
  4. Project based on Acemoglou and Angrist (2000, NBER)

No free lunch

“Why is it necessary to introduce so many different statistical learning approaches, rather than just a single best method? There is no free lunch in statistics: no one method dominates all others over all possible data sets. On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.” – James et al. (An Introduction to Statistical Learning)

This is the “No Free Lunch” (NFL) Theorem (Wolpert 1996): Without any specific knowledge of the problem or data at hand, no one predictive model can be said to be the best.

Consider the relation between a response variable Y and p covariates (or predictors or explanatory variables) X = (X1,X2,..,Xp)

Y = f(X) + ε where f (the model) represents the systematic information that X provides about Y, and ε is a random error term.

Econometrics:

  1. Estimation and Inference, usually for the linear (regression) model f(X) = E(Y|X) = Xβ: We seek to understand the association (or correlation) between Y and X1, X2, .., Xp.
    • Which predictors are associated with the response? What is the relationship between the response and each predictor?
  2. Identification (this should really the first point): What are we willing to assume about the relation between observables X and unobservables ε ? Hence: Based on this (usually untestable) hypothesis, which estimation methods are (in)valid ?
  3. Causality: Does Xp have a causal impact on the response Y ?

Statistical Learning and Machine Learning, usually assuming that ε and X are independent:

  1. Estimation and Inference (as above).
  2. Prediction and Classification:  = (X) where is our estimate of f.

Predictions (based on correlations) vs. Causality:

  • In a prediction task, the goal is to predict an outcome Y given a set of features X. A key component of a prediction exercise is that we only care that the prediction model(X) is close to Y in data distributions similar to our training set. A simple correlation between X and Y can be helpful for these types of predictions.
  • In a causal task, we want to know how changing an aspect of the world X* (not just in our training data) affects an outcome Y. In this case, it’s critical to know whether changing X causes an increase in Y, or whether the relationship in the data is merely correlational.