Skip to contents

Short summary

Given outcomes Y and features X, estimation techniques (such as regressions) are designed to study the correlation between Y and X. In a prediction task, the goal is to predict an outcome Y given a set of features X, usually based on observed correlations. If outcome Y is discrete rather than continuous, we speak of discrete response models and classification. To counter the risk of over-fitting a model on existing data and to obtain predictive models that perform well on not-yet-seen data, resampling technique will be deployed.

Motivation

“Why is it necessary to introduce so many different statistical learning approaches, rather than just a single best method? There is no free lunch in statistics: no one method dominates all others over all possible data sets. On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.” – James et al. (An Introduction to Statistical Learning)

This is the “No Free Lunch” (NFL) Theorem (Wolpert 1996): Without any specific knowledge of the problem or data at hand, no one predictive model can be said to be the best.

Empirical work is hard for several reasons:

  • Fundamentally messy.
  • No clear start (statement of a theorem) and end (proof of the theorem).
  • Data is always messy \to discretion will always come into play.
  • There is no Q.E.D at the end of empirical papers.

Estimation and inference, correlations v. causality

Consider the relation between a response variable YY and pp covariates (or predictors or explanatory variables) X=(X1,X2,..,Xp)X=(X_1,X_2,..,X_p)

Y=f(X)+ε Y = f(X)+ \varepsilon where ff (the model) represents the systematic information that XX provides about YY, and ε\varepsilon is a random error term.

Econometrics:

  1. Estimation and Inference, usually for the linear (regression) model f(X)=E(Y|X)=Xβf(X)=E(Y|X)=X \beta: We seek to understand the association (or correlation) between YY and X1,X2,..,XpX_1,X_2,..,X_p.
    • Which predictors are associated with the response? What is the relationship between the response and each predictor?
  2. Identification (this should really the first point): What are we willing to assume about the relation between observables XX and unobservables ε\varepsilon ? Hence: Based on this (usually untestable) hypothesis, which estimation methods are (in)valid ?
  3. Causality: Does XpX_p have a causal impact on the response YY ?

Statistical Learning and Machine Learning, usually assuming that ε\varepsilon and XX are independent:

  1. Estimation and Inference (as above).
  2. Prediction and Classification: Ŷ=f̂(X)\hat{Y}= \hat{f}(X) where f̂\hat{f} is our estimate of ff.

Predictions (based on correlations) vs. Causality:

  • In a prediction task, the goal is to predict an outcome YY given a set of features XX. A key component of a prediction exercise is that we only care that the prediction model(X)model(X) is close to YY in data distributions similar to our training set. A simple correlation between XX and YY can be helpful for these types of predictions.
  • In a causal task, we want to know how changing an aspect of the world X*X^* (not just in our training data) affects an outcome YY. In this case, it’s critical to know whether changing XXcauses an increase in Y, or whether the relationship in the data is merely correlational.

Some econometricians, simplifying the perspectives, suggest that econometric tasks focus on β̂\widehat{\beta} whereas Machine Learning tasks focus on Ŷ\widehat{Y}. But they would (often) acknowledge that “the success of machine learning at intelligence tasks is largely due to its ability to discover complex structure that was not specified in advance. It manages to fit complex and very flexible functional forms to the data without simply overfitting; it finds functions that work well out-of-sample.”

Another way of thinking about the estimation goals is the mean squared error MSE(β̂)=Var(β̂)+Bias(β̂)2MSE (\widehat{\beta}) = Var(\widehat{\beta}) + Bias(\widehat{\beta})^2. Econometricians obsess about the bias, and often forget about the variability; by contrast, Machine Learning method (e.g. ensemble learners) are often designed to accept a (hopefully) small bias in order to reduce the variance in order to minimise the MSE.

Interpretability

  • Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance (robustness). Many sophisticated Machine Learning models lack this easy interpretability.

  • By removing irrelevant features (by setting the corresponding coefficient estimates to zero) we can obtain a model that is more easily interpreted (feature selection).

  • Occam’s Razor: When faced with several methods that give roughly equivalent performance, pick the simplest.

    • For instance, one would prefer a simple linear model with 10 features to an equivalently performing deep learning model with 1000 or more features.

Low vs. high dimensional settings.

Classical estimation approaches such as least squares linear regression require n >> p.

Data sets containing more features than observations (p > n) are often referred to as high-dimensional, and classical techniques are not appropriate in this setting.

Why ? The problem is one of overfitting when p > n: Regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero.

When the model is overfitted, its predictive power on not-yet seen data will be poor, because we will have fitted the noise component of the data. This is the curse of dimensionality.

Topics

  1. Ordinary Least Squares (OLS)
    • LPM: The Titanic data: Survived or not ?
  2. Discrete response models: The Generalised Linear Models (GLM). How shall we model outcomes that are not continuous and unrestricted, such as a probability, a discrete categorical outcome, or a count ?
    • GLM / Logit: The Titanic data: Survived or not ?
    • GLM / Probit: Female labour supply
    • GLM / Poisson regression: Modelling counts of bike rides
    • penalised GLM: Regularisation using glmnet (later, if time permits).
  3. Resampling methods: Cross-validation
  4. Regression tress
  5. Classification trees
  6. Random Forests
  7. Predictive performance comparisons: Titanic
  8. Multi-class problems. Classification trees and RFs are immediately applicable. A appropriate regression model is the multinomial logit model (which generalises the logit).
  9. Neural networks (Deep Learning)

Next Level / Course:

  • Deep Reinforcement Learning (DQN, PPO) using py torch