[PPT] - Economics for Data Science Chiara Binelli Academic year 2019-2020 PowerPoint Presentation

SLIDE 1

Economics for Data Science

Chiara Binelli Academic year 2019-2020 Email: chiara.binelli@unimib.it

SLIDE 2

Data Science and Economics

Economics approach:

1. Have a theory that identifies a relationship of interest (ex. impact of completing college on wages). 2. Estimate the impact of a treatment (ex. completing college) on an

utcome variable (ex. wages) holding everything else constant.
Focus on some coefficients of interest to estimate causal effects.
Effort to estimate unbiased effects with carefully constructed standard errors.
Data Science approach – data-driven approach:

1. Predict how a given outcome varies with a large number of potential predictors. 2. May or not use prior theory to establish which predictors are relevant.

Data-driven model selection to identify meaningful predictive variables.
Less attention to statistical uncertainty and standard errors and more to

model uncertainty.

SLIDE 3

Two main limitations of the Data Science approach:

1. Lack of theory: data-driven approach (predictive models are

chosen using data driven cross-validation methods).

2. Lack of statistical significance: focus on predictions that

minimize mean-squared errors without much attention to statistical significance since the exact source of variation identifying the prediction is difficult to assess.

– Thus, bias is allowed in order to reduce variance. – Example: LASSO penalizes the inclusion of covariates so that if two covariates are correlated only one will be included and its parameter will reflect the impact of both included and excluded;

OVB!

Data Science and Economics

SLIDE 4

Data Science and Economics

Economists often interested in assessing the

effectiveness of a policy or testing theories that predict a causal relationship

– Main goal is to identify statistically significant causal effects. – A model with high degree of predictive fit is seen as secondary to finding an empirical specification that identifies a causal effect.

Common Data Science techniques such as classification

and regression trees, lasso, boosting, and cross- validation have not been much used in Economics.

SLIDE 5

Data Science and Economics

Concrete example (Einav and Levin 2014): assess if

taking online classes improves earnings.

Economics approach:

– Either design an experiment that induces some workers to take on line classes for reasons unrelated to their earning potential.

e.g. change in the price of online classes.

– Or, absent the experiment, use observational data to estimate the impact of online classes on earnings in an unbiased way. – Focus on:

Obtaining a point estimate of the impact of online classes on earnings

that is precisely estimated.

Discussing whether there are omitted variables that might confound a

causal interpretation (e.g. workers’ ambition driving a decision to take classes and work harder at the same time).

SLIDE 6

Data Science and Economics

Data Science approach:

– Identify which variables predict earnings, given a vast set of predictors in the data, and the potential for building a model that predicts earnings well, both in sample and out of sample. – Focus on:

Model that predicts earnings both for individuals that have and have

not taken online classes.

– NOTE: focus is not on causal effect and statistical significance but rather on prediction.

SLIDE 7

Machine Learning and Statistical Inference

Flexibility of machine learning algorithms means that two

different functions that use different variables can produce similar predictions:

– In traditional estimation, large standard errors express the uncertainty in attributing effects. – In machine learning, lack of consistency in model’s selection – how to measure this?

Computing standard errors in machine learning

algorithms is difficult due to the data-driven approach:

– Leeb and Potscher (2006, 2008) develop conditions under which it is impossible to compute consistent estimates of model parameters after data-driven selection.

SLIDE 8

Big Data and Statistical Inference

When big data represent all the data for a given set of

variables, should we compute standard errors?

Very much YES!

– The error of a model comes from two sources: omitted variables and measurement error

Omitted variables error: some relevant explanatory

variables are omitted, thus get in then error term.

Measurement error: the dependent variable is measured

with error.

SLIDE 9

Big Data and Statistical Inference

Sample error is very different from model error: it is the

difference between the sample-based regression results and the results based on the full population.

Probability theory tells us that for a well constructed

sample, regression coefficients are unbiased estimates

f population regression coefficients.
Tests of statistical significance are relevant both for

samples and for entire data populations.

– To read more on this, see Babones, S. J. 2013.Statistical Modeling with Cross-Sectional Designs , Chapter 5, pp. 107-118.

SLIDE 10

Data Science and Economics

Due to theory, the Economics approach is more

interpretable in terms of which variation identifies the impact of interest and its statistical significance.

The Data Science approach is better for predictions:

– Examples: comparison of performance of OLS vs machine learning algorithms (regression trees, random forest, LASSO, ensemble) in Mullainathan and Spiess (2017 Journal of Economic Perspectives); advantages of using ensembles methods to improve predictions (Athey et al. 2019). – Intuition: machine learning algorithms easily allow introducing pairwise interactions between all potential predictors.

The two approaches have mutual benefits.

SLIDE 11

Economics for Data Science

From Economics to Data Science: 2 main contributions

1. Provide a theory: theory to ask interesting questions and to analyze complex big datasets. With data complexity, crucial to have models to guide choice of variables, relationships between variables, hypothesis to test and experiments to run. 2. Focus on causality: crucial to answer important questions.

From Data Science to Economics: 3 main contributions

1. Test robustness to misspecifications (Athey and Imbens 2015). 2. New tools for causal inference. 3. Better predictions.

SLIDE 12

From Economics to Data Science:

1. Provide a Theory
Example: online advertising auctions.
Important question for Google or Facebook:

– Which ads to show online and how much to charge for the ads? 1. Machine learning methods to build a predictive model to assess the likelihood that a user will click on an ad. By exploiting the enormous amount of data available online, this predictive model tells us which ads to show. 2. Economic theory to build auction models to set prices.

Several e-commerce companies have built teams of

economists (often with PhDs in Economics), statisticians and computer scientists.

SLIDE 13

From Economics to Data Science:

1. Provide a Theory
A theory is a way to investigate the mechanism through which

X affects Y. It is a way to make ML interpretable.

“Interpretable machine learning”: ML field to go beyond a

“black box” approach to explain the logic behind predictions.

– To interpret a model, we require the following insights: 1. Identify the most important features. 2. For any single prediction, the effect of each feature in the data

n that particular prediction.

3. Effect of each feature over a large number of possible predictions

Molnar (2019): https://christophm.github.io/interpretable-ml-

book/ and Kaggle crash course on ML explainability: https://www.kaggle.com/learn/machine-learning-explainability

SLIDE 14

From Economics to Data Science:

2. Focus on Causality
Machine learning algorithms optimize properties of the observed

data: improve performance by optimizing parameters over a set of

inputs. E.g. to build a predictive model we minimize over fit.

– “As long as we optimize some properties of the observed data, however noble or sophisticated, while making no reference to the world outside the observed data, we are limited to questions of association.” Pearl (2018)

However, lots of important questions involve cause-and-effect
relationships. Until recently we had no mathematical framework to

articulate and answer these questions.

“More has been learned about causal inference in the last few

decades than the sum total of everything that had been learned about it in all prior recorded history“ Garry King, Harvard. “The Causal Revolution" (Pearl and Mackenzie, 2018).

SLIDE 15

From Economics to Data Science:

2. Focus on Causality

(Pearl 2018)

Human-level AI cannot emerge solely from model-blind

learning machines; it requires the symbiotic collaboration

f data and models.
Data science is only as much of a science as it facilitates

the interpretation of data - a two-body problem, connecting data to reality.

Data alone are hardly a science, regardless how big

they get and how skillfully they are manipulated.

– We need a theory to interpret the data.

SLIDE 16

From Data Science to Economics:

1. Test Robustness to Misspecifications

(Athey and Imbens 2015)

Researches interested in the effect of a given variable

(x) on an outcome (y) typically report the estimated impact of x on y and a measure of uncertainty of the estimate such as the standard error (se).

Problem: the measure of uncertainty, that is the

statistical significance of the estimate, depends on the model’s specification:

– Different specifications of the model (variables included, functional forms, etc.) produce different estimates of x on y and associated se.

Athey and Imbens (2015) propose a simple machine

learning approach to assess the sensitivity of the point estimates to model specification.

SLIDE 17

From Data Science to Economics:

1. Test Robustness to Misspecifications

(Athey and Imbens 2015)

Idea: compute a measure of how sensitive the estimates

are to a range of alternative models.

Estimate a series of models and compute the standard

deviation of the point estimate of the effect of x on y over the set of models.

Each member of the set of model specifications is

generated by splitting the sample into subsamples based

n covariate values, estimating the model separately for

each subsample, and then combining the results to form a new estimate of the overall effect.

SLIDE 18

From Data Science to Economics:

1. Test Robustness to Misspecifications
The idea to measure how sensitive the estimates are to

a range of alternative model specifications has been long discussed in Economics starting with Leamer (1983):

– Similar statistical inference results are obtained under different sets of assumptions.

Machine learning provides effective tools to do this.

– Calculate the standard deviation of the estimated effect of x on y in different samples.

SLIDE 19

From Data Science to Economics:

2. New Tools for Causal Inference
When estimating the causal effect of a given treatment,

we want to compare the observed outcome with the hypothetical outcome in the absence of the treatment (counterfactual).

Machine learning methods can be used to build the best

predictive model for the counterfactual without the (sometimes excessive) monetary costs of running a randomized controlled experiment.

Ex.: compare actual visits to a website following an

advertisement campaign (observed outcome) to the predicted visits absent the advertisement (counterfactual outcome) using time series data on past visits, seasonal effects, data on Google queries (pages 22-24 Varian 2014).

SLIDE 20

Flexibly control for a large number of covariates and for

heterogeneous effects.

Answer causal questions left unanswered (Varian 2016, PNAS).
Emerging literature combining machine learning methods with

applied econometrics’ techniques to improve the estimation of causal effects (Section 4.2 in Athey and Imbens 2016).

– Random forests, boosting or LASSO to estimate the propensity score in the presence of many covariates (matching). – Improved LASSO through a double-selection method of selecting covariates that are correlated with the outcome, and covariates that are correlated with the treatment so that LASSO can produce causal effects (Belloni et al. 2013). – Wager and Athey (2018): “causal forests” - modification of random forests to produce asymptotically unbiased estimates with CI.

From Data Science to Economics:

2. New Tools for Causal Inference

SLIDE 21

Provide predictions that can be used for estimations

such as for the first stage of an IV estimation.

Construct counterfactuals for policy evaluation.
Answer PREDICTION POLICY PROBLEMS

(Kleinberg et al. 2015).

From Data Science to Economics:

3. Better Predictions

SLIDE 22

Prediction Policy Problems (Kleinberg et al. 2015)

Many questions where causal inference is not necessary.

– Example 1: is the chance of rain high enough to require an umbrella? The benefits of an umbrella depend on rain. – Example 2: Are the benefits of a hip surgery high enough to justify the surgery? The benefits of a hip surgery depend on whether the patient lives long enough after the surgery (Kleinberg et al. 2015). – Example 3: detain or release someone arrested before trial’s decision? The decision depends on prediction of arrestee’s probability of committing a crime.

Therefore, pure prediction problems.

SLIDE 23

Prediction Policy Problems (Kleinberg et al. 2015)

OLS focuses on unbiasedness (it is the best linear unbiased

estimator!) and provides poor predictions.

OLS: given a dataset D of n points (y,x), pick a function that

predicts the y value of a new data point. Goal is to minimize a loss function that we take to be

OLS finds that minimizes in-sample error:
PROBLEM: ensuring zero bias in sample creates problems
ut of sample.

2

)) ( ˆ ( x f y 







n i i i

x f y

1 2

)) ( ˆ ( min

OLS

f ˆ

SLIDE 24

Prediction Policy Problems (Kleinberg et al. 2015)

SLIDE 25

Prediction Policy Problems (Kleinberg et al. 2015)

Machine learning techniques maximize predictive

performance by exploiting this variance-bias trade off.

Instead of minimizing only in-sample error, ML minimizes:

+ R(f)

R(f) is a regularizer that penalizes functions that create
variance. is the price at which we trade off variance to bias.
OLS: ; LASSO: ; RIDGE: d=2



  1 , ) (   d f R

d









n i i i

x f y

1 2

)) ( ˆ ( min



SLIDE 26

Prediction Policy Problems (Kleinberg et al. 2015)

chosen using cross-validation :

– For a set of lambdas, estimate algorithm on (f-1) folds and see which value of lambda produces the best prediction in the fth fold. – Thus, we use the data itself to decide how to make the bias- variance trade off.

Advantages:

– Allow for more flexible functional forms:

Higher order interaction terms and decision trees that allow for lots of

interactivity.

– Allow to make predictions using “wide” data, that is datasets that have more variables than data points:

Example: language data that often have ten times as many variables

than data.



SLIDE 27

Examples of Prediction Policy Problems (Chalfin et al. 2016)

How to reduce excessive use of force by police in the US? Can we

reduce police excessive use of force by using machine learning techniques to identify high-risk officers and replace them with average- risk officers?

Predictive model of whether an officer was ever involved in police

shooting or accused of abuse as a function of sociodemographic variables, previous behaviors and polygraph results. Training and test data.

Significant reduction in predicted shootings if the predicted bottom

percentile of officers is replaced with the middle segment of the predicted distribution.

On the contrary, replacement using the rank-ordering of applications

from the police hiring system increases predicted shootings.

SLIDE 28

Examples of Prediction Policy Problems (Chalfin et al. 2016)

How can districts choose which teachers to retain after a

probationary period?

Predictive model of students’ test scores (assumed

measure that the schools use to decide if teachers do well and should be retained) as a function of teachers’ variables (sociodemographic, classrooms observations, etc.), students (sociodemographic, test scores, etc.), and principals (survey about school and teachers).

Much bigger gains from replacing the predicted bottom

10% of teachers with average quality teachers rather than by using the system of principal rating of teachers currently in place.

SLIDE 29

One More Example

f Prediction Policy Problems

(Kleinberg et al. 2017)

After arrest, how can judges decide if the defendant is sent

home rather than to jail while waiting for trial?

Predictive model of risk of committing a crime using an

algorithm that finds past defendants who are like the ones currently in court, and uses the crime rates of these similar defendants to predict the crime rates of those in court.

Findings: pre-trial decisions made using the algorithm’s

predictions could reduce crimes committed by released defendants by up to 25%, and reduce incarceration by up to 42% without increasing crime.

SLIDE 30

Caveats of Prediction Policy Problems (Kleinberg et al. 2016)

Kleinberg et al. (2016) make some important points on how best to use

machine learning for policy: – Focus on problems that require prediction. – Think about the outcome to be predicted.

Example: predict overall crime rate rather than rates of specific crimes since

crimes are correlated.

– The outcome to predict has to be easy to measure and not affected by factors that are difficult to measure.

Example: build a prediction model to inform the decision of whom to sentence

rather than whom to bail.

Problem: sentencing depends not only on recidivism risk, but also on factors that

are not directly measured such as society’s sense of mercy and redemption.

– Test the algorithm using an experiment on new data.

Example: we only have data on the defendants that a judge decided to release.
To test if the prediction is good, run an RCT or a natural experiment to compare

bail decisions made using machine learning to those made by the judges.

SLIDE 31

Prediction and Causal Questions (Athey 2017)

Pure prediction problems do not answer the more complex question of

estimating heterogeneous effects.

Hip surgery example: the benefits of a hip surgery depend on whether the

patient lives long enough after the surgery (Kleinberg et al. 2015).

We know the effect of the treatment is negative for the patients that will

die, so for them it is easy to decide for no surgery.

However, an important open question remains: which patients should be

given priority to receive surgery among the ones that are likely to survive more than one year?

Causal question that requires estimating counterfactuals scenarios of

the effects of alternative policies of assigning patients to hip surgeries.

SLIDE 32

Prediction and Causal Questions (Athey 2017)

Prediction and causal inference are different.
Several research questions can be framed as a prediction or as a

causal question.

Example: data on hotel prices and occupancy rates.
Question 1: what is the best prediction of hotels’ occupancy

given unusually high observed prices on a given day?

– Prediction question: the answer is likely to be “high occupancy” since hotels tend to raise prices as they become full.

Question 2: what is the effect of increasing the prices on a

given day on occupancy rates?

– Causal question: the answer is likely to be “low occupancy” since an increase in prices is unlikely to increase occupancy.

SLIDE 33

Prediction and Causal Questions (Blake et al. 2015)

Question: how effective is paid search advertisement?
From simple regressions of sales on the amount spent on

advertisement, huge returns to advertisement.

When running randomized control experiments, the returns

are zero or negative since the majority of the clicks did not result in sales.

Prediction models would lead to misleading results.

SLIDE 34

Prediction and Causal Questions

The causal question is the question of a counterfactual, i.e.

the question of “What would happen if?”

An emerging literature is merging machine learning and

Economics to use machine learning methods for causal inference.

An important example of this literature is the TTTC

estimator (Varian 2016, PNAS), which we will discuss in the next weeks.

SLIDE 35

Nowcasting (contemporaneous forecasts of economic

statistics):

– E.g.: proxy measures of unemployment claims and consumer confidence (Choi and Varian 2011).

Improved decision making by exploiting real time

information and running experiments.

From Data Science to Economics: Other Advantages

SLIDE 36

Novel measurement and research design:

– Machine learning can deal with data that are too high-dimensional for standard estimation methods (e.g. image information). – Very useful when reliable data on economic outcomes are missing such as in measuring poverty and wealth (Blumenstock 2018 and 2016, Jean et al. 2016, Blumestock, Cadamuro and On 2015), credit scores and loan repayment (Bjorkegren and Grissen 2018), and when we have datasets with missing data (Athey et al. 2019). – Innovative research designs: Bernheim et al. (2013) use a machine learning algorithm trained on a subset of respondents to a survey to predict actual choices from survey respondents, thus providing a tool to infer actual from reported behavior.

From Data Science to Economics: Other Advantages

SLIDE 37

Access: much of the novel BIG DATA belong to private companies.
Data management and computation.
Asking the right question: huge amount of time only to open and

describe the data.

Not from instruments designed to produce valid and reliable data

Lazer et al. (2014).

Search algorithms, such as Google search, are not static, they are

constantly changed to improve performance.

Search behavior is affected by the service provider, and the

data generating process changes.

Studies using data from search engines, Facebook or Twitter

may not be replicated using data from earlier or later periods.

Challenges with Big Data

SLIDE 38

Machine Learning and Causal Inference

The Problem of the “Missing” Counterfactual
Program Evaluation and Machine Learning
Randomized and Natural (or Quasi-) Experiments

Starred readings:

* Angrist and Pischke, Mostly Harmless Econometrics, Princeton

and Oxford University Press, 2009, Chapter 2 pages 11-24.

* Stock and Watson, Introduction to Econometrics, 3rd edition,

Chapter 13 pages 511-529 and 538-540.

* Athey S. and G. Imbens. 2015. “Machine Leaning Methods in

Economics and Econometrics”, AER: Papers and Proceedings, 105(5): 476-480.

* Varian, H. (2016) “Causal Inference in Economics and Marketing”,

PNAS, Vol. 113, No. 27, pages 7310-7315.

Topic 2

SLIDE 39

Challenge: Estimating the Causal Effect

Drawing causal inference such as “What is the causal

effect of college education on earnings?” requires answering counterfactual questions:

1. How would earnings of individuals who did not go to college would have been if they had gone to college? 2. How would earnings of individuals who did go to college would have been if they had not gone to college?

Problem: we never observe counterfactual outcomes

since we can not simultaneously observe a given person in two different states of the world.

You either go to college or you do not…

SLIDE 40

The “Missing” Counterfactual’s Problem

We can never observe the same person in two different

states of the world at the same time.

We may have data on the same unit i in two consecutive

trials before and after the treatment (data on your wages before and after you complete college).

BUT we can not be sure that the treatment effect for i is

the same we would have measured if we had observed i simultaneously in both states (with and without college):

– Carryover effects (effect of college on wages wears off slowly). – Time trends (“unobserved ability” may change over time).

SLIDE 41

Framework of Potential Outcomes (Rubin’s Causal Model)

Each individual has two potential outcomes

– Potential outcome without treatment – Potential outcome with treatment

 Treatment Effect: for each individual, but only one of the two outcomes is observed

D=1 if i receives treatment, else D=0
Observed outcome:
If individual is treated:

– is observed, – is a counterfactual

If individual is not treated:

– is observed, – is a counterfactual

SLIDE 42

Selection Problem

Therefore, we have: Y=Y0+(Y1-Y0)D (1) Given (1), the comparison of average value of Y for people observed in D=1 and D=0 is misleading since: E(Y|D=1)-E(Y|D=0) = = [E(Y1|D=1)-E(Y0|D=1)] + [E(Y0|D=1)- E(Y0|D=0)] = = Average effect of treatment on the treated + Bias  Bias is the difference between (average) counterfactual Y0 in both populations (treated and untreated)

SLIDE 43

Selection Problem

Random assignment of D solves the selection problem since makes D independent of potential outcomes. Formally: E(Y|D=1)-E(Y|D=0) = E(Y1|D=1)-E(Y0|D=0) = E(Y1|D=1)-E(Y0|D=1) since the independence of Y0 and D implies that E(Y0|D=1)=E(Y0|D=0). And: E(Y1|D=1)-E(Y0|D=1)=E(Y1-Y0)

The effect of the randomly assigned treatment on the

treated is the same as the effect of the treatment on a randomly chosen individual.

SLIDE 44

Why is the bias likely? Example of simple Roy model: “I will go to college if it is worth.”

Selection Rule: D=1 if Y1-Y0>C  Then in general: E(Y0|D=1)=E(Y0|Y0<Y1-C)  those who chose treatment which is different from E(Y0|D=0)=E(Y0|Y0>Y1-C)  those who chose not to be treated Bias due to comparative advantages in terms of (Y1-Y0). For example:

Participants can have larger potential gains.
Heterogeneity in costs.
Heterogeneity in preferences.

SLIDE 45

Parameters of Interest

Most commonly used, given some observables X:

Average treatment effect (ATE):
Average effect of treatment on the treated (TTE):
Average effect of treatment on the untreated (UTE)
ATE is the average of TTE and UTE:

SLIDE 46

Goal of “Program Evaluation”

Find a “good” comparison group to make up for not knowing counterfactual outcomes

Illustration:

Identification problem:

we observe E(Y0|D=0), E(Y1|D=1), but not the counterfactuals: E(Y0|D=1), E(Y1|D=0)

Example: to estimate the TTE, we would need

TTE = E(Y1|D=1) - E(Y0|D=1) (Problem: second term is unobserved)

Assumption to identify TTE:

E(Y0|D=1)=E(Y0|D=0)=E(Y0), i.e. no selectivity based on outcome in untreated state

 Substitute unobserved second term with observed E(Y0|D=0)

(it is satisfied in randomized experiments)

SLIDE 47

Different Approaches

f Program Evaluation

1. Run an experiment and use simple differences estimator. 2. Use observational data to construct the counterfactual:

a. Selection on observables:
“Unconfoundedness assumption”: we assume to observe all X

variables that affect both participation decision or treatment (ex. completing college) and outcome of interest (ex. wages).

Diff-in-Diff
Matching
Regression discontinuity
b. Selection on unobservables
Instrumental variable estimation
Control function approach

SLIDE 48

Experiments

Randomized experiment:
Setting where the assignment of the treatment does not depend on

either observables or unobservables, and the researcher has control over the assignment (Cochran 1972).

Designed and implemented consciously by human researchers.
Entails a conscious use of a treatment and a control group with

random assignment (e.g. clinical trials of drugs).

“Natural” or quasi-experiment:
Source of randomization that is “as if” randomly assigned, but this

variation was not part of a conscious randomized treatment and control design.

SLIDE 49

Randomized Experiments

How can randomization solve the problem of not observing counterfactual outcomes?

Comparison group selected using randomization device to exclude a

fraction of program applicants from a given treatment  by definition: no selection into treatment (if randomization worked)

Main advantage: comparability between program participants and

nonparticipants  same distribution of observables and unobservables in treatment and control group

SLIDE 50

Randomized Experiments

How can randomization solve the problem of not observing counterfactual outcomes?

Comparison group selected using randomization device to exclude a

fraction of program applicants from the program  by definition: no selection into treatment (if randomization worked)

Main advantage: comparability between program participants and

nonparticipants  same distribution of observables and unobservables

Formally: randomization leads to