[PPT] - Introduction to Bayesian Statistics Louis Raes Spring 2017 Table PowerPoint Presentation

SLIDE 1

Introduction to Bayesian Statistics

Louis Raes Spring 2017

SLIDE 2

Organisation

Small group; very advanced students
Lecture: introduce some concepts, ideas → readings at home

and assignments → assignments are not graded

Course is evaluated based on an oral exam; last lecture, exam

requirements are handed out → bunch of questions (much overlap with assignments) are handed out → prepare a few for the exam

SLIDE 4

Teaching Style

Throughout the slides I have put questions (in red).

→ I expect you to solve these in class (I will pause) → Someone will have to provide the solution to class mates in class → Question too easy? No problem, just solve them, won’t take long → Struggle with a question: indication that you need to read up

n this; I refer to chapters and pages from books which may be
helpful. A reader with selected readings is provided.

SLIDE 5

Teaching Style

Questions in blue are not solved in class but need to be done at
home. Prepare them for next lecture. I expect students to have

solved the blue questions and have answers (or attempts) with

them. You are allowed to prepare in groups of two. (In case of

abuse, I’ll toughen requirements).

SLIDE 6

Organisation

Bayesian inference means computation.
Feel free to choose your language / computing environment → R

and Matlab are good choices; I use R → your choice is your own responsibility

Tons of resources online; do not mindlessly copy-paste.

→ In practice I too use a lot from online and offline resources, though the hope is that is not mindlessly done.

SLIDE 7

Goals

Have an idea about Bayesian inference.
Develop intuition about basic concepts.
After these lectures you should be able to move quicker in case
f self-study; if you do not pursue this further at least you have an

idea.

SLIDE 8

Goals

Pace is (relatively) slow, getting fundamentals right.
Do not forget readings, integral part of the course (and exam)

SLIDE 9

Further resources

Doing Bayesian Data Analysis, John Kruschke

Shorthand: Kruschke → Easiest resource on Bayesian Inference I am aware off → Geared towards psychologists → Quirky (humor, poetry) → Super accessible; focus on R → Choice of topics might be a bit erratic, not ideal for economics applications, unless you do experiments (warning: → no idea about usage of Bayes among experimentalists)

SLIDE 10

Further resources

Bayesian Analysis for the Social Sciences, Simon Jackman

Shorthand: Jackman → Graduate level, many applications in political science → Good blend of theory (more formal) and applications → Not too easy; focus on R and Bugs/Jags (Bugs is a bit outdated by now → check out Stan)

SLIDE 11

Further resources

Bayesian Data Analysis, Gelman et al.

Shorthand: BDA → ”The” textbook on Bayesian Statistics; scholar.google: 19318 cites (5/5/2017) → Very comprehensive but with applied bending → Covers applications in many fields

For applications in economics and materials geared towards

econometrics → Gary Koop → leading Bayesian, many syllabi, papers, textbooks, codes (Matlab)

SLIDE 12

Introduction

Bayesian inference means using Bayes’s theorem to update beliefs regarding an object of interest after observing data. prior beliefs → data → posterior beliefs ???

SLIDE 13

Introduction

Bayesian inference means practical methods to make inferences from data we observe via probability models, to learn about quantities we are interested in. Three steps:

1. Set up a full probability model: a joint distribution for all
bservable and nonobservable quantities, consistent with our

knowledge about the problem as well as the data collection process

2. Condition on the observed data: the posterior
3. Evaluate model fit and implications of the posterior

distribution: does the model fit the data, reasonable conclusions, sensitive? → if necessary, go back to step 1. (See BDA chapter 1)

SLIDE 14

Introduction

Bayesian statistics is often preferred due to philosophical reasons,

many disciplines are moving towards interval estimation rather than hypothesis testing and the Bayesian probability interval is much closer to the (common-sense) interpretation of what a confidence interval should be. Question:

1. In a classic (frequentist setting), what is the interpretation of

a confidence interval?

2. What is the interpretation of a p-value. Write this down

carefully -in your own words.

SLIDE 15

Introduction

Ignoring philosophical discussions, Bayesian inference has other advantages: (i) flexible and general → deal with very complicated problems; (ii) take uncertainty seriously → uncertainty in derived quantities (iii) pooling / shrinkage.

SLIDE 16

Introduction

To have a meaningful discussion on the merits, we need to

introduce quite a bit of jargon.

Philosophical discussions and comparisons with other approaches

are to be discussed later onwards.

SLIDE 17

Introduction

Assignment:

1. Read the Introduction of Jackman’s book as well as chapter 1

until p.8

SLIDE 18

Concepts

Question:

1. Write down the definition of conditional probability.
2. Write down an example highlighting / explaining this.
3. What is the law of total probability?
4. What is Bayes’ rule.

SLIDE 19

Introduction

Definition: Let A and B be events with P(B) > 0, then the conditional probability of A given B is: P(A|B) = P(A ∩ B) P(B) . (1) While we state it as a definition, it can be justified with a betting argument following de Finetti or derived from more elementary axioms.

SLIDE 20

Introduction

Definition: Multiplication rule P(A ∩ B) = P(A|B)P(B) (2) Definition: Law of total probability P(B) = P(A ∩ B) + P(¬A ∩ B) = P(B|A)P(A) + P(B|¬A)P(¬A) (3)

SLIDE 21

Introduction

Bayes’s Theorem: If A and B are events with P(B) > 0, then P(A|B) = P(B|A)P(A) P(B) (4) Proof: . . .

SLIDE 22

Introduction

In Bayesian Statistics we use the above theorem as follows. Let A denote a hypothesis, and B the evidence, then we see that the theorem provides the probability of the hypothesis A after having seen the data B.

SLIDE 23

Introduction

A common illustration is a (drug) testing example. Do this

yourself. Write down a reasonable rate for false negatives of a drug

test and a reasonable rate of false positives. Choose a true rate of drug use in a subject pool (think of students taking some meds during exam period or athletes). Now derive the posterior probability for a subject, randomly drawn from a subject pool for testing, returning a positive test. What is the probability that the subject has used a substance? Does the result align with your a priori intuition?

SLIDE 24

Introduction

In our research we often work in a continuous setting and we consider continuous parameters we want to learn about: a regression coefficient or proportion etc. Let θ denote the parameter we want to learn about and y = (y1, . . . , yn)′ the data at hand. Beliefs are represented as probability density functions or pdf’s. So the prior on θ is p(θ) and the posterior is p(θ|y). Bayes’s Theorem: p(θ|y) = p(y|θ)p(θ)

p(y|θ)p(θ)dθ

(5) Proof: . . .

SLIDE 25

Introduction

Often this is written as p(θ|y) ∝ p(y|θ)p(θ), (6) where the constant of proportionality (the denominator missing here), ensures that the posterior integrates to 1, as a proper density must. Furthermore, the term p(y|θ) is nothing more than the likelihood. Hence the mantra: the posterior is proportional to the prior times the likelihood.

SLIDE 26

Introduction

A practical example: Coin tossing. What is the chance that a coin comes up heads? → two possibilities that are mutually exclusive → each datum is independent ∼ proportion of babies that are girls ∼ proportion of heart surgery patients who survive after 1 year

SLIDE 27

Introduction

Step 1: specify the likelihood: The probability of a coin coming up heads is a function of an unknown parameter θ: p(y = 1|θ) = f (θ) (7) We assume the identify function, so we have p(y = 1|θ) = θ and also p(y = 0|θ) = 1 − θ. Combined we have the Bernoulli distribution: p(y|θ) = θy(1 − θ)1−y

SLIDE 28

Introduction

When we flip a coin N times, we have N observations, so we get: p({y1, . . . , yn}|θ) =

i

p(yi|θ) =

i

θyi(1 − θ)1−yi (8)

r write with z the number of times head shows up in N flips, and

we get: p(z, N|θ) = θz(1 − θ)(N−z) (9)

SLIDE 29

Introduction

So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why?

SLIDE 30

Introduction

So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why? → Remember: the posterior can only exist over the support of the prior!

SLIDE 31

Introduction

Proposal: the Beta distrbution: p(θ|a, b) = beta(θ|a, b, ) = θ(a−1)(1 − θ)(b−1)/B(a, b) (10)

SLIDE 32

Introduction

The beta distribution depends on two parameters, B(a, b) is a normalizing constant ensures that the area under the density integrates to 1: B(a, b) = 1

0 θ(a−1)(1 − θ)(b−1)dθ

mean: ¯ θ = a/(a + b)

SLIDE 33

Introduction

Think of a and b as prior observed data, a as heads and b as tails in a total of a+b flips. Question: How should this prior look like for large a and small b ? How should this prior look like for a large a and an equally large b ? How do you think this prior looks like for a = 1, b = 1 ? Let your instincts speak... (no googling), draw a few sketches.

SLIDE 34

Beta(5,5); Beta(5,1); Beta(1,1)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior

θ p(θ)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior

θ p(θ)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior

θ p(θ)

SLIDE 35

Introduction

So the goal is to move from the data and the priors to the posterior. Say we have observed 5 coin flips: heads, tails, tails, heads, heads. → How does the posterior look like?

SLIDE 36

Prior: Beta(1,1), Data: 1,0,0,1,1

0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4

Prior

Theta pTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.010 0.020 0.030

Likelihood

Theta pDataGivenTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

Posterior

Theta pThetaGivenData

SLIDE 37

Prior: Beta(10,10), Data: 1,0,0,1,1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0

Prior

Theta pTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.010 0.020 0.030

Likelihood

Theta pDataGivenTheta 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4

Posterior

Theta pThetaGivenData

SLIDE 38

Obtaining the posterior

Analytically: conjugacy
: Computationally:
1. : Grid Approximation
2. : Computationally: Metropolis Algorithm (MCMC, Gibbs,

Hamiltonean Monte Carlo)

SLIDE 39

Conjugate Priors

Formally: If F is a class of sampling distributions p(y|θ), and P is a class of prior distributions for θ, then the class P is conjugate to F if: p(θ|y) ∈ P for all p(.|θ) ∈ F and p(.) ∈ P. Key interest in natural conjugate priors, which arise by taking P to be the set of all densities having the same functional form as the likelihood.

SLIDE 40

Conjugate Priors

→ Conjugacy means that the posterior looks like the prior. → Conjugacy means computational convenience. → Conjugacy means that the prior has some interpretation as prior data

SLIDE 41

Conjugate Priors

Let us go back to our example. Posterior ∝ prior × likelihood Prior: Beta(a, b) Likelihood: Bernoulli

SLIDE 42

Conjugate Priors

Posterior = prior × likelihood p(θ|z, N) = θa−1(1 − θ)(b−1)θz(1 − θ)(N−z)/[B(a, b)p(z, N)] = θz+a−1(1 − θ)(b−1+N−z)/[B(z + a, N − z + b)] (11)

SLIDE 43

The relationship between prior and posterior

Prior mean:

a a+b

Posterior mean:

z+a z+a+b+N−z = z+a a+b+N

→ rearrange into a weighted mean of prior mean and data proportion z/N: z + a a + b + N = z N N N + a + b + a a + b a + b N + a + b (12)

SLIDE 44

The relationship between prior and posterior

So the posterior is always somewhere between prior mean and proportion in data. Question: Look at the weight, when does the prior become less important?

SLIDE 45

Conjugacy

Even an applied Bayesian should have a good working knowledge

f distributions. Some distributions that are common in a Bayesian

context (but maybe less familiar) are the Beta, Gamma, Wishart and Cauchy disributions, to name a few. On the next slide I reproduce a table from a randomly chosen recent economics article: J.Fuhrer, JME, 2017. Assignment: Visualize the priors and overlay the prior with the

posterior. We do not know the shape of the posterior entirely so

you may pick the same distribution as the prior (but differently parametrized).

SLIDE 46

Conjugacy

SLIDE 47

We have a posterior. Now what?

Bayesian analysis yields an entire distribution: the posterior. What to report? → Any feature of the posterior you deem relevant . . . But, only mentioning a single moment → ignoring lots of information.

SLIDE 48

We have a posterior. Now what?

We can specify a loss function and derive from that, what to

communicate from a posterior. A quadratic loss function leads to the mean, a symmetric linear loss function leads to the median, asymmetric linear loss functions lead to other quantiles than the median. Assignment:

1. Read the relevant pages in Jackman to obtain some intuition
n this.
2. Think of a situation (economic model, regression parameter,

. . . ) in which we could reasonably specify an asymmetric loss function i.e. you’d prefer to err in one direction rather than the other. Describe such a situation.

SLIDE 49

We have a posterior. Now what?

We can also communicate confidence intervals → credible

regions → uncertainty intervals. Assignment: Read this short post on the blog by Andrew Gelman:

http://andrewgelman.com/2016/11/26/ reminder-instead-confidence-interval-lets-say-uncertainty-interval/

Also read the comments and the links to other posts within this comment. Read on Highest Probability Regions in Jackman.

SLIDE 50

Intermezzo: Bayes Factors

We should briefly mention Bayes Factors. The Bayes factor is a comparison between two models M0 and M1. Definition: Given data y and two competing models M0 and M1, the Bayes Factor is the ratio of the posterior odds over the prior

dds or:
p(M1|y)

p(M0|y)

p(M1)

p(M0

(13)

and provides a summary of the evidence for M1 against M0 provided by the data.

SLIDE 51

Intermezzo: Bayes Factors

A Bayes factor below 1 indicates negative support for M1, a Bayes factor of 3-12 means strong support etc. Most Bayesian textbooks provide a further discussion on Bayes Factors (e.g. Jackman or BDA).

SLIDE 52

Obtaining a posterior

So far we only discussed obtained a posterior via conjugate priors. Already many possibilities, see for example wiki for an overview. But still limiting: → Maybe we have a reason to specify some goofy looking prior. → We are dealing with a complex model which cannot easily be squeezed in the conjugate mold. → Most models for observational data (or models with sufficient structure) cannot be dealt with via conjugacy.

SLIDE 53

Grid Approximation

Let us go back to the coin flipping example. Now we specify a prior defined only for a finite number of θ values (the grid). For example, we assume that θ can only take value of ).25, 0.5, 0.75. → Our prior expresses probability mass at each value of θ.

SLIDE 54

Grid Approximation

Bayes’ rule is then: p(θ|y) = p(y|θ)p(θ)

θ p(y|θ)p(θ)

with the sum in the denominator over the finite number of discrete values of θ that we are considering, and p(θ), the probability mass at p(θ).

SLIDE 55

Grid Approximation

By specifying a grid we may try to approximate continuous prior

densities. We use the discrete Bayes form which may be much

easier than the continous case which requires evaluating an integral.

SLIDE 56

Grid Approximation

Assignment: Apply grid approximation to the coin tossing problem. Create code where you generate 10 coin flips (randomly) from a fair coin. Specify a nonstandard prior (make something goofy, be creative) and apply grid approximation. Make graphs of the likelihood, the prior and the posterior.

SLIDE 57

Grid Approximation

→ Grid approximation has mostly a pedagogical purpose. → If we have a 10-dimensional parameter space and we want to set up a grid of 1000 values, then we have 100010 values to evaluate → problematic

SLIDE 58

Metropolis algorithm

MCMC methods or Markov Chain Monte Carlo methods are what makes Bayesian statistics work in practice. There are many varieties and a decent understanding requires some investment in the theory of Markov Chains (e.g. Jackman chapter 4). Here the focus is to provide a low-tech intuitive understanding, I follow Kruschke closely. Assignment: Look for an explanation of Metropolis-Hastings online or in a textbook, and try to grasp it.

SLIDE 59

Metropolis algorithm

Goal: understand the posterior. Solution: sample many points from it and just compute mean, standard deviation etc. So even if we lack analytical formulas, if we can randomly sample many points, we can calculate what we need.

SLIDE 60

Metropolis algorithm

Assignment: Consider Uniform[0, 3]. What is the mean? Generate 10 values of this distribution and calculate the mean. Do this also for 100, 1000 and 10.000 values. Put these on a plot along with the analytically calculated mean. In class: ”Guess” what you would obtain and sketch by hand At Home: Write some code to do this.

SLIDE 61

Metropolis algorithm

The Monte Carlo Principle: Anything we want to learn about a random variable θ can be learned by sampling many times from f (θ), the density of θ.

SLIDE 62

Metropolis algorithm

The Monte Carlo Principle: Anything we want to learn about a random variable θ can be learned by sampling many times from f (θ), the density of θ. Simulation consistency: (informally:) As the number of draws of f (θ) increases, the simulation based estimate of some feature of θ, h(θ) improves. (See Jackman 3 for a formal discussion).

SLIDE 63

Metropolis algorithm: a politician

How can we get representative samples from a distribution? The story of a politician (Kruschke, Chapter 7, see reader) A politician travels through a series of islands. He decides each day whether to stay on the island, travel west or travel east. All islands are on a line and he can move only 1 island to the left or right on a given day. He wants to spend time on the islands proportionnally to their population, so about twice as much time on an island A than on island B if island A is twice as populous.

SLIDE 64

Metropolis algorithm: a politician

The politician has little information. He knows the population of the island he resides. He can also ask the size of the population of an adjacent island. His mechanism is as follows: Flip a coin, if heads propose east, if tails propose west. If the proposed island (E or W) is larger, definitely go. If it is smaller, only go probabilistically. The probability of moving is just pmove = popproposed/popcurrent. The politician can draw from the Uniform[0, 1] distribution. If his draw is between 0 and pmove, he moves, otherwise he stays.

SLIDE 65

Metropolis algorithm: a politician

In the long run, the probability that the politician is on any one of the island, exactly matches the relative population of the island!

SLIDE 66

Metropolis algorithm: example of island hopping

Let us consider an example of island hopping: Assume we have 7 islands, with relative proportions as given in the graph below.

SLIDE 67

Metropolis algorithm: example of island hopping

SLIDE 68

Metropolis algorithm: example of island hopping

SLIDE 69

Metropolis algorithm: example of island hopping

Summary:

◮ we are at θcurrent ◮ Propose left or right, with 50% chance → proposal

distribution

◮ Decision to accept is based on the ratio of the target

distribution at the proposed position to the target distribution at the current position.

◮ Ratio > 1: accept, ratio < accept probabilistically.

So the probability to move, pmove = min P(θproposed)

P(θcurrent) , 1

SLIDE 70

Metropolis algorithm: example of island hopping

What do we need to do this?

1. Be able to draw values from the proposal distribution.
2. Be able to evaluate the target distribution at any proposed

position.

3. Be able to draw from the uniform distribution.

→ With these things we can generate samples from the target distribution.

SLIDE 71

Metropolis algorithm: some intuition on why it works

We show here the intuition on why it works (following Kruschke). Say we are at position θ and we write the probability of moving towards θ + 1 as p(θ → θ + 1) Then, θ + 1 as p(θ → θ + 1) = 0.5 × min(P(θ + 1)/P(θ), 1). Vice versa: p(θ + 1 → θ) = 0.5 × min(P(θ)/P(θ + 1), 1)

SLIDE 72

Metropolis algorithm: some intuition on why it works

The ratio of these probabilities: p(θ → θ + 1) p(θ + 1 → θ) = 0.5 × min(P(θ + 1)/P(θ), 1) 0.5 × min(P(θ)/P(θ + 1), 1) =

1

P(θ)/P(θ+1)

ifP(θ + 1) > P(θ)

P(θ)/P(θ+1) 1

ifP(θ + 1) < P(θ) = P(θ + 1) P(θ) (14)

SLIDE 73

Metropolis algorithm: some intuition on why it works

The relative probability of transitions between adjacent positions, exactly matches the relative values of the target distribution. → intuition: if it is true for adjacent positions, we extrapolate to the whole range of positions. Read the relevant pages in Kruschke to see this claim further worked out in this simple context. (See reader)

SLIDE 74

Metropolis algorithm: essentials

Note the super stylized setup: discrete positions, 1 dimension, propose left or right Works far more general: continuous setting, multi dimensional, more general proposal distributions.

SLIDE 75

Metropolis algorithm: general description (Jackman ch.5)

The Metropolis-Hastings algorithm defines a set of jumping rules to generate a Markov chain over the support of the posterior p(θ|y), Θ. At the start of iteration t, we have values θ(t−1) and we go to θ(t) as follows:

1. Sample a vector of proposal values θ∗ from a proposal or

jumping distribution Jt(θ∗, θt−1)

2. Calculate the acceptance ratio: r =

p(θ∗|y)Jt(θ∗,θ(t−1)) p(θ(t−1)|y)Jt(θ(t−1),θ∗)

3. α = min(r, 1)
4. sample u ∼ U(0, 1)
5. if u ≤ α, then θ∗ → θt
6. θ(t−1) → θt otherwise

SLIDE 76

Metropolis algorithm: issues

How should the proposal distribution look like?

What would be a bad choice ?

SLIDE 77

Metropolis algorithm: issues

Burn-in, Efficiency, Convergence
Metropolis algorithm is an example of Markov Chain Monte

Carlo.

SLIDE 78

Metropolis algorithm: some code

I have written some code to demonstrate the Metropolis algorithm. Metropolis.r → inspect the code.

SLIDE 79

Metropolis algorithm: some code

Assignment: Do exercise 7.1 of Kruschke (see handout). Ideally you try to build the algorithm from scratch. If that does not work out, start from the code provided in Kruschke or my code. → focus on understanding the how and why

SLIDE 80

Metropolis algorithm: further thoughts

The Metropolis-Hastings algorithm is remarkable and while it should work in theory, practice may prove to be very difficult. In a realistic multi-dimensional setting, merely specifying a reasonable proposal distribution may be very hard to do. → Many more sophisticated algorithms to tackle various problems: e.g. adaptive algorithms

SLIDE 81

Metropolis algorithm: further thoughts

Another approach relies on breaking the high dimensional posterior density p(θ|y) into a series of lower dimensional problems. This brings us to the Gibbs sampler.

SLIDE 82

What is Gibbs Sampling

The Gibbs sampler rests on the following idea: Joint probability densities can be fully characterized by their component conditional densities.

SLIDE 83

Gibbs: Two coins

As an illustrative example, consider the following example from Kruschke (chapter 8): Assume two coins which are flipped. We want to make inferences about the differences in the proportion of

heads. (How different is the bias?)

SLIDE 84

Gibbs: Two coins: priors

Do not confuse the independence of the priors with the independence of the observations: → it could be reasonable to assume dependent priors, but independent observations! The likelihood here is: p(y|θ1, θ2) = θz1

1 (1 − θ1)(N1−z1)θz2 2 (1 − θ2)(N2−z2)

with Nx, zx the number of trials and successes for coin x.

SLIDE 85

Gibbs: Two coins:

The problem is fairly simple and can be solved analytically, by grid

r Metropolis-Hastings.

Here: Gibbs sampling

SLIDE 86

Gibbs: requirement

To do Gibbs, we need to generate samples from the posterior distribution conditional on each individual parameter. → additional requirement. If we can draw from p(θ1|θ2, y) and from p(θ2|θ1, y), Gibbs provides us with p(θ1, θ2|y).

SLIDE 87

Gibbs:

Gibbs is again a random walk, but the next step only depends on the current position. Gibbs cycles through each parameter component θi ∈ θ. → Say we select θi, then we generate a random value directly from p(θi|θj=i, y). The new parameter component, along with all other component unchanged, constitutes the new position in the random walk.

SLIDE 88

Gibbs:

Gibbs can be seen as a special case, with a proposal that is always accepted. Assignment:

1. Try to write code for the two coin problem. Specify for each

10 flips (assume coins are independent) for each. Next, specify beta priors and show how the conditional beta distribution looks like. Not Mandatory: Write code for the Gibbs Sampler in this context, make graphs of prior, likelihood and posterior.

SLIDE 89

Gibbs: general description (Jackman ch. 5)

Consider partitioning the parameter vector θ into d blocks (subvectors or scalars), θ = θ1, . . . , θd. Index iterations by t:

1. sample θ(t+1)

1

from g1(θ1|θ(t)

2 , θ(t) 3 , . . . , θ(t) d , y)

2. sample θ(t+1)

2

from g2(θ1|θ(t+1)

1

, θ(t)

3 , . . . , θ(t) d , y)

3. . . .
4. sample θ(t+1)

d

from gd(θd|θ(t+1)

1

, θ(t+1)

3

, . . . , θ(t+1)

d−1 , y)

5. θt+1 ← (θd|θ(t+1)

1

, θ(t+1)

3

, . . . , θ(t+1)

d−1 )′

So after d steps we have fully updated θ vector.

SLIDE 90

Gibbs:

Gibbs is important for historical purposes. As a standalone algorithm you’ll rarely see this in practice, but it is an important building block and foundational algorithm. See for example Metropolis-within-Gibbs, data augmentation, slice sampler etc. Assignment: Pick any of the questions below

1. Use an example to explain and demonstrate Metropolis within

Gibbs

2. Explain the intuition behind data augmentation.

Introduction to Bayesian Statistics

Louis Raes Spring 2017

Table of contents

Organisation, goals What is Bayesian Statistics? Introduction Statistical basics Doing Bayesian Analysis Conjugacy Grid approximation Metropolis algorithm Gibbs Sampling

Organisation

and assignments → assignments are not graded

requirements are handed out → bunch of questions (much overlap with assignments) are handed out → prepare a few for the exam

Teaching Style

→ I expect you to solve these in class (I will pause) → Someone will have to provide the solution to class mates in class → Question too easy? No problem, just solve them, won’t take long → Struggle with a question: indication that you need to read up

Teaching Style

solved the blue questions and have answers (or attempts) with

abuse, I’ll toughen requirements).

Organisation

and Matlab are good choices; I use R → your choice is your own responsibility

→ In practice I too use a lot from online and offline resources, though the hope is that is not mindlessly done.

Goals

idea.

Goals

Further resources

Further resources

Shorthand: Jackman → Graduate level, many applications in political science → Good blend of theory (more formal) and applications → Not too easy; focus on R and Bugs/Jags (Bugs is a bit outdated by now → check out Stan)

Further resources

Shorthand: BDA → ”The” textbook on Bayesian Statistics; scholar.google: 19318 cites (5/5/2017) → Very comprehensive but with applied bending → Covers applications in many fields

econometrics → Gary Koop → leading Bayesian, many syllabi, papers, textbooks, codes (Matlab)

Introduction

Bayesian inference means using Bayes’s theorem to update beliefs regarding an object of interest after observing data. prior beliefs → data → posterior beliefs ???

Introduction

Bayesian inference means practical methods to make inferences from data we observe via probability models, to learn about quantities we are interested in. Three steps:

knowledge about the problem as well as the data collection process

distribution: does the model fit the data, reasonable conclusions, sensitive? → if necessary, go back to step 1. (See BDA chapter 1)

Introduction

many disciplines are moving towards interval estimation rather than hypothesis testing and the Bayesian probability interval is much closer to the (common-sense) interpretation of what a confidence interval should be. Question:

a confidence interval?

carefully -in your own words.

Introduction

Ignoring philosophical discussions, Bayesian inference has other advantages: (i) flexible and general → deal with very complicated problems; (ii) take uncertainty seriously → uncertainty in derived quantities (iii) pooling / shrinkage.

Introduction

introduce quite a bit of jargon.

are to be discussed later onwards.

Introduction

Assignment:

until p.8

Concepts

Question:

Introduction

Definition: Let A and B be events with P(B) > 0, then the conditional probability of A given B is: P(A|B) = P(A ∩ B) P(B) . (1) While we state it as a definition, it can be justified with a betting argument following de Finetti or derived from more elementary axioms.

Introduction

Definition: Multiplication rule P(A ∩ B) = P(A|B)P(B) (2) Definition: Law of total probability P(B) = P(A ∩ B) + P(¬A ∩ B) = P(B|A)P(A) + P(B|¬A)P(¬A) (3)

Introduction

Bayes’s Theorem: If A and B are events with P(B) > 0, then P(A|B) = P(B|A)P(A) P(B) (4) Proof: . . .

Introduction

In Bayesian Statistics we use the above theorem as follows. Let A denote a hypothesis, and B the evidence, then we see that the theorem provides the probability of the hypothesis A after having seen the data B.

Introduction

A common illustration is a (drug) testing example. Do this

Introduction

(5) Proof: . . .

Introduction

Introduction

A practical example: Coin tossing. What is the chance that a coin comes up heads? → two possibilities that are mutually exclusive → each datum is independent ∼ proportion of babies that are girls ∼ proportion of heart surgery patients who survive after 1 year

Introduction

Introduction

When we flip a coin N times, we have N observations, so we get: p({y1, . . . , yn}|θ) =

p(yi|θ) =

θyi(1 − θ)1−yi (8)

we get: p(z, N|θ) = θz(1 − θ)(N−z) (9)

Introduction

So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why?

Introduction

So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why? → Remember: the posterior can only exist over the support of the prior!

Introduction

Proposal: the Beta distrbution: p(θ|a, b) = beta(θ|a, b, ) = θ(a−1)(1 − θ)(b−1)/B(a, b) (10)

Introduction

The beta distribution depends on two parameters, B(a, b) is a normalizing constant ensures that the area under the density integrates to 1: B(a, b) = 1

0 θ(a−1)(1 − θ)(b−1)dθ

mean: ¯ θ = a/(a + b)

Introduction

Beta(5,5); Beta(5,1); Beta(1,1)

Introduction

So the goal is to move from the data and the priors to the posterior. Say we have observed 5 coin flips: heads, tails, tails, heads, heads. → How does the posterior look like?

Prior: Beta(1,1), Data: 1,0,0,1,1