Introduction to Bayesian Statistics Louis Raes Spring 2017 Table - - PowerPoint PPT Presentation
Introduction to Bayesian Statistics Louis Raes Spring 2017 Table - - PowerPoint PPT Presentation
Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation, goals What is Bayesian Statistics? Introduction Statistical basics Doing Bayesian Analysis Conjugacy Grid approximation Metropolis algorithm Gibbs
Table of contents
Organisation, goals What is Bayesian Statistics? Introduction Statistical basics Doing Bayesian Analysis Conjugacy Grid approximation Metropolis algorithm Gibbs Sampling
Organisation
- Small group; very advanced students
- Lecture: introduce some concepts, ideas → readings at home
and assignments → assignments are not graded
- Course is evaluated based on an oral exam; last lecture, exam
requirements are handed out → bunch of questions (much overlap with assignments) are handed out → prepare a few for the exam
Teaching Style
- Throughout the slides I have put questions (in red).
→ I expect you to solve these in class (I will pause) → Someone will have to provide the solution to class mates in class → Question too easy? No problem, just solve them, won’t take long → Struggle with a question: indication that you need to read up
- n this; I refer to chapters and pages from books which may be
- helpful. A reader with selected readings is provided.
Teaching Style
- Questions in blue are not solved in class but need to be done at
- home. Prepare them for next lecture. I expect students to have
solved the blue questions and have answers (or attempts) with
- them. You are allowed to prepare in groups of two. (In case of
abuse, I’ll toughen requirements).
Organisation
- Bayesian inference means computation.
- Feel free to choose your language / computing environment → R
and Matlab are good choices; I use R → your choice is your own responsibility
- Tons of resources online; do not mindlessly copy-paste.
→ In practice I too use a lot from online and offline resources, though the hope is that is not mindlessly done.
Goals
- Have an idea about Bayesian inference.
- Develop intuition about basic concepts.
- After these lectures you should be able to move quicker in case
- f self-study; if you do not pursue this further at least you have an
idea.
Goals
- Pace is (relatively) slow, getting fundamentals right.
- Do not forget readings, integral part of the course (and exam)
Further resources
- Doing Bayesian Data Analysis, John Kruschke
Shorthand: Kruschke → Easiest resource on Bayesian Inference I am aware off → Geared towards psychologists → Quirky (humor, poetry) → Super accessible; focus on R → Choice of topics might be a bit erratic, not ideal for economics applications, unless you do experiments (warning: → no idea about usage of Bayes among experimentalists)
Further resources
- Bayesian Analysis for the Social Sciences, Simon Jackman
Shorthand: Jackman → Graduate level, many applications in political science → Good blend of theory (more formal) and applications → Not too easy; focus on R and Bugs/Jags (Bugs is a bit outdated by now → check out Stan)
Further resources
- Bayesian Data Analysis, Gelman et al.
Shorthand: BDA → ”The” textbook on Bayesian Statistics; scholar.google: 19318 cites (5/5/2017) → Very comprehensive but with applied bending → Covers applications in many fields
- For applications in economics and materials geared towards
econometrics → Gary Koop → leading Bayesian, many syllabi, papers, textbooks, codes (Matlab)
Introduction
Bayesian inference means using Bayes’s theorem to update beliefs regarding an object of interest after observing data. prior beliefs → data → posterior beliefs ???
Introduction
Bayesian inference means practical methods to make inferences from data we observe via probability models, to learn about quantities we are interested in. Three steps:
- 1. Set up a full probability model: a joint distribution for all
- bservable and nonobservable quantities, consistent with our
knowledge about the problem as well as the data collection process
- 2. Condition on the observed data: the posterior
- 3. Evaluate model fit and implications of the posterior
distribution: does the model fit the data, reasonable conclusions, sensitive? → if necessary, go back to step 1. (See BDA chapter 1)
Introduction
- Bayesian statistics is often preferred due to philosophical reasons,
many disciplines are moving towards interval estimation rather than hypothesis testing and the Bayesian probability interval is much closer to the (common-sense) interpretation of what a confidence interval should be. Question:
- 1. In a classic (frequentist setting), what is the interpretation of
a confidence interval?
- 2. What is the interpretation of a p-value. Write this down
carefully -in your own words.
Introduction
Ignoring philosophical discussions, Bayesian inference has other advantages: (i) flexible and general → deal with very complicated problems; (ii) take uncertainty seriously → uncertainty in derived quantities (iii) pooling / shrinkage.
Introduction
- To have a meaningful discussion on the merits, we need to
introduce quite a bit of jargon.
- Philosophical discussions and comparisons with other approaches
are to be discussed later onwards.
Introduction
Assignment:
- 1. Read the Introduction of Jackman’s book as well as chapter 1
until p.8
Concepts
Question:
- 1. Write down the definition of conditional probability.
- 2. Write down an example highlighting / explaining this.
- 3. What is the law of total probability?
- 4. What is Bayes’ rule.
Introduction
Definition: Let A and B be events with P(B) > 0, then the conditional probability of A given B is: P(A|B) = P(A ∩ B) P(B) . (1) While we state it as a definition, it can be justified with a betting argument following de Finetti or derived from more elementary axioms.
Introduction
Definition: Multiplication rule P(A ∩ B) = P(A|B)P(B) (2) Definition: Law of total probability P(B) = P(A ∩ B) + P(¬A ∩ B) = P(B|A)P(A) + P(B|¬A)P(¬A) (3)
Introduction
Bayes’s Theorem: If A and B are events with P(B) > 0, then P(A|B) = P(B|A)P(A) P(B) (4) Proof: . . .
Introduction
In Bayesian Statistics we use the above theorem as follows. Let A denote a hypothesis, and B the evidence, then we see that the theorem provides the probability of the hypothesis A after having seen the data B.
Introduction
A common illustration is a (drug) testing example. Do this
- yourself. Write down a reasonable rate for false negatives of a drug
test and a reasonable rate of false positives. Choose a true rate of drug use in a subject pool (think of students taking some meds during exam period or athletes). Now derive the posterior probability for a subject, randomly drawn from a subject pool for testing, returning a positive test. What is the probability that the subject has used a substance? Does the result align with your a priori intuition?
Introduction
In our research we often work in a continuous setting and we consider continuous parameters we want to learn about: a regression coefficient or proportion etc. Let θ denote the parameter we want to learn about and y = (y1, . . . , yn)′ the data at hand. Beliefs are represented as probability density functions or pdf’s. So the prior on θ is p(θ) and the posterior is p(θ|y). Bayes’s Theorem: p(θ|y) = p(y|θ)p(θ)
- p(y|θ)p(θ)dθ
(5) Proof: . . .
Introduction
Often this is written as p(θ|y) ∝ p(y|θ)p(θ), (6) where the constant of proportionality (the denominator missing here), ensures that the posterior integrates to 1, as a proper density must. Furthermore, the term p(y|θ) is nothing more than the likelihood. Hence the mantra: the posterior is proportional to the prior times the likelihood.
Introduction
A practical example: Coin tossing. What is the chance that a coin comes up heads? → two possibilities that are mutually exclusive → each datum is independent ∼ proportion of babies that are girls ∼ proportion of heart surgery patients who survive after 1 year
Introduction
Step 1: specify the likelihood: The probability of a coin coming up heads is a function of an unknown parameter θ: p(y = 1|θ) = f (θ) (7) We assume the identify function, so we have p(y = 1|θ) = θ and also p(y = 0|θ) = 1 − θ. Combined we have the Bernoulli distribution: p(y|θ) = θy(1 − θ)1−y
Introduction
When we flip a coin N times, we have N observations, so we get: p({y1, . . . , yn}|θ) =
- i
p(yi|θ) =
- i
θyi(1 − θ)1−yi (8)
- r write with z the number of times head shows up in N flips, and
we get: p(z, N|θ) = θz(1 − θ)(N−z) (9)
Introduction
So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why?
Introduction
So we have a likelihood, now we need a prior. The key requirement for this prior is that it lives on the interval [0, 1]. Why? → Remember: the posterior can only exist over the support of the prior!
Introduction
Proposal: the Beta distrbution: p(θ|a, b) = beta(θ|a, b, ) = θ(a−1)(1 − θ)(b−1)/B(a, b) (10)
Introduction
The beta distribution depends on two parameters, B(a, b) is a normalizing constant ensures that the area under the density integrates to 1: B(a, b) = 1
0 θ(a−1)(1 − θ)(b−1)dθ
mean: ¯ θ = a/(a + b)
Introduction
Think of a and b as prior observed data, a as heads and b as tails in a total of a+b flips. Question: How should this prior look like for large a and small b ? How should this prior look like for a large a and an equally large b ? How do you think this prior looks like for a = 1, b = 1 ? Let your instincts speak... (no googling), draw a few sketches.
Beta(5,5); Beta(5,1); Beta(1,1)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior
θ p(θ)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior
θ p(θ)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 Prior
θ p(θ)
Introduction
So the goal is to move from the data and the priors to the posterior. Say we have observed 5 coin flips: heads, tails, tails, heads, heads. → How does the posterior look like?
Prior: Beta(1,1), Data: 1,0,0,1,1
0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4
Prior
Theta pTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.010 0.020 0.030
Likelihood
Theta pDataGivenTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
Posterior
Theta pThetaGivenData
Prior: Beta(10,10), Data: 1,0,0,1,1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
Prior
Theta pTheta 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.010 0.020 0.030
Likelihood
Theta pDataGivenTheta 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4
Posterior
Theta pThetaGivenData
Obtaining the posterior
- Analytically: conjugacy
- : Computationally:
- 1. : Grid Approximation
- 2. : Computationally: Metropolis Algorithm (MCMC, Gibbs,
Hamiltonean Monte Carlo)
Conjugate Priors
Formally: If F is a class of sampling distributions p(y|θ), and P is a class of prior distributions for θ, then the class P is conjugate to F if: p(θ|y) ∈ P for all p(.|θ) ∈ F and p(.) ∈ P. Key interest in natural conjugate priors, which arise by taking P to be the set of all densities having the same functional form as the likelihood.
Conjugate Priors
→ Conjugacy means that the posterior looks like the prior. → Conjugacy means computational convenience. → Conjugacy means that the prior has some interpretation as prior data
Conjugate Priors
Let us go back to our example. Posterior ∝ prior × likelihood Prior: Beta(a, b) Likelihood: Bernoulli
Conjugate Priors
Posterior = prior × likelihood p(θ|z, N) = θa−1(1 − θ)(b−1)θz(1 − θ)(N−z)/[B(a, b)p(z, N)] = θz+a−1(1 − θ)(b−1+N−z)/[B(z + a, N − z + b)] (11)
The relationship between prior and posterior
Prior mean:
a a+b
Posterior mean:
z+a z+a+b+N−z = z+a a+b+N
→ rearrange into a weighted mean of prior mean and data proportion z/N: z + a a + b + N = z N N N + a + b + a a + b a + b N + a + b (12)
The relationship between prior and posterior
So the posterior is always somewhere between prior mean and proportion in data. Question: Look at the weight, when does the prior become less important?
Conjugacy
Even an applied Bayesian should have a good working knowledge
- f distributions. Some distributions that are common in a Bayesian
context (but maybe less familiar) are the Beta, Gamma, Wishart and Cauchy disributions, to name a few. On the next slide I reproduce a table from a randomly chosen recent economics article: J.Fuhrer, JME, 2017. Assignment: Visualize the priors and overlay the prior with the
- posterior. We do not know the shape of the posterior entirely so
you may pick the same distribution as the prior (but differently parametrized).
Conjugacy
We have a posterior. Now what?
Bayesian analysis yields an entire distribution: the posterior. What to report? → Any feature of the posterior you deem relevant . . . But, only mentioning a single moment → ignoring lots of information.
We have a posterior. Now what?
- We can specify a loss function and derive from that, what to
communicate from a posterior. A quadratic loss function leads to the mean, a symmetric linear loss function leads to the median, asymmetric linear loss functions lead to other quantiles than the median. Assignment:
- 1. Read the relevant pages in Jackman to obtain some intuition
- n this.
- 2. Think of a situation (economic model, regression parameter,
. . . ) in which we could reasonably specify an asymmetric loss function i.e. you’d prefer to err in one direction rather than the other. Describe such a situation.
We have a posterior. Now what?
- We can also communicate confidence intervals → credible
regions → uncertainty intervals. Assignment: Read this short post on the blog by Andrew Gelman:
http://andrewgelman.com/2016/11/26/ reminder-instead-confidence-interval-lets-say-uncertainty-interval/
Also read the comments and the links to other posts within this comment. Read on Highest Probability Regions in Jackman.
Intermezzo: Bayes Factors
We should briefly mention Bayes Factors. The Bayes factor is a comparison between two models M0 and M1. Definition: Given data y and two competing models M0 and M1, the Bayes Factor is the ratio of the posterior odds over the prior
- dds or:
- p(M1|y)
p(M0|y)
- p(M1)
p(M0
- (13)
and provides a summary of the evidence for M1 against M0 provided by the data.
Intermezzo: Bayes Factors
A Bayes factor below 1 indicates negative support for M1, a Bayes factor of 3-12 means strong support etc. Most Bayesian textbooks provide a further discussion on Bayes Factors (e.g. Jackman or BDA).
Obtaining a posterior
So far we only discussed obtained a posterior via conjugate priors. Already many possibilities, see for example wiki for an overview. But still limiting: → Maybe we have a reason to specify some goofy looking prior. → We are dealing with a complex model which cannot easily be squeezed in the conjugate mold. → Most models for observational data (or models with sufficient structure) cannot be dealt with via conjugacy.
Grid Approximation
Let us go back to the coin flipping example. Now we specify a prior defined only for a finite number of θ values (the grid). For example, we assume that θ can only take value of ).25, 0.5, 0.75. → Our prior expresses probability mass at each value of θ.
Grid Approximation
Bayes’ rule is then: p(θ|y) = p(y|θ)p(θ)
- θ p(y|θ)p(θ)
with the sum in the denominator over the finite number of discrete values of θ that we are considering, and p(θ), the probability mass at p(θ).
Grid Approximation
By specifying a grid we may try to approximate continuous prior
- densities. We use the discrete Bayes form which may be much
easier than the continous case which requires evaluating an integral.
Grid Approximation
Assignment: Apply grid approximation to the coin tossing problem. Create code where you generate 10 coin flips (randomly) from a fair coin. Specify a nonstandard prior (make something goofy, be creative) and apply grid approximation. Make graphs of the likelihood, the prior and the posterior.
Grid Approximation
→ Grid approximation has mostly a pedagogical purpose. → If we have a 10-dimensional parameter space and we want to set up a grid of 1000 values, then we have 100010 values to evaluate → problematic
Metropolis algorithm
MCMC methods or Markov Chain Monte Carlo methods are what makes Bayesian statistics work in practice. There are many varieties and a decent understanding requires some investment in the theory of Markov Chains (e.g. Jackman chapter 4). Here the focus is to provide a low-tech intuitive understanding, I follow Kruschke closely. Assignment: Look for an explanation of Metropolis-Hastings online or in a textbook, and try to grasp it.
Metropolis algorithm
Goal: understand the posterior. Solution: sample many points from it and just compute mean, standard deviation etc. So even if we lack analytical formulas, if we can randomly sample many points, we can calculate what we need.
Metropolis algorithm
Assignment: Consider Uniform[0, 3]. What is the mean? Generate 10 values of this distribution and calculate the mean. Do this also for 100, 1000 and 10.000 values. Put these on a plot along with the analytically calculated mean. In class: ”Guess” what you would obtain and sketch by hand At Home: Write some code to do this.
Metropolis algorithm
The Monte Carlo Principle: Anything we want to learn about a random variable θ can be learned by sampling many times from f (θ), the density of θ.
Metropolis algorithm
The Monte Carlo Principle: Anything we want to learn about a random variable θ can be learned by sampling many times from f (θ), the density of θ. Simulation consistency: (informally:) As the number of draws of f (θ) increases, the simulation based estimate of some feature of θ, h(θ) improves. (See Jackman 3 for a formal discussion).
Metropolis algorithm: a politician
How can we get representative samples from a distribution? The story of a politician (Kruschke, Chapter 7, see reader) A politician travels through a series of islands. He decides each day whether to stay on the island, travel west or travel east. All islands are on a line and he can move only 1 island to the left or right on a given day. He wants to spend time on the islands proportionnally to their population, so about twice as much time on an island A than on island B if island A is twice as populous.
Metropolis algorithm: a politician
The politician has little information. He knows the population of the island he resides. He can also ask the size of the population of an adjacent island. His mechanism is as follows: Flip a coin, if heads propose east, if tails propose west. If the proposed island (E or W) is larger, definitely go. If it is smaller, only go probabilistically. The probability of moving is just pmove = popproposed/popcurrent. The politician can draw from the Uniform[0, 1] distribution. If his draw is between 0 and pmove, he moves, otherwise he stays.
Metropolis algorithm: a politician
In the long run, the probability that the politician is on any one of the island, exactly matches the relative population of the island!
Metropolis algorithm: example of island hopping
Let us consider an example of island hopping: Assume we have 7 islands, with relative proportions as given in the graph below.
Metropolis algorithm: example of island hopping
Metropolis algorithm: example of island hopping
Metropolis algorithm: example of island hopping
Summary:
◮ we are at θcurrent ◮ Propose left or right, with 50% chance → proposal
distribution
◮ Decision to accept is based on the ratio of the target
distribution at the proposed position to the target distribution at the current position.
◮ Ratio > 1: accept, ratio < accept probabilistically.
So the probability to move, pmove = min P(θproposed)
P(θcurrent) , 1
Metropolis algorithm: example of island hopping
What do we need to do this?
- 1. Be able to draw values from the proposal distribution.
- 2. Be able to evaluate the target distribution at any proposed
position.
- 3. Be able to draw from the uniform distribution.
→ With these things we can generate samples from the target distribution.
Metropolis algorithm: some intuition on why it works
We show here the intuition on why it works (following Kruschke). Say we are at position θ and we write the probability of moving towards θ + 1 as p(θ → θ + 1) Then, θ + 1 as p(θ → θ + 1) = 0.5 × min(P(θ + 1)/P(θ), 1). Vice versa: p(θ + 1 → θ) = 0.5 × min(P(θ)/P(θ + 1), 1)
Metropolis algorithm: some intuition on why it works
The ratio of these probabilities: p(θ → θ + 1) p(θ + 1 → θ) = 0.5 × min(P(θ + 1)/P(θ), 1) 0.5 × min(P(θ)/P(θ + 1), 1) =
- 1
P(θ)/P(θ+1)
ifP(θ + 1) > P(θ)
P(θ)/P(θ+1) 1
ifP(θ + 1) < P(θ) = P(θ + 1) P(θ) (14)
Metropolis algorithm: some intuition on why it works
The relative probability of transitions between adjacent positions, exactly matches the relative values of the target distribution. → intuition: if it is true for adjacent positions, we extrapolate to the whole range of positions. Read the relevant pages in Kruschke to see this claim further worked out in this simple context. (See reader)
Metropolis algorithm: essentials
Note the super stylized setup: discrete positions, 1 dimension, propose left or right Works far more general: continuous setting, multi dimensional, more general proposal distributions.
Metropolis algorithm: general description (Jackman ch.5)
The Metropolis-Hastings algorithm defines a set of jumping rules to generate a Markov chain over the support of the posterior p(θ|y), Θ. At the start of iteration t, we have values θ(t−1) and we go to θ(t) as follows:
- 1. Sample a vector of proposal values θ∗ from a proposal or
jumping distribution Jt(θ∗, θt−1)
- 2. Calculate the acceptance ratio: r =
p(θ∗|y)Jt(θ∗,θ(t−1)) p(θ(t−1)|y)Jt(θ(t−1),θ∗)
- 3. α = min(r, 1)
- 4. sample u ∼ U(0, 1)
- 5. if u ≤ α, then θ∗ → θt
- 6. θ(t−1) → θt otherwise
Metropolis algorithm: issues
- How should the proposal distribution look like?
What would be a bad choice ?
Metropolis algorithm: issues
- Burn-in, Efficiency, Convergence
- Metropolis algorithm is an example of Markov Chain Monte
Carlo.
Metropolis algorithm: some code
I have written some code to demonstrate the Metropolis algorithm. Metropolis.r → inspect the code.
Metropolis algorithm: some code
Assignment: Do exercise 7.1 of Kruschke (see handout). Ideally you try to build the algorithm from scratch. If that does not work out, start from the code provided in Kruschke or my code. → focus on understanding the how and why
Metropolis algorithm: further thoughts
The Metropolis-Hastings algorithm is remarkable and while it should work in theory, practice may prove to be very difficult. In a realistic multi-dimensional setting, merely specifying a reasonable proposal distribution may be very hard to do. → Many more sophisticated algorithms to tackle various problems: e.g. adaptive algorithms
Metropolis algorithm: further thoughts
Another approach relies on breaking the high dimensional posterior density p(θ|y) into a series of lower dimensional problems. This brings us to the Gibbs sampler.
What is Gibbs Sampling
The Gibbs sampler rests on the following idea: Joint probability densities can be fully characterized by their component conditional densities.
Gibbs: Two coins
As an illustrative example, consider the following example from Kruschke (chapter 8): Assume two coins which are flipped. We want to make inferences about the differences in the proportion of
- heads. (How different is the bias?)
Gibbs: Two coins: priors
Do not confuse the independence of the priors with the independence of the observations: → it could be reasonable to assume dependent priors, but independent observations! The likelihood here is: p(y|θ1, θ2) = θz1
1 (1 − θ1)(N1−z1)θz2 2 (1 − θ2)(N2−z2)
with Nx, zx the number of trials and successes for coin x.
Gibbs: Two coins:
The problem is fairly simple and can be solved analytically, by grid
- r Metropolis-Hastings.
Here: Gibbs sampling
Gibbs: requirement
To do Gibbs, we need to generate samples from the posterior distribution conditional on each individual parameter. → additional requirement. If we can draw from p(θ1|θ2, y) and from p(θ2|θ1, y), Gibbs provides us with p(θ1, θ2|y).
Gibbs:
Gibbs is again a random walk, but the next step only depends on the current position. Gibbs cycles through each parameter component θi ∈ θ. → Say we select θi, then we generate a random value directly from p(θi|θj=i, y). The new parameter component, along with all other component unchanged, constitutes the new position in the random walk.
Gibbs:
Gibbs can be seen as a special case, with a proposal that is always accepted. Assignment:
- 1. Try to write code for the two coin problem. Specify for each
10 flips (assume coins are independent) for each. Next, specify beta priors and show how the conditional beta distribution looks like. Not Mandatory: Write code for the Gibbs Sampler in this context, make graphs of prior, likelihood and posterior.
Gibbs: general description (Jackman ch. 5)
Consider partitioning the parameter vector θ into d blocks (subvectors or scalars), θ = θ1, . . . , θd. Index iterations by t:
- 1. sample θ(t+1)
1
from g1(θ1|θ(t)
2 , θ(t) 3 , . . . , θ(t) d , y)
- 2. sample θ(t+1)
2
from g2(θ1|θ(t+1)
1
, θ(t)
3 , . . . , θ(t) d , y)
- 3. . . .
- 4. sample θ(t+1)
d
from gd(θd|θ(t+1)
1
, θ(t+1)
3
, . . . , θ(t+1)
d−1 , y)
- 5. θt+1 ← (θd|θ(t+1)
1
, θ(t+1)
3
, . . . , θ(t+1)
d−1 )′
So after d steps we have fully updated θ vector.
Gibbs:
Gibbs is important for historical purposes. As a standalone algorithm you’ll rarely see this in practice, but it is an important building block and foundational algorithm. See for example Metropolis-within-Gibbs, data augmentation, slice sampler etc. Assignment: Pick any of the questions below
- 1. Use an example to explain and demonstrate Metropolis within
Gibbs
- 2. Explain the intuition behind data augmentation.