[PPT] - Probabilistic & Unsupervised Learning Introduction and PowerPoint Presentation

SLIDE 1

Probabilistic & Unsupervised Learning Introduction and Foundations

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

SLIDE 2

What do we mean by learning?

Jan Steen

Not just remembering:

SLIDE 3

What do we mean by learning?

Jan Steen

Not just remembering:

◮ Systematising (noisy) observations: discovering structure.

SLIDE 4

What do we mean by learning?

Jan Steen

Not just remembering:

◮ Systematising (noisy) observations: discovering structure. ◮ Predicting new outcomes: generalising.

SLIDE 5

What do we mean by learning?

Jan Steen

Not just remembering:

◮ Systematising (noisy) observations: discovering structure. ◮ Predicting new outcomes: generalising. ◮ Choosing actions wisely.

SLIDE 6

Three learning problems

◮ Systematising (noisy) observations: discovering structure. ◮ Predicting new outcomes: generalising. ◮ Choosing actions wisely.

SLIDE 7

Three learning problems

◮ Systematising (noisy) observations: discovering structure.

◮ Unsupervised learning. Observe (sensory) input alone:

x1, x2, x3, x4, . . . Describe pattern of data [p(x)], identify and extract underlying structural variables [xi → yi].

◮ Predicting new outcomes: generalising. ◮ Choosing actions wisely.

SLIDE 8

Three learning problems

◮ Systematising (noisy) observations: discovering structure.

◮ Unsupervised learning. Observe (sensory) input alone:

x1, x2, x3, x4, . . . Describe pattern of data [p(x)], identify and extract underlying structural variables [xi → yi].

◮ Predicting new outcomes: generalising.

◮ Supervised learning. Observe input/output pairs (“teaching”):

(x1, y1), (x2, y2), (x3, y3), (x4, y4), . . .

Predict the correct y∗ for new test input x∗.

◮ Choosing actions wisely.

SLIDE 9

Three learning problems

◮ Systematising (noisy) observations: discovering structure.

◮ Unsupervised learning. Observe (sensory) input alone:

x1, x2, x3, x4, . . . Describe pattern of data [p(x)], identify and extract underlying structural variables [xi → yi].

◮ Predicting new outcomes: generalising.

◮ Supervised learning. Observe input/output pairs (“teaching”):

(x1, y1), (x2, y2), (x3, y3), (x4, y4), . . .

Predict the correct y∗ for new test input x∗.

◮ Choosing actions wisely.

◮ Reinforcement learning. Rewards or payoffs (and possibly also inputs) depend on actions:

x1 : a1 → r1, x2 : a2 → r2, x3 : a3 → r3 . . . Find a policy for action choice that maximises payoff.

SLIDE 10

Unsupervised Learning

Find underlying structure:

◮ separate generating processes (clusters) ◮ reduced dimensionality representations ◮ good explanations (causes) of the data ◮ modelling the data density Φ W a a

filters basis functions causes image patch, I image ensemble

Uses of Unsupervised Learning:

◮ structure discovery, science ◮ data compression ◮ outlier detection ◮ input to supervised/reinforcement algorithms (causes may be more simply related to

utputs or rewards)

◮ a theory of biological learning and perception

SLIDE 11

Supervised learning

Two main examples: Classification:

x

x

x x x x x

o

x x x x

Discrete (class label) outputs.

Regression:

−2 2 4 6 8 10 12 −20 −10 10 20 30 40 50 x y

Continuous-values outputs. But also: ranks, relationships, trees etc. Variants may relate to unsupervised learning:

◮ semi-supervised learning (most x unlabelled; assumes structure of {x} and relationship

x → y are linked).

◮ multitask (transfer) learning (predict different y in different contexts; assumes links

between structure of relationships).

SLIDE 12

A probabilistic approach

Data are generated by random and/or unknown processes.

SLIDE 13

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

SLIDE 14

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

SLIDE 15

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs

SLIDE 16

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery

SLIDE 17

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss

SLIDE 18

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

SLIDE 19

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

◮ Probabilistic modelling is often equivalent to other views of learning:

SLIDE 20

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

◮ Probabilistic modelling is often equivalent to other views of learning:

◮ information theoretic: finding compact representations of the data

SLIDE 21

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

◮ Probabilistic modelling is often equivalent to other views of learning:

◮ information theoretic: finding compact representations of the data ◮ physical analogies: minimising (free) energy of a corresponding statistical

mechanical system

SLIDE 22

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

◮ Probabilistic modelling is often equivalent to other views of learning:

◮ information theoretic: finding compact representations of the data ◮ physical analogies: minimising (free) energy of a corresponding statistical

mechanical system

◮ structural risk: compensate for overconfidence in powerful models

SLIDE 23

A probabilistic approach

Data are generated by random and/or unknown processes. Our approach to learning starts with a probabilistic model of data production: P(data|parameters) P(x|θ) or P(y|x, θ) This is the generative model or likelihood.

◮ The probabilistic model can be used to

◮ make inferences about missing inputs ◮ generate predictions/fantasies/imagery ◮ make predictions or decisions which minimise expected loss ◮ communicate the data in an efficient way

◮ Probabilistic modelling is often equivalent to other views of learning:

◮ information theoretic: finding compact representations of the data ◮ physical analogies: minimising (free) energy of a corresponding statistical

mechanical system

◮ structural risk: compensate for overconfidence in powerful models

The calculus of probabilities naturally handles randomness. It is also the right way to reason about unknown values.

SLIDE 24

Representing beliefs

Let b(x) represent our strength of belief in (plausibility of) proposition x: 0 ≤ b(x) ≤ 1 b(x) = 0 x is definitely not true b(x) = 1 x is definitely true b(x|y) strength of belief that x is true given that we know y is true Cox Axioms (Desiderata):

◮ Let b(x) be real. As b(x) increases, b(¬x) decreases, and so the function mapping

b(x) ↔ b(¬x) is monotonically decreasing and self-inverse.

◮ b(x ∧ y) depends only on b(y) and b(x|y). ◮ Consistency

◮ If a conclusion can be reasoned in more than one way, then every way should lead to the

same answer.

◮ Beliefs always take into account all relevant evidence. ◮ Equivalent states of knowledge are represented by equivalent plausibility assignments.

Consequence: Belief functions (e.g. b(x), b(x|y), b(x, y)) must be isomorphic to probabilities, satisfying all the usual laws, including Bayes rule. (See Jaynes, Probability Theory: The Logic of Science)

SLIDE 25

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x.

SLIDE 26

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x. ◮ Probabilities normalise: x∈X P(x) = 1 for distributions if x is a discrete variable and

+∞

−∞ p(x)dx = 1 for probability densities over continuous variables Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

SLIDE 27

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x. ◮ Probabilities normalise: x∈X P(x) = 1 for distributions if x is a discrete variable and

+∞

−∞ p(x)dx = 1 for probability densities over continuous variables ◮ The joint probability of x and y is: P(x, y). Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

SLIDE 28

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x. ◮ Probabilities normalise: x∈X P(x) = 1 for distributions if x is a discrete variable and

+∞

−∞ p(x)dx = 1 for probability densities over continuous variables ◮ The joint probability of x and y is: P(x, y). ◮ The marginal probability of x is: P(x) = y P(x, y), assuming y is discrete. Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

SLIDE 29

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x. ◮ Probabilities normalise: x∈X P(x) = 1 for distributions if x is a discrete variable and

+∞

−∞ p(x)dx = 1 for probability densities over continuous variables ◮ The joint probability of x and y is: P(x, y). ◮ The marginal probability of x is: P(x) = y P(x, y), assuming y is discrete. ◮ The conditional probability of x given y is: P(x|y) = P(x, y)/P(y) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

SLIDE 30

Basic rules of probability

◮ Probabilities are non-negative P(x) ≥ 0 ∀x. ◮ Probabilities normalise: x∈X P(x) = 1 for distributions if x is a discrete variable and

+∞

−∞ p(x)dx = 1 for probability densities over continuous variables ◮ The joint probability of x and y is: P(x, y). ◮ The marginal probability of x is: P(x) = y P(x, y), assuming y is discrete. ◮ The conditional probability of x given y is: P(x|y) = P(x, y)/P(y) ◮ Bayes Rule:

P(x, y) = P(x)P(y|x) = P(y)P(x|y)

⇒

P(y|x) = P(x|y)P(y) P(x)

Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

SLIDE 31

The Dutch book theorem

Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 ⇒

x

is true win

≥ $1

x is false lose $9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A ∩ B = ∅, then

SLIDE 32

The Dutch book theorem

Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 ⇒

x

is true win

≥ $1

x is false lose $9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A ∩ B = ∅, then

  

b(A)

= 0.3

b(B)

= 0.2

b(A ∪ B)

= 0.6    ⇒ accept the bets    ¬A

at 3 : 7

¬B

at 2 : 8 A ∪ B at 4 : 6

  

SLIDE 33

The Dutch book theorem

Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 ⇒

x

is true win

≥ $1

x is false lose $9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A ∩ B = ∅, then

  

b(A)

= 0.3

b(B)

= 0.2

b(A ∪ B)

= 0.6    ⇒ accept the bets    ¬A

at 3 : 7

¬B

at 2 : 8 A ∪ B at 4 : 6

  

But then:

¬A ∩ B ⇒ win + 3 − 8 + 4 = −1

A ∩ ¬B

⇒ win − 7 + 2 + 4 = −1 ¬A ∩ ¬B ⇒ win + 3 + 2 − 6 = −1

SLIDE 34

The Dutch book theorem

Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 ⇒

x

is true win

≥ $1

x is false lose $9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A ∩ B = ∅, then

  

b(A)

= 0.3

b(B)

= 0.2

b(A ∪ B)

= 0.6    ⇒ accept the bets    ¬A

at 3 : 7

¬B

at 2 : 8 A ∪ B at 4 : 6

  

But then:

¬A ∩ B ⇒ win + 3 − 8 + 4 = −1

A ∩ ¬B

⇒ win − 7 + 2 + 4 = −1 ¬A ∩ ¬B ⇒ win + 3 + 2 − 6 = −1

The only way to guard against Dutch Books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

SLIDE 35

Bayesian learning

Apply the basic rules of probability to learning from data.

◮ Problem specification:

Data: D = {x1, . . . , xn} Models: M1, M2, etc. Parameters: θi (per model) Prior probability of models: P(Mi). Prior probabilities of model parameters: P(θi|Mi) Model of data given parameters (likelihood model): P(x|θi, Mi)

SLIDE 36

Bayesian learning

Apply the basic rules of probability to learning from data.

◮ Problem specification:

Data: D = {x1, . . . , xn} Models: M1, M2, etc. Parameters: θi (per model) Prior probability of models: P(Mi). Prior probabilities of model parameters: P(θi|Mi) Model of data given parameters (likelihood model): P(x|θi, Mi)

◮ Data probability (likelihood)

P(D|θi, Mi) =

n

j=1

P(xj|θi, Mi) ≡ L(θi) (provided the data are independently and identically distributed (iid).

SLIDE 37

Bayesian learning

Apply the basic rules of probability to learning from data.

◮ Problem specification:

Data: D = {x1, . . . , xn} Models: M1, M2, etc. Parameters: θi (per model) Prior probability of models: P(Mi). Prior probabilities of model parameters: P(θi|Mi) Model of data given parameters (likelihood model): P(x|θi, Mi)

◮ Data probability (likelihood)

P(D|θi, Mi) =

n

j=1

P(xj|θi, Mi) ≡ L(θi) (provided the data are independently and identically distributed (iid).

◮ Parameter learning (posterior):

P(θi|D, Mi) = P(D|θi, Mi)P(θi|Mi) P(D|Mi)

;

P(D|Mi) =

dθi P(D|θi, Mi)P(θi|Mi)

SLIDE 38

Bayesian learning

Apply the basic rules of probability to learning from data.

◮ Problem specification:

Data: D = {x1, . . . , xn} Models: M1, M2, etc. Parameters: θi (per model) Prior probability of models: P(Mi). Prior probabilities of model parameters: P(θi|Mi) Model of data given parameters (likelihood model): P(x|θi, Mi)

◮ Data probability (likelihood)

P(D|θi, Mi) =

n

j=1

P(xj|θi, Mi) ≡ L(θi) (provided the data are independently and identically distributed (iid).

◮ Parameter learning (posterior):

P(θi|D, Mi) = P(D|θi, Mi)P(θi|Mi) P(D|Mi)

;

P(D|Mi) =

dθi P(D|θi, Mi)P(θi|Mi)

P(D|Mi) is called the marginal likelihood or evidence for Mi. It is proportional to the posterior probability model Mi being the one that generated the data.

SLIDE 39

Bayesian learning

Apply the basic rules of probability to learning from data.

◮ Problem specification:

Data: D = {x1, . . . , xn} Models: M1, M2, etc. Parameters: θi (per model) Prior probability of models: P(Mi). Prior probabilities of model parameters: P(θi|Mi) Model of data given parameters (likelihood model): P(x|θi, Mi)

◮ Data probability (likelihood)

P(D|θi, Mi) =

n

j=1

P(xj|θi, Mi) ≡ L(θi) (provided the data are independently and identically distributed (iid).

◮ Parameter learning (posterior):

P(θi|D, Mi) = P(D|θi, Mi)P(θi|Mi) P(D|Mi)

;

P(D|Mi) =

dθi P(D|θi, Mi)P(θi|Mi)

P(D|Mi) is called the marginal likelihood or evidence for Mi. It is proportional to the posterior probability model Mi being the one that generated the data.

◮ Model selection:

P(Mi|D) = P(D|Mi)P(Mi) P(D)

SLIDE 40

Bayesian learning: A coin toss example

Coin toss: One parameter q — the probability of obtaining heads So our space of models is the set of distributions over q ∈ [0, 1].

SLIDE 41

Bayesian learning: A coin toss example

Coin toss: One parameter q — the probability of obtaining heads So our space of models is the set of distributions over q ∈ [0, 1]. Learner A believes model MA: all values of q are equally plausible;

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

q P(q)

A

SLIDE 42

Bayesian learning: A coin toss example

Coin toss: One parameter q — the probability of obtaining heads So our space of models is the set of distributions over q ∈ [0, 1]. Learner A believes model MA: all values of q are equally plausible; Learner B believes model MB: more plausible that the coin is “fair” (q ≈ 0.5) than “biased”.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

q P(q)

A

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

B

SLIDE 43

Bayesian learning: A coin toss example

Coin toss: One parameter q — the probability of obtaining heads So our space of models is the set of distributions over q ∈ [0, 1]. Learner A believes model MA: all values of q are equally plausible; Learner B believes model MB: more plausible that the coin is “fair” (q ≈ 0.5) than “biased”.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

q P(q)

A

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

B Both prior beliefs can be described by the Beta distribution: p(q|α1, α2) = q(α1−1)(1 − q)(α2−1) B(α1, α2)

= Beta(q|α1, α2)

SLIDE 44

Bayesian learning: A coin toss example

Coin toss: One parameter q — the probability of obtaining heads So our space of models is the set of distributions over q ∈ [0, 1]. Learner A believes model MA: all values of q are equally plausible; Learner B believes model MB: more plausible that the coin is “fair” (q ≈ 0.5) than “biased”.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

q P(q)

A: α1 = α2 = 1.0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

B: α1 = α2 = 4.0 Both prior beliefs can be described by the Beta distribution: p(q|α1, α2) = q(α1−1)(1 − q)(α2−1) B(α1, α2)

= Beta(q|α1, α2)

SLIDE 45

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q

SLIDE 46

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads

SLIDE 47

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads The probability of the observed data (likelihood) is: p(H|q) = q

SLIDE 48

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads The probability of the observed data (likelihood) is: p(H|q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q|H)

=

p(q)p(H|q) p(H)

SLIDE 49

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads The probability of the observed data (likelihood) is: p(H|q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q|H)

=

p(q)p(H|q) p(H)

∝ q Beta(q|α1, α2)

SLIDE 50

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads The probability of the observed data (likelihood) is: p(H|q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q|H)

=

p(q)p(H|q) p(H)

∝ q Beta(q|α1, α2) ∝

q q(α1−1)(1 − q)(α2−1)

SLIDE 51

Bayesian learning: The coin toss (cont)

Now we observe a toss. Two possible outcomes: p(H|q) = q p(T|q) = 1 − q Suppose our single coin toss comes out heads The probability of the observed data (likelihood) is: p(H|q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q|H)

=

p(q)p(H|q) p(H)

∝ q Beta(q|α1, α2) ∝

q q(α1−1)(1 − q)(α2−1) = Beta(q|α1 + 1, α2)

SLIDE 52

Bayesian learning: The coin toss (cont)

A B Prior

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

q P(q)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

Beta(q|1, 1) Beta(q|4, 4) Posterior

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

Beta(q|2, 1) Beta(q|5, 4)

SLIDE 53

Bayesian learning: The coin toss (cont)

What about multiple tosses?

SLIDE 54

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3

SLIDE 55

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3 This is still straightforward:

SLIDE 56

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3 This is still straightforward: p(q|D)

=

p(q)p(D|q) p(D)

SLIDE 57

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3 This is still straightforward: p(q|D)

=

p(q)p(D|q) p(D)

∝ q3(1 − q)3 Beta(q|α1, α2)

SLIDE 58

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3 This is still straightforward: p(q|D)

=

p(q)p(D|q) p(D)

∝ q3(1 − q)3 Beta(q|α1, α2) ∝

Beta(q|α1 + 3, α2 + 3)

SLIDE 59

Bayesian learning: The coin toss (cont)

What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T }|q) = qq(1 − q)q(1 − q)(1 − q) = q3(1 − q)3 This is still straightforward: p(q|D)

=

p(q)p(D|q) p(D)

∝ q3(1 − q)3 Beta(q|α1, α2) ∝

Beta(q|α1 + 3, α2 + 3)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

q P(q)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

q P(q)

SLIDE 60

Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood.

SLIDE 61

Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P(x|θ) = g(θ)f(x)eφ(θ)TT(x) with g(θ) the normalising constant.

SLIDE 62

Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P(x|θ) = g(θ)f(x)eφ(θ)TT(x) with g(θ) the normalising constant. Given n iid observations, P({xi}|θ) =

i

P(xi|θ) = g(θ)ne

φ(θ)T

i T(xi)

i

f(xi)

SLIDE 63

Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P(x|θ) = g(θ)f(x)eφ(θ)TT(x) with g(θ) the normalising constant. Given n iid observations, P({xi}|θ) =

i

P(xi|θ) = g(θ)ne

φ(θ)T

i T(xi)

i

f(xi) Thus, if the prior takes the conjugate form P(θ) = F(τ, ν)g(θ)νeφ(θ)Tτ with F(τ, ν) the normaliser

SLIDE 64

Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P(x|θ) = g(θ)f(x)eφ(θ)TT(x) with g(θ) the normalising constant. Given n iid observations, P({xi}|θ) =

i

P(xi|θ) = g(θ)ne

φ(θ)T

i T(xi)

i

f(xi) Thus, if the prior takes the conjugate form P(θ) = F(τ, ν)g(θ)νeφ(θ)Tτ with F(τ, ν) the normaliser, then the posterior is P(θ|{xi}) ∝ P({xi}|θ)P(θ) ∝ g(θ)ν+ne

φ(θ)T

τ+

i T(xi)

with the normaliser given by F
τ +

i T(xi), ν + n

.

SLIDE 65

Conjugate priors

The posterior given an exponential family likelihood and conjugate prior is: P(θ|{xi}) = F

τ +

i T(xi), ν + n

g(θ)ν+n exp
φ(θ)T

τ +

i T(xi)

Here,

φ(θ)

is the vector of natural parameters

i T(xi)

is the vector of sufficient statistics

τ

are pseudo-observations which define the prior

ν

is the scale of the prior (need not be an integer) As new data come in, each one increments the sufficient statistics vector and the scale to define the posterior.

SLIDE 66

Conjugate priors

The posterior given an exponential family likelihood and conjugate prior is: P(θ|{xi}) = F

τ +

i T(xi), ν + n

g(θ)ν+n exp
φ(θ)T

τ +

i T(xi)

Here,

φ(θ)

is the vector of natural parameters

i T(xi)

is the vector of sufficient statistics

τ

are pseudo-observations which define the prior

ν

is the scale of the prior (need not be an integer) As new data come in, each one increments the sufficient statistics vector and the scale to define the posterior. The prior appears to be based on “pseudo-observations”, but:

1. This is different to applying Bayes’ rule. No prior! Sometimes we can take a uniform

prior (say on [0, 1] for q), but for unbounded θ, there may be no equivalent.

2. A valid conjugate prior might have non-integral ν or impossible τ, with no likelihood

equivalent.

SLIDE 67

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form.

SLIDE 68

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

SLIDE 69

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

So the natural parameter is the log odds log(q/(1 − q)), and the sufficient stats (for multiple tosses) is the number of heads.

SLIDE 70

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

So the natural parameter is the log odds log(q/(1 − q)), and the sufficient stats (for multiple tosses) is the number of heads. The conjugate prior is P(q) = F(τ, ν) (1 − q)νelog(q/(1−q))τ

= F(τ, ν) (1 − q)νeτ log q−τ log(1−q) = F(τ, ν) (1 − q)ν−τqτ

which has the form of the Beta distribution ⇒ F(τ, ν) = 1/B(τ + 1, ν − τ + 1).

SLIDE 71

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

So the natural parameter is the log odds log(q/(1 − q)), and the sufficient stats (for multiple tosses) is the number of heads. The conjugate prior is P(q) = F(τ, ν) (1 − q)νelog(q/(1−q))τ

= F(τ, ν) (1 − q)νeτ log q−τ log(1−q) = F(τ, ν) (1 − q)ν−τqτ

which has the form of the Beta distribution ⇒ F(τ, ν) = 1/B(τ + 1, ν − τ + 1). In general, then, the posterior will be P(q|{xi}) = Beta(α1, α2), with

α1 = 1 + τ +

i xi

α2 = 1 + (ν + n) −

τ +

i xi

SLIDE 72

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

So the natural parameter is the log odds log(q/(1 − q)), and the sufficient stats (for multiple tosses) is the number of heads. The conjugate prior is P(q) = F(τ, ν) (1 − q)νelog(q/(1−q))τ

= F(τ, ν) (1 − q)νeτ log q−τ log(1−q) = F(τ, ν) (1 − q)ν−τqτ

which has the form of the Beta distribution ⇒ F(τ, ν) = 1/B(τ + 1, ν − τ + 1). In general, then, the posterior will be P(q|{xi}) = Beta(α1, α2), with

α1 = 1 + τ +

i xi

α2 = 1 + (ν + n) −

τ +

i xi

If we observe a head, we add 1 to the sufficient statistic

xi, and also 1 to the count n. This increments α1.

SLIDE 73

Conjugacy in the coin flip

Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can be written: P(x|q) = qx(1 − q)(1−x)

= ex log q+(1−x) log(1−q) = elog(1−q)+x log(q/(1−q)) = (1 − q)elog(q/(1−q))x

So the natural parameter is the log odds log(q/(1 − q)), and the sufficient stats (for multiple tosses) is the number of heads. The conjugate prior is P(q) = F(τ, ν) (1 − q)νelog(q/(1−q))τ

= F(τ, ν) (1 − q)νeτ log q−τ log(1−q) = F(τ, ν) (1 − q)ν−τqτ

which has the form of the Beta distribution ⇒ F(τ, ν) = 1/B(τ + 1, ν − τ + 1). In general, then, the posterior will be P(q|{xi}) = Beta(α1, α2), with

α1 = 1 + τ +

i xi

α2 = 1 + (ν + n) −

τ +

i xi

If we observe a head, we add 1 to the sufficient statistic

xi, and also 1 to the count n. This increments α1. If we observe a tail we add 1 to n, but not to xi, incrementing α2.

SLIDE 74

Bayesian coins – comparing models

We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: “fair” and “bent”.

SLIDE 75

Bayesian coins – comparing models

We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: “fair” and “bent”. A priori, we may think that “fair” is more probable, eg: p(fair) = 0.8, p(bent) = 0.2

SLIDE 76

Bayesian coins – comparing models

We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: “fair” and “bent”. A priori, we may think that “fair” is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability:

0.5 1 0.5 1 p(q|bent) parameter, q 0.5 1 0.5 1 p(q|fair) parameter, q

SLIDE 77

Bayesian coins – comparing models

We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: “fair” and “bent”. A priori, we may think that “fair” is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability:

0.5 1 0.5 1 p(q|bent) parameter, q 0.5 1 0.5 1 p(q|fair) parameter, q

We make 10 tosses, and get: D = (T H T H T T T T T T).

SLIDE 78

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)?

SLIDE 79

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001

SLIDE 80

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent)

SLIDE 81

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8

SLIDE 82

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8 = B(3, 9) ≈ 0.002

SLIDE 83

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8 = B(3, 9) ≈ 0.002

Thus, the posterior for the models, by Bayes rule: P(fair|D) ∝ 0.0008, P(bent|D) ∝ 0.0004, ie, a two-thirds probability that the coin is fair.

SLIDE 84

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8 = B(3, 9) ≈ 0.002

Thus, the posterior for the models, by Bayes rule: P(fair|D) ∝ 0.0008, P(bent|D) ∝ 0.0004, ie, a two-thirds probability that the coin is fair. How do we make predictions?

SLIDE 85

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8 = B(3, 9) ≈ 0.002

Thus, the posterior for the models, by Bayes rule: P(fair|D) ∝ 0.0008, P(bent|D) ∝ 0.0004, ie, a two-thirds probability that the coin is fair. How do we make predictions? Could choose the fair model (model selection).

SLIDE 86

Bayesian coins – comparing models

Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P(D|fair) = (1/2)10 ≈ 0.001 and for the bent model is: P(D|bent) =

dq P(D|q, bent)p(q|bent) =
dq q2(1 − q)8 = B(3, 9) ≈ 0.002

Thus, the posterior for the models, by Bayes rule: P(fair|D) ∝ 0.0008, P(bent|D) ∝ 0.0004, ie, a two-thirds probability that the coin is fair. How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P(H|D) = P(H|D, fair)P(fair|D) + P(H|D, bent)P(bent|D) = 2 3 × 1 2 + 1 3 × 3 12 = 5 12.

SLIDE 87

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their parameters.

SLIDE 88

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

SLIDE 89

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

SLIDE 90

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

SLIDE 91

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss.

SLIDE 92

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

SLIDE 93

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss.

SLIDE 94

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss. ◮ Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter

value that maximises the likelihood function alone:

θML = argmax P(D|θ) .

SLIDE 95

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss. ◮ Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter

value that maximises the likelihood function alone:

θML = argmax P(D|θ) .

◮ Parameterisation-independent.

SLIDE 96

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss. ◮ Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter

value that maximises the likelihood function alone:

θML = argmax P(D|θ) .

◮ Parameterisation-independent.

◮ Approximations may allow us to recover samples from posterior, or to find a distribution

which is close in some sense.

SLIDE 97

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss. ◮ Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter

value that maximises the likelihood function alone:

θML = argmax P(D|θ) .

◮ Parameterisation-independent.

◮ Approximations may allow us to recover samples from posterior, or to find a distribution

which is close in some sense.

◮ Choosing between these and other alternatives may be a matter of definition, of goals

(loss function), or of practicality.

SLIDE 98

Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their

parameters. But it is often impractical for realistic models (outside the exponential family).

◮ Point estimates of parameters or other predictions

◮ Compute posterior and find single parameter that minimises expected loss.

θBP = argmin

ˆ θ

L(ˆ

θ, θ)

P(θ|D)

◮ θP(θ|D) minimises squared loss. ◮ Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P(θ),

and compute parameters that are most probable under the posterior:

θMAP = argmax P(θ|D) = argmax P(θ)P(D|θ) .

◮ Equivalent to minimising the 0/1 loss. ◮ Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter

value that maximises the likelihood function alone:

θML = argmax P(D|θ) .

◮ Parameterisation-independent.

◮ Approximations may allow us to recover samples from posterior, or to find a distribution

which is close in some sense.

◮ Choosing between these and other alternatives may be a matter of definition, of goals

(loss function), or of practicality.

◮ For the next few weeks we will look at ML and MAP learning in more complex models.

We will then return to the fully Bayesian formulation for the few intersting cases where it is tractable. Approximations will be addressed in the second half of the course.

SLIDE 99

Modelling associations between variables

−1 1 −1 1

xi1 xi2

◮ Data set D = {x1, . . . , xN} ◮ with each data point a vector of D features:

xi = [xi1 . . . xiD]

◮ Assume data are i.i.d. (independent and

identically distributed).

SLIDE 100

Modelling associations between variables

−1 1 −1 1

xi1 xi2

◮ Data set D = {x1, . . . , xN} ◮ with each data point a vector of D features:

xi = [xi1 . . . xiD]

◮ Assume data are i.i.d. (independent and

identically distributed). A simple forms of unsupervised (structure) learning: model the mean of the data and the correlations between the D features in the data.

SLIDE 101

Modelling associations between variables

−1 1 −1 1

xi1 xi2

◮ Data set D = {x1, . . . , xN} ◮ with each data point a vector of D features:

xi = [xi1 . . . xiD]

◮ Assume data are i.i.d. (independent and

identically distributed). A simple forms of unsupervised (structure) learning: model the mean of the data and the correlations between the D features in the data. We can use a multivariate Gaussian model: p(x|µ, Σ) = N (µ, Σ) = |2πΣ|− 1

2 exp

−1

2(x − µ)TΣ−1(x − µ)

SLIDE 102

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood

L =

N

n=1

p(xn|µ, Σ)

SLIDE 103

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

ℓ = log

N

n=1

p(xn|µ, Σ)

SLIDE 104

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

ℓ = log

N

n=1

p(xn|µ, Σ) =

n

log p(xn|µ, Σ)

SLIDE 105

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

ℓ = log

N

n=1

p(xn|µ, Σ) =

n

log p(xn|µ, Σ)

= −N

2 log |2πΣ| − 1 2

n

(xn − µ)TΣ−1(xn − µ)

SLIDE 106

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

ℓ = log

N

n=1

p(xn|µ, Σ) =

n

log p(xn|µ, Σ)

= −N

2 log |2πΣ| − 1 2

n

(xn − µ)TΣ−1(xn − µ)

Note: equivalently, minimise −ℓ, which is quadratic in µ

SLIDE 107

ML Learning for a Gaussian

Data set D = {x1, . . . , xN}, likelihood: p(D|µ, Σ) =

N

n=1

p(xn|µ, Σ) Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

ℓ = log

N

n=1

p(xn|µ, Σ) =

n

log p(xn|µ, Σ)

= −N

2 log |2πΣ| − 1 2

n

(xn − µ)TΣ−1(xn − µ)

Note: equivalently, minimise −ℓ, which is quadratic in µ Procedure: take derivatives and set to zero:

∂ℓ ∂µ = 0 ⇒ ˆ µ = 1

N

n

xn (sample mean)

∂ℓ ∂Σ = 0 ⇒ ˆ Σ = 1

N

n

(xn − ˆ µ)(xn − ˆ µ)T

(sample covariance)

SLIDE 108

Refresher – matrix derivatives of scalar forms

We will use the following facts:

SLIDE 109

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

SLIDE 110

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

SLIDE 111

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

SLIDE 112

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB

SLIDE 113

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

n

[ATB]nn

SLIDE 114

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

n
m

AT

nmBmn

SLIDE 115

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn

SLIDE 116

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

SLIDE 117

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

SLIDE 118

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC

SLIDE 119

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

SLIDE 120

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

1 BF2C

∂F2 ∂A

SLIDE 121

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

CF T

1 BF2

∂F2 ∂A

SLIDE 122

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A

SLIDE 123

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT

SLIDE 124

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT

SLIDE 125

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT ∂ ∂Aij

log |A|

SLIDE 126

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT ∂ ∂Aij

log |A| = 1

|A| ∂ ∂Aij |A|

SLIDE 127

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT ∂ ∂Aij

log |A| = 1

|A| ∂ ∂Aij

k

(−1)i+kAik |[A]ik|

SLIDE 128

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT ∂ ∂Aij

log |A| = 1

|A| ∂ ∂Aij

k

(−1)i+kAik |[A]ik| =

1

|A|(−1)i+j |[A]ij|

SLIDE 129

Refresher – matrix derivatives of scalar forms

We will use the following facts: xTAy = yTATx = Tr

xTAy
(scalars equal their own transpose and trace)

Tr [A] = Tr

AT

Tr [ABC] = Tr [CAB] = Tr [BCA]

∂ ∂Aij

Tr

ATB
=

∂ ∂Aij

mn

AmnBmn = Bij

⇒ ∂ ∂ATr

ATB
= B

∂ ∂ATr

ATBAC
= ∂

∂ATr

F1(A)TBF2(A)C
with F1 and F2 both identity maps

= ∂ ∂F1

Tr

F T

1 BF2C

∂F1 ∂A + ∂ ∂F2

Tr

F T

2 BTF1CT ∂F2

∂A = BF2C + BTF1CT = BAC + BTACT ∂ ∂Aij

log |A| = 1

|A| ∂ ∂Aij

k

(−1)i+kAik |[A]ik| =

1

|A|(−1)i+j |[A]ij| ⇒ ∂ ∂A log |A| = (A−1)T

SLIDE 130

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

SLIDE 131

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)

SLIDE 132

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)
= 1

2

n

∂ ∂µ

xT

nΣ−1xn + µTΣ−1µ − 2µTΣ−1xn

SLIDE 133

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)
= 1

2

n

∂ ∂µ

xT

nΣ−1xn + µTΣ−1µ − 2µTΣ−1xn

= 1

2

n

∂ ∂µ

µTΣ−1µ
− 2 ∂

∂µ

µTΣ−1xn

SLIDE 134

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)
= 1

2

n

∂ ∂µ

xT

nΣ−1xn + µTΣ−1µ − 2µTΣ−1xn

= 1

2

n

∂ ∂µ

µTΣ−1µ
− 2 ∂

∂µ

µTΣ−1xn
= 1

2

n
2Σ−1µ − 2Σ−1xn

SLIDE 135

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)
= 1

2

n

∂ ∂µ

xT

nΣ−1xn + µTΣ−1µ − 2µTΣ−1xn

= 1

2

n

∂ ∂µ

µTΣ−1µ
− 2 ∂

∂µ

µTΣ−1xn
= 1

2

n
2Σ−1µ − 2Σ−1xn
= NΣ−1µ − Σ−1

n

xn

SLIDE 136

Gaussian Derivatives

∂(−ℓ) ∂µ = ∂ ∂µ

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

= 1

2

n

∂ ∂µ

(xn − µ)TΣ−1(xn − µ)
= 1

2

n

∂ ∂µ

xT

nΣ−1xn + µTΣ−1µ − 2µTΣ−1xn

= 1

2

n

∂ ∂µ

µTΣ−1µ
− 2 ∂

∂µ

µTΣ−1xn
= 1

2

n
2Σ−1µ − 2Σ−1xn
= NΣ−1µ − Σ−1

n

xn

= 0 ⇒ ˆ µ = 1

N

n

xn

SLIDE 137

Gaussian Derivatives

∂(−ℓ) ∂Σ−1

SLIDE 138

Gaussian Derivatives

∂(−ℓ) ∂Σ−1 = ∂ ∂Σ−1

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

SLIDE 139

Gaussian Derivatives

∂(−ℓ) ∂Σ−1 = ∂ ∂Σ−1

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

=

∂ ∂Σ−1

N

2 log |2πI|

−

∂ ∂Σ−1

N

2 log |Σ−1|

+ 1

2

n

∂ ∂Σ−1

(xn − µ)TΣ−1(xn − µ)

SLIDE 140

Gaussian Derivatives

∂(−ℓ) ∂Σ−1 = ∂ ∂Σ−1

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

=

∂ ∂Σ−1

N

2 log |2πI|

−

∂ ∂Σ−1

N

2 log |Σ−1|

+ 1

2

n

∂ ∂Σ−1

(xn − µ)TΣ−1(xn − µ)
= −N

2 ΣT + 1 2

n

(xn − µ)(xn − µ)T

SLIDE 141

Gaussian Derivatives

∂(−ℓ) ∂Σ−1 = ∂ ∂Σ−1

N

2 log |2πΣ| + 1 2

n

(xn − µ)TΣ−1(xn − µ)

=

∂ ∂Σ−1

N

2 log |2πI|

−

∂ ∂Σ−1

N

2 log |Σ−1|

+ 1

2

n

∂ ∂Σ−1

(xn − µ)TΣ−1(xn − µ)
= −N

2 ΣT + 1 2

n

(xn − µ)(xn − µ)T = 0 ⇒ ˆ Σ = 1

N

n

(xn − µ)(xn − µ)T

SLIDE 142

Equivalences

−1 1 −1 1

xi1 xi2

modelling correlations

maximising likelihood of a Gaussian model
minimising a squared error cost function
minimizing data coding cost in bits (assuming Gaussian distributed)

SLIDE 143

Multivariate Linear Regression

The relationship between variables can also be modelled as a conditional distribution.

−1 1 −1 1

xi1 xi2

◮ data D = {(x1, y1) . . . , (xN, yN)} ◮ each xi (yi) is a vector of Dx (Dy) features, ◮ yi is conditionally independent of all else, given xi.

SLIDE 144

Multivariate Linear Regression

The relationship between variables can also be modelled as a conditional distribution.

−1 1 −1 1

xi1 xi2

◮ data D = {(x1, y1) . . . , (xN, yN)} ◮ each xi (yi) is a vector of Dx (Dy) features, ◮ yi is conditionally independent of all else, given xi.

A simple form of supervised (predictive) learning: model y as a linear function of x, with Gaussian noise.

SLIDE 145

Multivariate Linear Regression

The relationship between variables can also be modelled as a conditional distribution.

−1 1 −1 1

xi1 xi2

◮ data D = {(x1, y1) . . . , (xN, yN)} ◮ each xi (yi) is a vector of Dx (Dy) features, ◮ yi is conditionally independent of all else, given xi.

A simple form of supervised (predictive) learning: model y as a linear function of x, with Gaussian noise. p(y|x, W, Σy) = |2πΣy|− 1

2 exp

−1

2(y − Wx)TΣ−1

y (y − Wx)

SLIDE 146

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

SLIDE 147

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W

SLIDE 148

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

SLIDE 149

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

(yi − Wxi)TΣ−1

y (yi − Wxi)

SLIDE 150

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

yT

i Σ−1 y yi + xT i WTΣ−1 y Wxi − 2xT i WTΣ−1 y yi

SLIDE 151

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

yT

i Σ−1 y yi + xT i WTΣ−1 y Wxi − 2xT i WTΣ−1 y yi

= 1

2

i

∂ ∂WTr

WTΣ−1

y WxixT i

− 2 ∂

∂WTr

WTΣ−1

y yixT i

SLIDE 152

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

yT

i Σ−1 y yi + xT i WTΣ−1 y Wxi − 2xT i WTΣ−1 y yi

= 1

2

i

∂ ∂WTr

WTΣ−1

y WxixT i

− 2 ∂

∂WTr

WTΣ−1

y yixT i

= 1

2

i
2Σ−1

y WxixT i − 2Σ−1 y yixT i

SLIDE 153

Multivariate Linear Regression – ML estimate

ML estimates are obtained by maximising the (conditional) likelihood, as before:

ℓ =

i

log p(yi|xi, W, Σy)

= −N

2 log |2πΣy| − 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

∂(−ℓ) ∂W = ∂ ∂W

N

2 log |2πΣy| + 1 2

i

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

(yi − Wxi)TΣ−1

y (yi − Wxi)

= 1

2

i

∂ ∂W

yT

i Σ−1 y yi + xT i WTΣ−1 y Wxi − 2xT i WTΣ−1 y yi

= 1

2

i

∂ ∂WTr

WTΣ−1

y WxixT i

− 2 ∂

∂WTr

WTΣ−1

y yixT i

= 1

2

i
2Σ−1

y WxixT i − 2Σ−1 y yixT i

= 0 ⇒

W =

i

yixT

i i

xixT

i

−1

SLIDE 154

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights.

SLIDE 155

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

SLIDE 156

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

SLIDE 157

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const

SLIDE 158

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const = −1

2wT (A + σ−2

y

i

xixT

i )

Σ−1

w

w + wT

i

(yixi)σ−2

y

+ const

SLIDE 159

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const = −1

2wT (A + σ−2

y

i

xixT

i )

Σ−1

w

w + wT

i

(yixi)σ−2

y

+ const

SLIDE 160

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const = −1

2wT (A + σ−2

y

i

xixT

i )

Σ−1

w

w + wT

i

(yixi)σ−2

y

+ const = −1

2wTΣ−1

w w + wTΣ−1 w Σw

i

(yixi)σ−2

y

µw

+ const

SLIDE 161

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const = −1

2wT (A + σ−2

y

i

xixT

i )

Σ−1

w

w + wT

i

(yixi)σ−2

y

+ const = −1

2wTΣ−1

w w + wTΣ−1 w Σw

i

(yixi)σ−2

y

µw

+ const

SLIDE 162

Multivariate Linear Regression – Posterior

Let yi be scalar (so that W is a row vector) and write w for the column vector of weights. A conjugate prior for w is P(w|A) = N

0, A−1

Then the log posterior on w is log P(w|D, A, σy) = log P(D|w, A, σy) + log P(w|A, σy) − log P(D|A, σy)

= −1

2wTAw − 1 2

i

(yi − wTxi)2σ−2

y

+ const = −1

2wT (A + σ−2

y

i

xixT

i )

Σ−1

w

w + wT

i

(yixi)σ−2

y

+ const = −1

2wTΣ−1

w w + wTΣ−1 w Σw

i

(yixi)σ−2

y

µw

+ const = log N

Σw
i(yixi)σ−2

y , Σw

SLIDE 163

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

SLIDE 164

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi

SLIDE 165

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

SLIDE 166

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

SLIDE 167

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

◮ The prior acts to “inflate” the apparent covariance of inputs.

SLIDE 168

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

◮ The prior acts to “inflate” the apparent covariance of inputs. ◮ As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0).

SLIDE 169

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

◮ The prior acts to “inflate” the apparent covariance of inputs. ◮ As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0). ◮ If A = αI this is known as the ridge regression estimator.

SLIDE 170

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

◮ The prior acts to “inflate” the apparent covariance of inputs. ◮ As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0). ◮ If A = αI this is known as the ridge regression estimator. ◮ The MAP/shrinkage/ridge weight estimate often has lower squared error (despite bias)

and makes more accurate predictions on test inputs than the ML estimate.

SLIDE 171

MAP and ML for linear regression

As the posterior is Gaussian, the MAP and posterior mean weights are the same: wMAP =

A +
i xixT

i

σ2

y

−1

Σw
i yixi

σ2

y

=

Aσ2

y +

i

xixT

i

−1

i

yixi Compare this to the (transposed) ML weight vector for scalar outputs: wML = WT =

i

xixT

i

−1

i

yixi

◮ The prior acts to “inflate” the apparent covariance of inputs. ◮ As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0). ◮ If A = αI this is known as the ridge regression estimator. ◮ The MAP/shrinkage/ridge weight estimate often has lower squared error (despite bias)

and makes more accurate predictions on test inputs than the ML estimate.

◮ An example of prior-based regularisation of estimates.

SLIDE 172

Gaussians for Regression

◮ Models the conditional P(y|x).

SLIDE 173

Gaussians for Regression

◮ Models the conditional P(y|x). ◮ If we also model P(x), then learning is indistinguishable from unsupervised. In particular

if P(x) is Gaussian, and P(y|x) is linear-Gaussian, then x, y are jointly Gaussian.

SLIDE 174

Gaussians for Regression

◮ Models the conditional P(y|x). ◮ If we also model P(x), then learning is indistinguishable from unsupervised. In particular

if P(x) is Gaussian, and P(y|x) is linear-Gaussian, then x, y are jointly Gaussian.

◮ Generalised Linear Models (GLMs) generalise to non-Gaussian, exponential-family

distributions and to non-linear link functions. yi ∼ ExpFam(µi, φ) g(µi) = wTxi Posterior, or even ML, estimation is not possible in closed form ⇒ iterative methods such as gradient ascent or iteratively re-weighted least squares (IRLS). A warning to fMRIers: SPM uses GLM for “general” (not -ised) linear model; which is just linear.

SLIDE 175

Gaussians for Regression

◮ Models the conditional P(y|x). ◮ If we also model P(x), then learning is indistinguishable from unsupervised. In particular

if P(x) is Gaussian, and P(y|x) is linear-Gaussian, then x, y are jointly Gaussian.

◮ Generalised Linear Models (GLMs) generalise to non-Gaussian, exponential-family

distributions and to non-linear link functions. yi ∼ ExpFam(µi, φ) g(µi) = wTxi Posterior, or even ML, estimation is not possible in closed form ⇒ iterative methods such as gradient ascent or iteratively re-weighted least squares (IRLS). A warning to fMRIers: SPM uses GLM for “general” (not -ised) linear model; which is just linear.

◮ These models: Gaussians, Linear-Gaussian Regression and GLMs are important

building blocks for the more sophisticated models we will develop later.

SLIDE 176

Gaussians for Regression

◮ Models the conditional P(y|x). ◮ If we also model P(x), then learning is indistinguishable from unsupervised. In particular

if P(x) is Gaussian, and P(y|x) is linear-Gaussian, then x, y are jointly Gaussian.

◮ Generalised Linear Models (GLMs) generalise to non-Gaussian, exponential-family

distributions and to non-linear link functions. yi ∼ ExpFam(µi, φ) g(µi) = wTxi Posterior, or even ML, estimation is not possible in closed form ⇒ iterative methods such as gradient ascent or iteratively re-weighted least squares (IRLS). A warning to fMRIers: SPM uses GLM for “general” (not -ised) linear model; which is just linear.

◮ These models: Gaussians, Linear-Gaussian Regression and GLMs are important

building blocks for the more sophisticated models we will develop later.

◮ Gaussian models are also used for regression in Gaussian Process Models. We’ll see

these later too.

SLIDE 177

Three limitations of the multivariate Gaussian model

◮ What about higher order statistical structure in the data? ◮ What happens if there are outliers? ◮ There are D(D + 1)/2 parameters in the multivariate Gaussian model.

What if D is very large?

SLIDE 178

Three limitations of the multivariate Gaussian model

◮ What about higher order statistical structure in the data?

⇒ nonlinear and hierarchical models

◮ What happens if there are outliers? ◮ There are D(D + 1)/2 parameters in the multivariate Gaussian model.

What if D is very large?

SLIDE 179

Three limitations of the multivariate Gaussian model

◮ What about higher order statistical structure in the data?

⇒ nonlinear and hierarchical models

◮ What happens if there are outliers?

⇒ other noise models

◮ There are D(D + 1)/2 parameters in the multivariate Gaussian model.

What if D is very large?

SLIDE 180

Three limitations of the multivariate Gaussian model

◮ What about higher order statistical structure in the data?

⇒ nonlinear and hierarchical models

◮ What happens if there are outliers?

⇒ other noise models

◮ There are D(D + 1)/2 parameters in the multivariate Gaussian model.

What if D is very large?

⇒ dimensionality reduction

SLIDE 181