Learning Neural Networks Learning Neural Networks Neural Networks - - PowerPoint PPT Presentation

learning neural networks learning neural networks
SMART_READER_LITE
LIVE PREVIEW

Learning Neural Networks Learning Neural Networks Neural Networks - - PowerPoint PPT Presentation

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural Networks can represent complex decision boundaries decision boundaries Variable size. Any boolean function can be Variable size. Any boolean


slide-1
SLIDE 1

Learning Neural Networks Learning Neural Networks

Neural Networks can represent complex Neural Networks can represent complex decision boundaries decision boundaries

– – Variable size. Any boolean function can be Variable size. Any boolean function can be

  • represented. Hidden units can be interpreted as new
  • represented. Hidden units can be interpreted as new

features features – – Deterministic Deterministic – – Continuous Parameters Continuous Parameters

Learning Algorithms for neural networks Learning Algorithms for neural networks

– – Local Search. The same algorithm as for sigmoid Local Search. The same algorithm as for sigmoid threshold units threshold units – – Eager Eager – – Batch or Online Batch or Online

slide-2
SLIDE 2

Neural Network Hypothesis Space Neural Network Hypothesis Space

Each unit a Each unit a6

6, a

, a7

7, a

, a8

8, and

, and ŷ ŷ computes a sigmoid function of its inputs: computes a sigmoid function of its inputs:

a a6

6 =

= σ σ(W (W6

6 ·

· X) a X) a7

7 =

= σ σ(W (W7

7 ·

· X) a X) a8

8 =

= σ σ(W (W8

8 ·

· X) X) ŷ ŷ = = σ σ(W (W9

9 ·

· A) A) where A = [1, a where A = [1, a6

6, a

, a7

7, a

, a8

8] is called the vector of

] is called the vector of hidden unit activitations hidden unit activitations

Original motivation: Differentiable approximation to multi Original motivation: Differentiable approximation to multi-

  • layer LTUs

layer LTUs

ŷ W9 W6 W7 W8 a6 a7 a8 x1 x2 x3 x4

slide-3
SLIDE 3

Representational Power Representational Power

Any Boolean Formula Any Boolean Formula

– – Consider a formula in disjunctive normal form: Consider a formula in disjunctive normal form:

(x (x1

1 ∧

∧ ¬ ¬ x x2

2)

) ∨ ∨ (x (x2

2 ∧

∧ x x4

4)

) ∨ ∨ ( (¬ ¬ x x3

3 ∧

∧ x x5

5)

)

Each AND can be represented by a hidden unit and the OR can Each AND can be represented by a hidden unit and the OR can be represented by the output unit. Arbitrary boolean functions be represented by the output unit. Arbitrary boolean functions require exponentially require exponentially-

  • many hidden units, however.

many hidden units, however.

Bounded functions Bounded functions

– – Suppose we make the output linear: Suppose we make the output linear: ŷ ŷ = W = W9

9 ·

· A of hidden units. A of hidden units. It can be proved that any bounded continuous function can be It can be proved that any bounded continuous function can be approximated to arbitrary accuracy with enough hidden units. approximated to arbitrary accuracy with enough hidden units.

Arbitrary Functions Arbitrary Functions

– – Any function can be approximated to arbitrary accuracy with two Any function can be approximated to arbitrary accuracy with two hidden layers of sigmoid units and a linear output unit. hidden layers of sigmoid units and a linear output unit.

slide-4
SLIDE 4

Fixed versus Variable Size Fixed versus Variable Size

In principle, a network has a fixed number of parameters In principle, a network has a fixed number of parameters and therefore can only represent a fixed hypothesis and therefore can only represent a fixed hypothesis space (if the number of hidden units is fixed). space (if the number of hidden units is fixed). However, we will initialize the weights to values near However, we will initialize the weights to values near zero and use gradient descent. The more steps of zero and use gradient descent. The more steps of gradient descent we take, the more functions can be gradient descent we take, the more functions can be “ “reached reached” ” from the starting weights. from the starting weights. So it turns out to be more accurate to treat networks as So it turns out to be more accurate to treat networks as having a variable hypothesis space that depends on the having a variable hypothesis space that depends on the number of steps of gradient descent number of steps of gradient descent

slide-5
SLIDE 5

Backpropagation: Gradient Backpropagation: Gradient Descent for Multi Descent for Multi-

  • Layer Networks

Layer Networks

It is traditional to train neural networks to minimize the squar It is traditional to train neural networks to minimize the squared ed

  • error. This is really a mistake
  • error. This is really a mistake—

—they should be trained to maximize they should be trained to maximize the log likelihood instead. But we will study the MSE first. the log likelihood instead. But we will study the MSE first. We must apply the chain rule many times to compute the gradient We must apply the chain rule many times to compute the gradient We will number the units from 0 to U and index them by We will number the units from 0 to U and index them by u u and and v v. . w wv,u

v,u will be the weight connecting unit

will be the weight connecting unit u u to unit to unit v.

  • v. (Note: This seems

(Note: This seems

  • backwards. It is the
  • backwards. It is the u

uth input to node th input to node v v.) .)

ˆ y = σ(W9 · [1, σ(W6 · X), σ(W7 · X), σ(W9 · X)]) Ji(W) = 1 2(ˆ yi − yi)2

slide-6
SLIDE 6

Derivation: Output Unit Derivation: Output Unit

Suppose Suppose w w9,6

9,6 is a component of

is a component of W W9

9, the

, the

  • utput weight vector, connecting it from
  • utput weight vector, connecting it from a

a6

6.

.

∂Ji(W) ∂w9,6 = ∂ ∂w9,6 1 2(ˆ yi − yi)2 = 1 2 · 2 · (ˆ yi − yi) · ∂ ∂w9,6 (σ(W9 · Ai) − yi) = (ˆ yi − yi) · σ(W9 · Ai)(1 − σ(W9 · Ai)) · ∂ ∂w9,6 W9 · Ai = (ˆ yi − yi)ˆ yi(1 − ˆ yi) · a6

slide-7
SLIDE 7

The Delta Rule The Delta Rule

Define Define then then

δ9 = (ˆ yi − yi)ˆ yi(1 − ˆ yi)

∂Ji(W) ∂w9,6 = (ˆ yi − yi)ˆ yi(1 − ˆ yi) · a6 = δ9 · a6

slide-8
SLIDE 8

Derivation: Hidden Units Derivation: Hidden Units

Define δ6 = δ9 · w9,6a6(1 − a6) and rewrite as

∂Ji(W) ∂w6,2 = δ6x2. ∂Ji(W) ∂w6,2 = (ˆ yi − yi) · σ(W9 · Ai)(1 − σ(W9 · Ai)) · ∂ ∂w6,2 W9 · Ai = δ9 · w9,6 · ∂ ∂w6,2 σ(W6 · X) = δ9 · w9,6 · σ(W6 · X)(1 − σ(W6 · X)) · ∂ ∂w6,2 (W6 · X) = δ9 · w9,6a6(1 − a6) · x2

slide-9
SLIDE 9

Networks with Multiple Output Units Networks with Multiple Output Units

We get a separate contribution to the gradient from each We get a separate contribution to the gradient from each

  • utput unit.
  • utput unit.

Hence, for input Hence, for input-

  • to

to-

  • hidden weights, we must sum up the

hidden weights, we must sum up the contributions: contributions:

δ6 = a6(1 − a6)

10 X u=9

wu,6δu

ŷ1 ŷ2 a6 a7 a8 1 1 W6 W7 W8 x1 x2 x3 x4

slide-10
SLIDE 10

The Backpropagation Algorithm The Backpropagation Algorithm

Forward Pass Forward Pass. Compute . Compute a au

u and

and ŷ ŷv

v for hidden units

for hidden units u u and and

  • utput units
  • utput units v

v. . Compute Errors Compute Errors. Compute . Compute ε εv

v = (

= (ŷ ŷv

v –

– y yv

v) for each output

) for each output unit unit v v Compute Output Deltas Compute Output Deltas. Compute . Compute δ δu

u =

= a au

u(1

(1 – – a au

u)

) ∑ ∑v

v w

wv,u

v,u δ

δv

v

Compute Gradient Compute Gradient. .

– – Compute for input Compute for input-

  • to

to-

  • hidden weights.

hidden weights. – – Compute for hidden Compute for hidden-

  • to

to-

  • output weights.
  • utput weights.

Take Gradient Step Take Gradient Step. .

∂Ji ∂wu,j = δuxij ∂Ji ∂wv,u = δvaiu

W := W − η∇W J(xi)

slide-11
SLIDE 11

Proper Initialization Proper Initialization

Start in the Start in the “ “linear linear” ” regions regions

– – keep all weights near zero, so that all sigmoid units are in the keep all weights near zero, so that all sigmoid units are in their ir linear regions. This makes the whole net the equivalent of one linear regions. This makes the whole net the equivalent of one linear threshold unit linear threshold unit— —a very simple function. a very simple function.

Break symmetry. Break symmetry.

– – Ensure that each hidden unit has different input weights so that Ensure that each hidden unit has different input weights so that the hidden units move in different directions. the hidden units move in different directions.

Set each weight to a random number in the range Set each weight to a random number in the range where the where the “ “fan fan-

  • in

in” ” of weight

  • f weight w

wv,u

v,u is the number of inputs

is the number of inputs to unit to unit v v. . [−1, +1] × 1 √ fan-in.

slide-12
SLIDE 12

Batch, Online, and Online with Batch, Online, and Online with Momentum Momentum

Batch

  • Batch. Sum the for each example

. Sum the for each example i i. . Then take a gradient descent step. Then take a gradient descent step. Online

  • Online. Take a gradient descent step with each

. Take a gradient descent step with each as it is computed. as it is computed. Momentum

  • Momentum. Maintain an exponentially

. Maintain an exponentially-

  • weighted

weighted moved sum of recent moved sum of recent Typical values of Typical values of µ µ are in the range [0.7, 0.95] are in the range [0.7, 0.95]

∇WJ(xi) ∇WJ(xi) ∆W (t+1) := µ∆W(t) + ∇WJ(xi) W (t+1) := W (t) − η∆W(t+1)

slide-13
SLIDE 13

Softmax Output Layer Softmax Output Layer

Let Let a a9

9 and

and a a10

10 be the output activations:

be the output activations: a a9

9 = W

= W9

9 ·

· A, A, a a10

10

= W = W10

10 ·

· A. Then define

  • A. Then define

The objective function is the negative log likelihood: The objective function is the negative log likelihood: where I[expr] is 1 if expr is true and 0 otherwise where I[expr] is 1 if expr is true and 0 otherwise

ŷ1 ŷ2 a6 a7 a8 1 1 W6 W7 W8 x1 x2 x3 x4

softmax

W9 W10

ˆ y1 = exp a9 exp a9 + exp a10 ˆ y2 = exp a10 exp a9 + exp a10

J(W) =

X i X k

−I[yi = k]log ˆ yk

slide-14
SLIDE 14

Neural Network Evaluation Neural Network Evaluation

no no yes yes no no somewhat somewhat yes yes yes yes yes yes yes yes yes yes Trees Trees yes yes no no yes yes no no yes yes somewhat somewhat yes yes no no no no Nets Nets yes yes yes yes yes yes no no yes yes no no no no yes yes no no LDA LDA yes yes yes yes Accurate Accurate yes yes yes yes Interpretable Interpretable yes yes yes yes Linear combinations Linear combinations no no no no Irrelevant inputs Irrelevant inputs yes yes yes yes Scalability Scalability no no no no Monotone transformations Monotone transformations yes yes no no Outliers Outliers no no no no Missing values Missing values no no no no Mixed data Mixed data Logistic Logistic Perc Perc Criterion Criterion