Fundamental Belief: Universal Approximation Theorems Ju Sun - - PowerPoint PPT Presentation

fundamental belief universal approximation theorems
SMART_READER_LITE
LIVE PREVIEW

Fundamental Belief: Universal Approximation Theorems Ju Sun - - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23 Logistics HW 0 posted (due: midnight Feb 07) 2 / 23 Logistics HW 0 posted


slide-1
SLIDE 1

Fundamental Belief: Universal Approximation Theorems

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

January 29, 2020

1 / 23

slide-2
SLIDE 2

Logistics

– HW 0 posted (due: midnight Feb 07)

2 / 23

slide-3
SLIDE 3

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access)

2 / 23

slide-4
SLIDE 4

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!)

2 / 23

slide-5
SLIDE 5

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) – Feb 06: discussion of the course project & ideas

2 / 23

slide-6
SLIDE 6

Outline

Recap Why should we trust NNs? Suggested reading

3 / 23

slide-7
SLIDE 7

Recap I

biological neuron vs. artificial neuron

4 / 23

slide-8
SLIDE 8

Recap I

biological neuron vs. artificial neuron biological NN vs. artificial NN

4 / 23

slide-9
SLIDE 9

Recap I

biological neuron vs. artificial neuron biological NN vs. artificial NN Artificial NN: (over)-simplification on neuron & connection levels

4 / 23

slide-10
SLIDE 10

Recap II

Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs)

5 / 23

slide-11
SLIDE 11

Recap II

Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) Also: – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization

5 / 23

slide-12
SLIDE 12

Recap III

6 / 23

slide-13
SLIDE 13

Recap III

Brief history of NNs: – 1943: first NNs invented (McCulloch and Pitts) – 1958 –1969: perceptron (Rosenblatt) – 1969: Perceptrons (Minsky and Papert)—end of perceptron – 1980’s–1990’s: Neocognitron, CNN, back-prop, SGD—we use today – 1990’s–2010’s: SVMs, Adaboosting, decision trees and random forests – 2010’s–now: DNNs and deep learning

6 / 23

slide-14
SLIDE 14

Outline

Recap Why should we trust NNs? Suggested reading

7 / 23

slide-15
SLIDE 15

Supervised learning

General view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

  • i=1

ℓ (yi, f (xi)) NN view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ – Find weights (w1, . . . , wk, b1, . . . , bk) to minimize the average loss min

w′s,b′s

1 n

n

  • i=1

ℓ [yi, {NN (w1, . . . , wk, b1, . . . , bk)} (xi)]

8 / 23

slide-16
SLIDE 16

Supervised learning

General view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

  • i=1

ℓ (yi, f (xi)) NN view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ – Find weights (w1, . . . , wk, b1, . . . , bk) to minimize the average loss min

w′s,b′s

1 n

n

  • i=1

ℓ [yi, {NN (w1, . . . , wk, b1, . . . , bk)} (xi)]

Why should we trust NNs?

8 / 23

slide-17
SLIDE 17

Function approximation

More accurate description of supervised learning

9 / 23

slide-18
SLIDE 18

Function approximation

More accurate description of supervised learning – Underlying true function: f0

9 / 23

slide-19
SLIDE 19

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi)

9 / 23

slide-20
SLIDE 20

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close

9 / 23

slide-21
SLIDE 21

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc)

9 / 23

slide-22
SLIDE 22

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization: how to find the best f ∈ H matters

9 / 23

slide-23
SLIDE 23

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization: how to find the best f ∈ H matters We focus on approximation capacity now.

9 / 23

slide-24
SLIDE 24

A word on notation

– k-layer NNs: with k layers of weights

10 / 23

slide-25
SLIDE 25

A word on notation

– k-layer NNs: with k layers of weights – k-hidden-layer NNs: with k hidden layers of nodes (i.e., (k + 1)-layer NNs)

10 / 23

slide-26
SLIDE 26

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear:

11 / 23

slide-27
SLIDE 27

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron):

11 / 23

slide-28
SLIDE 28

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z : 11 / 23

slide-29
SLIDE 29

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

  • x →

1 1+e−(w⊺x+b)

  • – σ = max(0, z) (ReLU):

11 / 23

slide-30
SLIDE 30

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

  • x →

1 1+e−(w⊺x+b)

  • – σ = max(0, z) (ReLU):

{x → max(0, w⊺x + b)}

11 / 23

slide-31
SLIDE 31

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

  • x →

1 1+e−(w⊺x+b)

  • – σ = max(0, z) (ReLU):

{x → max(0, w⊺x + b)}

11 / 23

slide-32
SLIDE 32

Second trial

Think of single-output (i.e., R) problems first

12 / 23

slide-33
SLIDE 33

Second trial

Think of single-output (i.e., R) problems first Add depth! . . .

12 / 23

slide-34
SLIDE 34

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

12 / 23

slide-35
SLIDE 35

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

σ (w⊺

L (W L−1 (. . . (W 1x + b1) + . . .) bL−1) + bL)

12 / 23

slide-36
SLIDE 36

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

σ (w⊺

L (W L−1 (. . . (W 1x + b1) + . . .) bL−1) + bL)

No better than a signle neuron! Why?

12 / 23

slide-37
SLIDE 37

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity!

13 / 23

slide-38
SLIDE 38

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity! two-layer network, linear activation at output

13 / 23

slide-39
SLIDE 39

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity! two-layer network, linear activation at output

Surprising news: universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide. — we don’t worry about the capacity

13 / 23

slide-40
SLIDE 40

Universal approximation theorem

Theorem (UAT, [Cybenko, 1989, Hornik, 1991]) Let σ : R → R be a nonconstant, bounded, and continuous

  • function. Let Im denote the m-dimensional unit hypercube [0, 1]m.

The space of real-valued continuous functions on Im is denoted by C(Im). Then, given any ε > 0 and any function f ∈ C(Im), there exist an integer N, real constants vi, bi ∈ R and real vectors wi ∈ Rm for i = 1, . . . , N, such that we may define: F(x) =

N

  • i=1

viσ

  • wT

i x + bi

  • as an approximate realization of the function f; that is,

|F(x) − f(x)| < ε for all x ∈ Im.

14 / 23

slide-41
SLIDE 41

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)?

15 / 23

slide-42
SLIDE 42

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s)

15 / 23

slide-43
SLIDE 43

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m:

15 / 23

slide-44
SLIDE 44

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm

15 / 23

slide-45
SLIDE 45

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N:

15 / 23

slide-46
SLIDE 46

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later)

15 / 23

slide-47
SLIDE 47

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later) – The space of real-valued continuous functions on Im:

15 / 23

slide-48
SLIDE 48

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later) – The space of real-valued continuous functions on Im: two examples to ponder on – binary classification – learn to solve square root

15 / 23

slide-49
SLIDE 49

Why could UAT hold?

The proof is very technical ... functional analysis

16 / 23

slide-50
SLIDE 50

Why could UAT hold?

Visual “proof” (http://neuralnetworksanddeeplearning.com/chap4.html) Think of R → R functions first, σ =

1 1+e−z

– Step 1: Build “step” functions – Step 2: Build “bump” functions – Step 3: Sum up bumps to approximate

17 / 23

slide-51
SLIDE 51

Step 1: build step functions

y = 1 1 + e−(wx+b) = 1 1 + e−w(x−b/w) – Larger w, sharper transition – Transition around −b/w, written as s

18 / 23

slide-52
SLIDE 52

Step 2: build bump functions

0.6 ∗ step(0.3) − 0.6 ∗ step (0.6) Write h as the bump height

19 / 23

slide-53
SLIDE 53

Step 3: sum up bumps to approximate

two bumps five bumps ultimate idea familiar?

20 / 23

slide-54
SLIDE 54

Outline

Recap Why should we trust NNs? Suggested reading

21 / 23

slide-55
SLIDE 55

Suggested reading

– Chap 4, Neural Networks and Deep Learning (online book) http://neuralnetworksanddeeplearning.com/chap4.html

22 / 23

slide-56
SLIDE 56

References i

[Cybenko, 1989] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314. [Hornik, 1991] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257. 23 / 23