[PPT] - Fundamental Belief: Universal Approximation Theorems Ju Sun PowerPoint Presentation

SLIDE 1

Fundamental Belief: Universal Approximation Theorems

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

January 29, 2020

1 / 23

SLIDE 2

Logistics

– HW 0 posted (due: midnight Feb 07)

2 / 23

SLIDE 3

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access)

2 / 23

SLIDE 4

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!)

2 / 23

SLIDE 5

Logistics

– HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) – Feb 06: discussion of the course project & ideas

2 / 23

SLIDE 6

Outline

Recap Why should we trust NNs? Suggested reading

3 / 23

SLIDE 7

Recap I

biological neuron vs. artificial neuron

4 / 23

SLIDE 8

Recap I

biological neuron vs. artificial neuron biological NN vs. artificial NN

4 / 23

SLIDE 9

Recap I

biological neuron vs. artificial neuron biological NN vs. artificial NN Artificial NN: (over)-simplification on neuron & connection levels

4 / 23

SLIDE 10

Recap II

Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs)

5 / 23

SLIDE 11

Recap II

Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) Also: – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization

5 / 23

SLIDE 12

Recap III

6 / 23

SLIDE 13

Recap III

Brief history of NNs: – 1943: first NNs invented (McCulloch and Pitts) – 1958 –1969: perceptron (Rosenblatt) – 1969: Perceptrons (Minsky and Papert)—end of perceptron – 1980’s–1990’s: Neocognitron, CNN, back-prop, SGD—we use today – 1990’s–2010’s: SVMs, Adaboosting, decision trees and random forests – 2010’s–now: DNNs and deep learning

6 / 23

SLIDE 14

Outline

Recap Why should we trust NNs? Suggested reading

7 / 23

SLIDE 15

Supervised learning

General view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

i=1

ℓ (yi, f (xi)) NN view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ – Find weights (w1, . . . , wk, b1, . . . , bk) to minimize the average loss min

w′s,b′s

1 n

n

i=1

ℓ [yi, {NN (w1, . . . , wk, b1, . . . , bk)} (xi)]

8 / 23

SLIDE 16

Supervised learning

General view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

i=1

ℓ (yi, f (xi)) NN view: – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ – Find weights (w1, . . . , wk, b1, . . . , bk) to minimize the average loss min

w′s,b′s

1 n

n

i=1

ℓ [yi, {NN (w1, . . . , wk, b1, . . . , bk)} (xi)]

Why should we trust NNs?

8 / 23

SLIDE 17

Function approximation

More accurate description of supervised learning

9 / 23

SLIDE 18

Function approximation

More accurate description of supervised learning – Underlying true function: f0

9 / 23

SLIDE 19

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi)

9 / 23

SLIDE 20

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close

9 / 23

SLIDE 21

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc)

9 / 23

SLIDE 22

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization: how to find the best f ∈ H matters

9 / 23

SLIDE 23

Function approximation

More accurate description of supervised learning – Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization: how to find the best f ∈ H matters We focus on approximation capacity now.

9 / 23

SLIDE 24

A word on notation

– k-layer NNs: with k layers of weights

10 / 23

SLIDE 25

A word on notation

– k-layer NNs: with k layers of weights – k-hidden-layer NNs: with k hidden layers of nodes (i.e., (k + 1)-layer NNs)

10 / 23

SLIDE 26

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear:

11 / 23

SLIDE 27

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron):

11 / 23

SLIDE 28

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z : 11 / 23

SLIDE 29

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

x →

1 1+e−(w⊺x+b)

– σ = max(0, z) (ReLU):

11 / 23

SLIDE 30

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

x →

1 1+e−(w⊺x+b)

– σ = max(0, z) (ReLU):

{x → max(0, w⊺x + b)}

11 / 23

SLIDE 31

First trial

Think of single-output (i.e., R) problems first A single neuron

(f → σ: again, activation always as σ)

H : {x → σ (w⊺x + b)} – σ identity or linear: linear functions – σ sign function sign (w⊺x + b) (perceptron): 0/1 function with hyperplane threshold – σ =

1 1+e−z :

x →

1 1+e−(w⊺x+b)

– σ = max(0, z) (ReLU):

{x → max(0, w⊺x + b)}

11 / 23

SLIDE 32

Second trial

Think of single-output (i.e., R) problems first

12 / 23

SLIDE 33

Second trial

Think of single-output (i.e., R) problems first Add depth! . . .

12 / 23

SLIDE 34

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

12 / 23

SLIDE 35

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

σ (w⊺

L (W L−1 (. . . (W 1x + b1) + . . .) bL−1) + bL)

12 / 23

SLIDE 36

Second trial

Think of single-output (i.e., R) problems first Add depth! . . . But make all hidden-nodes activations identity or linear

σ (w⊺

L (W L−1 (. . . (W 1x + b1) + . . .) bL−1) + bL)

No better than a signle neuron! Why?

12 / 23

SLIDE 37

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity!

13 / 23

SLIDE 38

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity! two-layer network, linear activation at output

13 / 23

SLIDE 39

Third trial

Think of single-output (i.e., R) problems first Add both depth & nonlinearity! two-layer network, linear activation at output

Surprising news: universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide. — we don’t worry about the capacity

13 / 23

SLIDE 40

Universal approximation theorem

Theorem (UAT, [Cybenko, 1989, Hornik, 1991]) Let σ : R → R be a nonconstant, bounded, and continuous

function. Let Im denote the m-dimensional unit hypercube [0, 1]m.

The space of real-valued continuous functions on Im is denoted by C(Im). Then, given any ε > 0 and any function f ∈ C(Im), there exist an integer N, real constants vi, bi ∈ R and real vectors wi ∈ Rm for i = 1, . . . , N, such that we may define: F(x) =

N

i=1

viσ

wT

i x + bi

as an approximate realization of the function f; that is,

|F(x) − f(x)| < ε for all x ∈ Im.

14 / 23

SLIDE 41

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)?

15 / 23

SLIDE 42

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s)

15 / 23

SLIDE 43

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m:

15 / 23

SLIDE 44

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm

15 / 23

SLIDE 45

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N:

15 / 23

SLIDE 46

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later)

15 / 23

SLIDE 47

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later) – The space of real-valued continuous functions on Im:

15 / 23

SLIDE 48

Thoughts on UAT

– σ : R → R be a nonconstant, bounded, and continuous: what about ReLU (leaky ReLU) or sign function (as in perceptron)? We have theorem(s) – Im denote the m-dimensional unit hypercube [0, 1]m: this can replaced by any compact subset of Rm – there exist an integer N: but how large N needs to be? (later) – The space of real-valued continuous functions on Im: two examples to ponder on – binary classification – learn to solve square root

15 / 23

SLIDE 49

Why could UAT hold?

The proof is very technical ... functional analysis

16 / 23

SLIDE 50

Why could UAT hold?

Visual “proof” (http://neuralnetworksanddeeplearning.com/chap4.html) Think of R → R functions first, σ =

1 1+e−z

– Step 1: Build “step” functions – Step 2: Build “bump” functions – Step 3: Sum up bumps to approximate

17 / 23

SLIDE 51

Step 1: build step functions

y = 1 1 + e−(wx+b) = 1 1 + e−w(x−b/w) – Larger w, sharper transition – Transition around −b/w, written as s

18 / 23

SLIDE 52

Step 2: build bump functions

0.6 ∗ step(0.3) − 0.6 ∗ step (0.6) Write h as the bump height

19 / 23

SLIDE 53

Step 3: sum up bumps to approximate

two bumps five bumps ultimate idea familiar?

20 / 23

SLIDE 54

Outline

Recap Why should we trust NNs? Suggested reading

21 / 23

SLIDE 55

– Chap 4, Neural Networks and Deep Learning (online book) http://neuralnetworksanddeeplearning.com/chap4.html

22 / 23

SLIDE 56

References i

[Cybenko, 1989] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314. [Hornik, 1991] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257. 23 / 23