[PPT] - BASICS OF ARTIFICIAL NEURAL NETWORKS Tilo Burghardt

SLIDE 1

Department of Computer Science University of Bristol

COMSM0045 – Applied Deep Learning 2020/21

comsm0045-applied-deep-learning.github.io 35 Slides Lecture 01

BASICS OF ARTIFICIAL NEURAL NETWORKS

Tilo Burghardt | tilo@cs.bris.ac.uk

SLIDE 2

Applied Deep Learning | University of Bristol Lecture 1 | 2

Agenda for Lecture 1

Neurons and their Structure
Single & Multi Layer Perceptron
Basics of Cost Functions
Gradient Descent and Delta Rule
Notation and Structure of

Deep Feed-Forward Networks

SLIDE 3

Applied Deep Learning | University of Bristol Lecture 1 | 3

Biological Inspiration

SLIDE 4

Applied Deep Learning | University of Bristol Lecture 1 | 4

Golgi’s first Drawings of Neurons

image source: www.the-scientist.com CAMILLO GOLGI

Computation in biological neural networks is delivered based on the co-operation of individual computational components, namely neuron cells.

SLIDE 5

Applied Deep Learning | University of Bristol Lecture 1 | 5

Schematic Model of a Neuron

nucleus cell body dendrites axon myelin sheath axon terminals synapse main flow of information: feed-forward

SLIDE 6

Applied Deep Learning | University of Bristol Lecture 1 | 6

Pavlov and Assistant Conditioning a Dog

image source: www.psysci.co

An environment can condition the behaviour of biological neural networks leading to the incorporation of new information.

SLIDE 7

Applied Deep Learning | University of Bristol Lecture 1 | 7

Neuro-Plasticity

plasticity refers to a system’s ability to adapt structure

and/or behaviour to accommodate new information

the brain shows various forms of plasticity:
natural forms include synaptic plasticity (mainly chemical),

structural sprouting (growth), rerouting (functional changes), and neurogenesis (new neurons)

image source: www.cognifit.com

Example of structural sprouting. temporal system evolution

SLIDE 8

Applied Deep Learning | University of Bristol Lecture 1 | 8

Artificial Feed-forward Networks

SLIDE 9

Applied Deep Learning | University of Bristol Lecture 1 | 9

Rosenblatt’s (left) development of the Perceptron (1950s)

image source: csis.pace.edu

SLIDE 10

Applied Deep Learning | University of Bristol Lecture 1 | 10

Simplification of a Neuron to a Computational Unit

x1 x2 x3 ...

∑

w1 w2 w3 ... y sign

b

flow of information: feed-forward bias input x multiplication with weights w

utput

1 1 ) (                      



v

therwise

if v sign b x w sign y

def i i i

input width activation function summation

SLIDE 11

Applied Deep Learning | University of Bristol Lecture 1 | 11

Notational Details for the Perceptron



) x w ( ) x w (

T input parameters T function activation

utput

unit

g sign y  



) w ; x (

parameters input function unit

f 

1

x1 x2 ...

g

...] [ θ ...] [ w

1 1

     w w

w1 w0 w2 ... bias unit function y=f(x) is shorthand for f(x;w)

∑

activation function g summation

utput

parameters s y input

CONVENTION: bias is incorporated in parameter vector various different variable names are used for parameters, most often we will use w NOTATION: a minor letter in non-italic font refers to a vector, a capital letter in non-italic font would refer to a matrix or vector set NOTATION: italic font refers to scalars NOTATION: semicolon separates input (left) from parameters (right)

SLIDE 12

Applied Deep Learning | University of Bristol Lecture 1 | 12

Geometrical Interpretation of the State Space

w1 w2 normal vector hyper plane defined by parameters w x1 x2 w0/w1 w0/w2 positive sign area negative sign area

x w 

T

x w 

T

x w 

T

The basic Perceptron defines a hyper plane . in x-state space that linearly separates two regions of that space (which corresponds to a two-class linear classification)

x w

T



hyper plane acts as decision boundary

SLIDE 13

Applied Deep Learning | University of Bristol Lecture 1 | 13

Basic Perceptron (Supervised) Learning Rule

Idea: whenever the system produces a

misclassification with current weights, adjust weights by towards a better performing weight vector: ... where is the learning rate.





utput

actual truth ground update

f f

therwise

if f ) x ( ) x ( x ) x ( w

* *

          w  

SLIDE 14

Applied Deep Learning | University of Bristol Lecture 1 | 14

Compute Output Compare Output and Ground Truth Adjust Weights Consider Next (Training) Input Pair

Training a Single-Layer Perceptron

) x w ( (x)

T

sign f 

 

) x ( x

*

f ,

) x ( ) x ( ) x (

* *

f f

therwise

if x f w

i i

      

? ) x ( ) x (

*

f f 

SLIDE 15

Applied Deep Learning | University of Bristol Lecture 1 | 15

Perceptron Learning Example: OR

Perceptron Training Attempt of OR using

x1 x2 f*

1

1 1 1 1 1 1 1 x0 x1 x2 parameters w f f* update ∆w

1

(0,0,0) 1

1

(1,0,0)

1

1 (1,0,0)

1

1 (-1,1,0)

1

(0,1,0) 1

1

(1,0,0)

1

1 (1,1,0)

1

1 (-1,0,1)

1

(0,1,1) 1

1

(1,0,0)

1

1 (1,1,1) 1 1 (0,0,0)

1

1 (1,1,1) 1 1 (0,0,0)

1

1 1 (1,1,1) 1 1 (0,0,0)

1

(1,1,1)

1
1

(0,0,0) ... ... ... ... ... ... ...

5 . ; x )) x ( ) x ( ( w

*

      f f

OR

learning progress sampling some (x,f*)

encoding could be changed to traditional value 0 by adjusting the output of the sign function to 0; training algorithm still valid

SLIDE 16

Applied Deep Learning | University of Bristol Lecture 1 | 16

Geometrical Interpretation of OR Space Learned

hyper plane defined by weights x1 x2 1=w0/w1 1=w0/w2 positive sign area negative sign area

x w 

T

x w 

T

x w 

T

1

1 1 1 class label 1 class label -1

SLIDE 17

Applied Deep Learning | University of Bristol Lecture 1 | 17

Larger Example Visualisation

image source: datasciencelab.wordpress.com

SLIDE 18

Applied Deep Learning | University of Bristol Lecture 1 | 18

Cost Functions

SLIDE 19

Applied Deep Learning | University of Bristol Lecture 1 | 19

Idea: Given a set X of input vectors x of one or more variables and a parameterisation w, a Cost Function is a map J onto a real number representing a cost or loss associated with the input configurations. (Negatively related to ‘goodness of fit’.)

Expected Loss: Empirical Risk: MSE Example:

Cost (or Loss) Functions

 

 

  

     

X x 2 * X x * * ~ )) x ( x, (

) x ( ) w x; ( | X | 1 ) w X; ( )) x ( ), w x; ( ( | X | 1 ) w X; ( )) x ( ), w x; ( ( ) w X; (

*

             

function loss example per function loss loss p f

f f J MSE f f L J f f L J

SLIDE 20

Applied Deep Learning | University of Bristol Lecture 1 | 20

Energy Landscapes over Parameter Space

Cost Function J parameter dimensions of w

SLIDE 21

Applied Deep Learning | University of Bristol Lecture 1 | 21

Steepest Gradient Descent

SLIDE 22

Applied Deep Learning | University of Bristol Lecture 1 | 22

Idea of ‘Steepest’ Gradient Descent

parameter dimensions of w



    

gradient steepest rate learning

ld

new

J ) w X; ( w w

t t 1 t

  





SLIDE 23

Applied Deep Learning | University of Bristol Lecture 1 | 23

The Delta Rule

       

      



x w ) x ( x w x w ) x ( x w ) x ( x w | X | ) w X; ( ) w X; ( w ) x ( x w | X | 2 1 ) w X; (

* T * T X x * T X x 2 * T

                        

 

 

      

derivative error the is k k k k k

f f x w f x w J w J f J

...and for a single sample...

MSE-type cost function with identity function as activation function weight vector change is modelled as a move along the steepest descent change for a single weight wk this term looks similar to the Perceptron learning rule also known as The Delta Rule (Widrow & Hoff, 1960)

SLIDE 24

Applied Deep Learning | University of Bristol Lecture 1 | 24

Linear Separability

SLIDE 25

Applied Deep Learning | University of Bristol Lecture 1 | 25

Basic Learning Example: XOR

Perceptron Training Attempt of XOR using

Will the learning process ever produce a solution?

x1 x2 f*

1

1 1 1 1 1 1

1

x0 x1 x2 parameters f f* update

1

(0,0,0) 1

1

(1,0,0)

1

1 (1,0,0)

1

1 (-1,1,0)

1

(0,1,0) 1

1

(1,0,0)

1

1 (1,1,0)

1

1 (-1,0,1)

1

(0,1,1) 1

1

(1,0,0)

1

1 (1,1,1) 1 1 (0,0,0)

1

1 (1,1,1) 1 1 (0,0,0)

1

1 1 (1,1,1) 1

1

(1,-1,-1)

1

1 (1,0,0)

1

1 (-1,1,0)

1

1 (1,1,0)

1

1 (-1,0,1) ... ... ... ... ... ... ...

XOR

learning progress sampling some (x,f*)

5 . ; x )) x ( ) x ( ( w

*

      f f

SLIDE 26

Applied Deep Learning | University of Bristol Lecture 1 | 26

Geometrical Interpretation of XOR Space

NO hyper plane separates the two classes.  Single-Layer Perceptrons (SLPs) can only learn linearly separable problems. x1 x2 positive sign area negative sign area

x w 

T

x w 

T

1

1 1

1

SLIDE 27

Applied Deep Learning | University of Bristol Lecture 1 | 27

Encoding Arbitrary Decision Boundaries

Idea: use of a Multi-Layer Perceptron (MLP)

with non-linear activation functions

Example of a hyper surface that separates the two classes. x1 x2 positive sign area negative sign area

(x)  f (x)  f

1

1 1

1

SLIDE 28

Applied Deep Learning | University of Bristol Lecture 1 | 28

Multi-Layer Architectures

SLIDE 29

Applied Deep Learning | University of Bristol Lecture 1 | 29

Structure and Notation for Deep Architectures







 

              ... ... ... ... ... ... ... ... W ... W W ) W )...; W ); W ; f ( f ( f (... f ))...) f ) W (( g ) W (( g (... g f

2 1 N 2 1 1 2 1 1 2 2 l ij N parameters all layer input N T T N layer

utput

N

w

depth

W

f0 f1 f2 ... ... ... ... ...

...

1

w

2

w

N

w

input layer

utput layer

network width d first hidden layer network depth N weight wij

l connects

the ith neuron in layer l-1 to the jth neuron in layer l f0

N

f1

N

f2

N

...

NOTATION: superscript usually refers to layer number, subscript to position in layer NOTATION: bold math-script used for tensors of

rder above 2

(basically 3D arrays or higher)

SLIDE 30

Applied Deep Learning | University of Bristol Lecture 1 | 30

Outlook:

Learning Representations

SLIDE 31

Applied Deep Learning | University of Bristol Lecture 1 | 31

Representational Power of Feedforward Networks

The basic Perceptron represents a linear classifier.
Boolean functions can be represented by layered networks

with one hidden layer (networks may be very wide requiring an exponential number of hidden neurons compared to input).

Layered networks with one hidden layer can also represent any

continuous function [Cybenko 1989; Hornik et al. 1989].

Layered networks with two hidden layers can represent any

mathematical function [Cybenko 1988].  long-standing optimism about the potential of neural networks to model learning and intelligent systems  question arises: why use more than two hidden layers – why is `deep’ advantageous at all? (see Lecture 4)

SLIDE 32

Applied Deep Learning | University of Bristol Lecture 1 | 32

Deep Composition

source: Ian Goodfellow, www.deeplearningbook.org

SLIDE 33

Applied Deep Learning | University of Bristol Lecture 1 | 33

The Concept of Deep Representation Learning

source: Ian Goodfellow, www.deeplearningbook.org

SLIDE 34

Applied Deep Learning | University of Bristol Lecture 1 | 34

“It is only after much hesitation that the writer has reconciled himself to the addition of the term "neurodynamics" to the list of such recent linguistic artifacts as "cybernetics", "bionics", "autonomics", "biomimesis", "synnoetics", "intelectronics", and "robotics". It is hoped that by selecting a term which more clearly delimits our realm of interest and indicates its relationship to traditional academic disciplines, the underlying motivation

f the perceptron program may be more

successfully communicated.”

-- Frank Rosenblatt

from “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms”, Spartan Books, 1962

source: www.lmtech.info

SLIDE 35

Applied Deep Learning | University of Bristol Lecture 1 | 35

Next Time: Towards Training Deep Architectures

Computational Graphs
Reverse Auto-Differentiation