BASICS OF ARTIFICIAL NEURAL NETWORKS Tilo Burghardt | - - PowerPoint PPT Presentation

basics of artificial neural networks
SMART_READER_LITE
LIVE PREVIEW

BASICS OF ARTIFICIAL NEURAL NETWORKS Tilo Burghardt | - - PowerPoint PPT Presentation

Department of Computer Science University of Bristol COMSM0045 Applied Deep Learning 2020/21 comsm0045-applied-deep-learning.github.io Lecture 01 BASICS OF ARTIFICIAL NEURAL NETWORKS Tilo Burghardt | tilo@cs.bris.ac.uk


slide-1
SLIDE 1

Department of Computer Science University of Bristol

COMSM0045 – Applied Deep Learning 2020/21

comsm0045-applied-deep-learning.github.io 35 Slides Lecture 01

BASICS OF ARTIFICIAL NEURAL NETWORKS

Tilo Burghardt | tilo@cs.bris.ac.uk

slide-2
SLIDE 2

Applied Deep Learning | University of Bristol Lecture 1 | 2

Agenda for Lecture 1

  • Neurons and their Structure
  • Single & Multi Layer Perceptron
  • Basics of Cost Functions
  • Gradient Descent and Delta Rule
  • Notation and Structure of

Deep Feed-Forward Networks

slide-3
SLIDE 3

Applied Deep Learning | University of Bristol Lecture 1 | 3

Biological Inspiration

slide-4
SLIDE 4

Applied Deep Learning | University of Bristol Lecture 1 | 4

Golgi’s first Drawings of Neurons

image source: www.the-scientist.com CAMILLO GOLGI

Computation in biological neural networks is delivered based on the co-operation of individual computational components, namely neuron cells.

slide-5
SLIDE 5

Applied Deep Learning | University of Bristol Lecture 1 | 5

Schematic Model of a Neuron

nucleus cell body dendrites axon myelin sheath axon terminals synapse main flow of information: feed-forward

slide-6
SLIDE 6

Applied Deep Learning | University of Bristol Lecture 1 | 6

Pavlov and Assistant Conditioning a Dog

image source: www.psysci.co

An environment can condition the behaviour of biological neural networks leading to the incorporation of new information.

slide-7
SLIDE 7

Applied Deep Learning | University of Bristol Lecture 1 | 7

Neuro-Plasticity

  • plasticity refers to a system’s ability to adapt structure

and/or behaviour to accommodate new information

  • the brain shows various forms of plasticity:
  • natural forms include synaptic plasticity (mainly chemical),

structural sprouting (growth), rerouting (functional changes), and neurogenesis (new neurons)

image source: www.cognifit.com

Example of structural sprouting. temporal system evolution

slide-8
SLIDE 8

Applied Deep Learning | University of Bristol Lecture 1 | 8

Artificial Feed-forward Networks

slide-9
SLIDE 9

Applied Deep Learning | University of Bristol Lecture 1 | 9

Rosenblatt’s (left) development of the Perceptron (1950s)

image source: csis.pace.edu

slide-10
SLIDE 10

Applied Deep Learning | University of Bristol Lecture 1 | 10

Simplification of a Neuron to a Computational Unit

x1 x2 x3 ...

w1 w2 w3 ... y sign

  • b

flow of information: feed-forward bias input x multiplication with weights w

  • utput

1 1 ) (                      

v

  • therwise

if v sign b x w sign y

def i i i

input width activation function summation

slide-11
SLIDE 11

Applied Deep Learning | University of Bristol Lecture 1 | 11

Notational Details for the Perceptron

) x w ( ) x w (

T input parameters T function activation

  • utput

unit

g sign y  

) w ; x (

parameters input function unit

f 

  • 1

x1 x2 ...

g

...] [ θ ...] [ w

1 1

     w w

w1 w0 w2 ... bias unit function y=f(x) is shorthand for f(x;w)

activation function g summation

  • utput

parameters s y input

CONVENTION: bias is incorporated in parameter vector various different variable names are used for parameters, most often we will use w NOTATION: a minor letter in non-italic font refers to a vector, a capital letter in non-italic font would refer to a matrix or vector set NOTATION: italic font refers to scalars NOTATION: semicolon separates input (left) from parameters (right)

slide-12
SLIDE 12

Applied Deep Learning | University of Bristol Lecture 1 | 12

Geometrical Interpretation of the State Space

w1 w2 normal vector hyper plane defined by parameters w x1 x2 w0/w1 w0/w2 positive sign area negative sign area

x w 

T

x w 

T

x w 

T

The basic Perceptron defines a hyper plane . in x-state space that linearly separates two regions of that space (which corresponds to a two-class linear classification)

x w

T

hyper plane acts as decision boundary

slide-13
SLIDE 13

Applied Deep Learning | University of Bristol Lecture 1 | 13

Basic Perceptron (Supervised) Learning Rule

  • Idea: whenever the system produces a

misclassification with current weights, adjust weights by towards a better performing weight vector: ... where is the learning rate.

  • utput

actual truth ground update

f f

  • therwise

if f ) x ( ) x ( x ) x ( w

* *

          w  

slide-14
SLIDE 14

Applied Deep Learning | University of Bristol Lecture 1 | 14

Compute Output Compare Output and Ground Truth Adjust Weights Consider Next (Training) Input Pair

Training a Single-Layer Perceptron

) x w ( (x)

T

sign f 

 

) x ( x

*

f ,

) x ( ) x ( ) x (

* *

f f

  • therwise

if x f w

i i

      

? ) x ( ) x (

*

f f 

slide-15
SLIDE 15

Applied Deep Learning | University of Bristol Lecture 1 | 15

Perceptron Learning Example: OR

Perceptron Training Attempt of OR using

x1 x2 f*

  • 1

1 1 1 1 1 1 1 x0 x1 x2 parameters w f f* update ∆w

  • 1

(0,0,0) 1

  • 1

(1,0,0)

  • 1

1 (1,0,0)

  • 1

1 (-1,1,0)

  • 1

(0,1,0) 1

  • 1

(1,0,0)

  • 1

1 (1,1,0)

  • 1

1 (-1,0,1)

  • 1

(0,1,1) 1

  • 1

(1,0,0)

  • 1

1 (1,1,1) 1 1 (0,0,0)

  • 1

1 (1,1,1) 1 1 (0,0,0)

  • 1

1 1 (1,1,1) 1 1 (0,0,0)

  • 1

(1,1,1)

  • 1
  • 1

(0,0,0) ... ... ... ... ... ... ...

5 . ; x )) x ( ) x ( ( w

*

      f f

OR

learning progress sampling some (x,f*)

encoding could be changed to traditional value 0 by adjusting the output of the sign function to 0; training algorithm still valid

slide-16
SLIDE 16

Applied Deep Learning | University of Bristol Lecture 1 | 16

Geometrical Interpretation of OR Space Learned

hyper plane defined by weights x1 x2 1=w0/w1 1=w0/w2 positive sign area negative sign area

x w 

T

x w 

T

x w 

T

  • 1

1 1 1 class label 1 class label -1

slide-17
SLIDE 17

Applied Deep Learning | University of Bristol Lecture 1 | 17

Larger Example Visualisation

image source: datasciencelab.wordpress.com

slide-18
SLIDE 18

Applied Deep Learning | University of Bristol Lecture 1 | 18

Cost Functions

slide-19
SLIDE 19

Applied Deep Learning | University of Bristol Lecture 1 | 19

Idea: Given a set X of input vectors x of one or more variables and a parameterisation w, a Cost Function is a map J onto a real number representing a cost or loss associated with the input configurations. (Negatively related to ‘goodness of fit’.)

Expected Loss: Empirical Risk: MSE Example:

Cost (or Loss) Functions

 

 

  

     

X x 2 * X x * * ~ )) x ( x, (

) x ( ) w x; ( | X | 1 ) w X; ( )) x ( ), w x; ( ( | X | 1 ) w X; ( )) x ( ), w x; ( ( ) w X; (

*

             

function loss example per function loss loss p f

f f J MSE f f L J f f L J

slide-20
SLIDE 20

Applied Deep Learning | University of Bristol Lecture 1 | 20

Energy Landscapes over Parameter Space

Cost Function J parameter dimensions of w

slide-21
SLIDE 21

Applied Deep Learning | University of Bristol Lecture 1 | 21

Steepest Gradient Descent

slide-22
SLIDE 22

Applied Deep Learning | University of Bristol Lecture 1 | 22

Idea of ‘Steepest’ Gradient Descent

parameter dimensions of w

    

gradient steepest rate learning

  • ld

new

J ) w X; ( w w

t t 1 t

  

slide-23
SLIDE 23

Applied Deep Learning | University of Bristol Lecture 1 | 23

The Delta Rule

       

      

x w ) x ( x w x w ) x ( x w ) x ( x w | X | ) w X; ( ) w X; ( w ) x ( x w | X | 2 1 ) w X; (

* T * T X x * T X x 2 * T

                        

 

 

      

derivative error the is k k k k k

f f x w f x w J w J f J

...and for a single sample...

MSE-type cost function with identity function as activation function weight vector change is modelled as a move along the steepest descent change for a single weight wk this term looks similar to the Perceptron learning rule also known as The Delta Rule (Widrow & Hoff, 1960)

slide-24
SLIDE 24

Applied Deep Learning | University of Bristol Lecture 1 | 24

Linear Separability

slide-25
SLIDE 25

Applied Deep Learning | University of Bristol Lecture 1 | 25

Basic Learning Example: XOR

Perceptron Training Attempt of XOR using

Will the learning process ever produce a solution?

x1 x2 f*

  • 1

1 1 1 1 1 1

  • 1

x0 x1 x2 parameters f f* update

  • 1

(0,0,0) 1

  • 1

(1,0,0)

  • 1

1 (1,0,0)

  • 1

1 (-1,1,0)

  • 1

(0,1,0) 1

  • 1

(1,0,0)

  • 1

1 (1,1,0)

  • 1

1 (-1,0,1)

  • 1

(0,1,1) 1

  • 1

(1,0,0)

  • 1

1 (1,1,1) 1 1 (0,0,0)

  • 1

1 (1,1,1) 1 1 (0,0,0)

  • 1

1 1 (1,1,1) 1

  • 1

(1,-1,-1)

  • 1

1 (1,0,0)

  • 1

1 (-1,1,0)

  • 1

1 (1,1,0)

  • 1

1 (-1,0,1) ... ... ... ... ... ... ...

XOR

learning progress sampling some (x,f*)

5 . ; x )) x ( ) x ( ( w

*

      f f

slide-26
SLIDE 26

Applied Deep Learning | University of Bristol Lecture 1 | 26

Geometrical Interpretation of XOR Space

NO hyper plane separates the two classes.  Single-Layer Perceptrons (SLPs) can only learn linearly separable problems. x1 x2 positive sign area negative sign area

x w 

T

x w 

T

  • 1

1 1

  • 1
slide-27
SLIDE 27

Applied Deep Learning | University of Bristol Lecture 1 | 27

Encoding Arbitrary Decision Boundaries

  • Idea: use of a Multi-Layer Perceptron (MLP)

with non-linear activation functions

Example of a hyper surface that separates the two classes. x1 x2 positive sign area negative sign area

(x)  f (x)  f

  • 1

1 1

  • 1
slide-28
SLIDE 28

Applied Deep Learning | University of Bristol Lecture 1 | 28

Multi-Layer Architectures

slide-29
SLIDE 29

Applied Deep Learning | University of Bristol Lecture 1 | 29

Structure and Notation for Deep Architectures

 

              ... ... ... ... ... ... ... ... W ... W W ) W )...; W ); W ; f ( f ( f (... f ))...) f ) W (( g ) W (( g (... g f

2 1 N 2 1 1 2 1 1 2 2 l ij N parameters all layer input N T T N layer

  • utput

N

w

depth

W

f0 f1 f2 ... ... ... ... ...

...

1

w

2

w

N

w

input layer

  • utput layer

network width d first hidden layer network depth N weight wij

l connects

the ith neuron in layer l-1 to the jth neuron in layer l f0

N

f1

N

f2

N

...

NOTATION: superscript usually refers to layer number, subscript to position in layer NOTATION: bold math-script used for tensors of

  • rder above 2

(basically 3D arrays or higher)

slide-30
SLIDE 30

Applied Deep Learning | University of Bristol Lecture 1 | 30

Outlook:

Learning Representations

slide-31
SLIDE 31

Applied Deep Learning | University of Bristol Lecture 1 | 31

Representational Power of Feedforward Networks

  • The basic Perceptron represents a linear classifier.
  • Boolean functions can be represented by layered networks

with one hidden layer (networks may be very wide requiring an exponential number of hidden neurons compared to input).

  • Layered networks with one hidden layer can also represent any

continuous function [Cybenko 1989; Hornik et al. 1989].

  • Layered networks with two hidden layers can represent any

mathematical function [Cybenko 1988].  long-standing optimism about the potential of neural networks to model learning and intelligent systems  question arises: why use more than two hidden layers – why is `deep’ advantageous at all? (see Lecture 4)

slide-32
SLIDE 32

Applied Deep Learning | University of Bristol Lecture 1 | 32

Deep Composition

source: Ian Goodfellow, www.deeplearningbook.org

slide-33
SLIDE 33

Applied Deep Learning | University of Bristol Lecture 1 | 33

The Concept of Deep Representation Learning

source: Ian Goodfellow, www.deeplearningbook.org

slide-34
SLIDE 34

Applied Deep Learning | University of Bristol Lecture 1 | 34

“It is only after much hesitation that the writer has reconciled himself to the addition of the term "neurodynamics" to the list of such recent linguistic artifacts as "cybernetics", "bionics", "autonomics", "biomimesis", "synnoetics", "intelectronics", and "robotics". It is hoped that by selecting a term which more clearly delimits our realm of interest and indicates its relationship to traditional academic disciplines, the underlying motivation

  • f the perceptron program may be more

successfully communicated.”

  • -- Frank Rosenblatt

from “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms”, Spartan Books, 1962

source: www.lmtech.info

slide-35
SLIDE 35

Applied Deep Learning | University of Bristol Lecture 1 | 35

Next Time: Towards Training Deep Architectures

  • Computational Graphs
  • Reverse Auto-Differentiation