Neural Networks
CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019
Soleymani
Some slides are based on Anca Draganβs slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.
Neural Networks CE417: Introduction to Artificial Intelligence - - PowerPoint PPT Presentation
Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2
CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019
Some slides are based on Anca Draganβs slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.
2
3
4
π¦" π¦# π¦3 π¦$
π₯" π₯# π₯3
π₯$
5
8
π¦" π¦# π¦3 π¦$
π₯" π₯# π₯3
π₯$
6
π¦" π¦# π¦3 π¦$
π₯" π₯# π₯3
π₯$
π
} A step function across a hyperplane
7
π¦" π¦# π¦3 π¦$
π₯" π₯# π₯3 π₯$
π
8
x1 x2 x3 xd
Wd+1
xd+1=1
} Restating
π§ = '1 ππ β π₯8π¦8 β₯ 0
$R" 8S"
0 ππ’βππ π₯ππ‘π Where π¦$R" = 1
} Note that the boundary β
$R" 8S"
π
$
9
that perfectly separates the two groups of points
34
$R" 8S"
10
11
} Perceptron: If sign(π_π(Z)) β π§(Z) then }
} } ADALINE:
} Widrow-Hoff, LMS, or delta rule
12
} guaranteed
} guaranteed to converge to the hypothesis with the minimum
} can also be used for regression problems
13
14
ππ ππ ππ
ππΆ ππ ππ ππ
ππΆ
} They have nice derivatives.
15
16
ππ ππ ππ
ππΆ ππ ππ ππ ππΆ
π
β We will hear more about activations later
tanh
tanh π¨ log (1 +πl)
sigmoid
1 1 1 + exp (βπ¨)
17
}
Maximum likelihood estimation: with:
18
} Multi-class linear classification
}
A weight vector for each class:
}
Score (activation) of a class y:
}
Prediction w/highest score wins:
} How to make the scores into probabilities?
19
_π
_π
_π
_π
_π
} Maximum likelihood estimation:
20
p
pπ(s)
w
w
i
} init } for iter = 1, 2, β¦
i
21
22
πzπΎ πΏ = 7
8
Z 8S"
w
w
i
} init } for iter = 1, 2, β¦
} pick random j
23
w
w
i
} init } for iter = 1, 2, β¦
} pick random subset of training examples J
jβJ
24
} A simple function such as XOR cannot be modeled with single
} More layers of linear units do not help. Its still linear. } Fixed output non-linearities are not enough.
25
26
27
} Can be seen as learning the features } Large number of neurons
} Danger for overfitting } (hence early stopping!)
28
Cybenko (1989) βApproximations by superpositions of sigmoidal functionsβ Hornik (1991) βApproximation Capabilities of Multilayer Feedforward Networksβ Leshno and Schocken (1991) βMultilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Functionβ 29
Cybenko (1989) βApproximations by superpositions of sigmoidal functionsβ Hornik (1991) βApproximation Capabilities of Multilayer Feedforward Networksβ Leshno and Schocken (1991) βMultilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Functionβ
30
31
32
} under mild assumptions on the activation function
} e.g., sigmoid activation functions (Cybenko,1989)
} when sufficiently large (but finite) number of hidden units is used
} the construction of such a network requires the nonlinear activation
v π¦ = 7
wv # π 7
" π¦8 $ 8S~
33 } Two-layer MLP (Number of layers of adaptive weights is counted)
πv π = π 7 π₯
wv [#]π¨ w
β πv π = π 7 π₯
wv [#]π 7 π₯8w ["]π¦8 $ 8S~
wv [#]
["]
w
34
35
β0.5
36
X
? ? ?
Y
37
β0.5
38
1.
2.
3.
39
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
40
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
41
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
42
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
43
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
44
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
45
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
46
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
47
X
1
X
2
X
3
X
4 X 5
Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
X# X3 Xβ Xβ X"
48
49
784 dimensions (MNIST) 784 dimensions
50
0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1
51
x1 x2
Can now be composed into βnetworksβ to compute arbitrary classification βboundariesβ
52
x1 x2
53
x1 x2
54
x1 x2
55
x1 x2
56
x1 x2
57
A ND
7π§8 β₯ 4.5
β 8S"
58
A ND A ND OR
59 A ND OR x1 x2
6
Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.
61
62
63 +
1
#
1 T
1
T
2
1
T
2
T
1
64
ββ‘ ββ‘ ββ‘
} But could be exponentially or even infinitely wide in its inputs size
65
f1(x) f2(x) f3(x) fK(x)
P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3
P(y3|x; w) = ez3 ez1 + ez2 + ez3
66
f1(x) f2(x) f3(x) fK(x)
P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3
P(y3|x; w) = ez3 ez1 + ez2 + ez3
67
f1(x) f2(x) f3(x) fK(x)
P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3
x1 x2 x3 xL
z(1)
1
z(1)
2
z(1)
3
z(1)
K(1)
z(2)
K(2)
z(2)
1
z(2)
2
z(2)
3
z(OUT )
1
z(OUT )
2
z(OUT )
3
z(nβ1)
3
z(nβ1)
2
z(nβ1)
1
z(nβ1)
K(nβ1)
i
j
i,j
j
68
P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3
x1 x2 x3 xL
z(1)
1
z(1)
2
z(1)
3
z(1)
K(1)
z(2)
K(2)
z(2)
1
z(2)
2
z(2)
3
z(OUT )
1
z(OUT )
2
z(OUT )
3
z(nβ1)
3
z(nβ1)
2
z(nβ1)
1
z(nβ1)
K(nβ1)
i
j
i,j
j
z(n)
K(n)
z(n)
3
z(n)
2
z(n)
1
69
[source: MIT 6.S191 introtodeeplearning.com]
70
71
} We consider a neural network as a parametric function
72
} We consider a neural network as a parametric function π(π; πΏ)
β ZS"
73
74
} SSE
} Cross-entropy
Z log πv (Z) u vS"
(Z)
β’Ε‘t β β’Ε‘t
βΊ tΕβ’
β ZS"
Z β π§v Z # u vS"
} We need an efficient way of adapting all the weights, not just
} Learning the weights going into hidden units is equivalent to
} This is difficult because nobody is telling us directly what the
75
76 } Start from random weights and then adjust them iteratively to get lower cost. } Update the weights according to the gradient of the cost function
77 } Which changes to the weights do improve the most? } The magnitude of each element shows how sensitive the cost
πΌπΉ
78
} Training algorithm that is used to adjust weights in multi-layer
} The back-propagation algorithm is based on gradient descent } Use chain rule and
79
Total training error:
} Gradient descent algorithm } Initialize all weights and biases π₯8w
[v]
} Using the extended notation : the bias is also weight
} Do :
} For every layer π for all π, π update:
} π₯8,w [v] = π₯8,w [v] β π $ΕΎ $ΕΈs,z
[t]
} Until πΉ has converged
Assuming the bias is also represented as a weight
β ZS"
80
Total derivative: Total training error:
β ZS"
[v] = 7 πππ‘π‘ π(Z), π(Z)
[v] β ZS"
} Initialize all weights π₯8w
[v]
} Do :
} For all π , π , π, initialize
$ΕΎ $ΕΈs,z
[t] = 0
} For all π = 1: π
} For every layer π for all π, π:
Β¨ Compute
$ £€Β₯Β₯ Β€ Λ ,q Λ $ΕΈs,z
[t]
Β¨
$ΕΎ $ΕΈs,z
[t] +=
$ £€Β₯Β₯ Β€ Λ ,q Λ $ΕΈs,z
[t]
} For every layer π for all π, π: π₯8,w
[v] = π₯8,w [v] β Β¦ _ $ΕΎ $ΕΈs,z
[t]
81
n But neural net f is never one of those?
n No problem: CHAIN RULE:
82
83
84
85
[v]
86
π(. ) π(. ) π(. ) π(. ) π(. ) π[Β£o"]
For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)
π["] π¦ Γ π π[#] Γ π π[-] Γ π π¨["] π["] π¨[#] π[-o"] π¨[-] π[-] π[-] = ππ£π’ππ£π’
87
π8
[Β£o"]
π¨
w [-]
πw
[-]
π
[-] = π π¨8 [-]
w [-] = 7 π₯8w [-]π8 [-o"]
w β π§w # w
w = πw
[-]
88
[-] = ππππ‘π‘
[-]
[-] = πΒ³ π¨ w [-]
w [-]
[-] = πΒ³ π¨ w [-] π8 [-o"]
[-] = ππππ‘π‘
w [-] π8 [-o"]
[-]
89
2
[*]
[Β£] = π πππ‘π‘
Β£
Β£
[Β£]
[Β£] =
[Β£]
w [Β£] Γ
w [Β£]
[Β£] = πΒ³ π¨ w [Β£] π8 [Β£o"]
[Β£o"] = 7 π πππ‘π‘
[Β£] Γ
[Β£]
w [Β£] Γ
w [Β£]
[Β£o"] $[Β΄] wS"
[Β£] ΓπΒ³ π¨ w [Β£] Γπ₯8w [Β£] $[Β΄] wS"
90
π8
[Β£o"]
π¨
w [Β£]
πw
[Β£]
π π₯8w
[Β£]
π8
[Β£o"]
πw
[Β£]
[Β£]
π¨
w [Β£]
Β£
[Β£] = π π¨8 [Β£]
w [Β£] = 7 π₯8w [Β£]π8 [Β£o"]
[Β£]
[Β£]
[Β£o
[Β£]
91
[Β£] = π πππ‘π‘
[Β£] Γ
[Β£]
[Β£]
w [Β£]Γπ8 [Β£o"]ΓπΒ³ π¨ w [Β£]
} π
w [Β£] = ΒΆ £€Β₯Β₯ ΒΆΒ·z
[Β΄] is the sensitivity of the output to πw
[Β£]
} Sensitivity vectors can be obtained by running a backward process in the
[Β£o"]
w [Β£]
[Β£]
[Β£] = π π¨8 [Β£]
w [Β£] = 7 π₯8w [Β£]π8 [Β£o"]
[Β£]
We will compute πΊ[Β£o"] from πΊ[Β£]:
π8
[Β£o"] = 7 π w [Β£]ΓπΒ³ π¨ w [Β£] Γπ₯8w [Β£] $[Β΄] wS"
92
} Initialize all weights to small random numbers. } While not satisfied } For each training example do: 1.
2.
3.
[Β£] as π₯8w [Β£] β π₯8w [Β£] β π ΒΆ £€Β₯Β₯ ΒΆΕΈsz
[Β΄]
ΒΆ £€Β₯Β₯ ΒΆΕΈsz
[Β΄] = π
w [Β£]Γπ8 [Β£o"]ΓπΒ³ π¨ w [Β£]
} e.g.Theano,TensorFlow, PyT
} Only need to program the function g(x,y,w) } Can automatically compute all derivatives w.r.t. all entries in w } This
} Autodiff / Backpropagation can often be done at computational
93