SLIDE 1 复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Basis of Neural Networks
魏忠钰
SLIDE 2 General Neural Architectures for NLP
words/features with dense vectors (embeddings) by lookup table’
- 2. Concatenate the vectors
- 3. Multi-layer neural
networks
§ Classification § Matching § ranking
- R. Collobert et al. “Natural language processing (almost) from scratch”
SLIDE 3 Machine Learning
§ Machine learning explores the study and construction
- f algorithms that can learn from and make
predictions on data. (from Wikipedia)
SLIDE 4 Formal Specification of Machine Learning
§ Input Data: 𝑦", 𝑧" , 1 ≤ 𝑗 ≤ 𝑛 § Model
§ Linear Model: 𝑧 = 𝑔 𝑦 = 𝑥,𝑦 + 𝑐 § Generalized Linear Model: 𝑧 = 𝑔 𝑦 = 𝑥,𝜚(𝑦) + 𝑐 § Non-linear Model: Neural Network
§ Criterion:
§ Loss Function:
§ L(y, f(x)) à Optimization § 𝑅 𝜄 = 4
5 ∑
𝑀(𝑧", 𝑔(𝑦", 𝜄))
5 "84
à Minimization § Regularization: 𝜄
§ Objective Function: Q 𝜄 + 𝜇 𝜄 ;
SLIDE 5
Linear Classifier
𝑔 𝑦, 𝑋 = 𝑋𝑦 + b
SLIDE 6 Generalized Linear Classification
§ Hypothesis is a logistic function of a linear combination
𝑧 = 𝑔 𝑦 = 𝑥,𝑦 + 𝑐 F x =
4 4?@AB (D)
§ We can interpret F(x) as P(y=1|x) § Then the log-odds ratio, In P(y=1|x) P(y=0|x) = 𝑥,𝑦 is linear
SLIDE 7
Softmax
§ Softmax regression is a generalization of logistic regression to multi-class classification problems § With softmax, the posterior probability of y = c is: § To present class c by one-hot vector § Where I() is indicator function
𝑄 𝑧 = 𝑑 𝑦 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑥M
,𝑦 =
exp (𝑥M
,𝑦)
∑ exp (𝑥"
,𝑦) P "84
𝑧 = [𝐽 1 = 𝑑 , 𝐽 2 = 𝑑 , … , 𝐽(𝐷 = 𝑑)],
SLIDE 8 Examples of word classification
W = K * D b = K * 1
SLIDE 9
How to learn W? 𝑅 𝜄 = 1 𝑛 W 𝑀(𝑧", 𝑔(𝑦", 𝜄))
5 "84
§ Hinge Loss (SVM) § Softmax loss: cross-entropy loss
SLIDE 10
SVM vs Softmax (Quiz)
SLIDE 11
Parameter Learning
§ In ML, our objective is to learn the parameter 𝜄 to minimize the loss function. § How to learn 𝜄 ?
SLIDE 12
Gradient Descent
§ Gradient Descent: § 𝜇 is also called Learning Rate in ML.
SLIDE 13
Gradient Descent
SLIDE 14
Learning Rate
SLIDE 15
Gradient Descent
SLIDE 16
Stochastic Gradient Descent (SGD)
SLIDE 17
Computational graphs
SLIDE 18
Backpropagation: a simple example
SLIDE 19
SLIDE 20
SLIDE 21
SLIDE 22
SLIDE 23
SLIDE 24
SLIDE 25
SLIDE 26
SLIDE 27
SLIDE 28
SLIDE 29
SLIDE 30
SLIDE 31
SLIDE 32
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36
SLIDE 37
SLIDE 38
SLIDE 39
SLIDE 40
SLIDE 41
SLIDE 42
SLIDE 43
SLIDE 44
SLIDE 45
SLIDE 46
SLIDE 47
SLIDE 48
SLIDE 49
SLIDE 50
SLIDE 51
SLIDE 52
SLIDE 53
Biological Neuron
SLIDE 54
Artificial Neuron
SLIDE 55
Activation Functions
SLIDE 56
SLIDE 57
Activation Functions
SLIDE 58
Feedforward Neural Network
SLIDE 59
Neural Network
SLIDE 60
Feedforward Computing