Machine Learning Prof. Kuan-Ting Lai 2020/4/11 Applied Math for - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Prof. Kuan-Ting Lai 2020/4/11 Applied Math for - - PowerPoint PPT Presentation

Applied Math for Machine Learning Prof. Kuan-Ting Lai 2020/4/11 Applied Math for Machine Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude


slide-1
SLIDE 1

Applied Math for Machine Learning

  • Prof. Kuan-Ting Lai

2020/4/11

slide-2
SLIDE 2

Applied Math for Machine Learning

  • Linear Algebra
  • Probability
  • Calculus
  • Optimization
slide-3
SLIDE 3

Linear Algebra

  • Scalar

− real numbers

  • Vector (1D)

− Has a magnitude & a direction

  • Matrix (2D)

− An array of numbers arranges in rows & columns

  • Tensor (>=3D)

− Multi-dimensional arrays of numbers

slide-4
SLIDE 4

Real-world examples of Data Tensors

  • Timeseries Data – 3D (samples, timesteps, features)
  • Images – 4D (samples, height, width, channels)
  • Video – 5D (samples, frames, height, width, channels)

4

slide-5
SLIDE 5

Vector Dimension vs. Tensor Dimension

  • The number of data in a vector is also called “dimension”
  • In deep learning , the dimension of Tensor is also called “rank”
  • Matrix = 2d array = 2d tensor = rank 2 tensor

https://deeplizard.com/learn/video/AiyK0idr4uM

slide-6
SLIDE 6

The Matrix

slide-7
SLIDE 7

Matrix

  • Define a matrix with m rows

and n columns:

Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017

slide-8
SLIDE 8

Matrix Operations

  • Addition and Subtraction
slide-9
SLIDE 9

Matrix Multiplication

  • Two matrices A and B, where
  • The columns of A must be equal to the rows of B, i.e. n == p
  • A * B = C, where
  • m

n p q q m

slide-10
SLIDE 10

Example of Matrix Multiplication (3-1)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-11
SLIDE 11

Example of Matrix Multiplication (3-2)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-12
SLIDE 12

Example of Matrix Multiplication (3-3)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-13
SLIDE 13

Matrix Transpose

https://en.wikipedia.org/wiki/Transpose

slide-14
SLIDE 14

Dot Product

  • Dot product of two vectors become a scalar
  • Inner product is a generalization of the dot product
  • Notation: 𝑤1 ∙ 𝑤2 or 𝑤1𝑈𝑤2
slide-15
SLIDE 15

Dot Product of Matrix

slide-16
SLIDE 16

Linear Independence

  • A vector is linearly dependent on other vectors if it can be expressed

as the linear combination of other vectors

  • A set of vectors 𝑤1, 𝑤2,⋯ , 𝑤𝑜 is linearly independent if 𝑏1𝑤1 +

𝑏2𝑤2 + ⋯ + 𝑏𝑜𝑤𝑜 = 0 implies all 𝑏𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}

slide-17
SLIDE 17

Span the Vector Space

  • n linearly independent vectors can span

n-dimensional space

slide-18
SLIDE 18

Rank of a Matrix

  • Rank is:

− The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns

  • Row rank = Column rank
  • Example:

https://en.wikipedia.org/wiki/Rank_(linear_algebra)

Row- echelon form

slide-19
SLIDE 19

Identity Matrix I

  • Any vector or matrix multiplied by I remains unchanged
  • For a matrix 𝐵, 𝐵𝐽 = 𝐽𝐵 = 𝐵
slide-20
SLIDE 20

Inverse of a Matrix

  • The product of a square matrix 𝐵 and its inverse matrix 𝐵−1

produces the identity matrix 𝐽

  • 𝐵𝐵−1 = 𝐵−1𝐵 = 𝐽
  • Inverse matrix is square, but not all square matrices has inverses
slide-21
SLIDE 21

Pseudo Inverse

  • Non-square matrix and have left-inverse or right-inverse matrix
  • Example:

− Create a square matrix 𝐵𝑈𝐵 − Multiplied both sides by inverse matrix (𝐵𝑈𝐵)−1 − (𝐵𝑈𝐵)−1𝐵𝑈 is the pseudo inverse function

𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ𝑛×𝑜, 𝑐 ∈ ℝ𝑜 𝐵𝑈𝐵𝑦 = 𝐵𝑈𝑐 𝑦 = (𝐵𝑈𝐵)−1𝐵𝑈𝑐

slide-22
SLIDE 22

Norm

  • Norm is a measure of a vector’s magnitude
  • 𝑚2 norm
  • 𝑚1 norm
  • 𝑚𝑞 norm
  • 𝑚∞ norm
slide-23
SLIDE 23

Eigen Vectors

  • Eigenvector is a non-zero vector that changed by only a scalar factor λ

when linear transform 𝐵 is applied to:

  • 𝑦 are Eigenvectors and 𝜇 are Eigenvalues
  • One of the most important concepts in machine learning, ex:

− Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …

𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ𝑜×𝑜, 𝑦 ∈ ℝ𝑜

slide-24
SLIDE 24

Example: Shear Mapping

  • Horizontal axis is the

Eigenvector

slide-25
SLIDE 25

Principle Component Analysis (PCA)

  • Eigenvector of Covariance Matrix

https://en.wikipedia.org/wiki/Principal_component_analysis

slide-26
SLIDE 26

NumPy for Linear Algebra

  • NumPy is the fundamental package for scientific computing

with Python. It contains among other things: −a powerful N-dimensional array object −sophisticated (broadcasting) functions −tools for integrating C/C++ and Fortran code −useful linear algebra, Fourier transform, and random number capabilities

slide-27
SLIDE 27

Create Tensors

Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)

slide-28
SLIDE 28

Create 3D Tensor

slide-29
SLIDE 29

Attributes of a Numpy Tensor

  • Number of axes (dimensions, rank)

− x.ndim

  • Shape

− This is a tuple of integers showing how many data the tensor has along each axis

  • Data type

− uint8, float32 or float64

slide-30
SLIDE 30

Numpy Multiplication

slide-31
SLIDE 31

Unfolding the Manifold

  • Tensor operations are complex geometric transformation in high-

dimensional space

− Dimension reduction

slide-32
SLIDE 32

Basics of Probability

slide-33
SLIDE 33

Three Axioms of Probability

  • Given an Event 𝐹 in a sample space 𝑇, S =ڂ𝑗=1

𝑂

𝐹𝑗

  • First axiom

− 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1

  • Second axiom

− 𝑄 𝑇 = 1

  • Third axiom

− Additivity, any countable sequence of mutually exclusive events 𝐹𝑗 − 𝑄ڂ𝑗=1

𝑜

𝐹𝑗 = 𝑄 𝐹1 + 𝑄 𝐹2 + ⋯ + 𝑄 𝐹𝑜 = σ𝑗=1

𝑜

𝑄 𝐹𝑗

slide-34
SLIDE 34

Union, Intersection, and Conditional Probability

  • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶
  • 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶
  • Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has
  • ccurred

− 𝑄 𝐵|𝐶 = 𝑄

𝐵𝐶 𝐶

− 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)

slide-35
SLIDE 35

Chain Rule of Probability

  • The joint probability can be expressed as chain rule
slide-36
SLIDE 36

Mutually Exclusive

  • 𝑄 𝐵𝐶 = 0
  • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶
slide-37
SLIDE 37

Independence of Events

  • Two events A and B are said to be independent if the probability of

their intersection is equal to the product of their individual probabilities

− 𝑄 𝐵𝐶 = 𝑄 𝐵 𝑄 𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐵

slide-38
SLIDE 38

Bayes Rule

𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) 𝑄(𝐶)

Proof: Remember 𝑄 𝐵|𝐶 = 𝑄

𝐵𝐶 𝐶

So 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) Then Bayes 𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)/𝑄 𝐶

slide-39
SLIDE 39

Naïve Bayes Classifier

slide-40
SLIDE 40

Naïve = Assume All Features Independent

slide-41
SLIDE 41

Normal (Gaussian) Distribution

  • One of the most important distributions
  • Central limit theorem

− Averages of samples of observations of random variables independently drawn from independent distributions converge to the normal distribution

slide-42
SLIDE 42
slide-43
SLIDE 43

Differentiation

OR

slide-44
SLIDE 44

Derivatives of Basic Function

𝑒𝑧 𝑒𝑦

slide-45
SLIDE 45

Gradient of a Function

  • Gradient is a multi-variable generalization of the derivative
  • Apply partial derivatives
  • Example
slide-46
SLIDE 46

Chain Rule

46

slide-47
SLIDE 47

Maxima and Minima for Univariate Function

  • If

𝑒𝑔(𝑦) 𝑒𝑦

= 0, it’s a minima or a maxima point, then we study the second derivative:

− If

𝑒2𝑔(𝑦) 𝑒𝑦2

< 0 => Maxima − If

𝑒2𝑔(𝑦) 𝑒𝑦2

> 0 => Minima − If

𝑒2𝑔(𝑦) 𝑒𝑦2

= 0 => Point of reflection

Minima

slide-48
SLIDE 48

Gradient Descent

slide-49
SLIDE 49

Gradient Descent along a 2D Surface

slide-50
SLIDE 50

Avoid Local Minimum using Momentum

slide-51
SLIDE 51

Optimization

https://en.wikipedia.org/wiki/Optimization_problem

slide-52
SLIDE 52

Principle Component Analysis (PCA)

  • Assumptions

− Linearity − Mean and Variance are sufficient statistics − The principal components are orthogonal

slide-53
SLIDE 53

Principle Component Analysis (PCA)

max. cov 𝐙, 𝐙 𝑡. 𝑐. 𝑢 𝐗T𝐗 = 𝐉

slide-54
SLIDE 54

References

  • Francois Chollet, “Deep Learning with Python,” Chapter 2 “Mathematical Building

Blocks of Neural Networks”

  • Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017
  • Machine Learning Cheat Sheet
  • https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
  • https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-

How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

  • Wikipedia