SLIDE 1 Applied Math for Machine Learning
2020/4/11
SLIDE 2 Applied Math for Machine Learning
- Linear Algebra
- Probability
- Calculus
- Optimization
SLIDE 3 Linear Algebra
− real numbers
− Has a magnitude & a direction
− An array of numbers arranges in rows & columns
− Multi-dimensional arrays of numbers
SLIDE 4 Real-world examples of Data Tensors
- Timeseries Data – 3D (samples, timesteps, features)
- Images – 4D (samples, height, width, channels)
- Video – 5D (samples, frames, height, width, channels)
4
SLIDE 5 Vector Dimension vs. Tensor Dimension
- The number of data in a vector is also called “dimension”
- In deep learning , the dimension of Tensor is also called “rank”
- Matrix = 2d array = 2d tensor = rank 2 tensor
https://deeplizard.com/learn/video/AiyK0idr4uM
SLIDE 6
The Matrix
SLIDE 7 Matrix
- Define a matrix with m rows
and n columns:
Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017
SLIDE 8 Matrix Operations
SLIDE 9 Matrix Multiplication
- Two matrices A and B, where
- The columns of A must be equal to the rows of B, i.e. n == p
- A * B = C, where
- m
n p q q m
SLIDE 10 Example of Matrix Multiplication (3-1)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
SLIDE 11 Example of Matrix Multiplication (3-2)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
SLIDE 12 Example of Matrix Multiplication (3-3)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
SLIDE 13 Matrix Transpose
https://en.wikipedia.org/wiki/Transpose
SLIDE 14 Dot Product
- Dot product of two vectors become a scalar
- Inner product is a generalization of the dot product
- Notation: 𝑤1 ∙ 𝑤2 or 𝑤1𝑈𝑤2
SLIDE 15
Dot Product of Matrix
SLIDE 16 Linear Independence
- A vector is linearly dependent on other vectors if it can be expressed
as the linear combination of other vectors
- A set of vectors 𝑤1, 𝑤2,⋯ , 𝑤𝑜 is linearly independent if 𝑏1𝑤1 +
𝑏2𝑤2 + ⋯ + 𝑏𝑜𝑤𝑜 = 0 implies all 𝑏𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}
SLIDE 17 Span the Vector Space
- n linearly independent vectors can span
n-dimensional space
SLIDE 18 Rank of a Matrix
− The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns
- Row rank = Column rank
- Example:
https://en.wikipedia.org/wiki/Rank_(linear_algebra)
Row- echelon form
SLIDE 19 Identity Matrix I
- Any vector or matrix multiplied by I remains unchanged
- For a matrix 𝐵, 𝐵𝐽 = 𝐽𝐵 = 𝐵
SLIDE 20 Inverse of a Matrix
- The product of a square matrix 𝐵 and its inverse matrix 𝐵−1
produces the identity matrix 𝐽
- 𝐵𝐵−1 = 𝐵−1𝐵 = 𝐽
- Inverse matrix is square, but not all square matrices has inverses
SLIDE 21 Pseudo Inverse
- Non-square matrix and have left-inverse or right-inverse matrix
- Example:
− Create a square matrix 𝐵𝑈𝐵 − Multiplied both sides by inverse matrix (𝐵𝑈𝐵)−1 − (𝐵𝑈𝐵)−1𝐵𝑈 is the pseudo inverse function
𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ𝑛×𝑜, 𝑐 ∈ ℝ𝑜 𝐵𝑈𝐵𝑦 = 𝐵𝑈𝑐 𝑦 = (𝐵𝑈𝐵)−1𝐵𝑈𝑐
SLIDE 22 Norm
- Norm is a measure of a vector’s magnitude
- 𝑚2 norm
- 𝑚1 norm
- 𝑚𝑞 norm
- 𝑚∞ norm
SLIDE 23 Eigen Vectors
- Eigenvector is a non-zero vector that changed by only a scalar factor λ
when linear transform 𝐵 is applied to:
- 𝑦 are Eigenvectors and 𝜇 are Eigenvalues
- One of the most important concepts in machine learning, ex:
− Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …
𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ𝑜×𝑜, 𝑦 ∈ ℝ𝑜
SLIDE 24 Example: Shear Mapping
Eigenvector
SLIDE 25 Principle Component Analysis (PCA)
- Eigenvector of Covariance Matrix
https://en.wikipedia.org/wiki/Principal_component_analysis
SLIDE 26 NumPy for Linear Algebra
- NumPy is the fundamental package for scientific computing
with Python. It contains among other things: −a powerful N-dimensional array object −sophisticated (broadcasting) functions −tools for integrating C/C++ and Fortran code −useful linear algebra, Fourier transform, and random number capabilities
SLIDE 27 Create Tensors
Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)
SLIDE 28
Create 3D Tensor
SLIDE 29 Attributes of a Numpy Tensor
- Number of axes (dimensions, rank)
− x.ndim
− This is a tuple of integers showing how many data the tensor has along each axis
− uint8, float32 or float64
SLIDE 30
Numpy Multiplication
SLIDE 31 Unfolding the Manifold
- Tensor operations are complex geometric transformation in high-
dimensional space
− Dimension reduction
SLIDE 32
Basics of Probability
SLIDE 33 Three Axioms of Probability
- Given an Event 𝐹 in a sample space 𝑇, S =ڂ𝑗=1
𝑂
𝐹𝑗
− 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1
− 𝑄 𝑇 = 1
− Additivity, any countable sequence of mutually exclusive events 𝐹𝑗 − 𝑄ڂ𝑗=1
𝑜
𝐹𝑗 = 𝑄 𝐹1 + 𝑄 𝐹2 + ⋯ + 𝑄 𝐹𝑜 = σ𝑗=1
𝑜
𝑄 𝐹𝑗
SLIDE 34 Union, Intersection, and Conditional Probability
- 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶
- 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶
- Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has
- ccurred
− 𝑄 𝐵|𝐶 = 𝑄
𝐵𝐶 𝐶
− 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)
SLIDE 35 Chain Rule of Probability
- The joint probability can be expressed as chain rule
SLIDE 36 Mutually Exclusive
- 𝑄 𝐵𝐶 = 0
- 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶
SLIDE 37 Independence of Events
- Two events A and B are said to be independent if the probability of
their intersection is equal to the product of their individual probabilities
− 𝑄 𝐵𝐶 = 𝑄 𝐵 𝑄 𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐵
SLIDE 38 Bayes Rule
𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) 𝑄(𝐶)
Proof: Remember 𝑄 𝐵|𝐶 = 𝑄
𝐵𝐶 𝐶
So 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) Then Bayes 𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)/𝑄 𝐶
SLIDE 39
Naïve Bayes Classifier
SLIDE 40
Naïve = Assume All Features Independent
SLIDE 41 Normal (Gaussian) Distribution
- One of the most important distributions
- Central limit theorem
− Averages of samples of observations of random variables independently drawn from independent distributions converge to the normal distribution
SLIDE 42
SLIDE 43
Differentiation
OR
SLIDE 44 Derivatives of Basic Function
𝑒𝑧 𝑒𝑦
SLIDE 45 Gradient of a Function
- Gradient is a multi-variable generalization of the derivative
- Apply partial derivatives
- Example
SLIDE 47 Maxima and Minima for Univariate Function
𝑒𝑔(𝑦) 𝑒𝑦
= 0, it’s a minima or a maxima point, then we study the second derivative:
− If
𝑒2𝑔(𝑦) 𝑒𝑦2
< 0 => Maxima − If
𝑒2𝑔(𝑦) 𝑒𝑦2
> 0 => Minima − If
𝑒2𝑔(𝑦) 𝑒𝑦2
= 0 => Point of reflection
Minima
SLIDE 48
Gradient Descent
SLIDE 49
Gradient Descent along a 2D Surface
SLIDE 50
Avoid Local Minimum using Momentum
SLIDE 51 Optimization
https://en.wikipedia.org/wiki/Optimization_problem
SLIDE 52 Principle Component Analysis (PCA)
− Linearity − Mean and Variance are sufficient statistics − The principal components are orthogonal
SLIDE 53
Principle Component Analysis (PCA)
max. cov 𝐙, 𝐙 𝑡. 𝑐. 𝑢 𝐗T𝐗 = 𝐉
SLIDE 54 References
- Francois Chollet, “Deep Learning with Python,” Chapter 2 “Mathematical Building
Blocks of Neural Networks”
- Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017
- Machine Learning Cheat Sheet
- https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
- https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-
How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when