Dimensionality reduction AI F UN DAMEN TALS Nemanja Radojkovic - - PowerPoint PPT Presentation

dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

Dimensionality reduction AI F UN DAMEN TALS Nemanja Radojkovic - - PowerPoint PPT Presentation

Dimensionality reduction AI F UN DAMEN TALS Nemanja Radojkovic Senior Data Scientist Denition "Dimensionality reduction is the process of reducing the number of variables under consideration by obtaining a set of principal


slide-1
SLIDE 1

Dimensionality reduction

AI F UN DAMEN TALS

Nemanja Radojkovic

Senior Data Scientist

slide-2
SLIDE 2

AI FUNDAMENTALS

Denition

"Dimensionality reduction is the process of reducing the number of variables under consideration by

  • btaining a set of principal variables."
slide-3
SLIDE 3

AI FUNDAMENTALS

Why?

Pro's Reduce overtting Obtain independent features Lower computational intensity Enable visualization Con's Compression => Loss of information => loss of performance

slide-4
SLIDE 4

AI FUNDAMENTALS

Types

Feature selection (B ? A) Selecting a subset of existing features, based on predictive power Non-trivial problem: Looking for the best "team of features", not individually best features! Feature extraction (B ? A) Transforming and combining existing features into new ones. Linear or non-linear projections.

slide-5
SLIDE 5

AI FUNDAMENTALS

Common algorithms

Linear (faster, deterministic) Principal Component Analysis (PCA)

from sklearn.decomposition \ import PCA

Latent Dirichlet Allocation

from sklearn.decomposition \ import LatentDirichletAllocation

Non-linear (slower, non-deterministic) Isomap

from sklearn.manifold import Isomap

t-distributed Stochastic Neighbor Embedding (t-SNE)

from sklearn.manifold import TSNE

slide-6
SLIDE 6

AI FUNDAMENTALS

Principal Component Analysis (PCA)

Family: Linear methods. Intuition: Principal components are directions of highest variability in data. Reduction = keeping only top #N principal components. Assumption: Normal distribution of data. Caveat: Very sensitive to outliers. Code example:

from sklearn.decomposition import PCA pca = PCA(n_dimensions=3) X_reduced = pca.fit_transform(X)

slide-7
SLIDE 7

Use it wisely!

AI F UN DAMEN TALS

slide-8
SLIDE 8

Clustering

AI F UN DAMEN TALS

Nemanja Radojkovic

Senior Data Scientist

slide-9
SLIDE 9

AI FUNDAMENTALS

What is clustering?

Cluster = Group of entities or events sharing similar attributes. Clustering (AI) = The process of applying Machine Learning algorithms for automatic discovery of clusters.

slide-10
SLIDE 10

AI FUNDAMENTALS

Popular clustering algorithms

KMeans clustering

from sklearn.cluster import KMeans

Spectral clustering

from sklearn.cluster import SpectralClustering

DBSCAN

from sklearn.cluster import DBSCAN

slide-11
SLIDE 11

AI FUNDAMENTALS

slide-12
SLIDE 12

AI FUNDAMENTALS

slide-13
SLIDE 13

AI FUNDAMENTALS

slide-14
SLIDE 14

AI FUNDAMENTALS

How many clusters do I have?

–> Elbow method!

slide-15
SLIDE 15

AI FUNDAMENTALS

How many clusters do I have?

slide-16
SLIDE 16

AI FUNDAMENTALS

Cluster analysis and tuning

Unsupervised (no "ground truth", no expectations) Variance Ratio Criterion: sklearn.metrics.calinski_harabaz_score "What is the average distance of each point to the center of the cluster AND what is the distance between the clusters?" Silhouette score: sklearn.metrics.silhouette_score "How close is each point to its own cluster VS how close it is to the others?" Supervised ("ground truth"/expectations provided) Mutual information (MI) criterion: sklearn.metrics.mutual_info_score Homogeneity score: sklearn.metrics.homogeneity_score

slide-17
SLIDE 17

Explore, experiment and tune!

AI F UN DAMEN TALS

slide-18
SLIDE 18

Anomaly detection

AI F UN DAMEN TALS

Nemanja Radojkovic

Senior Data Scientist

slide-19
SLIDE 19

AI FUNDAMENTALS

Denition and use cases

Detecting unusual entities or events. Hard to dene what's odd, but possible to dene what's normal. Use cases Credit card fraud detection Network security monitoring Heart-rate monitoring

slide-20
SLIDE 20

AI FUNDAMENTALS

Approaches: Thresholding

slide-21
SLIDE 21

AI FUNDAMENTALS

Approaches: Rate of change

slide-22
SLIDE 22

AI FUNDAMENTALS

Approaches: Shape monitoring

slide-23
SLIDE 23

AI FUNDAMENTALS

Algorithms

Robust covariance (assumes normal distribution)

from sklearn.covariance import EllipticEnvelope

Isolation Forest (powerful, but more computationally demanding)

from sklearn.ensemble import IsolationForest

One-Class SVM (sensitive to outliers, many false negatives)

from sklearn.svm import OneClassSVM

slide-24
SLIDE 24

AI FUNDAMENTALS

slide-25
SLIDE 25

AI FUNDAMENTALS

Training and testing

Example: Isolation Forest

from sklearn.ensemble import IsolationForest algorithm = IsolationForest() # Fit the model algorithm.fit(X) # Apply the model and detect the outliers results = algorithm.predict(X)

slide-26
SLIDE 26

AI FUNDAMENTALS

Evaluation

from sklearn.metrics \ import (confusion_matrix, precision_score, recall_score) confusion_matrix(y_true, y_predicted)

Precision = How many of the anomalies I have detected are TRUE anomalies? Recall = How many of the TRUE anomalies I have managed to detect? Example: Arrhythmia detection

slide-27
SLIDE 27

Want to learn more?

AI F UN DAMEN TALS

slide-28
SLIDE 28

Selecting the right model

AI F UN DAMEN TALS

Nemanja Radojkovic

Senior Data Scientist

slide-29
SLIDE 29

AI FUNDAMENTALS

Model-to-problem t

Type of Learning Target variable dened & known? => Supervised. Classication? Regression No target variable, exploration? => Unsupervised. Dimensionality Reduction? Clustering? Anomaly Detection?

slide-30
SLIDE 30

AI FUNDAMENTALS

Dening the priorities

Interpretable models Linear regression (Linear, Logistic, Lasso, Ridge) Decision Trees Well performing models Tree ensembles (Random Forests, Gradient Boosted Trees) Support Vector Machines Articial Neural Networks Simplicity rst!

slide-31
SLIDE 31

AI FUNDAMENTALS

Using multiple metrics

Satisfying metrics Cut-off criteria that every candidate model needs to meet. Multiple satisfying metrics possible (e.g. minimum accuracy, maximum execution time, etc) Optimizing metrics Illustrates the ultimate business priority (e.g. "minimize false positives", "maximize recall") "There can be only one" Final model: Passes the bar on all satisfying metrics and has the best score on the optimization metric.

slide-32
SLIDE 32

AI FUNDAMENTALS

Interpretation

Global "What are the general decision-making rules of this model?" Common approaches: Decision tree visualization Feature importance plot Local "Why was this specic example classied in this way?" LIME algorithm (Local Interpretable Model-Agnostic Explanations)

slide-33
SLIDE 33

Model selection and interpretation

AI F UN DAMEN TALS