DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - - PowerPoint PPT Presentation

data analytics using deep learning
SMART_READER_LITE
LIVE PREVIEW

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L E A R N I N G S T A T E R E P R E S E N T A T I O N S F O R Q U E R Y O P T I M I Z A T I O N W I T H D E E P R E I N F O R C E M E N T L E


slide-1
SLIDE 1

DATA ANALYTICS USING DEEP LEARNING

GT 8803 // FALL 2018 // JOY ARULRAJ

L E C T U R E # 1 9 : L E A R N I N G S T A T E R E P R E S E N T A T I O N S F O R Q U E R Y O P T I M I Z A T I O N W I T H D E E P R E I N F O R C E M E N T L E A R N I N G

slide-2
SLIDE 2

GT 8803 // Fall 2018

PAPER

  • Learning State Representations for Query

Optimization with Deep Reinforcement Learning

– Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, S. Sathiya Keerthi – University of Washington , Microsoft , Criteo Research

  • Key Topics

– Deep reinforcement learning – Query optimization

2

slide-3
SLIDE 3

GT 8803 // Fall 2018

RELATED LINKS

  • Paper - https://arxiv.org/abs/1803.08604
  • J Ortiz - https://homes.cs.washington.edu/~jortiz16/
  • M Balazinska - https://www.cs.washington.edu/people/faculty/magda
  • J Gehrke - http://www.cs.cornell.edu/johannes/
  • SS Keerthi - http://www.keerthis.com/

3

slide-4
SLIDE 4

GT 8803 // Fall 2018

AGENDA

  • Problem Overview
  • Background
  • Key Ideas
  • Technical Details
  • Experiments
  • Discussion

4

slide-5
SLIDE 5

GT 8803 // Fall 2018

TODAY’s PAPER

5 Query Reinforcement Learning Deep Learning Query Cardinality Database

slide-6
SLIDE 6

GT 8803 // Fall 2018

Problem Overview

  • Query Optimization is still a difficult

problem

  • Deep Reinforcement Learning (DRL) is an

evolving approach to solve complex problems.

  • Can DRL be used to improve query plan
  • ptimization?

6

slide-7
SLIDE 7

GT 8803 // Fall 2018

PROBLEM OVERVIEW

  • Contribution #1: Generate a model that

determine a subquery’s cardinality

  • Contribution #2: Use reinforcement learning

as a Markov process to propose a query plan

  • Some Challenges:

– State isn’t obvious like in some contexts (e.g. games) – Choosing the reward can be tricky

7

slide-8
SLIDE 8

GT 8803 // Fall 2018

Background: Query Optimization

  • Ongoing problem in database systems

research

  • Current systems still aren’t great - Why???

– Plans must be efficient in time and resources - tradeoffs – Current DBMSs make simplified assumptions

  • Avoid multidimensional/complex methods

– Result -> Estimation errors and poor query plans

8

slide-9
SLIDE 9

GT 8803 // Fall 2018

Background: Query Optimization

– Join order

  • When join includes more than

2 relations, join time can vary depending on size of relation

– Subquery optimization

  • group by, exists operators can
  • ften be simplified, but…
  • can be computationally

complex to determine

– Cardinality estimation

  • Hard to map predicates as

new data comes in

  • Requires stats to be updated

9

EXPRESS LEARNING - DATABASE MANAGEMENT SYSTEMS

https://en.wikipedia.org/wiki/Query_optimization

slide-10
SLIDE 10

GT 8803 // Fall 2018

Background: Query Optimization

  • Commonly used approaches

– Data sketches – Sampling – Histograms – Heuristics

10

slide-11
SLIDE 11

GT 8803 // Fall 2018

BACKGROUND: DEEP LEARNING

  • What is it?

– Maps input x to output y though a series of hidden layers. – Transforms data into representations

  • e.g. images of cats become pixels

– Hidden layers apply of series of functions – Errors decrease over time via backpropagation

11

slide-12
SLIDE 12

GT 8803 // Fall 2018

BACKGROUND: DEEP LEARNING

  • What is it good for?

– Machine translation – Object detection – Winning games – Much more…

12

slide-13
SLIDE 13

GT 8803 // Fall 2018

BACKGROUND: DEEP LEARNING

  • Why?

– Performs well across multiple domains – We have improved, cheaper hardware and large datasets for training – It’s good at finding patterns that aren’t obvious to humans (even domain experts) – Libraries

  • PyTorch, TensorFlow, Keras

13

slide-14
SLIDE 14

GT 8803 // Fall 2018

BACKGROUND: REINFORCEMENT LEARNING

  • What is it?

– Agents – the learner in the model – States – condition of the environment – Actions – Inputs from the agent (based on previous learning or trial/error) – Rewards – Feedback to agent to reward (or not)

14 http://introtodeeplearning.com/materials/2018_6S191_Lecture5.pdf

slide-15
SLIDE 15

GT 8803 // Fall 2018

BACKGROUND: REINFORCEMENT LEARNING

  • What is it good for?

– Beating Atari games – Training autonomous vehicles, robots – Optimizing stocks, gambling, auction bids, etc.

15

slide-16
SLIDE 16

GT 8803 // Fall 2018

BACKGROUND: REINFORCEMENT LEARNING

  • Why?

– May perform better than brute-force deep learning models – Agents can use trial/error or greedy approaches to optimize reward – Can be good in complex state spaces because you don’t have have to provide fully labeled

  • utputs for the model to train on; can just

provide more simpler rewards

16

slide-17
SLIDE 17

GT 8803 // Fall 2018

KEY IDEA Can deep reinforcement learning be used to learn query representations?

17

slide-18
SLIDE 18

GT 8803 // Fall 2018

SUBQUERY LEARNING VIA DRL

18

ℎ" #" ℎ"$%

State t+1 State t Representation

  • f Database

Properties Query

  • peration

at time t Subquery Representation

State t+2

⋈ ⋈

Query

  • peration

at time t+1 State Transition Function ""

#$

Subquery Representation Action t Action t+1 %&'%&(&)*+*,-) +* *,/& * %&'%&(&)*+*,-) +* *,/& *+1 +0*,-) +* *,/& *

slide-19
SLIDE 19

GT 8803 // Fall 2018

EXAMPLE

19

⋈ 𝜏 𝜏 ⋈

select * from customers C, orders O where C.col1 = O.col1 and O.col1 <= 10

Representation of Database Properties State t Query operation at time t Representation of Subquery State t+1 Query operation at time t+1

𝜏

Representation of Subquery State t+2

slide-20
SLIDE 20

GT 8803 // Fall 2018

APPROACH

  • Map query and database to a feature vector
  • Two options:

– Transform values using deep networks and output cardinality – Recursive approach taking subquery (ht) and

  • peration (at) as input

20

(Q,D)

slide-21
SLIDE 21

GT 8803 // Fall 2018

MORE ON APPROACH

  • Two options:

– Transform values using deep networks and output cardinality

  • Needs lots of data – very sparse
  • Recursive approach is selected

– Recursive approach taking subquery (ht) and

  • peration (at) as input
  • ht is learned by the model
  • Thus we have NNST model that learns based on NNObserved

and NNInit

21

slide-22
SLIDE 22

GT 8803 // Fall 2018

HOW TO ENCODE DATA

22

slide-23
SLIDE 23

GT 8803 // Fall 2018

TECHNICAL DETAIL

23

Use cardinality as an

  • bserved variable

Query Operation Prior subquery representation

slide-24
SLIDE 24

GT 8803 // Fall 2018

STEPS

  • NNInit = 𝑔 𝑦%, 𝑏%

x = database properties (min/max values, # distinct values, 1D histogram) a = single relational operator (= ≠ < > ≤ ≥ )

  • NNST = 𝑔 ℎ1, 𝑏1

h = latent representation of model itself (a subquery) a = single relational operation ( )

  • NNObserved

Mapping from hidden state to observed variables at time t

24

slide-25
SLIDE 25

GT 8803 // Fall 2018

Experiments

  • Uses IMDB dataset

– 3 GB – Real data (has skew and correlations between columns)

  • TensorFlow (Python)
  • Baseline estimates against SQL

Server

25

slide-26
SLIDE 26

GT 8803 // Fall 2018

Experiment #1

  • Train init function with properties from IMDB
  • 20K queries (15K train/5K test)
  • Model uses stochastic gradient descent (SGD)
  • Learning rate of .01
  • 50 hidden nodes in hidden layer

26

slide-27
SLIDE 27

GT 8803 // Fall 2018

Experiment #1

27

  • Fewer epochs == greater errors
  • m=3, 6th epoch similar to SQL Server
  • > 6th epoch, outperforms SQL Server
  • Greater cardinality == longer to converge

(outperforms SQL Server by 9th epoch)

slide-28
SLIDE 28

GT 8803 // Fall 2018

Experiment #1

28

slide-29
SLIDE 29

GT 8803 // Fall 2018

EXPERIMENT #2

  • Combined models
  • Select and join operation

– Where a is the join ( )

  • Hidden state is able to store enough info to

predict cardinality

29

slide-30
SLIDE 30

GT 8803 // Fall 2018

EXPERIMENT #2

30

slide-31
SLIDE 31

GT 8803 // Fall 2018

NEXT STEPS

31

Can subquery representations be used to build query plans?

slide-32
SLIDE 32

GT 8803 // Fall 2018

GOAL

  • Given a database D and a query Q, train a

model that can learn to predict subquery cardinalities (and the best join)…

32

State t+1 Subquery Representation

⋈ 𝐵

Action 1

?

⋈ 𝐶

Action 2

⋈ 𝐷

Action 3

slide-33
SLIDE 33

GT 8803 // Fall 2018

ASSUMPTIONS

  • Model-free environment where probabilities

between states are unknown

  • Each state encodes operations that have

already been done

  • The model needs a good reward to be

successful

  • Need to determine optimal policy

33

slide-34
SLIDE 34

GT 8803 // Fall 2018

Example

34

slide-35
SLIDE 35

GT 8803 // Fall 2018

APPROACH

  • For all relations in a database, assume a set of

relations with attributes

  • Vector at time t, represents equi-join

predicates and 1D selection predicates

– e.g. if a predicate exists, set value to 1, otherwise 0

35

slide-36
SLIDE 36

GT 8803 // Fall 2018

How to reward?

  • Can be given at each state or at the end.
  • Option 1:

– Minimize cost based on existing query estimators

  • Option 2:

– Use cardinality from learned model – Experimental

36

slide-37
SLIDE 37

GT 8803 // Fall 2018

Q-LEARNING

  • Init with random values
  • For each state, the next value of Q comes

from:

– Current value of Q – Learning rate – Reward – Max value for a reward given a greedy policy – Discount factor

37

slide-38
SLIDE 38

GT 8803 // Fall 2018

Q-Learning

38 https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning- 14ac0b4493cc

slide-39
SLIDE 39

GT 8803 // Fall 2018

OPEN PROBLEMS

  • How to choose rewards?
  • State space is large and Q-learning can be

impractical

– Need to approximate solutions

39

slide-40
SLIDE 40

GT 8803 // Fall 2018

Related Work

  • Eliminate optimizer
  • Use RL for query processing
  • Feedback loop on optimizer
  • Neural networks to estimate cardinality
  • Neural networks to build fast indexes
  • DRL to determine join order

40

slide-41
SLIDE 41

GT 8803 // Fall 2018

STRENGTHS

  • Deep learning is a more feasible approach

than manually written queries

  • Unique approach with using recursive model
  • Deep learning models can approximate and

exceed performance of industry-standard

  • ptimizers

41

slide-42
SLIDE 42

GT 8803 // Fall 2018

WEAKNESSES

  • Q-Learning is impractical and difficult

– Large state space – Reward selection problem

  • Evaluating query plans takes time, but so

does training iterative models, would be valuable to compare.

42

slide-43
SLIDE 43

GT 8803 // Fall 2018

MODEL DETAILS (NOT IN PAPER)

43

slide-44
SLIDE 44

GT 8803 // Fall 2018

Further Discussion

  • Strategies to pick a reward function?
  • For actions, discuss a value-based recursive

approach vs. a policy gradient approach.

  • Is there a way to pick the most representative

queries to reduce state space?

44

slide-45
SLIDE 45

GT 8803 // Fall 2018

references

  • 1. Slides from Jennifer Ortiz, “Deep Learning for Query

Plan Resource and Cost Estimation”, Teradata Analytics Universe, 2018.

  • 2. LIMITED, I. E. (2012). EXPRESS LEARNING - DATABASE

MANAGEMENT SYSTEMS. S.l.: PEARSON EDUCATION INDIA.

  • 3. MIT 6.S191: Introduction to Deep Learning,

http://introtodeeplearning.com/