Joining Ranked Input In Practice Ihab F. Ilyas Purdue University - - PowerPoint PPT Presentation

joining ranked input in practice
SMART_READER_LITE
LIVE PREVIEW

Joining Ranked Input In Practice Ihab F. Ilyas Purdue University - - PowerPoint PPT Presentation

Joining Ranked Input In Practice Ihab F. Ilyas Purdue University Joint work with Walid G. Aref Ahmed K. Elmagarmid Purdue University Hewlett Packard Motivation Almost every application that depends


slide-1
SLIDE 1

Joining Ranked Input In Practice

Ihab F. Ilyas Purdue University

Joint work with

Walid G. Aref Ahmed K. Elmagarmid Purdue University Hewlett Packard

slide-2
SLIDE 2

Motivation

  • Almost every application that depends on ranking the

results of the user queries will need a way to combine results of multiple rankings:

– Similarity queries on multiple features. – Query by multiple examples. – Joining multiple ordered streams. – Documents ranking on multiple keywords search. – Document search results from multiple search engines.

  • Most of these applications are only interested in the top

K results to be presented to the user.

slide-3
SLIDE 3

Motivation

  • We need a ranking operator that takes as

an input multiple ranked streams and produce the combined ranking with minimal access to the inputs.

  • The only alternative is to consume all the

inputs and sort on the computed overall

  • score. Sometimes not even feasible!
slide-4
SLIDE 4

The Driving Application

  • STEAM: The Continuous Media Streaming Database

Project at Purdue

(Demo - ICDE2002)

  • Objectives of STEAM:

– Build a database system prototype that has the capabilities to manipulate continuous media objects. – Allow for flexible querying of these objects. – Store media objects inside the database and present the results as

  • utput streams from the database.
  • Challenge:

– Modify the different engine components to satisfy the new functionality requirements and the new data types.

  • Application:

– Medical video database application on top of the system.

slide-5
SLIDE 5
slide-6
SLIDE 6

The Driving Application

  • Use the feature approach.
  • Many physical features can be extracted.
  • Individual features are multi-dimensional.
  • Query Types:

– Single-Feature Queries:

“get 4 video shots that are most similar in color to a given query image”

– Multi-Feature Queries:

“get 4 video shots that are most similar in color, texture and edge histogram to a given query image”

slide-7
SLIDE 7

Color Histogram Edge Histogram Texture Tamura Query

Single-Feature Query

slide-8
SLIDE 8

Single-Feature Query

  • For low to medium dimensionality, use k-

Nearest Neighbor (K-NN) algorithm on a high-dimensional index structure.

  • For very high dimensional features,

sequential scan over the objects.

slide-9
SLIDE 9

Multi-Feature Query

Color Histogram Edge Histogram Texture Tamura Query

slide-10
SLIDE 10

Multi-Feature Query

A score function is needed to combine features.

– Alternative 1:

  • Compute the score of each object in the database.
  • Sort the results on the combined score!

– Alternative 2:

  • Concatenate features in one single feature vector.
  • Treat it as a single-feature query.
  • Dimensionality effect !
slide-11
SLIDE 11

Multi-Feature Query

– Alternative 3:

  • Index on each feature separately.
  • Get the top K objects for each feature separately.
  • Join the results.
  • Not even correct !!

– Alternative 4:

  • Index on each feature separately.
  • Retrieve objects from each index ranked on one feature.
  • Try to combine the score of each ranking.
  • Stop when you have K objects that are “guaranteed” to have

the top K total scores.

slide-12
SLIDE 12

Theory?

Efficient solutions:

  • Fagin’s Algorithm (FA) [JCSS’99]
  • Fagin et. al (TA, NRA, CA) [PODS’01]
  • Nepal and Ramakrishna (Multi-step) [ICDE’99]
  • Natsev et.al (J*) [VLDB’01]
  • Güntzer et. al (Quick-Combine, Stream-

Combine) [VLDB’00,ITCC’01]

slide-13
SLIDE 13

NRA (No-Random-Access)

A,10 B,4 C,3 D,2 E,1 A,5 D,4 C,3 B,2 E,1 D,7 C,6 E,5 B,4 A,3 A(15-22) D(7 -22) A(15-21) D(11-15) B(4 -14) C(6 -14) A(15-19) D(13-13) Buffer Queue L L L3 2 1 C(12-12) E(5 -11) D(11-14) B(4 -12) A(15-20) D(13-13) B(10-10) C(12-12) E(5 – 9) A(15-19) k = 2

slide-14
SLIDE 14

The J* Algorithm

  • Based on the A* search algorithm.
  • Handle general joining conditions.
  • Transform the problem to a navigation in a

graph of nodes aiming to reach a node with a valid join combination

  • Have to see all the scores of an object

before reporting it!

slide-15
SLIDE 15

The J* Algorithm

A D C B E A B E C D 1 2 3 4 5 5 4 3 2 1

D,A (10) D,D (10) D,A (10) D,A (9) C,A (9) D,D (10) D,D (10) C,A (9) D,C (9) D,D (9) C,A (9)

slide-16
SLIDE 16

Rank Join Algorithms

slide-17
SLIDE 17

Systems?

Few attempts – No Practical Solution

  • Garlic (IBM): FA algorithm
  • HERON: Quick-Combine
slide-18
SLIDE 18

How to Integrate into Engine?

  • Table Functions:

SELECT Images.name FROM GlobalRank( Images, Color, Texture, QueryImage.jpg, ScoreFunction, 10) ;

  • Query Operator

– Implementation outside the SQL engine loose efforts of the query optimizer. – A chance to be shuffled with other operators in the plan for better performance (pushing down predicates and projections) – In brief, under the optimizer control !

slide-19
SLIDE 19

Our Objective

  • Realize the NRA-like algorithms as a

physical query operator to be used practically in current database engines.

– Perform necessary modifications. – Identify several optimization issues.

  • Compare the new operator with the J*

(operator ready).

slide-20
SLIDE 20

The NRA-RJ Operator

  • Implements a logical ranking operator RJOIN.
  • RJOIN takes n sorted inputs, a join condition

and a score function.

  • Each input attaches a score to each object in

that input.

  • The output is the joined stream with an overall

score attached to each object computed using the score function.

  • The output is sorted on that overall score.
slide-21
SLIDE 21

The NRA-RJ Operator

RJOIN Sorted Input 1 Sorted Input 2 Sorted Input n …… <oid,score1,…> <oid,score2,…> <oid,scoren,…> <oid, f(score1, score2, …scoren), …> Sorted output

slide-22
SLIDE 22

The NRA-RJ Operator

  • Example Inputs:

– Different views of the same database objects sorted

  • n different criteria: e.g., the feature ranking of video

shots w.r.t different features. – External sources: e.g., web search results from different search engines.

  • Usage covers a wide range of queries in many

new applications: similarity queries in multimedia databases, multiple streams processing, etc.

slide-23
SLIDE 23

Why Query Operator?

NRA-RJ NRA-RJ Sorted Input Sorted Input Sorted Input Input

σ

D A B C NRA-RJ NRA-RJ Sorted Input Sorted Input Sorted Input Input

σ

D A B C

σ σ σ

slide-24
SLIDE 24

From an Algorithm To a Query Operator

Important Properties

  • Incremental

– So the operator can fit into the Open/Get-Next/Close framework.

  • Pipelined

– If one is going to sort, one is better off sorting

  • nce for all the scores.

– Most of the queries that use the operator are interested in getting the first answer fast.

slide-25
SLIDE 25

NRA NRA-RJ

Output stream has no specific grade/score attached not a valid input. Consider all the inputs together more context, faster termination. Less flexible. A multi-way

  • ne-layer operator at the

leaf-level of the query plan. Allow inputs to have ranges

  • f score.

Report output objects with a range from worst to best grade. NRA-RJ: Consider some of the inputs at each stage less context. NRA-RJ: More flexible, pipelined special join

  • perator can be integrated

well in the query plan.

slide-26
SLIDE 26

NRA-RJ

  • Open(): Open left and right inputs.
  • Close(): Close left and right inputs.
  • GetNext():

If Queue.Top.WorstGrade > Maximum (Best Grades of all other

  • bjects)

Return the tuple.

LOOP

Left.GetNext() Right.GetNext() Compute threshold. Update best and worst grades of objects in Queue. If Queue.Top.WorstGrade > Maximum (Best Grades of all other

  • bjects)

break

End Loop Remove from Queue Return tuple.

slide-27
SLIDE 27

NRA-RJ

A (10-10) B (4 - 4) C (3 - 3) D (2 - 2) E (1 - 1) A (5 - 5) D (4 - 4) C (3 - 3) B (2 - 2) E (1 - 1) A(15-15) D (7 - 7) C (6 - 6) E (5 - 5) B (4 - 4) A (3 - 3) A(15-15) A(15-22) D(7-22) B(4 – 8) D(4 – 8) C(6 – 6) B(4 – 7) D(4 – 7) C(6 – 6) B(6 – 6) D(6 – 6) C(6 – 6) A(15-21) D(7-13) C(12– 12) A(15-21)

L L L3 2 1

Buffer Queue Buffer Queue 1 2

slide-28
SLIDE 28

Optimizing NRA-RJ

  • The local ranking problem

Solution : A Balancing Factor !

slide-29
SLIDE 29

Optimizing NRA-RJ

n

NRA RJ NRA RJ

Input A Input B Input C m m n Input A Input B Input C n/p m’ m’

slide-30
SLIDE 30
  • Stopping condition

NRA-RJ vs. J*

C : 4 D: 3 A : 10 B : 5 E : 2 A : 1 B : 5 C : 4 D: 3 E : 2 A (10-14) B (10-10) C (4 – 9)

NRA-RJ at d = 2

slide-31
SLIDE 31

NRA-RJ vs. J*

  • Space Complexity

– Worst Case: (important for allocation)

  • NRA-RJ : A queue of (N ) objects
  • J* : A queue of (2N –1) objects

– Best Case:

  • NRA-RJ: Empty Queue
  • J*: A queue of (2K) objects.
slide-32
SLIDE 32

Performance Study

  • The Prototype:

– Begin with Predator (ORDBMS) on top of Shore (SM). – Modify/add all necessary components.

  • Incorporate the GiST index structure inside Shore to provide

HD indexes.

  • Change the query engine to add the NRA-RJ, the

STOP_AFTER and the NN operators.

  • Modify the buffer manager to handle large media segments.
  • Add a stream manager to handle streaming of media In/Out

the database.

– Runs on Sun Enterprise 450 with UltraSparc-II processors running SunOS 5.6.

slide-33
SLIDE 33
slide-34
SLIDE 34

Empirical Results

slide-35
SLIDE 35

Input Ordering

NRA-RJ is sensitive to input

  • rdering.

2 4 6 8 10 12 14 20 40 60 80 100 120

K CPU Time

O1 O2 O3 O4 O5 O6 1000 2000 3000 4000 5000 6000 20 40 60 80 100 120

K Buffer Size

O1 O2 O3 O4 O5 O6

K NRA-RJ NRA J* 5 3.692383 2.699707 4.881836 10 12.87939 9.37793 10.77881 15 17.28613 13.97217 15.12305 20 19.15967 15.3457 16.49658 25 20.65723 16.62451 17.71289 30 23.05957 19.27686 20.36523 35 23.83545 20.27148 21.39111 40 25.38232 21.78711 22.87549 45 26.87256 23.25537 25.11572 50 28.7417 25.71826 27.45361 55 29.88037 27.04443 28.93604 60 30.51514 27.9917 29.72705 65 31.16211 28.60742 30.37402 70 32.28662 30.07568 31.77979 80 33.29834 32.1123 33.78516

slide-36
SLIDE 36

Input Ordering

J* is less sensitive to input

  • rdering

100 200 300 400 500 600 20 40 60 80 100 120

K CPU Time

O1 O2 O3 O4 O5 O6 1000 2000 3000 4000 5000 6000 7000 8000 20 40 60 80 100 120

K Buffer Size

O1 O2 O3 O4 O5 O6 20 40 60 80 100 120 20 40 60 80 100 120

K Page Accesses

O1 O2 O3 O4 O5 O6

slide-37
SLIDE 37

NRA-RJ vs J*

100 200 300 400 500 600 20 40 60 80 100 120

K CPU Time

NRA-RJ(1) NRA-RJ(2) J*(1) J*(2) 1000 2000 3000 4000 5000 6000 7000 8000 20 40 60 80 100 120

K Buffer Size

NRA-RJ(1) NRA-RJ(2) J*(1) J*(2) 20 40 60 80 100 120 140 20 40 60 80 100 120

K Page Access

NRA-RJ(1) NRA-RJ(2) J*(1) J*(2)

Choosing the input ordering is an important optimizer decision.

slide-38
SLIDE 38

5 10 15 20 25 30 35 40 20 40 60 80 100 120

K Page Accesses

NRA-RJ NRA J* 500 1000 1500 2000 2500 20 40 60 80 100 120

K Buffer Size

NRA-RJ NRA J* 5 10 15 20 25 30 20 40 60 80 100 120

K CPU Time

NRA-RJ NRA J*

NRA-RJ vs. NRA vs. J*

slide-39
SLIDE 39

20 40 60 80 100 120 140 160 180 2 3 4 5 6

Pipeline length Page Accesses

NRA-RJ NRA J* 1000 2000 3000 4000 5000 6000 7000 8000 9000 2 3 4 5 6

Pipeline length Buffer Size

NRA-RJ NRA J* 50 100 150 200 250 300 350 2 3 4 5 6

Pipeline length CPU Time

NRA-RJ NRA J*

slide-40
SLIDE 40

Conclusion

  • NRA-RJ is a practical binary pipelined query
  • perator.
  • Can be adopted by real database engine as a new

join operator.

  • For equi-join on key attributes the proposed NRA-

RJ with the local-ranking minimization is the best solution.

  • For general join conditions, the J* is the only rank-

join algorithm proposed.

  • Choosing the input order and the balancing factor

are important optimizer decisions

slide-41
SLIDE 41

Future Work

  • Heuristics to choose a “good” input order.

Sampling on input or consuming the first k tuples are possible approaches.

  • How to correctly cost the NRA-RJ operator.
  • Adaptive NRA-RJ. What happens if one of the

streams stopped or became very slow?