Probabilistic Query Evaluation: Towards Tractable Combined - - PowerPoint PPT Presentation

probabilistic query evaluation towards tractable combined
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Query Evaluation: Towards Tractable Combined - - PowerPoint PPT Presentation

Probabilistic Query Evaluation: Towards Tractable Combined Complexity Mikal Monet 1 , 2 , supervised by Pierre Senellart 2 , 3 and Antoine Amarilli 1 May 31th, 2017 1 LTCI, Tlcom ParisTech, Universit Paris-Saclay; Paris, France 2 Inria


slide-1
SLIDE 1

Probabilistic Query Evaluation: Towards Tractable Combined Complexity

Mikaël Monet1,2, supervised by Pierre Senellart2,3 and Antoine Amarilli1

May 31th, 2017

1LTCI, Télécom ParisTech, Université Paris-Saclay; Paris, France 2Inria Paris; Paris, France 3École normale supérieure, PSL Research University; Paris, France

slide-2
SLIDE 2

Introduction

  • Uncertainty in data

→ Untrustworthy sources, automated information extraction, imperfect sensor precision in experimental sciences, etc.

  • Need framework to model this uncertainty and reason about it

1/20

slide-3
SLIDE 3

Introduction

  • Uncertainty in data

→ Untrustworthy sources, automated information extraction, imperfect sensor precision in experimental sciences, etc.

  • Need framework to model this uncertainty and reason about it

→ Probabilistic Databases!

1/20

slide-4
SLIDE 4

Plan

1) Define TID model and probabilistic query evaluation (PQE)

2/20

slide-5
SLIDE 5

Plan

1) Define TID model and probabilistic query evaluation (PQE) 2) Existing approaches (efficient PQE in the data)

2/20

slide-6
SLIDE 6

Plan

1) Define TID model and probabilistic query evaluation (PQE) 2) Existing approaches (efficient PQE in the data) 3) Efficient PQE in the query and the data

2/20

slide-7
SLIDE 7

Plan

1) Define TID model and probabilistic query evaluation (PQE) 2) Existing approaches (efficient PQE in the data) 3) Efficient PQE in the query and the data 4) Efficient PQE in the data, reasonable complexity in the query

2/20

slide-8
SLIDE 8

Tuple-independent databases (TID)

  • Probabilistic databases: model uncertainty about data
  • Simplest model: tuple-independent databases (TID)
  • A relational database I
  • A probability valuation π mapping each fact of I to [0, 1]
  • Semantics of a TID (I, π): a probability distribution on I′ ⊆ I:
  • Each fact F ∈ I is either present or absent with probability π(F)
  • Assume independence across facts

3/20

slide-9
SLIDE 9

Example: TID

S a b .5 a c .2

4/20

slide-10
SLIDE 10

Example: TID

S a b .5 a c .2 This TID (I, π) represents the following probability distribution:

4/20

slide-11
SLIDE 11

Example: TID

S a b .5 a c .2 This TID (I, π) represents the following probability distribution: .5 × .2 S a b a c

4/20

slide-12
SLIDE 12

Example: TID

S a b .5 a c .2 This TID (I, π) represents the following probability distribution: .5 × .2 S a b a c .5 × (1 − .2) S a b

4/20

slide-13
SLIDE 13

Example: TID

S a b .5 a c .2 This TID (I, π) represents the following probability distribution: .5 × .2 S a b a c .5 × (1 − .2) S a b (1 − .5) × .2 S a c

4/20

slide-14
SLIDE 14

Example: TID

S a b .5 a c .2 This TID (I, π) represents the following probability distribution: .5 × .2 S a b a c .5 × (1 − .2) S a b (1 − .5) × .2 S a c (1 − .5) × (1 − .2) S

4/20

slide-15
SLIDE 15

Probabilistic query evaluation (PQE)

Let us fix:

  • Relational signature σ
  • Class I of relational instances on σ (e.g., acyclic, treelike)
  • Class Q of Boolean queries (e.g., paths, trees)

5/20

slide-16
SLIDE 16

Probabilistic query evaluation (PQE)

Let us fix:

  • Relational signature σ
  • Class I of relational instances on σ (e.g., acyclic, treelike)
  • Class Q of Boolean queries (e.g., paths, trees)

Probabilistic query evaluation (PQE) problem for Q and I:

  • Given a query q ∈ Q
  • Given an instance I ∈ I and a probability valuation π
  • Compute the probability that (I, π) satisfies q

5/20

slide-17
SLIDE 17

Probabilistic query evaluation (PQE)

Let us fix:

  • Relational signature σ
  • Class I of relational instances on σ (e.g., acyclic, treelike)
  • Class Q of Boolean queries (e.g., paths, trees)

Probabilistic query evaluation (PQE) problem for Q and I:

  • Given a query q ∈ Q
  • Given an instance I ∈ I and a probability valuation π
  • Compute the probability that (I, π) satisfies q

→ Pr((I, π) | = q) =

J⊆I, J| =q Pr(J)

5/20

slide-18
SLIDE 18

Complexity of probabilistic query evaluation (PQE)

Question: what is the (data, combined) complexity of PQE depending on the class Q of queries and class I of instances?

6/20

slide-19
SLIDE 19

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

7/20

slide-20
SLIDE 20

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S

7/20

slide-21
SLIDE 21

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S → PQE is #P-hard for any q ∈ UCQs \S

7/20

slide-22
SLIDE 22

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S → PQE is #P-hard for any q ∈ UCQs \S

  • Existing data dichotomy result on instances

7/20

slide-23
SLIDE 23

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S → PQE is #P-hard for any q ∈ UCQs \S

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015]

7/20

slide-24
SLIDE 24

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S → PQE is #P-hard for any q ∈ UCQs \S

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015] → There is an FO query for which PQE is #P-hard on any unbounded-treewidth graph family I (under some assumptions) [Amarilli, Bourhis, & Senellart, 2016]

7/20

slide-25
SLIDE 25

Data complexity results: related work

  • Existing data dichotomy result on queries [Dalvi & Suciu, 2012]
  • I is all instances
  • There is a class S ⊆ UCQs of safe queries

→ PQE is PTIME for any q ∈ S → PQE is #P-hard for any q ∈ UCQs \S

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015] → There is an FO query for which PQE is #P-hard on any unbounded-treewidth graph family I (under some assumptions) [Amarilli, Bourhis, & Senellart, 2016]

What about combined complexity?

7/20

slide-26
SLIDE 26

Wish list

We want:

  • PQE tractable in combined complexity

OR

  • PQE tractable in the data, reasonable in the query

8/20

slide-27
SLIDE 27

Restrict to CQs on graph signatures

∃x y z t R(x, y) ∧ S(y, z) ∧ S(t, z) R a b .1 b c .1 c d .05 d a 1. d b .8 S b d .7

9/20

slide-28
SLIDE 28

Restrict to CQs on graph signatures

∃x y z t R(x, y) ∧ S(y, z) ∧ S(t, z) → x y z t R S S R a b .1 b c .1 c d .05 d a 1. d b .8 S b d .7

9/20

slide-29
SLIDE 29

Restrict to CQs on graph signatures

∃x y z t R(x, y) ∧ S(y, z) ∧ S(t, z) → x y z t R S S R a b .1 b c .1 c d .05 d a 1. d b .8 S b d .7 → d c b a 1. R .1 R R .1 R .05 S a .7 R a .8

9/20

slide-30
SLIDE 30

Restrict instances to trees

Q = one-way paths (1WP), I = polytrees (PT)

10/20

slide-31
SLIDE 31

Restrict instances to trees

Q = one-way paths (1WP), I = polytrees (PT) Q: T S S S T

10/20

slide-32
SLIDE 32

Restrict instances to trees

Q = one-way paths (1WP), I = polytrees (PT) Q: T S S S T I: + prob. for each edge T T T T S S S S S S T S T

10/20

slide-33
SLIDE 33

Restrict instances to trees

Q = one-way paths (1WP), I = polytrees (PT) Q: T S S S T I: + prob. for each edge T T T T S S S S S S T S T Proposition PQE of 1WP on PT is #P-hard

10/20

slide-34
SLIDE 34

Our graph classes

1WP 2WP R S S T R S S T R DWT PT 1WP 2WP DWT PT Connected All ⊆ ⊆ ⊆ ⊆ ⊆ ⊆

11/20

slide-35
SLIDE 35

Results

↓Q I→ 1WP 2WP DWT PT Connected 1WP 2WP DWT PTIME PT #P-hard Connected 2 labels

12/20

slide-36
SLIDE 36

Results

↓Q I→ 1WP 2WP DWT PT Connected 1WP 2WP DWT PTIME PT #P-hard Connected 2 labels ↓Q I→ 1WP 2WP DWT PT Connected 1WP 2WP DWT PTIME PT #P-hard Connected No labels

12/20

slide-37
SLIDE 37

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE

13/20

slide-38
SLIDE 38

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE
  • Focus on CQs on arity-two signatures

13/20

slide-39
SLIDE 39

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE
  • Focus on CQs on arity-two signatures
  • Showed the importance of various features on the problem:

labels, global orientation, branching, connectedness

13/20

slide-40
SLIDE 40

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE
  • Focus on CQs on arity-two signatures
  • Showed the importance of various features on the problem:

labels, global orientation, branching, connectedness

  • Established the complexity for all combinations of the graph

classes we considered

13/20

slide-41
SLIDE 41

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE
  • Focus on CQs on arity-two signatures
  • Showed the importance of various features on the problem:

labels, global orientation, branching, connectedness

  • Established the complexity for all combinations of the graph

classes we considered Drawbacks and future work:

  • Our graph classes may seem “arbitrary”

13/20

slide-42
SLIDE 42

Led to a publication in PODS’2017

Contributions:

  • Detailed study of the combined complexity of PQE
  • Focus on CQs on arity-two signatures
  • Showed the importance of various features on the problem:

labels, global orientation, branching, connectedness

  • Established the complexity for all combinations of the graph

classes we considered Drawbacks and future work:

  • Our graph classes may seem “arbitrary”
  • Not yet a dichotomy, just starting to understand the problem
  • Practical applications?

13/20

slide-43
SLIDE 43

Lowering our expectations

What if we want the complexity to be:

  • Tractable in the data
  • Not too horrible in the query

Can we then support a more expressive query language (e.g., disjunctions, negations, recursion)?

14/20

slide-44
SLIDE 44

Starting point

  • Existing data dichotomy result on instances

15/20

slide-45
SLIDE 45

Starting point

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015]

15/20

slide-46
SLIDE 46

Starting point

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015]

  • Problem: nonelementary in the query 22

. . . |Q|

15/20

slide-47
SLIDE 47

Starting point

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015]

  • Problem: nonelementary in the query 22

. . . |Q|

The instance class is parameterized

15/20

slide-48
SLIDE 48

Starting point

  • Existing data dichotomy result on instances

→ PQE for MSO on bounded-treewidth instances has linear data complexity [Amarilli, Bourhis, & Senellart, 2015]

  • Problem: nonelementary in the query 22

. . . |Q|

The instance class is parameterized Idea: one parameter for the instances and one parameter for the queries

15/20

slide-49
SLIDE 49

Parameterized Complexity

Idea: one parameter kI for the instance (treewidth) AND one parameter kQ for the query

16/20

slide-50
SLIDE 50

Parameterized Complexity

Idea: one parameter kI for the instance (treewidth) AND one parameter kQ for the query

  • Instance classes I1, I2, · · ·

16/20

slide-51
SLIDE 51

Parameterized Complexity

Idea: one parameter kI for the instance (treewidth) AND one parameter kQ for the query

  • Instance classes I1, I2, · · ·
  • Query classes Q1, Q2, · · ·

16/20

slide-52
SLIDE 52

Parameterized Complexity

Idea: one parameter kI for the instance (treewidth) AND one parameter kQ for the query

  • Instance classes I1, I2, · · ·
  • Query classes Q1, Q2, · · ·

Definition The problem is fixed-parameter tractable (FPT) linear if there exists a computable function f such that it can be solved in time f(kI, kQ) × |Q| × |I|

16/20

slide-53
SLIDE 53

Publication in ICDT’2017

1) A new language...

  • We introduce the language of intentional-clique-guarded

Datalog (ICG-Datalog), parameterized by body-size kP

17/20

slide-54
SLIDE 54

Publication in ICDT’2017

1) A new language...

  • We introduce the language of intentional-clique-guarded

Datalog (ICG-Datalog), parameterized by body-size kP 2) ... with FPT-linear (combined) evaluation...

  • Given an ICG-Datalog program P with body-size kP and a

relational instance I of treewidth kI, checking if I | = P can be done in time f(kP, kI) × |P| × |I|

17/20

slide-55
SLIDE 55

Publication in ICDT’2017

1) A new language...

  • We introduce the language of intentional-clique-guarded

Datalog (ICG-Datalog), parameterized by body-size kP 2) ... with FPT-linear (combined) evaluation...

  • Given an ICG-Datalog program P with body-size kP and a

relational instance I of treewidth kI, checking if I | = P can be done in time f(kP, kI) × |P| × |I| 3) ... and also FPT-linear (combined) computation of provenance

  • We design a new concise provenance representation based on

cyclic Boolean circuits: cycluits

17/20

slide-56
SLIDE 56

Tree encoding E Two-way Alternating Tree Automaton A Database I

  • f treewidth ≤ kI

C(x) ← Subway("Corvisart",x) “Under which conditions is it impossible to go from station Corvisart to station Châtelet with the subway?" Provenance Cycluit

1 2

(Paris Metro map)

C(x) ← C(y) Subway(y,x) ∧ ICG-Datalog program P

  • f body-size ≤ kP

O( g(kP, kI) |P| ) O( g'(kI) |I| ) O( |A| • |E| )

Goal() ← ¬ C("Châtelet")

18/20

slide-57
SLIDE 57

Application to PQE

Theorem Having fixed kP and kI, we can solve PQE in O(22|P|α |I| |P|).

  • 2EXP, but still better than previous nonelementary bounds

19/20

slide-58
SLIDE 58

Conclusion

Up to now:

  • Study of the combined complexity of PQE
  • Tractable cases quite restricted
  • If we lower our expectations then we can capture more

expressive query languages Ongoing and future work:

  • Lots of open technical questions
  • Started a collaboration with Dan Olteanu (Univ. of Oxford) on

mixed probabilistic models

  • Practical applications?

Thanks for your attention!

20/20