[PPT] - Inference and Learning for Probabilistic Logic Programming Fabrizio PowerPoint Presentation

SLIDE 1

Inference and Learning for Probabilistic Logic Programming

Fabrizio Riguzzi

Dipartimento di Matematica e Informatica Università di Ferrara

F. Riguzzi (DMI)

Inference and learning for PLP 1 / 25

SLIDE 2

Outline

1

Probabilistic Logic Languages

2

Distribution Semantics

3

Inference

4

Parameter Learning

5

Structure Learning

6

Challenges for Future Work

F. Riguzzi (DMI)

Inference and learning for PLP 2 / 25

SLIDE 3

Probabilistic Logic Languages

Combining Logic and Probability

Useful to model domains with complex and uncertain relationships among entities Many approaches proposed in the areas of Logic Programming, Uncertainty in AI, Machine Learning, Databases Distribution Semantics [Sato ICLP95] A probabilistic logic program defines a probability distribution over normal logic programs (called instances or possible worlds or simply worlds) The distribution is extended to a joint distribution over worlds and interpretations (or queries) The probability of a query is obtained from this distribution

F. Riguzzi (DMI)

Inference and learning for PLP 3 / 25

SLIDE 4

Probabilistic Logic Languages

Probabilistic Logic Programming (PLP) Languages under the Distribution Semantics

Probabilistic Logic Programs [Dantsin RCLP91] Probabilistic Horn Abduction [Poole NGC93], Independent Choice Logic (ICL) [Poole AI97] PRISM [Sato ICLP95] Logic Programs with Annotated Disjunctions (LPADs) [Vennekens et al. ICLP04] ProbLog [De Raedt et al. IJCAI07] They differ in the way they define the distribution over logic programs

F. Riguzzi (DMI)

Inference and learning for PLP 4 / 25

SLIDE 5

Probabilistic Logic Languages

Logic Programs with Annotated Disjunctions

sneezing(X) : 0.7 ∨ null : 0.3 ← flu(X). sneezing(X) : 0.8 ∨ null : 0.2 ← hay_fever(X). flu(bob). hay_fever(bob). Distributions over the head of rules null does not appear in the body of any rule Worlds obtained by selecting one atom from the head of every grounding of each clause

F. Riguzzi (DMI)

Inference and learning for PLP 5 / 25

SLIDE 6

Probabilistic Logic Languages

ProbLog

sneezing(X) ← flu(X), flu_sneezing(X). sneezing(X) ← hay_fever(X), hay_fever_sneezing(X). flu(bob). hay_fever(bob). 0.7 :: flu_sneezing(X). 0.8 :: hay_fever_sneezing(X). Distributions over facts Worlds obtained by selecting or not every grounding of each probabilistic fact

F. Riguzzi (DMI)

Inference and learning for PLP 6 / 25

SLIDE 7

Distribution Semantics

Reasoning Tasks

Inference: we want to compute the probability of a query given the model and, possibly, some evidence Weight learning: we know the structural part of the model (the logic formulas) but not the numeric part (the weights) and we want to infer the weights from data Structure learning we want to infer both the structure and the weights of the model from data

F. Riguzzi (DMI)

Inference and learning for PLP 7 / 25

SLIDE 8

Inference

Inference for PLP under DS

Computing the probability of a query (no evidence) Knowledge compilation:

compile the program to an intermediate representation

Binary Decision Diagrams (ProbLog [De Raedt et al. IJCAI07], cplint [Riguzzi AIIA07,Riguzzi LJIGPL09], PITA [Riguzzi & Swift ICLP10]) deterministic, Decomposable Negation Normal Form circuit (d-DNNF) (ProbLog2 [Fierens et al. TPLP13]) Sentential Decision Diagrams (this morning talk)

compute the probability by weighted model counting

Bayesian Network based:

Convert to BN Use BN inference algorithms (CVE [Meert et al. ILP09])

Lifted inference

F. Riguzzi (DMI)

Inference and learning for PLP 8 / 25

SLIDE 9

Inference

Lifted Inference

Previous approaches: ground the program and run propositional probabilistic inference Inference has high complexity: #P in general The grounding may be exponential in the size of the domain of variables In special cases we can use algorithms polynomial in the domain

f variables

p :: famous(Y). popular(X) :- friends(X, Y), famous(Y).

In this case P(popular(john)) = 1 − (1 − p)m where m is the number of friends of john because popular(john) is the nosiy OR of famous(P) for all friends P We do not need to know the identities of these friends, and hence, need not ground the clauses.

F. Riguzzi (DMI)

Inference and learning for PLP 9 / 25

SLIDE 10

Inference

Lifted variable elimination

LP2 [Bellodi et al. ICLP14] Translate a ProbLog program into the Prolog Factor Language (PFL) [Gomes and Santos Costa ILP12] Run lifted variable elimination (GC-FOVE [Taghipour et al. JAIR13]) on the PFL Example: workshop attributes [Milch et al. AAAI08] A workshop is being organized and a number of people have been

invited. series indicates whether the workshop is successful

enough to start a series of related meetings

series :- s. series :- attends(P). attends(P) :- at(P,A). 0.1::s. 0.3::at(P,A) :- person(P), attribute(A).

F. Riguzzi (DMI)

Inference and learning for PLP 10 / 25

SLIDE 11

Inference

Lifted Variable Elimination

series is the noisy OR of the attends(P) atoms in the factor attends(P) is the noidy OR of the at(P,A) atoms in the factor After grounding, factors derived from the second and the fourth clauses should not be multiplied together but should be combined with heterogeneous multiplication from VE with causal independence Variables series and attends(P) are in fact convergent variables.

F. Riguzzi (DMI)

Inference and learning for PLP 11 / 25

SLIDE 12

Inference

Lifted Variable Elimination with Causal Independence

Heterogeneous factors to be combined with heterogeneous multiplication Deputy variables for convergent variables We introduce two new types of factors to PFL, het and deputy, two new operation heterogeneous multiplication and heterogeneous summation

het series1p, s; identity ; []. het series2p, attends(P); identity; [person(P)]. deputy series2, series2p; []. deputy series1, series1p; []. bayes series, series1, series2; disjunction ; []. het attends1p(P), at(P.A); identity; [person(P),attribute(A)]. deputy attends1(P), attends1p(P); [person(P)]. bayes attends(P), attends1(P); identity; [person(P)]. bayes s; [0.9, 0.1]; []. bayes at(P,A); [0.7, 0.3] ; [person(P),attribute(A)].

F. Riguzzi (DMI)

Inference and learning for PLP 12 / 25

SLIDE 13

Inference

Workshops Attributes

Query series where we fixed the number of people to 50 and we increased the number of attributes m.

1 2 3 4 5 x 10

4

10

−1

10 10

1

10

2

10

3

N. of attributes

Runtime (s) LP2 PITA ProbLog2 2 4 6 8 10 x 10

4

10

−1

10 10

1

10

2

10

3

10

4

N. of attributes

Runtime (s) LP2

F. Riguzzi (DMI)

Inference and learning for PLP 13 / 25

SLIDE 14

Inference

Weighted Model Counting (WMC)

First order WMC [Van den Broeck et al. IJCAI11] compiles theories in first order logic with a weight function on literals (without existential quantifiers) to FO d-DNNF from which WMC is polynomial Problem: when translating ProbLog into first order logic, existential quantifiers for variables appearing in the body only appear: series ← attends(P). translates to seriesv¬∃P attends(P). Skolemization for dealing with existential quantifiers ([Van den Broeck et al. KR14], previous talk) For each existential quantifiers, two predicates are introduced: a Tseitin predicate and a Skolem predicate, no function symbol is used

F. Riguzzi (DMI)

Inference and learning for PLP 14 / 25

SLIDE 15

Parameter Learning

Problem: given a set of interpretations, a program, find the parameters maximizing the likelihood of the interpretations (or of instances of a target predicate) Exploit the equivalence with BN to use BN learning algorithms The interpretations record the truth value of ground atoms, not of the choice variables Unseen data: relative frequency can’t be used An Expectation-Maximization algorithm must be used:

Expectation step: the distribution of the unseen variables in each instance is computed given the observed data Maximization step: new parameters are computed from the distributions using relative frequency End when likelihood does not improve anymore

F. Riguzzi (DMI)

Inference and learning for PLP 15 / 25

SLIDE 16

Parameter Learning

[Thon et al. PKDD08] proposed an adaptation of EM for CPT-L, a simplified version of LPADs The algorithm computes the counts efficiently by repeatedly traversing the BDDs representing the explanations [Ishihata et al ILP08] independently proposed a similar algorithm LFI-PROBLOG [Gutmann et al. ECML11] is the adaptation of EM to ProbLog EMBLEM [Bellodi & Riguzzi IDA13] adapts [Ishihata et al ILP08] to LPADs

F. Riguzzi (DMI)

Inference and learning for PLP 16 / 25

SLIDE 17

Parameter Learning

EMBLEM

EM over Bdds for probabilistic Logic programs Efficient Mining Input: an LPAD; logical interpretations (data); target predicate(s) all ground atoms in the interpretations for the target predicate(s) correspond to as many queries BDDs encode the disjunction of explanations for each query Q

F. Riguzzi (DMI)

Inference and learning for PLP 17 / 25

SLIDE 18

Parameter Learning

EM Algorithm

Expectation step (synthesis)

1

Expectations E[cik0] and E[cik1] where cikx is the number of times a Boolean variable Xijk takes value x for all Cis, k = 1, . . . , ni − 1 E[cikx] =

Q

E[cikx|Q]

2

Expected counts per query E[cikx|Q], for all queries Q and x ∈ {0, 1}. E[cikx|Q] =

j∈g(i)

P(Xijk = x|Q) g(i) := {j|θj is a substitution grounding Ci}

Maximization step

Updates parameters πik representing P(Xijk = 1) πik = E[cik1] / (E[cik0] + E[cik1])

F. Riguzzi (DMI)

Inference and learning for PLP 18 / 25

SLIDE 19

Structure Learning

Structure Learning for LPADs

1

Find the model and the parameters that maximize the probability

f the data (log-likelihood)

1

SLIPCASE: Structure LearnIng of ProbabilistiC logic progrAmS with Em over bdds [Bellodi & Riguzzi ILP11] Beam search in the space of probabilistic programs

2

SLIPCOVER: Structure LearnIng of Probabilistic logic program by searching OVER the clause space [Bellodi & Riguzzi TPLP13]

1. Beam search in the space of clauses to find the promising ones
2. Greedy search in the space of probabilistic programs guided by

the LL of the data. Both perform parameter learning by means of EMBLEM

F. Riguzzi (DMI)

Inference and learning for PLP 19 / 25

SLIDE 20

Structure Learning

SLIPCASE

Compute optimum parameters and log-likelihood LL of the data for Theory with EMBLEM best theory=Theory, best likelihood=LL

Beam search

1

Beam: the N theories with the highest log-likelihood, initially Theory

2

Remove the 1st theory from beam → refinements:

language bias with modeh/modeb declarations +/- literal in a clause and +/- clause

3

Estimate LL for each refinement with Nmax iterations of EMBLEM

4

Update (best theory,best likelihood)

5

Insert the refinements in the beam, ordered by likelihood

6

Remove the refinements exceeding the size of the beam Stop search after MaxSteps iterations or if empty Beam EMBLEM over best theory

F. Riguzzi (DMI)

Inference and learning for PLP 20 / 25

SLIDE 21

Structure Learning

SLIPCOVER

Cycle on the set of predicates that can appear in the head of clauses, either target or background, For each predicate, beam search in the space of clauses The initial set of beams IBs, one for each predicate appearing in a head declaration, is generated by SLIPCOVER by building a set of bottom clauses as in Progol [Muggleton NGC95] EMBLEM is then executed for a theory composed of the single refined clause Cl′. LL is used as score of the updated clause Cl′′. Cl′′ is then inserted into a list of promising clauses.

F. Riguzzi (DMI)

Inference and learning for PLP 21 / 25

SLIDE 22

Structure Learning

SLIPCOVER

Then greedy search in the space of theories:

SLIPCOVER starts with an empty theory and adds a target clause at a time form the list of promising clauses After each addition, it runs EMBLEM and computes the LL of the data as the score of the resulting theory. If the score is better than the current best, the clause is kept in the theory, otherwise it is discarded.

Finally, SLIPCOVER adds all the background promising clauses to the theory and performs parameter learning on the resulting theory.

F. Riguzzi (DMI)

Inference and learning for PLP 22 / 25

SLIDE 23

Structure Learning

Experiments - Area Under the PR Curve

System HIV UW-CSE WebKB Mutagenesis Hepatitis SLIPCOVER 0.82 ± 0.05 0.11 ± 0.08 0.47 ± 0.05 0.95 ± 0.01 0.80 ± 0.01 SLIPCASE 0.78 ± 0.05 0.03 ± 0.01 0.31 ± 0.21 0.92 ± 0.08 0.71 ± 0.05 LSM 0.37 ± 0.03 0.07 ± 0.02

0.53 ± 0.04

SEM-CP-logic 0.58 ± 0.03

Aleph
0.07 ± 0.02

0.15 ± 0.05 0.73 ± 0.09

ALEPH++
0.05 ± 0.006

0.37 ± 0.16 0.95 ± 0.009

F. Riguzzi (DMI)

Inference and learning for PLP 23 / 25

SLIDE 24

Challenges for Future Work

Inference Identify the portion of a program relevant to a query in a lifted way using First-Order Bayes Ball [Meert at al. ECML10] Lift other circuits, such as SDD Learning Apply lifted inference: for parameter learning in WFOMC see [Van den Broeck et al. STARAI13] Approximate inference for parameter optimizations: pseudo-likelihood, piecewise shattering Other search approaches for structure learning, such as local and randomized search, gradient based boosting [Natarajan et al. ML12]

F. Riguzzi (DMI)

Inference and learning for PLP 24 / 25

SLIDE 25

Challenges for Future Work

F. Riguzzi (DMI)

Inference and learning for PLP 25 / 25