Machine Learning and Association rules Petr Berka, Jan Rauch - - PowerPoint PPT Presentation

machine learning and association rules
SMART_READER_LITE
LIVE PREVIEW

Machine Learning and Association rules Petr Berka, Jan Rauch - - PowerPoint PPT Presentation

Machine Learning and Association rules Petr Berka, Jan Rauch University of Economics, Prague {berka|rauch}@vse.cz Tutorial Outline Statistics, machine learning and data mining basic concepts, similarities and differences (P. Berka)


slide-1
SLIDE 1

Machine Learning and Association rules

Petr Berka, Jan Rauch University of Economics, Prague {berka|rauch}@vse.cz

slide-2
SLIDE 2

Tutorial Outline

 Statistics, machine learning and data

mining – basic concepts, similarities and

differences (P. Berka)

 Machine Learning Methods and

Algorithms – general overview and selected

methods (P. Berka)

 Break  GUHA Method and LISp-Miner System

(J.Rauch)

Tutorial @ COMPSTAT 2010 2 Tutorial @ COMPSTAT 2010

slide-3
SLIDE 3

Part 1

Statistics, machine learning and data mining

slide-4
SLIDE 4

Statistics

 A formal science that deals with collection,

analysis, interpretation, explanation and presentation of (usually numerical) data.

 The science of making effective use of

numerical data relating to groups of individuals or experiments

(wikipedia)

Tutorial @ COMPSTAT 2010 4 Tutorial @ COMPSTAT 2010

slide-5
SLIDE 5

5

Machine Learning

 „The field of machine learning is concerned

with the question of how to construct computer programs that automatically improve with experience.―

(Mitchell, 1997)

 „Things learn when they change their behavior

in a way that makes them perform better in a future.―

(Witten, Frank, 1999)

5 Tutorial @ COMPSTAT 2010

slide-6
SLIDE 6

6

Knowledge Discovery in Databases

 „Non-trivial process of identifying valid, novel,

potentially useful and ultimately understandable patterns from data.―

(Fayyad et al., 1996)

 „Analysis of observational data sets to find

unsuspected relationships and summarize data in novel ways that are both understandable and useful to the data owner.‖

(Hand, Manilla, Smyth, 2001)

6 Tutorial @ COMPSTAT 2010

slide-7
SLIDE 7

Tutorial @ COMPSTAT 2010 7

The CRISP-DM Methodology

Data Mining

7 Tutorial @ COMPSTAT 2010

slide-8
SLIDE 8

Tutorial @ COMPSTAT 2010 8

Machine Learning

Statistics

skill acquisition empirical concept learning analytical concept learning exploratory data analysis descriptive statistics confirmatory data analysis

Data Mining

8 Tutorial @ COMPSTAT 2010

slide-9
SLIDE 9

Statistics vs. Machine Learing

 Hypothesis driven  Model oriented

 formulate hypothesis  collect data (in a

controlled way)

 analyze data  interpret results

 Data driven  Algorithm oriented

 formulate a task  preprocess available

data

 apply (different)

algorithms

 interpret results

Tutorial @ COMPSTAT 2010 9 Tutorial @ COMPSTAT 2010

slide-10
SLIDE 10

Terminological differences

Machine Learning Statistics attribute variable target attribute, class dependent variable, response input attribute independent variable, predictor learning fitting, parameter estimation weights (in neural nets) parameters (in regression) error residuum

Tutorial @ COMPSTAT 2010 10 10 Tutorial @ COMPSTAT 2010

slide-11
SLIDE 11

Similarities

 algorithms

 decision trees:

C4.5 ~ CART

 neural networks ~ regression  nearest neighbor classification

 methods

 cross-validation test 

2 test

Tutorial @ COMPSTAT 2010 11 11 Tutorial @ COMPSTAT 2010

slide-12
SLIDE 12

Part 2

Machine Learning Methods and Algorithms

slide-13
SLIDE 13

Tutorial @ COMPSTAT 2010 13

Learning methods

rote learning (memoryzing)

learning from instruction, learning by being told

learning by analogy, instance-based learning, lazy learning

explanation-based learning

learning from examples

learning from observation and discovery

13 Tutorial @ COMPSTAT 2010

slide-14
SLIDE 14

Tutorial @ COMPSTAT 2010 14

Feedback during learning

 pre-classified examples (supervised

learning)

 rewards or punishments (reinforcement

learning)

 indirect hints derived from the behaviour of

teacher (apprenticeship learning)

 nothing (unsupervised learning)

14 Tutorial @ COMPSTAT 2010

slide-15
SLIDE 15

Tutorial @ COMPSTAT 2010 15

Illustrative Example

Data about pacients with different atherosclerosis risk

Pac-id DIAST CHLST risk P1 100 300 Ano P2 85 247 Ne P3 87 291 Ano P4 105 259 Ano P5 81 231 Ne P6 105 288 Ano . . .

15 Tutorial @ COMPSTAT 2010

slide-16
SLIDE 16

Atherosclerosis risk factors study

Longitudinal (1975-2000) study of atherosclerosis risk factors in the population of middle-aged men divided into three groups (normal, risk, pathological).

 to identify atherosclerosis risk factors prevalence in a

population of middle-aged men,

 to follow the development of these risk factors and their

impact on the examined men health, especially with respect to atherosclerotic CVD,

 to study the impact of complex risk factors intervention on

development of risk factors and CVD mortality,

 to compare (after 10-12 years) risk factors profile and

health of the selected men in different groups.

16 Tutorial @ COMPSTAT 2010

slide-17
SLIDE 17

Data STULONG

Entry 1419x64 Control 10572x66 Letter 403x62 Death 389x5

Find knowledge that can be used to classify new patients according to atherosclerosis risk

17 Tutorial @ COMPSTAT 2010

slide-18
SLIDE 18

Tutorial @ COMPSTAT 2010 18

Empirical concept learning

 examples belonging to the same class have similar

characteristics (similarity-based learning)

 we infer general knowledge from a finite set of

examples (inductive learning)

18 Tutorial @ COMPSTAT 2010

slide-19
SLIDE 19

Tutorial @ COMPSTAT 2010 19

Empirical concept learning from data (1/3)

 Analyzed data

m n 2 n 1 n m 2 2 2 1 2 m 1 2 1 1 1

x ...... x x : : : x ...... x x x ...... x x D

n 2 1 m n 2 n 1 n m 2 2 2 1 2 m 1 2 1 1 1 TR

y : y y x ...... x x : : : x ...... x x x ...... x x D

 Classification task: we search for knowledge

(represented by a decision function f) f: x y, that for input values x of an example infers the value of target attribute ŷ = f (x).

19 Tutorial @ COMPSTAT 2010

slide-20
SLIDE 20

Tutorial @ COMPSTAT 2010 20

Empirical concept learning from data (2/3)

 During classification of an example we can

make an error Qf(oi, ŷi):

Q y = (y - y

f i i i

( ,  )  )

  • i

2

i i i i i f

y ˆ = y for y ˆ y for 1 = ) y ˆ , ( Q

i

  •  For the whole training data DTR we can

compute the total error Err(f,DTR), e.g. as

Err(f,D = 1 n Q y

TR f i=1 n i

) ( ,  )

  • i

20 Tutorial @ COMPSTAT 2010

slide-21
SLIDE 21

Tutorial @ COMPSTAT 2010 21

Empirical concept learning from data (3/3)

 The goal of learning is to find such a

knowledge f*, that will minimize this error

) D Err(f, min ) D , Err(f

TR f TR *

21 Tutorial @ COMPSTAT 2010

slide-22
SLIDE 22

Tutorial @ COMPSTAT 2010 22

Empirical concept learning as …

 … search

 we are learning both the structure and

parameters of a model

 … approximation

 we are learning the parameters of a model

22 Tutorial @ COMPSTAT 2010

slide-23
SLIDE 23

Tutorial @ COMPSTAT 2010 23

Search (1/2)

MGM –most general model

(one cluster for all examples)

M1 more general than M2 M2 more specific than M1 MSM – most specific model(s)

(single cluster for each example)

 Ordering of models

23 Tutorial @ COMPSTAT 2010

slide-24
SLIDE 24

Tutorial @ COMPSTAT 2010 24

Search (2/2)

Direction

 top-down  bottom-up

Strategy

 blind  heuristic  random

Breadth

 single  parallel

 Search methods

24 Tutorial @ COMPSTAT 2010

slide-25
SLIDE 25

Tutorial @ COMPSTAT 2010 25

Approximation (1/2)

Estimation of the parameters of a model (decision function) y=f(x) using a set of the values [xi ,yi]

i i i

x f y dq d ) (

2

Least squares method:

Looking for parameters that minimize the overall error

i (yi - f(xi)) 2

transformed to solving the equation

25 Tutorial @ COMPSTAT 2010

slide-26
SLIDE 26

Tutorial @ COMPSTAT 2010 26

Approximation (2/2)

 Analytical solution (known type of the function)

solving a set of equations for the parameters

 regression

 Numerical solution (unknown type of the function)

 gradient methods

Err(q) =

j j

q Err η

  • Δq

Q

q Err ,..., q Err , q Err

1

Modification of parameters q = [q0, q1, ..., qQ] as qj qj + qj where

26 Tutorial @ COMPSTAT 2010

slide-27
SLIDE 27

Tutorial @ COMPSTAT 2010 27

Selected algorithms

decision trees

decision rules

association rules

neural networks

genetic algorithms

bayesian methods

nearest-neighbor methods

27 Tutorial @ COMPSTAT 2010

slide-28
SLIDE 28

Tutorial @ COMPSTAT 2010

Decision tree algorithms

TDIDT algorithm

  • 1. select the best splitting attribute as a root of the

current (sub)tree,

  • 2. divide data in this node into subsets according to the

values of the selected attribute and add new node for each this subset,

  • 3. if there is an added node, for which the data do not

belong to the same class, goto step 1.

 only categorial attributes  only data without noise 28 28 Tutorial @ COMPSTAT 2010

slide-29
SLIDE 29

Tutorial @ COMPSTAT 2010

Splitting criteria

 How to select a splitting attribute?

Y1 Y2 … YS X1 a11 a12 a1s r1 X2 a21 a22 a2s r2 : : : XR ar1 ar2 ar2 rr s1 s2 ss n

R i i ij S j i ij i

r a r a n r X H

1 1 2

log ) (

R i S j i ij i

r a n r X Gini

1 1 2

1 ) (

R i S j j i j i ij

n s r n s r a

1 1 2

2(X) =

Entropy (min) – ID3, C4.5 Gini index (min) - CART

2 (max) - CHAID

Contingency table Y class attribute X input attribute

29 29 Tutorial @ COMPSTAT 2010

slide-30
SLIDE 30

Tutorial @ COMPSTAT 2010 30

Decision trees in the attribute space

30 Tutorial @ COMPSTAT 2010

slide-31
SLIDE 31

Tutorial @ COMPSTAT 2010 31

Decision trees (search)

 top-down (TDIDT)

 single, heuristic

 ID3, C4.5 (Quinlan), CART

(Breiman a kol.)

 parallel heuristic

 Option trees (Buntine), Random

forrest (Breiman)

 random

 parallel

 using genetic programming

 bottom-up additional technique

during tree pruning

31 Tutorial @ COMPSTAT 2010

slide-32
SLIDE 32

Tutorial @ COMPSTAT 2010 32

Decision rules – set covering algorithms

each training example covered by single rule = straightforward use during classification

set covering algorithm

1. create a rule that covers some examples of one class and does not cover any examples of other classes 2. remove covered examples from training data 3. if there are some examples not covered by any rule, go to step 1

32 Tutorial @ COMPSTAT 2010

slide-33
SLIDE 33

Tutorial @ COMPSTAT 2010 33

Decision rules in the attribute space

IF DIASThigh) THEN risk(yes) IF CHLST(high) THEN risk(yes) IF DIAST(low) CHLST(low) THEN risk(no)

33 Tutorial @ COMPSTAT 2010

slide-34
SLIDE 34

Tutorial @ COMPSTAT 2010 34

Decision rules (search)

 top-down

 parallel heuristic

 CN2 (Clark, Niblett), CN4 (Bruha)

 bottom-up

 single heuristic

 Find-S (Mitchell)

 parallel heuristic

 AQ (Michalski)

 random

 parallel

 GA-CN4 (Králík, Bruha)

IF DIAST(low) THEN IF DIAST(low) AND CHLST(low) THEN

34 Tutorial @ COMPSTAT 2010

slide-35
SLIDE 35

Decision rules – compositional algorithms (search)

KEX algorithm

1 add empty rule to the rule set KB 2 repeat 2.1 find by rule specialization a rule Ant C that fulfils the user given criteria on length and validity, 2.2 if this rule significantly improves the set of rules KB build so far then add the rule to KB

each training example can be covered by more rules = these rules contribute to the final decision during classification

35 Tutorial @ COMPSTAT 2010

slide-36
SLIDE 36

KEX algorithm – more details

Tutorial @ COMPSTAT 2010 36 36 Tutorial @ COMPSTAT 2010

slide-37
SLIDE 37

Tutorial @ COMPSTAT 2010 37

Association rules

SUC SUC ANT 257 43 300 ANT 66 1036 1102 323 1079 1402

IF smoking(no) diast(low) THEN chlst(low)

 support

a/(a+b+c+d) = 0.18

 confidence

a/(a+b) = 0.86

37 Tutorial @ COMPSTAT 2010

slide-38
SLIDE 38

Tutorial @ COMPSTAT 2010 38

Association rule (generating as top-down search)

combination 1n 1n 2n 1n 2n 3m 1n 2n 3m 4a 1n 2n 3m 4a 5a 1n 2n 3m 4a 5n 1n 2n 3m 4n 1n 2n 3m 4n 5a 1n 2n 3m 4n 5n 1n 2n 3m 5a 1n 2n 3m 5n

depth-first breadth-first

Apriori (Agrawal), LISp-Miner (Rauch)

heuristic

KAD (Ivánek, Stejskal)

combination 5a 1n 3m 3z 4a 4n 1v 1n 4a 4n 5a 1v 5a 2v combination . . . 4a 4n 5a 5n 1n 2n 1n 2s 1n 2v 1n 3m 1n 3z . . .

38 Tutorial @ COMPSTAT 2010

slide-39
SLIDE 39

Association rules algorithm

apriori algorithm

1. set k=1 and add all items that reach minsup into L 2. repeat 1. increase k 2. consider an itemset C of length k 3. if all subsets of length k-1 of the itemset C are in L then if C reaches minsup then add C into L

39 Tutorial @ COMPSTAT 2010

slide-40
SLIDE 40

apriori – more details

Tutorial @ COMPSTAT 2010 40 40 Tutorial @ COMPSTAT 2010

slide-41
SLIDE 41

Tutorial @ COMPSTAT 2010 41

Neural networks – single neuron

m 1 i m 1 i

for ' for 1 ' w x w y w x w y

i i i i

41 Tutorial @ COMPSTAT 2010

slide-42
SLIDE 42

Tutorial @ COMPSTAT 2010 42

Neural networks -multilayer perceptron

42 Tutorial @ COMPSTAT 2010

slide-43
SLIDE 43

Backpropagation algorithm = approximation

Tutorial @ COMPSTAT 2010 43 43 Tutorial @ COMPSTAT 2010

slide-44
SLIDE 44

Genetic algorithms = parallel random search

Tutorial @ COMPSTAT 2010 44 44 Tutorial @ COMPSTAT 2010

slide-45
SLIDE 45

Tutorial @ COMPSTAT 2010 45

Genetic algorithms

 Genetic operations

 Selection  Cross-over  Mutation

45 Tutorial @ COMPSTAT 2010

slide-46
SLIDE 46

Tutorial @ COMPSTAT 2010 46

Bayesian methods

 Naive bayesian classifier

(approximation)

) ( ) ( ) | ( ) ,..., | (

1

E P H P H E P E E H P

k K K 1 k

 Bayesian network (search,

approximace)

n ii i i n

u rodiče u P u u P

1 1

)) ( | ( ) ,..., (

46 Tutorial @ COMPSTAT 2010

slide-47
SLIDE 47

Tutorial @ COMPSTAT 2010 47

Naive bayesian classifier

 Computing the probabilities

P(risk=yes) = 0.71 P(risk=no) = 0.19 P(smoking=yes)|risk=yes) = 0.81 P(smoking=no)|risk=no) = 0.19 . . .

 Classification

Class Hi with highest value of

k P(Ek|Hi) P(Hi)

47 Tutorial @ COMPSTAT 2010

slide-48
SLIDE 48

Tutorial @ COMPSTAT 2010 48

Nearest-neighbor methods

Algorithm k-NN Learning Add examples [xi, yi] into case base Classification

  • 1. For a new example x

1.1. Find x1, x2, … xK K nearest neighbors 1.2. assign y = ŷ‘ y‗ is the majority class of x1, … xK,

48 Tutorial @ COMPSTAT 2010

slide-49
SLIDE 49

Tutorial @ COMPSTAT 2010 49

Nearest-neighbors in the attribute space

 Using examples  Using centroids

49 Tutorial @ COMPSTAT 2010

slide-50
SLIDE 50

Tutorial @ COMPSTAT 2010 50

Nearest-neighbor methods

 Selecting instances to be added

 no search

 IB1 (Aha)

 simple heuristic top-down search

 IB2, IB3 (Aha)

 clustering (identifying centroids)

 simple heuristic top-down search

 top-down (divisive)  bottom-up (aglomerative)

 approximation

 K-NN (given number of clusters)

50 Tutorial @ COMPSTAT 2010

slide-51
SLIDE 51

Tutorial @ COMPSTAT 2010 51

Further readings

 T. Mitchell: Machine Learning. McGraw-Hill, 1997  J. Han, M. Kerber: Data Mining, Concepts and

  • Techniques. Morgan Kaufmann, 2001

 I. Witten, E. Frank: Data Mining, Practical

Machine Learning tools and Techniques with

  • Java. 2 edition. Morgan Kaufmann, 2005

 http://www.aaai.org/AITopics  http://www.kdnuggets.com

51 Tutorial @ COMPSTAT 2010

slide-52
SLIDE 52

Break

slide-53
SLIDE 53

Part 3

GUHA Method and LISp-Miner System

slide-54
SLIDE 54

GUHA Method and LISp-Miner System

Why here?

 Association rules coined by Agrawal in 1990‘s  More general rules studied since 1960‘s  GUHA method of mechanizing hypothesis formation  Theory based on combination of

 Mathematical logic  Mathematical statistics

 Several implementations

 LISp-Miner system

 Relevant tools and theory 54 Tutorial @ COMPSTAT 2010

slide-55
SLIDE 55

Outline

 GUHA – main features  Association rule – couple of Boolean attributes  GUHA procedure ASSOC  LISp-Miner system  Related research 55 Tutorial @ COMPSTAT 2010

slide-56
SLIDE 56

GUHA – main features

Starting questions:

Can computers formulate and verify scientific hypotheses? Can computers in a rational way analyse empirical data and produce reasonable reflection of the observed empirical world? Can it be done using mathematical logic and statistics?

1978

56 Tutorial @ COMPSTAT 2010

slide-57
SLIDE 57

Examples of hypothesis formation

Evidence Observational statement Theoretical statement

(1): Theoretical statement

  • bservational statement

(2), (3) : Theoretical statement ??? observational statement 57 Tutorial @ COMPSTAT 2010

slide-58
SLIDE 58

From an observational statement to a theoretical statement

Justified by some rules of rational inductive inference

Some philosophers reject any possibility of formulating such rules

Nobody believes that there can be universal rules

There are non-trivial rules of inductive inference applicable under some well described circumstances

Some of them are useful in mechanized inductive inference

Evidence Observational statement Theoretical statement

Scheme of inductive inference: theoretical assumptions, observational statement theoretical statement

(1): Theoretical statement

  • bservational statement

(2), (3) : Theoretical statement ??? observational statement 58 Tutorial @ COMPSTAT 2010

slide-59
SLIDE 59

Logic of discovery

Five questions:

theoretical assumptions, observational statement theoretical statement

L0: In what languages does one formulate observational and theoretical statements? (What is the syntax and semantics of these languages? What is their relation to the classical first order predicate calculus?) L1: What are rational inductive inference rules bridging the gap between observational and theoretical sentences? (What does it mean that a theoretical statement is justified?) L2: Are there rational methods for deciding whether a theoretical statement is justified (on the basis of given theoretical assumptions and observational statements)? L3: What are the conditions for a theoretical statement or a set of theoretical statements to be of interest (importance) with respect to the task of scientific cognition? L4: Are there methods for suggesting such a set of statements, which is as interesting, as possible?

L0 – L2: Logic of induction L3 – L4: Logic of suggestion L0 – L4: Logic of discovery Scheme of inductive inference: 59 Tutorial @ COMPSTAT 2010

slide-60
SLIDE 60

Generation and verification of particular

  • bservational statements

DATA Simple definition of a large set of relevant

  • bservational statements

All the prime observational statements

Observational : Theoretical statement = 1:1

GUHA Procedure

60 Tutorial @ COMPSTAT 2010

slide-61
SLIDE 61

Outline

 GUHA – main features  Association rule – couple of Boolean attributes

 Data matrix and Boolean attributes  Association rule  4ft-quantifiers

 GUHA procedure ASSOC  LISp-Miner  Related research 61 Tutorial @ COMPSTAT 2010

slide-62
SLIDE 62

Data matrix and Boolean attributes

A1 A2 … Am 3 9 … 6 7 5 … 7 … … … … 4 7 … 5 A1(3) A2(7,9) A1(3) A2(7,9) … 1 1 1 … … … … … … 1 … Data matrix M Boolean attributes , ,

62 Tutorial @ COMPSTAT 2010

slide-63
SLIDE 63

M a b c d

F (a,b,c,d) =

Association rule

1 … is true in M 0 … is false in M

antecedent succedent 4ft quantifier

63 Tutorial @ COMPSTAT 2010

slide-64
SLIDE 64

Important simple 4ft-quantifiers (1)

M

a b c d

Founded implication:

Base p,

Base a p b a a

Double founded implication:

Base p,

Base a p c b a a

Founded equivalence:

Base p,

Base a p d c b a d a

64 Tutorial @ COMPSTAT 2010

slide-65
SLIDE 65

Important simple 4ft-quantifiers (2)

M

a b c d

Above Average:

Base p,

Base a d c b a c a p b a a ) 1 (

„Classical―:

S C,

S d c b a a C b a a

65 Tutorial @ COMPSTAT 2010

slide-66
SLIDE 66

4ft-quantifiers – statistical hypothesis tests (1)

M

a b c d

Lower critical implication for 0 < p 1, 0 < < 0.5

Base p , , !

The rule

! p;

corresponds to the statistical test (on the level ) of the null hypothesis H0: P( | ) p against the alternative one H1: P( | ) > p. Here P( | ) is the conditional probability of the validity of under the condition .

66 Tutorial @ COMPSTAT 2010

slide-67
SLIDE 67

4ft-quantifiers – statistical hypothesis tests (2)

M

a b c d

Base , The rule

,Base

corresponds to the statistical test (on the level

  • f the null hypothesis
  • f independence of

and against the alternative one of the positive dependence. Fisher‘s quantifier for 0 < < 0.5

67 Tutorial @ COMPSTAT 2010

slide-68
SLIDE 68

Outline

 GUHA – main features  Association rule – couple of Boolean attributes  GUHA procedure ASSOC  LISp-Miner  Related research 68 Tutorial @ COMPSTAT 2010

slide-69
SLIDE 69

GUHA procedure ASSOC

Generation and verification

  • f all relevant

Data matrix M Set of relevant Set of relevant 4ft-quantifier

All prime

69 Tutorial @ COMPSTAT 2010

slide-70
SLIDE 70

GUHA – selected implementations (1)

1966 - MINSK 22 (I. Havel) Boolean data matrix simplified version association rules punch tape

end of 1960s - IBM 7040 (I. Havel)

1976 IBM 370 (I. Havel, J. Rauch) Boolean data matrix association rules statistical quantifiers bit strings punch cards

70 Tutorial @ COMPSTAT 2010

slide-71
SLIDE 71

GUHA – selected implementations (2)

Early 1990s – PC-GUHA MS DOS

  • A. Sochorová, P. Hájek, J. Rauch

Since 1995 GUHA+- Windows

  • D. Coufal + all.

Since 1996 LISp-Miner Windows

  • M. Šimůnek + J. Rauch + all.

7 GUHA procedures KEX related research

Since 2006 Ferda,

  • M. Ralbovský + all.

71 Tutorial @ COMPSTAT 2010

slide-72
SLIDE 72

Outline

 GUHA – main features  Association rule – couple of Boolean attributes  GUHA procedure ASSOC  LISp-Miner

 Overview  Application examples

 Related research 72 Tutorial @ COMPSTAT 2010

slide-73
SLIDE 73

LISp-Miner overview

 4ft-Miner  KL-Miner  CF-Miner 

4ftAction-Miner

 SD4ft-Miner  SDKL-Miner  SDCF-Miner

http://lispminer.vse.cz

KEX

i.e. 7 GUHA procedures

LMDataSource

73 Tutorial @ COMPSTAT 2010

slide-74
SLIDE 74

LISp-Miner, application examples

 Stulong data set  4ft-Miner (enhanced ASSOC procedure):

 B (Physical, Social)

? B (Biochemical)

 SD4ft-Miner:

 normal

risk: B (Physical, Social)

? B (Biochemical)

74 Tutorial @ COMPSTAT 2010

slide-75
SLIDE 75

Stulong data set (1)

http://euromise.vse.cz/challenge2004/

75

slide-76
SLIDE 76

Stulong data set (2)

http://euromise.vse.cz/challenge2004/data/entry/

76 Tutorial @ COMPSTAT 2010

slide-77
SLIDE 77

Education Marital status Responsibility in a job

Social characteristcs

77 Tutorial @ COMPSTAT 2010

slide-78
SLIDE 78

Physical examinations

Weight [kg] Height [cm] Skinfold above musculus triceps [mm] Skinfold above musculus subscapularis [mm] …… additional attributes

78 Tutorial @ COMPSTAT 2010

slide-79
SLIDE 79

Cholesterol [mg%] Triglycerides in mg%

Biochemical examinations

79 Tutorial @ COMPSTAT 2010

slide-80
SLIDE 80

LISp-Miner, application examples

 Stulong data set  4ft-Miner (enhanced ASSOC procedure):

 B (Physical, Social)

? B (Biochemical)

 SD4ft-Miner:

 normal

risk: B (Physical, Social)

? B (Biochemical)

80 Tutorial @ COMPSTAT 2010

slide-81
SLIDE 81

B (Physical, Social)

? B (Biochemical)

In the ENTRY data matrix, are there some interesting relations between Boolean attributes describing combination of results of Physical examination and Social characteristics and results of Biochemical examination?

?

B (Physical, Social) B (Biochemical)

? evaluated using 4-fould table

ENTRY

a b c d

81 Tutorial @ COMPSTAT 2010

slide-82
SLIDE 82

Generation and verification of

?

Entry data matrix

B (Physical, Social) B (Biochemical)

?

All prime

?

Applying GUHA procedure 4ft-Miner

B (Physical, Social)

? B (Biochemical)

?

B (Physical, Social) B (Biochemical)

82 Tutorial @ COMPSTAT 2010

slide-83
SLIDE 83

Defining B (Social, Physical) (1)

B (Social, Physical) = B (Social) B (Physical) B (Social) = [B (Education), B (Marital Status), B (Responsibility_Job)]

2

B (Physical) = [B (Weight), B (Height), B (Subscapular), B (Triceps)]

4 1

83 Tutorial @ COMPSTAT 2010

slide-84
SLIDE 84

B (Education): Subsets of length 1 - 1

Education (basic school), Education (apprentice school) Education (secondary school), Education (university) Education: basic school, apprentice school, secondary school, university

Defining B (Social, Physical) (2)

84 Tutorial @ COMPSTAT 2010

slide-85
SLIDE 85

Note: Attribute A with categories 1, 2, 3, 4, 5 Literals with coefficients Subset (1 – 3):

A(1), A(2), A(3), A(4), A(5) A(1, 2), A(1, 3), A(1, 4), A(1, 5) A(2, 3), A(2, 4), A(2, 5) A(3, 4), A(3, 5) A(4, 5) A(1, 2, 3), A(1, 2, 4), A(1, 2, 5) A(2, 3, 4), A(2, 3, 5) A(3, 4, 5)

``

85 Tutorial @ COMPSTAT 2010

slide-86
SLIDE 86

Defining B (Social, Physical) (3)

52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,...., 128, 129, 130, 131, 132, 133 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,...., 128, 129, 130, 131, 132, 133 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,...., 128, 129, 130, 131, 132, 133 ,...., 52, 53, 54, 55, 56, ...., 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133

B (Weight): Intervals of length 10 - 10: Weight(52 – 61), Weight(53 – 62), …

Set of categories of Weight: 52, 53, 54, 55, …….., 130, 131, 132, 133

86

slide-87
SLIDE 87

Defining B (Social, Physical) (4)

Left cuts 1 – 3 i.e. Triceps(low) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(1 – 5) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(1 – 10) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(1 – 15)

B (Triceps): Cuts 1 - 3

Set of categories of Triceps: (0;5 , (5;10 , (10;15 , …, (25;30 (30;35 (35;40

87 Tutorial @ COMPSTAT 2010

slide-88
SLIDE 88

Right cuts 1 – 3 i.e. Triceps(high) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(35 – 40) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(30 – 40) (0;5 , (5;10 , (10;15 , (15;20 , (20;25 , (25;30 , (30;35 , (35;40 i.e. Triceps(25 – 45)

B (Triceps): Cuts 1 - 3

Set of categories of Triceps: (0;5 , (5;10 , (10;15 , …, (25;30 (30;35 (35;40

Defining B (Social, Physical) (5)

88 Tutorial @ COMPSTAT 2010

slide-89
SLIDE 89

Examples of

B (Social, Physical):

Education (basic school) Education (university) Marital_Status(single) Weight (52 – 61) Marital_Status(divorced ) Weight (52 – 61) Triceps (25 – 45) Weight (52 – 61) Height (52 – 61) Subscapular(0 – 10) Triceps (25 – 45)

Defining B (Social, Physical) (6)

89 Tutorial @ COMPSTAT 2010

slide-90
SLIDE 90

Note: Types of coefficients

See examples above

90 Tutorial @ COMPSTAT 2010

slide-91
SLIDE 91

Defining B (Biochemical)

Examples of B (Biochemical): Cholesterol (110 – 120), Cholesterol (110 – 130), …, Cholesterol (110 – 210) Cholesterol ( 380), Cholesterol ( 370), …, Cholesterol ( 290) Cholesterol ( 380) Triglicerides ( 50), … Cholesterol ( 380) Triglicerides ( 300), … …,

Analogously to B (Social, Physical)

91 Tutorial @ COMPSTAT 2010

slide-92
SLIDE 92

Defining

? in ?

? corresponds to a condition concerning 4ft( , , M )

M

a b c d

17 types of 4ft-quantifiers

92 Tutorial @ COMPSTAT 2010

slide-93
SLIDE 93

Founded implication

p,B

p,B

: at least 100p per cent of objects ofM

satisfying satisfy also and there are at least Base objects satisfying both and

M

a b c d B a p b a a

Above average

+ p,B

+ p,B

: the relative frequency of objects of M satisfying

among the objects satisfying is at least 100p per cent higher than the relative frequency of in the whole data matrix M and there are at least Base objects satisfying both and

B a d c b a c a p b a a ) 1 (

Two examples of

?

93 Tutorial @ COMPSTAT 2010

slide-94
SLIDE 94

B (Social, Physical)

0.9,50

B (Biochemical) Solving B(Social, Physical)

0.9,50 B(Biochemical) (1)

94 Tutorial @ COMPSTAT 2010

slide-95
SLIDE 95

Problem: Confidence 0.9 in

0.9,50 too high

Solution: Use confidence 0.5 PC with 1.66 GHz, 2 GB RAM 2 min. 40 sec. 5 . 106 rules verified 0 true rules

Solving B(Social, Physical)

0.9,50 B(Biochemical) (2)

95 Tutorial @ COMPSTAT 2010

slide-96
SLIDE 96

Solving B(Social, Physical)

0.5,50 B(Biochemical) (1)

B(Social, Physical)

0.5,50

B(Biochemical)

96 Tutorial @ COMPSTAT 2010

slide-97
SLIDE 97

30 rules with confidence 0.5 Problem: The strongest rule has confidence only 0.526, see detail Solution: Search for rules expressing 70% higher relative frequency than average It means to use

+ 0.7,50 instead of 0.5,50

Solving B(Social, Physical)

0.5,50 B(Biochemical) (2)

97 Tutorial @ COMPSTAT 2010

slide-98
SLIDE 98

Entry Triglicerides( 115) Triglicerides( 115) Subscapular(0;10 51 46 Subscapular(0;10 303 729

Subscapular(0;10

0.53, 51 Triglicerides( 115)

Solving B(Social, Physical)

0.5,50 B(Biochemical) (3)

Detail of results - the strongest rule

98 Tutorial @ COMPSTAT 2010

slide-99
SLIDE 99

B(Social, Physical)

+ 0.7,50

B(Biochemical)

Solving B(Social, Physical)

+ 0.7,50 B(Biochemical) (1)

99 Tutorial @ COMPSTAT 2010

slide-100
SLIDE 100

14 rules with relative frequency of succedent 0.7 than average, example – see detail

Solving B(Social, Physical)

+ 0.7,50 B(Biochemical) (2)

100 Tutorial @ COMPSTAT 2010

slide-101
SLIDE 101

: Weight (65;75 Subscapular( 15) Triceps( 15) : Triglicerides ( 95)

relative frequency of patients satisfying in the whole data matrix: relative frequency of patients satisfying among the patients satisfying : i.e. 82 % higher

31 . 114 51 51

Entry 51 114 165 140 824 964 191 938 1129

17 . 824 140 114 51 140 51

confidence = 51 / 165 = 0.31 (not interesting!) thus + 0.82,51

824 140 114 51 140 51 ) 82 . 1 ( 114 51 51

Solving B(Social, Physical)

+ 0.7,50 B(Biochemical) (3)

Detail of results - the strongest rule

101 Tutorial @ COMPSTAT 2010

slide-102
SLIDE 102

 mines for rules

/ and conditional rules /

 very fine tools to define set of relevant , ,  elements of semantics ….. Right cuts 1 – 3 i.e. Triceps(high  measures of association

  • n 4ft( , , M ) = a, b, c, d

 works very fast  does not use Apriori, uses bit string approach

4ft-Miner, summary

102 Tutorial @ COMPSTAT 2010

slide-103
SLIDE 103

LISp-Miner, application examples

 Stulong data set  4ft-Miner (enhanced ASSOC procedure):

 B (Physical, Social)

? B (Biochemical)

 SD4ft-Miner:

 normal

risk: B (Physical, Social)

? B (Biochemical)

103 Tutorial @ COMPSTAT 2010

slide-104
SLIDE 104

Normal Risk Pathological

Is there any difference between normal and risk patients what concerns

B (Social, Physical)

? B (Biochemical)?

SD4ft-Miner Motivation

normal risk: B (Social, Physical)

? B (Biochemical)

104 Tutorial @ COMPSTAT 2010

slide-105
SLIDE 105

risk pathological

Is there any difference between normal and risk what concerns

p, B

?

normal

a1 b1 c1 d1

risk

a2 b2 c2 d2

30 30 3 . | |

2 1 2 2 2 1 1 1

a a b a a b a a

Example of difference: |confidencenormal – confidencerisk | 0.3 Condition of interestingness:

Normal Risk: B (Social, Physical)

? B (Biochemical) (1)

105 Tutorial @ COMPSTAT 2010

slide-106
SLIDE 106

risk pathological

Normal Risk: B (Social, Physical)

? B (Biochemical) (2)

SD4ft-Miner procedure B(Social, Physical) B(Biochemical) normal risk

3 . 3 . 3 . | |

2 1 2 2 2 1 1 1

a a b a a b a a

106 Tutorial @ COMPSTAT 2010

slide-107
SLIDE 107

risk pathological

Normal Risk: B (Social, Physical)

? B (Biochemical) (3) 19 000 000 patterns verified in 10 minutes 32 patterns found The strongest one – see detail

107 Tutorial @ COMPSTAT 2010

slide-108
SLIDE 108

: Marital_Status(married) Weight (75,85 Height (172,181 Triceps( 15) : Cholesterol ( 210)

normal risk: B (Social, Physical)

? B (Biochemical) (4)

Detail of results - the strongest rule

Entry / normal

32 25 90 129

Entry / risk

32 119 188 520

confidencenormal = 0.56 confidencerisk = 0.21 confidencenormal – confidencerisk = 0.35

108 Tutorial @ COMPSTAT 2010

slide-109
SLIDE 109

 Mines for patterns

: /

 Are there any differences between sets

and what concerns relation of some and when condition is satisfied?

 Based on same principles as 4ft-Miner  definitions of ,

, , ,

measures of association on a, b, c, d

 Powerful tool, requires careful applications  Necessity to use domain knowledge

SD4ft-Miner, Summary

109 Tutorial @ COMPSTAT 2010

slide-110
SLIDE 110

Outline

GUHA – main features

Association rule – couple of Boolean attributes

GUHA procedure ASSOC

LISp-Miner

Related research

Domain knowledge

SEWEBAR project

Observational calculi

EverMiner project

110 Tutorial @ COMPSTAT 2010

slide-111
SLIDE 111

Storing and maintaining groups of attributes:

LISp-Miner Knowledge Base (1)

111 Tutorial @ COMPSTAT 2010

slide-112
SLIDE 112

If Education increases then Beer consumption decreases If Age increases then BMI increases too

Mutual influence of attributes

LISp-Miner Knowledge Base (2)

112

slide-113
SLIDE 113

SEWEBAR project

http://sewebar.vse.cz/

113 Tutorial @ COMPSTAT 2010

slide-114
SLIDE 114

EverMiner project

114 Tutorial @ COMPSTAT 2010

slide-115
SLIDE 115

Observational calculi

Logical calculi with formulas – patterns mined from data

Study of logical properties of such calculi

Logic of association rules

Deduction rules between association rules

is correct iff … ; is correct

Various applications ' '

) ( ) ( ) ( ) ( ) (

50 , 9 . 50 , 9 .

C B A B A

115 Tutorial @ COMPSTAT 2010

slide-116
SLIDE 116

LISp-Miner - authors

http://lispminer.vse.cz/people.html Scientific features: Jan Rauch Implementation features: Milan Šimůnek

116 Tutorial @ COMPSTAT 2010

slide-117
SLIDE 117

Further readings

Rauch J., Šimůnek M. (2005) An Alternative Approach to Mining Association Rules.In: Lin T Y et al. (eds) Data Mining: Foundations, Methods, and Applications, Springer-Verlag, pp. 219—238

Šimůnek M. (2003) Academic KDD Project LISp-Miner. In Abraham A. et all (eds) Advances in Soft Computing - Intelligent Systems Design and Applications, Springer, Berlin Heidelberg New York

Rauch J.: (2005) Logic of Association Rules. Applied Intelligence 22, 9–-28.

Rauch J., Šimůnek M.(2009) Dealing with Background Knowledge in the SEWEBAR Project. In: Berendt B. et al.: Knowledge Discovery Enhanced with Semantic and Social Information}. Berlin, Springer-Verlag, 2009, pp. 89 – 106

Kliegr T., Ralbovský M., Sv\'atek V., Šimůnek M., Jirkovský V., Nemrava J., Zemánek, J.(2009) Semantic Analytical Reports: A Framework for Post- processing data Mining Results. In: Foundations of Intelligent Systems. Berlin, Springer Verlag, 2009, pp. 88 –– 98.

117 Tutorial @ COMPSTAT 2010