T echniques and Rule P atterns fo r Decla ratively Querying - - PowerPoint PPT Presentation

t echniques and rule p atterns fo r decla ratively
SMART_READER_LITE
LIVE PREVIEW

T echniques and Rule P atterns fo r Decla ratively Querying - - PowerPoint PPT Presentation

T echniques and Rule P atterns fo r Decla ratively Querying W eb Data with FLORID Bertram Lud ascher Rainer Himmer oder W olfgang Ma y Institut f ur Info rmatik Universit at F reiburg


slide-1
SLIDE 1 T echniques and Rule P atterns fo r Decla ratively Querying W eb Data with FLORID Bertram Lud
  • ascher
Rainer Himmer
  • der
W
  • lfgang
Ma y Institut f
  • ur
Info rmatik Universit
  • at
F reiburg Germany Overview
  • Intro
duction
  • FLORID
W eb mo del
  • Integration
  • f
W eb Access with DOOD pa radigm
  • Data
Integration A Case Study
  • Navigation
  • Conclusions
slide-2
SLIDE 2 INTRODUCTION
  • Goal
A unifo rm framew
  • rksystem
fo r
  • Querying
the W eb
  • exp
ress decla ratively ho w to querynavigate
  • n
the W eb
  • extract
data from W eb pages fo r p
  • pulating
a database W ebdata w a rehousing
  • Management
  • f
Semistructured Data
  • structure
is irregula r pa rtial unkno wn implicit in the data
  • example
HTML pages
  • queryingnavigation
using general path exp ressions b
  • th
in the w eb via links and in the database
  • discover
structure
  • Info
rmation Integration
  • heterogeneous
sources with dierent structure
  • wrapp
ers mediato rs
slide-3
SLIDE 3 QUERYING THE WEB WITH FLOGICFLORID
  • DOOD
P a radigm
  • deduction
  • datadriven
explo ration
  • f
the W eb and high level querying
  • bjecto
rientation
  • exible
mo deling
  • f
semistructured data
  • ptional
metho ds instead
  • f
NULLs W ebFLORID
  • extension
  • f
Flogic fo r querying and restructuring the W eb
  • decla
rative rulebased p rogramming st yle unifo rm language fo r wrapp ers
  • mediato
rs
  • meta
features schema b ro wsingreasoning va riables at classmetho d p
  • sitions
  • restructuring
  • f
info rmation
  • navigation
b y general path exp ressions
  • unifo
rm access to lo cal db
  • W
eb data
  • integration
  • f
heterogenous info rmation
slide-4
SLIDE 4 FLOGIC IN A NUTSHELL
  • Basic
Constructs ObjectClass
  • ISArelation
  • SubClassClass
  • SUBCLASSrelation
  • Class
MethodPtypes
  • Rtype
  • SIGNA
TURE singlevalued Class MethodPtypes
  • Rtypes
  • and
multivalued Object MethodParams
  • R
  • D
A T A singlevalued Object MethodParams
  • fRRg
  • and
multivalued Obj MP Spec
  • MP
Spec
  • P
A TH EXPRESSION Object Creation via P ath Exp ressions in the Head Xfatherman
  • Xperson
Xmotherwoman
  • Xperson
  • personMC
Mfather Cman Mmother Cwoman
slide-5
SLIDE 5 WEB MODEL
  • The
W eb
  • Graph
consisting
  • f
no des urls containing w eb do cuments
  • and
links url
  • HTMLHEADHEA
D
  • A
HREFurl label A
  • HTML
  • z
  • wd
url
  • HTMLHEADHEAD
  • A
HREF A
  • HTML
  • z
  • wd
hrefslabel Link Structure Signature
  • webdoc
hrefsstring
  • url
  • Example
  • wdwebdoc
hrefslabel
  • url
  • F
urther A ttributes webdoc self
  • url
address
  • string
modif
  • string
  • error
  • string
  • Additional
userp rogrammed evaluation
  • f
the w eb do cuments
slide-6
SLIDE 6 INTEGRA TION OF THE WEB MODEL IN THE DEDUCTIVE SYSTEM
  • url
FLOGICDB webdoc u
  • get
hrefs address urlstring get
  • webdoc
  • RuleBased
Explo ration
  • Uget
  • Uurl
  • generate
OID
  • Ugetwebdoc
  • add
to webdoc
  • Uget
address
  • hrefs
  • ll
in slots Uexplored
  • Uurlget
  • NewUurl
  • Uurl
hrefs
  • NewU
slide-7
SLIDE 7 SEMANTICS
  • P
ath Exp ressions FLUVLDB closure axioms
  • extended
Herb rand universe U
  • Herb
rand base HB
  • W
eb Interface
  • set
  • f
reserved names R get url hrefs
  • explo
re U RL
  • PHB
U RL
  • R
  • maps
URLs to sets
  • f
new facts
  • W
eb Access Axiom fo r H
  • HB
  • H
j
  • u
  • url
  • uget
  • H
j
  • new
fo r all facts
  • new
  • explo
reu if get is dened fo r a URL u then all explo red data is in H
  • minimal
Herb rand W eb Mo del
  • Integration
with Bottomup Evaluation T W
  • P
H
  • H
  • T
  • P
H
  • u
url
  • u
get T
  • P
H
  • explo
reu
  • decla
rative semantics if explo re
  • then
W ebFLORID
  • FLORID
slide-8
SLIDE 8 EXAMPLE INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE
  • CIA
W ORLD F A CTBOOK CIA
  • geography
  • p
eople government economy
  • no
cities apa rt from country capitals
  • info
rmation link structure fo rmatted text
  • at
text structure quite regula r
  • nly
B I BRtags used fo r structuring W ORLD ONLINE W OL
  • administrative
divisions main cities
  • info
rmation link structure tables
  • structured
tables but not regula r dierent table la y
  • ut
columns
slide-9
SLIDE 9 EXAMPLE INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE
slide-10
SLIDE 10 INTEGRA TION METHODOLOGY T ypical Steps and Rules
  • CIA
F actb
  • k
Matching via Regula r Exp ressions accessing relevant pages CurlciaU
  • CcontinentfileciaFN
strcatciasrcFNU Uurlget
  • CcontinenturlciaU
cidCcountryurlcia
  • U
namecia Label continent
  • CT
  • CTcontinenturlciaget
href sL abel
  • U
Uurlget
  • countryurlciaU
extracting ra w data patterncapitalnameCap ital n
  • patterntotalareatotal
arean sq km CMethod
  • X
  • patternMethod
RegEx pmatchCcountryurlciage t RegEx
  • X
restructuring and data cleaning Crealcountry
  • CcountrycapitalnameCA
not substrnone CA
  • P
atterns and rules fo r commalists ethic groups languages
slide-11
SLIDE 11 INTEGRA TION METHODOLOGY T ypical Steps and Rules
  • W
OL P ages P a rsing nsgmlsP a rser integrated into FLORID
  • and
Evaluating Accessing
  • pa
rsing relevant pages Uurlparse
  • CcountryurlwolU
  • Generates
parsetree
  • f
the document
  • TabUurlparsetable
elementTabRowColconte nts Cont typ eT ype
  • TabUparsetable
TabtabletbodyRow tr Col X Type
  • Co
nt
  • Identifying
MainCitiesT able and column attributes Cmaincitytab
  • TheaderrowHZpopyearP
SY cit yco l CSpo pco l PS
  • CcountryurlwolU
TUparsetable elementTcontentsCo nt substrContmain cities elementTHZCScontents Heade rt ype th
  • substrcityHeader
elementTHZPScontents Heade rt ype th
  • substrpopHeader
pmatchHeader
  • Y
  • Evaluation
  • f
maincitiestable Cmaincities
  • ctyCCNcitycountryCnam
estr Np
  • pul
atio nY P
  • Ccountrymaincitytab
  • TcitycolCSpopcolPSp
  • py
ear PS Y
  • elementTDZCScontents
CNty pe td elementTDZPScontents Ptyp et d
slide-12
SLIDE 12 QUERYING INTEGRA TED
  • D
A T A
  • QUERY
Name the capitals from CIA with their p
  • pulation
from W OL
  • countrynamecia
  • Country
capitalname
  • City
citynamewol
  • City
population
  • P
P CityVienna CountryAustria P CityPrague CountryCzech Republic P CityParis CountryFrance P CityBerlin CountryGermany F using Country Objects C
  • C
  • CcountrynameciaN
CcountrynamewolN C
  • C
  • CcountrycontinentCTm
ain citi esna me wol N
  • CcountrycontinentCTcap
ital nam eN namecia Linking Capitals to Countries CcountrycapitalCapnam ec ia CN
  • Cciarelcountrycapital
cia CN
  • maincitiesCapnamewol
C N
  • countrynamecia
  • CountrycapitalnameCityp
  • pul
atio n
  • P
  • same
answ er as ab
  • ve
slide-13
SLIDE 13 SEMISTRUCTURED D A T A
  • Matching
Link structure kno wn do cument structure xed and kno wn P a rsingEvaluation Link structure kno wn va rying do cument structure
  • contentbased
queries data extraction Fishing in the W eb Link structure not kno wn Must b e extracted Def A semistructured database is a nite set
  • f
lab eled edges x
  • y
  • D
  • x
  • y
  • x
  • fy
g
  • Mapping
a ssdb to Flogic Xno de Yno de Llab el XL
  • fYg
  • ssdbXLY
Example W eb Sk eleton Extracto r P ext
  • ro
  • tsrc
  • fu
  • u
n g
  • dene
ro
  • t
no des no de
  • url
  • no
des a re urls Uno deget
  • ro
  • tsrc
  • fUg
  • get
ro
  • t
no des Yno de Llab el XL
  • fYg
  • dene
new no deslableslinks Xno degethrefsL
  • fYg
  • b
y follo wing hrefs Yget
  • Yno
de
  • access
no des which satisfy
slide-14
SLIDE 14 SEMISTRUCTURED D A T A
  • Sp
ecialization
  • f
the Sk eleton Extracto r fo r DBLP
  • ro
  • tsrc
  • fdblpg
dblp
  • httpwwwinfo
rmatikunitrierdeleydb
  • substrtrierY
and consider
  • nly
urls containing trier
  • substrdbjournalsisY
restrict to IS journal
  • Queries
with path exp ressions
  • dblpInf
SystemsLMichael E Senko Def General path exp ressions GPE
  • L
  • fanyg
  • GPE
  • if
M
  • N
  • GPE
and n
  • I
N
  • then
the follo wing a re in GPE
  • M
N
  • M
jN
  • M
  • M
  • M
  • M
  • M
  • n
  • if
  • is
bina ry relation symb
  • l
then if
  • GPE
  • if
  • L
and
  • is
a una ry relation symb
  • l
then
  • GPE
  • sp
ecicationimplementation b y simple path exp ressions
  • rules
slide-15
SLIDE 15 CONCLUSIONS
  • Summa
ry
  • DOOD
pa radigm attractive fo r querying and restructuring the W eb
  • unifo
rm access to lo cal db
  • W
eb data
  • integration
  • f
heterogenous info rmation
  • seamless
integration
  • f
an SGML pa rser
  • reasoning
ab
  • ut
do cument structure and W eb structure
  • use
  • f
sea rch engines AltaVista
  • Implementation
in W ebFLORID
  • Flo
rid
  • httpwwwinformatikunifre
ibur gde db isf lorid