1Table A System for Managing Structured Web Data Yang Zhang with: - - PowerPoint PPT Presentation

1table
SMART_READER_LITE
LIVE PREVIEW

1Table A System for Managing Structured Web Data Yang Zhang with: - - PowerPoint PPT Presentation

1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No tables Web is more than just text Other tables, tags, lists, etc tables Data


slide-1
SLIDE 1

1Table

A System for Managing Structured Web Data

Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe

slide-2
SLIDE 2

Structured Web Data

  • Web is more than just text

– tables, tags, lists, etc – 50% pages have tables – 25% tables appear to be useful data tables (relational, entity, sets, etc.)

  • No existing tools to effectively query this data

– RDBMSs don’t scale, process noisy data poorly – Search engines are structure‐blind

  • 1Table fills the gap!

No tables Other tables Data tables

slide-3
SLIDE 3

Schema Reconciliation Reference Reconciliation Data Visualization

The 1Table Project

Table Search Synthetic Table Generation

slide-4
SLIDE 4

Schema Reconciliation Reference Reconciliation Data Visualization Table Search

The 1Table Project

Synthetic Table Generation

slide-5
SLIDE 5

HOBO: TABLE SEARCH

1Table Project

slide-6
SLIDE 6

The Quest for Infrastructure

  • _: limited indexing options, inefficient

structure

  • _: lots of hoops, un‐structured
  • _: little bang for the buck, slow

setup, inefficient structure

  • Wanted control over query model, ranking

Hobo: “poor man’s text search”

slide-7
SLIDE 7

Challenges

  • Millions of tables (~100M in Core)
  • Noisy: many are not data tables (layout)
  • Query by: attributes? values? similar

examples?

  • No structured metadata

Hobo

  • Similar to traditional inverted index search
  • Schema‐agnostic structured query model
slide-8
SLIDE 8

Hobo Query Processor

Master GFS TID 00000 TID 00000 Table 00000 Index 00000 TID 00000 TID 00000 Table 00216 Index 00216 Slave 0 Slave 499 Slave 1 Shard Slaves Shard Slaves Shard Slaves Shard Slaves Shard Slaves Shard Slaves

slide-9
SLIDE 9

docjoins raw tables good tables analyzed/cleaned tables extraction filtering labeling, annotation, munging Hobo inverted index indexing querying query processor

Processing Pipeline

Daffie annotation servers

slide-10
SLIDE 10

Recipe: Hobo Query Model

  • Start with Google.com-style conjunction of

disjunctions

  • Add structural primitives: terms have attributes
  • Introduce binding of variables to terms
  • Impose binary relational constraints (½ cup)
  • Mix bindings and constraints in arbitrary

boolean expressions

  • Serve and enjoy
slide-11
SLIDE 11

Query Model

“united states” where x.offset + 1 = y.offset x and y

slide-12
SLIDE 12

Query Model

“france” where x.row = y.row x and y “paris” “germany” z and x.col = z.col

slide-13
SLIDE 13

Query Model

  • What attributes are currently available?

– Physical: offset, col, row – Logical: source (header/body/context) – For ranking: size, pageRank, isDataTable, hasHeaders, … – Easy to add more!

  • Fast (poly‐time) constraint verifier
slide-14
SLIDE 14

Query Languages

High‐level template‐based query language example:

((("united states") (us)) ((china | prc) (cn)) ((_) (to)))

Low‐level constraint‐based query language:

and { a = and { a = term { united } b = term { states } where a.pos + 1 = b.pos } b = or { term { china } term { prc } } c = us d = cn e = to where a.col == b.col c.col == d.col c.col == e.col a.row == c.row b.row == d.row }

“united states” us china | prc cn * to parser, rewriter

slide-15
SLIDE 15

Demo!

slide-16
SLIDE 16

Areas for Future Work

  • Low‐hanging performance fruits

– O(n) constraint verification by ordering/hashing – Smarter concurrent iteration over inverted index – Query rewriting – More resources

  • Soft constraints: not required, but use for ranking
  • Frontend: richer data visualization
  • Ranking of results
  • Easy integration into Dataspaces
slide-17
SLIDE 17

TABLE SUGGEST

1Table Project

slide-18
SLIDE 18

Synthetic Table Generation

united states us china cn tr united states us china cn turkey tr japan jp ... … What country corresponds to code “tr”?

slide-19
SLIDE 19

Challenges

  • Inconsistent/inaccurate information
  • Resolving data from multiple sources
  • Ad‐hoc semantics
  • Data with nested (sub‐cell) structure

– .us (united states) – united states/us

slide-20
SLIDE 20

TableSuggest Features

  • Spreadsheet that suggests values to fill in
  • Can draw data from _ and Google Sets, but

primarily 1Table (Hobo)

  • Hodgpodge of techniques (thrown in ad‐hoc

manner from inspecting results)

– Type enumeration (_, Hobo) – Set expansion (Sets, Hobo) – Attribute resolution (Hobo) – Column clustering (1Table) – …

slide-21
SLIDE 21

Demo!

slide-22
SLIDE 22

Areas for Future Work

  • More principled evaluation
  • Implementation infelicities
  • Support for numeric queries using two‐tier

indexing structure with “range buckets”

  • Richer sub‐structure extraction (lists)
  • Incremental indexing with live data feeds/sources
  • Tailoring to specific domains
  • Entity tables
  • Aggregating values in denormalized tables