Fast Dictionary-based Compression for Inverted Indexes Giulio - - PowerPoint PPT Presentation

fast dictionary based compression for inverted indexes
SMART_READER_LITE
LIVE PREVIEW

Fast Dictionary-based Compression for Inverted Indexes Giulio - - PowerPoint PPT Presentation

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR Melbourne, Australia Melbourne,


slide-1
SLIDE 1

Fast Dictionary-based Compression for Inverted Indexes

12/02/2019

Matthias Petri

The University of Melbourne Melbourne, Australia

Giulio Ermanno Pibiri

The University of Pisa and ISTI-CNR
 Pisa, Italy

Alistair Moffat

The University of Melbourne Melbourne, Australia

slide-2
SLIDE 2

Context — Inverted Indexes

We focus on compression effectiveness and decoding speed for inverted indexes. The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

slide-3
SLIDE 3

Context — Inverted Indexes

We focus on compression effectiveness and decoding speed for inverted indexes.

house is red red is always good the the is boy hungry is boy red house is the always hungry

V = {always, boy, good, house, hungry, is, red, the}

t1 t2 t3 t4 t5 t6 t7 t8

2 1 3 4 5

Lt1=[1, 3] Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

slide-4
SLIDE 4

Huge research corpora describing different space/time trade-offs.

  • Elias gamma/delta
  • Variable-Byte family
  • Binary Interpolative Coding
  • Simple family
  • PForDelta

Many solutions

  • Optimized PForDelta
  • Elias-Fano
  • Partitioned Elias-Fano
  • Clustered Elias-Fano
  • Asymmetric Numeral Systems
slide-5
SLIDE 5

Huge research corpora describing different space/time trade-offs.

  • Elias gamma/delta
  • Variable-Byte family
  • Binary Interpolative Coding
  • Simple family
  • PForDelta

Many solutions

  • Optimized PForDelta
  • Elias-Fano
  • Partitioned Elias-Fano
  • Clustered Elias-Fano
  • Asymmetric Numeral Systems

Interpolative Variable-Byte
 + SIMD

Space Time

Spectrum ~3X smaller ~4.5X faster

slide-6
SLIDE 6

Huge research corpora describing different space/time trade-offs.

  • Elias gamma/delta
  • Variable-Byte family
  • Binary Interpolative Coding
  • Simple family
  • PForDelta

Many solutions

  • Optimized PForDelta
  • Elias-Fano
  • Partitioned Elias-Fano
  • Clustered Elias-Fano
  • Asymmetric Numeral Systems

Interpolative Variable-Byte
 + SIMD

Space Time

Spectrum ~3X smaller ~4.5X faster

Can we inherit both advantages?

RQ

slide-7
SLIDE 7

A crucial fact

Patterns of d-gaps are repetitive.

slide-8
SLIDE 8

A crucial fact

Patterns of d-gaps are repetitive.

Gov2

slide-9
SLIDE 9

c1 c2 c3 c4 c5 c6 c6 c7 e c8 c9 c10 c11

… … input stream

b b b b

DINT — Dictionary of INTegers

  • Encode a whole pattern with a single dictionary


reference of b bits

  • Decode a whole pattern with a single dictionary

access

fixed-to-fixed arrangement

2b

l + 1

slide-10
SLIDE 10

c1 c2 c3 c4 c5 c6 c6 c7 e c8 c9 c10 c11

… … input stream

b b b b

DINT — Dictionary of INTegers

  • Encode a whole pattern with a single dictionary


reference of b bits

  • Decode a whole pattern with a single dictionary

access

fixed-to-fixed arrangement

2b

l + 1

slide-11
SLIDE 11

c1 c2 c3 c4 c5 c6 c6 c7 e c8 c9 c10 c11

… … input stream

b b b b

DINT — Dictionary of INTegers

  • Encode a whole pattern with a single dictionary


reference of b bits

  • Decode a whole pattern with a single dictionary

access

fixed-to-fixed arrangement

2b

l + 1

slide-12
SLIDE 12

c1 c2 c3 c4 c5 c6 c6 c7 e c8 c9 c10 c11

… … input stream

b b b b

DINT — Dictionary of INTegers

  • Encode a whole pattern with a single dictionary


reference of b bits

  • Decode a whole pattern with a single dictionary

access

fixed-to-fixed arrangement

2b

l + 1

1/3 of the time is saved

slide-13
SLIDE 13

Refinements

Packed dictionary structure Exploiting string overlap

1

Optimal block parsing

2

Multiple dictionaries

3

slide-14
SLIDE 14

Experimental results: setting

C++ code available at https://github.com/jermp/dint Machine Intel Xeon 6144 processor, 512 GiB RAM, Linux 4.13.0 Compiler gcc 7.2.0 (with all optimizations) Datasets

slide-15
SLIDE 15

Experimental results: compression effectiveness

slide-16
SLIDE 16

Experimental results: compression effectiveness

l = 16

b = 16

slide-17
SLIDE 17

Experimental results: effectiveness/efficiency plot

slide-18
SLIDE 18

Experimental results: effectiveness/efficiency plot

slide-19
SLIDE 19

Experimental results: effectiveness/efficiency plot

slide-20
SLIDE 20

Further readings

Chapter 6 and 7 of my Ph.D. thesis. http://pages.di.unipi.it/pibiri/papers/phd_thesis.pdf (more datasets, comparisons, query timings)

slide-21
SLIDE 21

Thanks for your attention, time, patience!

Any questions?