General Overview - rel. model Carnegie Mellon Univ. Dept. of - - PDF document

general overview rel model carnegie mellon univ dept of
SMART_READER_LITE
LIVE PREVIEW

General Overview - rel. model Carnegie Mellon Univ. Dept. of - - PDF document

C. Faloutsos General Overview - rel. model Carnegie Mellon Univ. Dept. of Computer Science Relational model - SQL Formal & commercial query languages 15-415 - Database Applications Functional Dependencies Normalization


slide-1
SLIDE 1
  • C. Faloutsos

CMU - 15-415 1

Carnegie Mellon

Carnegie Mellon Univ.

  • Dept. of Computer Science

15-415 - Database Applications

  • C. Faloutsos

Indexing and Hashing – part II

15-415 - C. Faloutsos 2 Carnegie Mellon

General Overview - rel. model

  • Relational model - SQL

– Formal & commercial query languages

  • Functional Dependencies
  • Normalization
  • Physical Design
  • Indexing

15-415 - C. Faloutsos 3 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

– dynamic hashing – multi-attribute indexing

15-415 - C. Faloutsos 4 Carnegie Mellon

(Static) Hashing

Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium?

15-415 - C. Faloutsos 5 Carnegie Mellon

Hashing

A: Brilliant idea: key-to-address transformation:

#0 page #123 page #999,999,999

123; Smith; Main str

15-415 - C. Faloutsos 6 Carnegie Mellon

Hashing

Since space is NOT free:

  • use M, instead of 999,999,999 slots
  • hash function: h(key) = slot-id

#0 page #123 page

#999,999,999

123; Smith; Main str

slide-2
SLIDE 2
  • C. Faloutsos

CMU - 15-415 2

15-415 - C. Faloutsos 7 Carnegie Mellon

Hashing

Typically: each hash bucket is a page, holding many records:

#0 page #h(123) M 123; Smith; Main str

15-415 - C. Faloutsos 8 Carnegie Mellon

Hashing

Notice: could have clustering, or non-clustering versions:

#0 page #h(123) M

123; Smith; Main str.

15-415 - C. Faloutsos 9 Carnegie Mellon

123

...

Hashing

Notice: could have clustering, or non-clustering versions:

#0 page #h(123) M

... EMP file

123; Smith; Main str.

...

234; Johnson; Forbes ave 345; Tompson; Fifth ave

...

15-415 - C. Faloutsos 10 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing

– hashing functions – size of hash table – collision resolution

  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

15-415 - C. Faloutsos 11 Carnegie Mellon

Design decisions

1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

15-415 - C. Faloutsos 12 Carnegie Mellon

Design decisions - functions

  • Goal: uniform spread of keys over hash

buckets

  • Popular choices:

– Division hashing – Multiplication hashing

slide-3
SLIDE 3
  • C. Faloutsos

CMU - 15-415 3

15-415 - C. Faloutsos 13 Carnegie Mellon

Division hashing

h(x) = (a*x+b) mod M

  • eg., h(ssn) = (ssn) mod 1,000

– gives the last three digits of ssn

  • M: size of hash table - choose a prime

number, defensively (why?)

15-415 - C. Faloutsos 14 Carnegie Mellon

  • eg., M=2; hash on driver-license number

(dln), where last digit is ‘gender’ (0/1 = M/F)

  • in an army unit with predominantly male

soldiers

  • Thus: avoid cases where M and keys have

common divisors - prime M guards against that!

Division hashing

15-415 - C. Faloutsos 15 Carnegie Mellon

Multiplication hashing

h(x) = [ fractional-part-

  • : golden ratio ( 0.618... = ( sqrt(5)-1)/2 )
  • in general, we need an irrational number
  • advantage: M need not be a prime number
  • but must be irrational

15-415 - C. Faloutsos 16 Carnegie Mellon

Other hashing functions

  • quadratic hashing (bad)
  • ...
  • conclusion: use division hashing

15-415 - C. Faloutsos 17 Carnegie Mellon

Design decisions

1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

15-415 - C. Faloutsos 18 Carnegie Mellon

Size of hash table

  • eg., 50,000 employees, 10 employee-

records / page

  • Q: M=?? pages/buckets/slots
slide-4
SLIDE 4
  • C. Faloutsos

CMU - 15-415 4

15-415 - C. Faloutsos 19 Carnegie Mellon

Size of hash table

  • eg., 50,000 employees, 10 employees/page
  • Q: M=?? pages/buckets/slots
  • A: utilization ~ 90% and

– M: prime number

Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555

15-415 - C. Faloutsos 20 Carnegie Mellon

Design decisions

1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

15-415 - C. Faloutsos 21 Carnegie Mellon

Collision resolution

  • Q: what is a ‘collision’?
  • A: ??

15-415 - C. Faloutsos 22 Carnegie Mellon

Collision resolution

#0 page #h(123) M

123; Smith; Main str.

15-415 - C. Faloutsos 23 Carnegie Mellon

Collision resolution

  • Q: what is a ‘collision’?
  • A: ??
  • Q: why worry about collisions/overflows?

(recall that buckets are ~90% full)

  • A: ‘birthday paradox’

15-415 - C. Faloutsos 24 Carnegie Mellon

Collision resolution

  • open addressing

– linear probing (ie., put to next slot/bucket) – re-hashing

  • separate chaining (ie., put links to overflow

pages)

slide-5
SLIDE 5
  • C. Faloutsos

CMU - 15-415 5

15-415 - C. Faloutsos 25 Carnegie Mellon

Collision resolution

#0 page #h(123) M

123; Smith; Main str. linear probing:

15-415 - C. Faloutsos 26 Carnegie Mellon

Collision resolution

#0 page #h(123) M

123; Smith; Main str. re-hashing h1() h2()

15-415 - C. Faloutsos 27 Carnegie Mellon

Collision resolution

123; Smith; Main str. separate chaining

15-415 - C. Faloutsos 28 Carnegie Mellon

Design decisions - conclusions

  • function: division hashing

– h(x) = ( a*x+b ) mod M

  • size M: ~90% util.; prime number.
  • collision resolution: separate chaining

– easier to implement (deletions!); – no danger of becoming full

15-415 - C. Faloutsos 29 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

– dynamic hashing – multi-attribute indexing

15-415 - C. Faloutsos 30 Carnegie Mellon

Hashing vs B-trees:

Hashing offers

  • speed ! ( O(1) avg. search time)

..but:

slide-6
SLIDE 6
  • C. Faloutsos

CMU - 15-415 6

15-415 - C. Faloutsos 31 Carnegie Mellon

Hashing vs B-trees:

..but B-trees give:

  • key ordering:

– range queries – proximity queries – sequential scan

  • O(log(N)) guarantees for search, ins./del.
  • graceful growing/shrinking

15-415 - C. Faloutsos 32 Carnegie Mellon

Hashing vs B-trees:

thus:

  • B-trees are implemented in most systems

footnotes:

  • hashing is not (why not?)
  • ‘dbm’ and ‘ndbm’ of UNIX: offer one or both

15-415 - C. Faloutsos 33 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

– dynamic hashing – multi-attribute indexing

15-415 - C. Faloutsos 34 Carnegie Mellon

Indexing in SQL

  • create index <index-name> on <relation-

name> (<attribute-list>)

  • create unique index <index-name> on

<relation-name> (<attribute-list>)

  • drop index <index-name>

15-415 - C. Faloutsos 35 Carnegie Mellon

Indexing in SQL

  • eg.,

create index ssn-index

  • n STUDENT (ssn)
  • or (eg., on TAKES(ssn,cid, grade) ):

create index sc-index

  • n TAKES (ssn, c-id)

15-415 - C. Faloutsos 36 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics: (theoretical interest)

– dynamic hashing – multi-attribute indexing

slide-7
SLIDE 7
  • C. Faloutsos

CMU - 15-415 7

15-415 - C. Faloutsos 37 Carnegie Mellon

Problem with static hashing

  • problem: overflow?
  • problem: underflow? (underutilization)

15-415 - C. Faloutsos 38 Carnegie Mellon

Solution: Dynamic/extendible hashing

  • idea: shrink / expand hash table on demand..
  • ..dynamic hashing

Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’

15-415 - C. Faloutsos 39 Carnegie Mellon

Extendible hashing

#0 page #h(123) M

123; Smith; Main str.

15-415 - C. Faloutsos 40 Carnegie Mellon

Extendible hashing

#0 page #h(123) M

123; Smith; Main str. solution: split the bucket in two

15-415 - C. Faloutsos 41 Carnegie Mellon

Extendible hashing

in detail:

  • keep a directory, with ptrs to hash-buckets
  • Q: how to divide contents of bucket in two?
  • A: hash each key into a very long bit string;

keep only as many bits as needed Eventually:

15-415 - C. Faloutsos 42 Carnegie Mellon

Extendible hashing

directory 00... 01... 10... 11...

10101... 10110... 1101... 10011... 0111... 0001... 101001...

slide-8
SLIDE 8
  • C. Faloutsos

CMU - 15-415 8

15-415 - C. Faloutsos 43 Carnegie Mellon

Extendible hashing

directory 00... 01... 10... 11...

10101... 10110... 1101... 10011... 0111... 0001... 101001...

15-415 - C. Faloutsos 44 Carnegie Mellon

Extendible hashing

directory 00... 01... 10... 11...

10101... 10110... 1101... 10011... 0111... 0001... 101001...

split on 3-rd bit

15-415 - C. Faloutsos 45 Carnegie Mellon

Extendible hashing

directory 00... 01... 10... 11...

1101... 10011... 0111... 0001... 101001... 10101... 10110...

new page / bucket

15-415 - C. Faloutsos 46 Carnegie Mellon

Extendible hashing

directory (doubled)

1101... 10011... 0111... 0001... 101001... 10101... 10110...

new page / bucket

000... 001... 010... 011... 100... 101... 110... 111...

15-415 - C. Faloutsos 47 Carnegie Mellon

Extendible hashing

00... 01... 10... 11...

10101... 10110... 1101... 10011... 0111... 0001... 101001...

000... 001... 010... 011... 100... 101... 110... 111...

1101... 10011... 0111... 0001... 101001... 10101... 10110...

BEFORE AFTER

15-415 - C. Faloutsos 48 Carnegie Mellon

Extendible hashing

  • Summary: directory doubles on demand
  • or halves, on shrinking files
  • needs ‘local’ and ‘global’ depth (see book)
  • Mainly, of theoretical interest - same for

– ‘linear hashing’ of Litwin – ‘order preserving’ – ‘perfect hashing’ (no collisions!)

slide-9
SLIDE 9
  • C. Faloutsos

CMU - 15-415 9

15-415 - C. Faloutsos 49 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

– dynamic hashing – multi-attribute indexing

15-415 - C. Faloutsos 50 Carnegie Mellon

multiple-key access

  • how to support queries on multiple

attributes, like

– grade>=3 and course=‘415’

  • major motivation: Geographic Information

systems (GIS)

15-415 - C. Faloutsos 51 Carnegie Mellon

multiple-key access

x y

15-415 - C. Faloutsos 52 Carnegie Mellon

multiple-key access

Typical query:

  • Find cities within x miles from Pittsburgh

thus, we want to store nearby cities on the same disk page:

15-415 - C. Faloutsos 53 Carnegie Mellon

multiple-key access

x y

15-415 - C. Faloutsos 54 Carnegie Mellon

multiple-key access

x y

slide-10
SLIDE 10
  • C. Faloutsos

CMU - 15-415 10

15-415 - C. Faloutsos 55 Carnegie Mellon

multiple-key access - R-trees

x y

15-415 - C. Faloutsos 56 Carnegie Mellon

multiple-key access - R-trees

x

  • R-trees: very successful for GIS
  • (along with z-ordering)
  • more details: at ‘advanced topics’, later
  • even more details: in 15-826

15-415 - C. Faloutsos 57 Carnegie Mellon

Indexing- overview

  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics:

– dynamic hashing – multi-attribute indexing