Can you put the balls in boxes so that no box has more than one - - PowerPoint PPT Presentation

can you put the balls in boxes so that no box has more
SMART_READER_LITE
LIVE PREVIEW

Can you put the balls in boxes so that no box has more than one - - PowerPoint PPT Presentation

Warmup Annoucements: PA2 due tonight Can you put the balls in boxes so that no box has more than one ball? Where do these go? No. You cannot put more things in than there are positions. Mathematicians call this the "pigeonhole


slide-1
SLIDE 1

Warmup

Annoucements: PA2 due tonight

Can you put the balls in boxes so that no box has more than one ball?

1

  • No. You cannot put more things in than there are positions. Mathematicians

call this the "pigeonhole principle". T ake home: We have to deal with collisions unless we have big hash tables. Where do these go?

slide-2
SLIDE 2

Collisions

A collision occurs when two keys map to the same place, that is for x,y ∈ K with x y we have hash(x) = hash(y).

K is the set of keys

Keys

Geoff Cinda Andy/Cinda

1 201

Hash function

3

Hash table

202 m-1

Will

203

2 Collision We will use m to denote the size of the hash table We will use n or k to denote the number of keys

slide-3
SLIDE 3

What is the probability a collision happens?

What is the probability someone in this room has a birthday today? What is the probability two people in this room share the same birthday?

3 Assume 365 days in a year. Prob of k people not having bday

364 365 k 364 365

Prob of someone having bday today

1 − 364 365 k

With one person p=0 With 366 p=1 by pigeonhole 1 366 1

365 365 · 364 365 · · · 365 − (k − 1) 365

Prob k people not sharing Prob of sharing = 1 - p 0.5 k 23 Prob of one person not having bday today Intuitive but wrong p= (k -1) / 365

slide-4
SLIDE 4

Expected value

Definition: The expected value of a (discrete) random variable X is: E[X] =

  • x ∈X

x ·Prob(X = x) where X is the set of values X could have. Question: Suppose we role a fair six sided die, that is all value are equally probable. What is the expected value?

4 You could think of this as the average value

E[X] = 1 · 1 6 + 2 · 1 6 · · · 6 · 1 6 = 3.5

Remember this is like an average. You don't really expect a single roll of the die to show 3.5. But if you do a bunch and average then the value you get would be close to 3.5.

slide-5
SLIDE 5

Linearity of expected value

For any two random variables X and Y we have E[X +Y] = E[X]+E[Y]. Example: Rolling two six side dice. What is the expected value

  • f the sum of the dice?

1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12

5 Slow way:

2 · 1 36 + 3 · 2 36 + · · · 12 · 1 36 = 7

36 possibilities T wo ways to get 3 This holds for any two random variables even if they are NOT independent. Better way: E[X+Y] = E[X] + E[Y] = 3.5 + 3.5 = 7 We use that E[X]=E[Y]=3.5 from last slide because we have fair dice.

slide-6
SLIDE 6

How many collisions should we expect?

What is the expected number of people who share a birthday in the room? Let Xi j =        1 if person i and j share a birthday

  • therwise

Then n

i=1

n

j=i+1 Xi j is the number of people who share

birthdays. E[n

i=1

n

j=i+1 Xi j] = n i=1

n

j=i+1 E[Xi j] = n i=1

n

j=i+1

Generalisation and collisions: If we randomly put k items into m bins we expect 1

m k(k−1) 2

pairs to collide (share the same bin). To have one

  • r more expected collisions we solve 1

m k(k−1) 2

≥ 1 =⇒ k √ 2m.

6

Prob(Xij = 1) 1 365 =

n

  • i=1

n

  • j=i+1

1 365 = n(n − 1) 2 · 1 365

T ake home: We expect collisions even with few keys. Example: If m=10,000 then with about 140 keys we expect a collision. Indicator variable

First person picks any day, then second can only pick the same day 1 out of 365 ways.

slide-7
SLIDE 7

Building a hash function

Conceptually there are two challenges:

  • 1. Mapping our keys to integers
  • 2. Mapping the resulting integers to array indices

{1,2,3,...,m −1}

1023 100003234 42 201 203 202 1 2 3 4 5 6

Geoff Cinda Andy/Cinda

In practice we typically solve them together

7 Step 1 Step 2 Later we will pretend like our keys are integers but they could be anything we just assume we know how to do step 1.

slide-8
SLIDE 8

Mapping strings to integers

How do we map “Andy” to a number? 2563 · +2562 · +2561 · +2560· Is this a good mapping scheme?

8 = 2036624961 What if the string is long? GOT book ~ 6,000,000 chars Human genome ~ 3,000,000,000 chars We would overflow very quickly with these strings Solution: We can use mod to wrap the values back ASCII Code A=65, n=110, d=100, y=121 121 100 110 65

slide-9
SLIDE 9

Hashing strings

int hash( string s, int p ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % p; } return h; } Runtime:

9 Horners rule

x3 + x2 + x = x(x2 + x + 1) = x(x(x + 1) + 1)

Evaluate from inside out Wrapping values to avoid overflow (a+b) mod c = a mod c + b mod c hash("Andy", 1024) = 577 hash("Cinda", 1024) = 323 hash("Geoff", 1024) = 327 hash("Andy/Cinda", 1024)=577 Collision What is the right parameter for runtime? Length of string = s

Θ(s)

This could be bad if we hash a human genome Solution: When building the hash function pick a random set of positions and

  • nly hash those.
slide-10
SLIDE 10

Fixed hash functions are a bad idea

Always using the same hash function can lead to poor performance:

  • If a malicious user knows your hash function they can

always cause a collision

  • Even a nice user could cause problems. Suppose we use

the previous hash function and set p = 65, note this is the ASCII value for “A”. Then what does the following code do?

cout << hash("A", 65); cout << hash("AA", 65); cout << hash("AAA", 65); cout << hash("AAAA", 65);

The solution is to create a new hash function everytime you create a dictionary. For the current example this could mean choosing a random value of p.

10

slide-11
SLIDE 11

Universal hashing

Definition: A set H of hash functions is universal if for all x y, the probability that hash(x) = hash(y) is at most 1

m when hash()

is chosen at random from H. Example: Let p be a prime number larger than any key. Choose a at random from {1,2,...,p −1} and b at random from {0,1,...,p −1} the following set of functions is universal h(x) = ((a ·x +b) mod p) mod m

11 Note: hash is random, x and y are fixed Why should p be prime and bigger than the keys? Then p is not divisible by x. Same argument for picking a and b. Squish values into array Why can't a be 0? Every value maps to the same place!

slide-12
SLIDE 12

Collisions and hash function summary

  • We can only avoid collisions if the size of the set of all

possible keys we want to hash is less than equal to the size of our hash table and we have a perfect hash function

  • This is bad if the set of keys is infinitely large
  • This is still bad if the set of keys is very big as it will require

a big hash table and lots of memory

  • We have to deal with collisions
  • Even with ≈ √m keys we will expect to have a collision
  • We need a collision resolution strategy
  • Separate chaining
  • Open addressing
  • Many other interesting strategies we won’t talk about

12