SLIDE 1 Warmup
Annoucements: PA2 due tonight
Can you put the balls in boxes so that no box has more than one ball?
1
- No. You cannot put more things in than there are positions. Mathematicians
call this the "pigeonhole principle". T ake home: We have to deal with collisions unless we have big hash tables. Where do these go?
SLIDE 2
Collisions
A collision occurs when two keys map to the same place, that is for x,y ∈ K with x y we have hash(x) = hash(y).
K is the set of keys
Keys
Geoff Cinda Andy/Cinda
1 201
Hash function
3
Hash table
202 m-1
Will
203
2 Collision We will use m to denote the size of the hash table We will use n or k to denote the number of keys
SLIDE 3 What is the probability a collision happens?
What is the probability someone in this room has a birthday today? What is the probability two people in this room share the same birthday?
3 Assume 365 days in a year. Prob of k people not having bday
364 365 k 364 365
Prob of someone having bday today
1 − 364 365 k
With one person p=0 With 366 p=1 by pigeonhole 1 366 1
365 365 · 364 365 · · · 365 − (k − 1) 365
Prob k people not sharing Prob of sharing = 1 - p 0.5 k 23 Prob of one person not having bday today Intuitive but wrong p= (k -1) / 365
SLIDE 4 Expected value
Definition: The expected value of a (discrete) random variable X is: E[X] =
x ·Prob(X = x) where X is the set of values X could have. Question: Suppose we role a fair six sided die, that is all value are equally probable. What is the expected value?
4 You could think of this as the average value
E[X] = 1 · 1 6 + 2 · 1 6 · · · 6 · 1 6 = 3.5
Remember this is like an average. You don't really expect a single roll of the die to show 3.5. But if you do a bunch and average then the value you get would be close to 3.5.
SLIDE 5 Linearity of expected value
For any two random variables X and Y we have E[X +Y] = E[X]+E[Y]. Example: Rolling two six side dice. What is the expected value
1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
5 Slow way:
2 · 1 36 + 3 · 2 36 + · · · 12 · 1 36 = 7
36 possibilities T wo ways to get 3 This holds for any two random variables even if they are NOT independent. Better way: E[X+Y] = E[X] + E[Y] = 3.5 + 3.5 = 7 We use that E[X]=E[Y]=3.5 from last slide because we have fair dice.
SLIDE 6 How many collisions should we expect?
What is the expected number of people who share a birthday in the room? Let Xi j = 1 if person i and j share a birthday
Then n
i=1
n
j=i+1 Xi j is the number of people who share
birthdays. E[n
i=1
n
j=i+1 Xi j] = n i=1
n
j=i+1 E[Xi j] = n i=1
n
j=i+1
Generalisation and collisions: If we randomly put k items into m bins we expect 1
m k(k−1) 2
pairs to collide (share the same bin). To have one
- r more expected collisions we solve 1
m k(k−1) 2
≥ 1 =⇒ k √ 2m.
6
Prob(Xij = 1) 1 365 =
n
n
1 365 = n(n − 1) 2 · 1 365
T ake home: We expect collisions even with few keys. Example: If m=10,000 then with about 140 keys we expect a collision. Indicator variable
First person picks any day, then second can only pick the same day 1 out of 365 ways.
SLIDE 7 Building a hash function
Conceptually there are two challenges:
- 1. Mapping our keys to integers
- 2. Mapping the resulting integers to array indices
{1,2,3,...,m −1}
1023 100003234 42 201 203 202 1 2 3 4 5 6
Geoff Cinda Andy/Cinda
In practice we typically solve them together
7 Step 1 Step 2 Later we will pretend like our keys are integers but they could be anything we just assume we know how to do step 1.
SLIDE 8
Mapping strings to integers
How do we map “Andy” to a number? 2563 · +2562 · +2561 · +2560· Is this a good mapping scheme?
8 = 2036624961 What if the string is long? GOT book ~ 6,000,000 chars Human genome ~ 3,000,000,000 chars We would overflow very quickly with these strings Solution: We can use mod to wrap the values back ASCII Code A=65, n=110, d=100, y=121 121 100 110 65
SLIDE 9 Hashing strings
int hash( string s, int p ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % p; } return h; } Runtime:
9 Horners rule
x3 + x2 + x = x(x2 + x + 1) = x(x(x + 1) + 1)
Evaluate from inside out Wrapping values to avoid overflow (a+b) mod c = a mod c + b mod c hash("Andy", 1024) = 577 hash("Cinda", 1024) = 323 hash("Geoff", 1024) = 327 hash("Andy/Cinda", 1024)=577 Collision What is the right parameter for runtime? Length of string = s
Θ(s)
This could be bad if we hash a human genome Solution: When building the hash function pick a random set of positions and
SLIDE 10 Fixed hash functions are a bad idea
Always using the same hash function can lead to poor performance:
- If a malicious user knows your hash function they can
always cause a collision
- Even a nice user could cause problems. Suppose we use
the previous hash function and set p = 65, note this is the ASCII value for “A”. Then what does the following code do?
cout << hash("A", 65); cout << hash("AA", 65); cout << hash("AAA", 65); cout << hash("AAAA", 65);
The solution is to create a new hash function everytime you create a dictionary. For the current example this could mean choosing a random value of p.
10
SLIDE 11
Universal hashing
Definition: A set H of hash functions is universal if for all x y, the probability that hash(x) = hash(y) is at most 1
m when hash()
is chosen at random from H. Example: Let p be a prime number larger than any key. Choose a at random from {1,2,...,p −1} and b at random from {0,1,...,p −1} the following set of functions is universal h(x) = ((a ·x +b) mod p) mod m
11 Note: hash is random, x and y are fixed Why should p be prime and bigger than the keys? Then p is not divisible by x. Same argument for picking a and b. Squish values into array Why can't a be 0? Every value maps to the same place!
SLIDE 12 Collisions and hash function summary
- We can only avoid collisions if the size of the set of all
possible keys we want to hash is less than equal to the size of our hash table and we have a perfect hash function
- This is bad if the set of keys is infinitely large
- This is still bad if the set of keys is very big as it will require
a big hash table and lots of memory
- We have to deal with collisions
- Even with ≈ √m keys we will expect to have a collision
- We need a collision resolution strategy
- Separate chaining
- Open addressing
- Many other interesting strategies we won’t talk about
12