CS425: Algorithms for Web Scale Data
Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org
CS425: Algorithms for Web Scale Data Most of the slides are from the - - PowerPoint PPT Presentation
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A.
Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org
2
3
4
5
6
7
9
10
11
12
14
15 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
Input text:
5-shingles:
9-shingles:
16 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
Storage of k-shingles: k bytes per shingle Instead, hash each shingle to a 4-byte integer. E.g. “The most ” 4320
Which one is better?
1.
2.
Consider the # of distinct elements represented with 4 bytes
17 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
Not all characters are common. e.g. Unlikely to have shingles like “zy%p” Rule of thumb: # of k-shingles is about 20k Using 4-shingles: # of shingles: 204 = 160K Using 9-shingles and then hashing to 4-byte values: # of shingles: 209 = 512B # of buckets: 232 = 4.3B 512B shingles (uniformly) distributed to 4.3B buckets
18
19
20
22
23
24
25
26
27
28 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
29
2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1
30 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
Choose a random permutation Claim: Pr[h(Ci) = h(Cj)] = sim(Ci, Cj) Proof: Consider 3 types of rows:
After random permutation , what if the
31 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
What is the probability that the first not-Z row is of type X?
Pr[h(Ci) = h(Cj)] =
|𝒀| 𝒀 +|𝒁|
sim(Ci, Cj) =
|𝑫𝒋 ∩𝑫𝐤| |𝑫𝒋∪𝑫𝐤| = |𝒀| 𝒀 +|𝒁| = Pr[h(Ci) = h(Cj)]
Conclusion: Pr[h(Ci) = h(Cj)] = sim(Ci, Cj)
32
33
34 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
What is the expected value of Jaccard similarity of two
1 𝑡 =1 𝑡 Pr[ℎ𝜌 C1 = h𝜌(𝐷2)]
Law of large numbers: Average of the results obtained from a large
35
36
How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)
37
38 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
39 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
40 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
41 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
42 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
43 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
44 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
46
47
48
49
50
51 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
52 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
53 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
54 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
55 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
56 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
57 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
58 CS 425 – Lecture 3 Mustafa Ozdal, Bilkent University
59
60
61
62
63
64
65
66
67
68
69 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
70
71