Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List - PowerPoint PPT Presentation
Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to web spam Spam 221 Spamming PageRank Spam farm model Optimal farm structure Alliances of two farms Larger alliances Spam
Link Spam Alliances Zoltán Gyöngyi Hector Garcia-Molina
Class List � Spam 101 — Intro to web spam � Spam 221 — Spamming PageRank • Spam farm model • Optimal farm structure • Alliances of two farms • Larger alliances � Spam 321 — Link spam detection seminar Very Large Data Bases ● Trondheim, September 1, 2005 2
Spam 101 kaiser pharmacy online Very Large Data Bases ● Trondheim, September 1, 2005 3
Spam 101 Save today on Viagra, Lipitor, Zoloft, … Phentermine 90 Pills/$119 Very Large Data Bases ● Trondheim, September 1, 2005 4
Spam 101 Pet shops commonly carry fish for home aquariums, small birds such as parakeets, small mammals such as fancy rats and hamsters… Very Large Data Bases ● Trondheim, September 1, 2005 5
Spam 101 Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services… Lawyers Loans Mortgage Ringtones Viagra Very Large Data Bases ● Trondheim, September 1, 2005 6
Spam 101 Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Trondheim, September 1, 2005 7
Spam 221: PageRank A page is important if many important pages point to it p 0 = c ∑ i p i / out(i) + (1 – c) PageRank PageRank of page p i of page p 0 that points to page p o Very Large Data Bases ● Trondheim, September 1, 2005 8
Spam 221: PageRank Random jump Damping probability ≈ 0.15 factor ≈ 0.85 (uniform static score) p 0 = c ∑ i p i / out(i) + (1 – c) Outdegree of page p i Very Large Data Bases ● Trondheim, September 1, 2005 9
Spam 221: Spam Farm Model 1 1 2 0 2 ? k 0 k Very Large Data Bases ● Trondheim, September 1, 2005 10
Spam 221: Spam Farm Model � Single target page p 0 • Increase exposure • In particular, increase PageRank Very Large Data Bases ● Trondheim, September 1, 2005 11
Spam 221: Spam Farm Model Canada Rx Cheap Canadian drugs here import pharmacy online best prescriptions discount savings � Boosting pages p 1 , …, p k • Owned/controlled by spammer Very Large Data Bases ● Trondheim, September 1, 2005 12
Spam 221: Spam Farm Model � Leakages λ 0 , …, λ k • Fractions of PageRank • Through hijacked links – Spammer has limited access to source page • λ = λ 0 + ··· + λ k Joe’s Blog Posted on 04/28/05 … Comments Great thoughts! I also wrote about this issue in my blog . (by as7869 ) Very Large Data Bases ● Trondheim, September 1, 2005 13
Spam 221: Optimal Farm � Optimal � Simple p 0 = λ + (1 – c)(c k + 1) q 0 = p 0 / (1 – c 2 ) • Every link points • Links to boosting pages to p 0 • 3.6x increase in target PageRank For c = 0.85 p 1 q 1 λ λ p 2 q 2 p 0 q 0 p k q k Very Large Data Bases ● Trondheim, September 1, 2005 14
Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank r 2 q 1 λ λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 15
Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank Lesson #1 : r 2 q 1 λ Short loop(s) increase target PageRank λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 16
Spam 221: Two Farms � Alliances = interconnected farms • Single spammer, several target pages/farms • Multiple spammers What happens if you and I team up? Very Large Data Bases ● Trondheim, September 1, 2005 17
Spam 221: Two Farms � We can do this… � … but it won’t help: d = c / (1 + c) target scores balance out p 0 = q 0 = d (k + m) / 2 Very Large Data Bases ● Trondheim, September 1, 2005 18
Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages p 1 q 1 p 2 p 0 q 0 q 2 p k q m � … and both target scores increase • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 19
Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages Lesson #2 : p 1 q 1 Target pages should only link to other targets p 2 p 0 q 0 q 2 p k q m � … and both target scores increase Lesson #3 : In an alliance of two, both participants win • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 20
Spam 221: Larger Alliances � “Extremes” • Ring core • Completely connected core Very Large Data Bases ● Trondheim, September 1, 2005 21
Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R e g Ring a 3000 P t e Problem: farm 10 g r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 22
Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R Lesson #4 : e g Ring a 3000 Larger alliances need to be stable to keep P t e Problem: farm 10 g all participants happy r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 23
Spam 221: Larger Alliances � Stable alliance = no farm has incentive to split off • Alliances of two are always stable • Larger alliances are not necessarily stable � Dynamics see paper • Should a new farm be added? • What about adding more boosting pages? • When/with whom should a farm split off? • Should a “loser” be compensated? Very Large Data Bases ● Trondheim, September 1, 2005 24
Spam 321: Spam Detection � Identifying regular structures • Inlink/outlink/PageRank distribution “unnatural” • Fetterly et al. , 2004 • Benczúr et al. , 2005 p 1 λ p 2 p 1 = p 2 = ··· = p k p 0 p k Very Large Data Bases ● Trondheim, September 1, 2005 25
Spam 321: Spam Detection � Detecting collusion • Alliance cores preserve (capture) PageRank • Zhang et al. , 2004 p 1 q 1 p 2 p 0 q 0 q 2 p k q m (p 0 + q 0 ) / ( ∑ i p i + ∑ j q j ) ≈ c / (1 – c) Very Large Data Bases ● Trondheim, September 1, 2005 26
Spam 321: Spam Detection � Estimating spam mass • Target PageRank depends on boosting • Work in progress 0 λ 0 p' 0 0 (p 0 – p' 0 ) / p 0 large Very Large Data Bases ● Trondheim, September 1, 2005 27
Review Session � Link spammers target PageRank � Spam farm model • Single target page • Boosting pages + leakage � Alliances of two • Always better than alone � Larger alliances • Different core structures • Not necessarily stable – Conditions on joining and leaving Very Large Data Bases ● Trondheim, September 1, 2005 28
Review Session � Related work • Bianchini et al. , 2005. Inside PageRank • Langville and Meyer, 2004. Deeper Inside PageRank • Baeza-Yates et al. , 2005. PageRank Increase under Different Collusion Topologies � Future work • Spam detection • Cost model extension Very Large Data Bases ● Trondheim, September 1, 2005 29
Spam 221: Larger Alliances � Various core structures • 4 farms of size 50 • One target probed (others symmetrical) 160 k n 130 a ring R e g 100 a P t e 70 g r a s T h 100 p 40 80 a r 60 G 40 # f o 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Score Group Very Large Data Bases ● Trondheim, September 1, 2005 30
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.