Vlad Sandulescu, joint work with Martin Ester
Detecting Singleton Review Spammers Using Semantic Similarity
2015.05.19
Detecting Singleton Review Spammers Using Semantic Similarity Vlad - - PowerPoint PPT Presentation
Detecting Singleton Review Spammers Using Semantic Similarity Vlad Sandulescu, joint work with Martin Ester 2015.05.19 Online reviews 31% of consumers read online reviews before actually making a purchase (rising) by the end of 2014, 15%
Vlad Sandulescu, joint work with Martin Ester
Detecting Singleton Review Spammers Using Semantic Similarity
2015.05.19
fake reviews
Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel staff were courteous, welcoming and
street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway
⋆ ⋆ ⋆ ⋆ ⋆
Ken K. Burke, VA
0 friends 4 reviews
⋆
Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel staff were courteous, welcoming and
street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway
⋆ ⋆ ⋆ ⋆
Ken K. Burke, VA
0 friends 4 reviews
⋆
⋆
Behavioural features text analysis
Hypothesis
detecting more subtle similarities between fake reviews written by the same author
between reviews, through paraphrase and synonyms
Goals
carry delight enchant ravish ship move displace tape drive tape transport rapture ecstasy raptus shipping transferral transportation transfer transmit channel channelise channelize conveyance exaltation is is diffusion send enrapture enthral enthrall
transport
Wordnet synsets
carry delight enchant ravish ship move displace tape drive tape transport rapture ecstasy raptus shipping transferral transportation transfer transmit channel channelise channelize conveyance exaltation is is diffusion send enrapture enthral enthrall
transport transport - shipping transport - move = 0.8 = 0.2
Wordnet synsets
Vectorial-based measures
For T1 and T2, their cosine similarity can be formulated as
Knowledge-based measures
For T1 and T2, their semantic similarity (Mihalcea et al.) can be formulated as: transport - ”The shop now offers night delivery”
cos(T1, T2) = T1T2 kT1kkT2k = Pn
i=1 T1iT2i
pPn
i=1 (T1i)2pPn i=1 (T2i)2
sim(T1, T2) = 1 2 ( P
w∈{T1}
(maxSim(w, T2) ⇤ idf (w)) P
w∈{T1}
idf (w) + P
w∈{T2}
(maxSim(w, T1) ⇤ idf (w)) P
w∈{T2}
idf (w) ) (
Aspect-based opinion mining
Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel staff were courteous, welcoming and
street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway
⋆ ⋆ ⋆ ⋆ ⋆
Ken K. Burke, VA
0 friends 4 reviews
⋆
N D
Θd Zd,n β Wd,n
represents the topic proportions for the dth document represents the topic assignment for the nth word in the dth document represents the observed word for the nth word in the dth document represents a distribution over the words in the known vocabulary
Topic Modeling for opinion spam detection
()
( ∥ ) + ( ∥ ), = ( + )
Ott dataset
Recommended reviews = truthful Not recommended = fake One submission per turker, rejected short, illegible or plagiarized reviews
from 130 US and UK businesses
from 660 New York restaurants
from TripAdvisor and AMT
Preprocessing
lemma lemma
Pairwise similarity
”I am working hard on my presentation at WWW” I/PRP am/VBP working/VBG hard/RB on/IN my/PRP presentation/NN at/IN WWW/NNP
CPL-↑P ,T>0.75 ↑T⇒↑P P=90%, T>0.8 Semantic ↑ F1-score P=90%, T>0.85 Trustpilot’s spammers are lazy Yelp’s spam is higher quality Yelp/Trustpilot - classifier performance with vectorial and semantic similarity measures
(a) Yelp - Precision
Precision
0,6 0,7 0,8 0,9 1,0
Threshold
0,5 0,6 0,7 0,8 0,9
(b) Yelp - F1 Score
F1 Score
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8
Threshold
0,2 0,3 0,4 0,5 0,6 0,7
(c) Trustpilot - Precision
Precision
0,7 0,8 0,9 1
Threshold
0,5 0,6 0,7 0,8 0,9
cos cpnl cpl mih
(d) Trustpilot - F1 Score
F1 Score
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8
Threshold
0,2 0,3 0,4 0,5 0,6 0,7
Semantic similarity results
(a) Cos
0,2 0,4 0,6 0,8 1 0,0 0,2 0,4 0,6 0,8
Cumulative percentage of reviews vs. similarity values Vectorial ∼ 2% diff
Semantic ∼ 6-10% diff
(b) Mihalcea
0,2 0,4 0,6 0,8 1 0,0 0,2 0,4 0,6 0,8
Distribution of truthful and deceptive reviews - Ott
truthful deceptive
much shorter
talks about the same aspects Yelp/Trustpilot - classifier performance for IR similarity with bag-of-words LDA
(a) Yelp - Precision
Precision
0,5 0,6 0,7 0,8 0,9 1
Threshold
0,5 0,6 0,7 0,8 0,9
(b) Yelp - F1 Score
F1 Score
0,1 0,2 0,3 0,4 0,5 0,6 0,7
Threshold
0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
(d) Trustpilot - F1 Score
Threshold
0,1 0,2 0,3 0,4 0,5 0,6 0,7
Threshold
0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
IR10 IR30 IR50 IR70 IR100
(c) Trustpilot - Precision
Precision
0,5 0,6 0,7 0,8 0,9 1
Threshold
0,5 0,6 0,7 0,8 0,9
Bag-of-words LDA model results
Yelp - classifier performance for IR similarity with bag-of-opinion-phrases LDA
(a) Precision
Precision
0,6 0,7
Threshold
0,5 0,6 0,7 0,8 0,9
(b) F1 Score
F1 Score
0,1 0,2 0,3 0,4 0,5 0,6 0,7
Threshold
0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
IR10 IR30 IR50 IR70 IR100
Bag-of-opinion-phrases LDA model results
Key points
THANK YOU questions?