TopicSketch: Real-time Bursty Topic Detection from Twitter Wei Xie , - - PowerPoint PPT Presentation

topicsketch real time bursty topic detection from twitter
SMART_READER_LITE
LIVE PREVIEW

TopicSketch: Real-time Bursty Topic Detection from Twitter Wei Xie , - - PowerPoint PPT Presentation

TopicSketch: Real-time Bursty Topic Detection from Twitter Wei Xie , Feida Zhu, Jing Jiang, Ee-Peng Lim and Ke Wang* Living Analytics Research Centre Singapore Management University * Ke Wang is from Simon Fraser University, and this work


slide-1
SLIDE 1

TopicSketch: Real-time Bursty Topic Detection from Twitter

Wei Xie, Feida Zhu, Jing Jiang, Ee-Peng Lim and Ke Wang* Living Analytics Research Centre Singapore Management University

1

* Ke Wang is from Simon Fraser University, and this work was done when the author was visiting Living Analytics Research Centre in Singapore Management University.

slide-2
SLIDE 2

Twitter as News Media

  • Twitter works as a huge

news media.

  • For some topics, especially

bursty topics, news first appears in Twitter, rather than traditional news media.

  • It is interesting and also

useful to detect bursty topics from Twitter.

2

slide-3
SLIDE 3

Handling Tweet Stream is Challenging

  • Large Volume


Number of tweets per day : 340 million

  • Large Velocity


Number of tweets per second : 9,000 (average) / 143,000 (peak)

  • Large Variety


All kinds of activities and topics appear in Twitter


3

slide-4
SLIDE 4

Outline

4

Motivation Related Work Proposed Method

Intuition Indicator of burst Assumptions Solution Framework Dimension reduction

Experiment Conclusion

slide-5
SLIDE 5

Related Work

5

slide-6
SLIDE 6

Related Work

  • Topic Modelling


—Liangjie Hong, et al. A time-dependent topic model for multiple text streams. KDD 2011
 —Qiming Diao, et al. Finding Bursty Topics from

  • Microblogs. ACL 2012

5

slide-7
SLIDE 7

Related Work

  • Topic Modelling


—Liangjie Hong, et al. A time-dependent topic model for multiple text streams. KDD 2011
 —Qiming Diao, et al. Finding Bursty Topics from

  • Microblogs. ACL 2012
  • Topic Detection & Tacking


—Sasa Petrovic, et al. Streaming First Story Detection with application to Twitter. HLT-NAACL 2010
 —Chenliang Li, et al. Twevent: segment-based event detection from tweets. CIKM 2012

5

slide-8
SLIDE 8

Related Work

  • Topic Modelling


—Liangjie Hong, et al. A time-dependent topic model for multiple text streams. KDD 2011
 —Qiming Diao, et al. Finding Bursty Topics from

  • Microblogs. ACL 2012
  • Topic Detection & Tacking


—Sasa Petrovic, et al. Streaming First Story Detection with application to Twitter. HLT-NAACL 2010
 —Chenliang Li, et al. Twevent: segment-based event detection from tweets. CIKM 2012

Both of them face difficulty to handle large tweet stream, as they need to process very huge historical data.

5

slide-9
SLIDE 9

Intuition

6

slide-10
SLIDE 10

Intuition

  • Rather than keep the big historical data, maybe

we can take a snapshot of the current data stream.

6

slide-11
SLIDE 11

Intuition

  • Rather than keep the big historical data, maybe

we can take a snapshot of the current data stream.

  • At least, it takes much smaller space and

hopefully we can efficiently infer topics from it.

6

slide-12
SLIDE 12

Intuition

  • Rather than keep the big historical data, maybe

we can take a snapshot of the current data stream.

  • At least, it takes much smaller space and

hopefully we can efficiently infer topics from it.

But How?

6

slide-13
SLIDE 13

Acceleration as an Indicator

Adopt the concepts in physics:

7

slide-14
SLIDE 14

Acceleration as an Indicator

Adopt the concepts in physics: Velocity

7

slide-15
SLIDE 15

Acceleration as an Indicator

Adopt the concepts in physics: Velocity

the rate of change of the volume of tweet stream

7

slide-16
SLIDE 16

Acceleration as an Indicator

Adopt the concepts in physics: Velocity

the rate of change of the volume of tweet stream

7

slide-17
SLIDE 17

Acceleration as an Indicator

Adopt the concepts in physics: Velocity Acceleration

the rate of change of the volume of tweet stream

7

slide-18
SLIDE 18

Acceleration as an Indicator

Adopt the concepts in physics: Velocity Acceleration

the rate of change of the volume of tweet stream the rate of change of the velocity of tweet stream

7

slide-19
SLIDE 19

Acceleration as an Indicator

Adopt the concepts in physics: Velocity Acceleration

the rate of change of the volume of tweet stream the rate of change of the velocity of tweet stream

7

slide-20
SLIDE 20

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

8

slide-21
SLIDE 21

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

9

slide-22
SLIDE 22

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s' HtL

2000 4000 6000 8000

9

slide-23
SLIDE 23

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s'' HtL

2000 4000 6000 8000

t s' HtL

2000 4000 6000 8000

9

slide-24
SLIDE 24

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s'' HtL

2000 4000 6000 8000

t s' HtL

2000 4000 6000 8000

9

slide-25
SLIDE 25

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s'' HtL

2000 4000 6000 8000

t s' HtL

2000 4000 6000 8000

9

slide-26
SLIDE 26

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s'' HtL

2000 4000 6000 8000

t s' HtL

2000 4000 6000 8000

9

slide-27
SLIDE 27

Acceleration as an Indicator

Usually we can

  • bserve the peak of

acceleration earlier than the peak of velocity.

t s'' HtL

2000 4000 6000 8000

t s' HtL

2000 4000 6000 8000

9

slide-28
SLIDE 28

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

10

slide-29
SLIDE 29

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?

10

slide-30
SLIDE 30

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?

The acceleration of the whole tweet stream.

10

slide-31
SLIDE 31

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?
  • 2. Is there any word bursting?

The acceleration of the whole tweet stream.

10

slide-32
SLIDE 32

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?
  • 2. Is there any word bursting?

The acceleration of the whole tweet stream. The acceleration of each word in the tweet stream.

10

slide-33
SLIDE 33

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?
  • 2. Is there any word bursting?
  • 3. Is there any topic bursting?

The acceleration of the whole tweet stream. The acceleration of each word in the tweet stream.

10

slide-34
SLIDE 34

Acceleration as an Indicator

Acceleration: a very good early indicator of burst.

  • 1. Is there any burst at all?
  • 2. Is there any word bursting?
  • 3. Is there any topic bursting?

The acceleration of the whole tweet stream. The acceleration of each word in the tweet stream. The acceleration of each pair of words in the tweet stream.

10

slide-35
SLIDE 35

Assumptions

11

slide-36
SLIDE 36

Assumptions

  • Each topic is represented as a distribution over

words pk.

11

slide-37
SLIDE 37

Assumptions

  • Each topic is represented as a distribution over

words pk.

  • Tweet stream is modelled as a mixture of multiple

latent topic streams. The stream of topic k has velocity vk(t) and acceleration ak(t).

11

slide-38
SLIDE 38

Assumptions

  • Each topic is represented as a distribution over

words pk.

  • Tweet stream is modelled as a mixture of multiple

latent topic streams. The stream of topic k has velocity vk(t) and acceleration ak(t).

  • Each tweet is related to only one topic.

11

slide-39
SLIDE 39

Assumptions

  • Each topic is represented as a distribution over

words pk.

  • Tweet stream is modelled as a mixture of multiple

latent topic streams. The stream of topic k has velocity vk(t) and acceleration ak(t).

  • Each tweet is related to only one topic.

The final goal is to discover these unknown pk and ak(t) from a snapshot of the tweet stream.

11

slide-40
SLIDE 40

Sketch as Snapshot

12

slide-41
SLIDE 41

Properties

13

slide-42
SLIDE 42

Properties

13

slide-43
SLIDE 43

Properties

The topics with small accelerations will be filtered out.

13

slide-44
SLIDE 44

Properties

Minimise the difference between observation and expectation. The topics with small accelerations will be filtered out.

13

slide-45
SLIDE 45

Real-time Framework

sketch

time

tweet stream D(t) word vector (5) (1) (2) (3) (4)

t

current tweet d N X’’(t) N S’’(t) Y’’(t) N N monitor estimator reporter

time S’’(t)

14

slide-46
SLIDE 46

Real-time Framework

sketch

time

tweet stream D(t) word vector (5) (1) (2) (3) (4)

t

current tweet d N X’’(t) N S’’(t) Y’’(t) N N monitor estimator reporter

time S’’(t)

14

N is very large

slide-47
SLIDE 47

Dimension Reduction

X’’(t) B H B H S’’(t) Y’’(t) sketch

From O(N2) to O(H*B2), B<<N, H<<N

  • G. Cormode and S. Muthukrishnan. An improved data stream summary: the

count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

15

slide-48
SLIDE 48

Efficiency Evaluation

16

Dataset : Singapore based Twitter data, which contains over 30 millions tweets. We use these tweets to simulate a live tweet stream.

slide-49
SLIDE 49

Efficiency Evaluation

  • Throughput

16

Dataset : Singapore based Twitter data, which contains over 30 millions tweets. We use these tweets to simulate a live tweet stream.

slide-50
SLIDE 50

Efficiency Evaluation

  • Throughput
  • Inference time

16

Dataset : Singapore based Twitter data, which contains over 30 millions tweets. We use these tweets to simulate a live tweet stream.

slide-51
SLIDE 51

Effectiveness Evaluation

17

  • Compare with Twevent
  • Use the same dataset which contain over 4 million tweets
  • List all the events detected by both algorithms between

June 7, 2010 to June 12, 2010, in which period several big events happened.

slide-52
SLIDE 52

Effectiveness Evaluation

Event Sub-event TopicSketch Twevent Steve Jobs released iPhone 4 during WWDC2010 Farmville client for iPhone 4 was demonstrated. #wwdc, iphone, farmville steve jobs, iMovie, wwdc, iphone, wifi Retina display of iPhone 4 was introduced. iphone, 4, #wwdc, display, retina iMovie for iPhone 4 was demonstrated. iphone, 4, imovie, #wwdc New iPhone 4 was available in Singapore in July. iphone, 4, singapore, july

18

slide-53
SLIDE 53

Effectiveness Evaluation

Event Sub-event TopicSketch Twevent Steve Jobs released iPhone 4 during WWDC2010 Farmville client for iPhone 4 was demonstrated. #wwdc, iphone, farmville steve jobs, iMovie, wwdc, iphone, wifi Retina display of iPhone 4 was introduced. iphone, 4, #wwdc, display, retina iMovie for iPhone 4 was demonstrated. iphone, 4, imovie, #wwdc New iPhone 4 was available in Singapore in July. iphone, 4, singapore, july

18

slide-54
SLIDE 54

Effectiveness Evaluation

20 15 10 5

16:00

farmville display imovie july

17:00 18:00 19:00

10 5

16:00 17:00 18:00 19:00

wwdc #wwdc

19

slide-55
SLIDE 55

Effectiveness Evaluation

20 15 10 5

16:00

farmville display imovie july

17:00 18:00 19:00

10 5

16:00 17:00 18:00 19:00

wwdc #wwdc

Event

19

slide-56
SLIDE 56

Effectiveness Evaluation

20 15 10 5

16:00

farmville display imovie july

17:00 18:00 19:00

10 5

16:00 17:00 18:00 19:00

wwdc #wwdc

Sub-event Event

19

slide-57
SLIDE 57

Effectiveness Evaluation

20 15 10 5

16:00

farmville display imovie july

17:00 18:00 19:00

10 5

16:00 17:00 18:00 19:00

wwdc #wwdc

Sub-event Event Detection

19

slide-58
SLIDE 58

Conclusion

20

  • We proposed TopicSketch a framework for real-time

detection of bursty topics from Twitter.

  • We developed a concept of “sketch” which provides a

“snapshot” of the current tweet stream. It can be updated efficiently. And we can find bursty topics from it efficiently.

  • TopicSketch provides a temporally-ordered sub-events

to describe the event, which is more informative than the traditional methods.

slide-59
SLIDE 59

Thanks

21