Data Stream Processing Part I Motivation Data Streams Reservoir - - PowerPoint PPT Presentation

data stream processing
SMART_READER_LITE
LIVE PREVIEW

Data Stream Processing Part I Motivation Data Streams Reservoir - - PowerPoint PPT Presentation

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due this Friday the 20th of October Motivation Data Streams Reservoir Sampling 2 Data Processing so far ... Input Document Output Document


slide-1
SLIDE 1

Data Stream Processing

Part I

Motivation Data Streams Reservoir Sampling

1

slide-2
SLIDE 2

Homework 1 is due this Friday the 20th of October

Motivation Data Streams Reservoir Sampling

2

slide-3
SLIDE 3

Data Processing so far ...

Input Document Output Document

Motivation Data Streams Reservoir Sampling

3

slide-4
SLIDE 4

Sensor Data Example

Input Document

time ºC ºC ºC

  • ne 4 byte real

per hour 96 bytes per day

Motivation Data Streams Reservoir Sampling

4

slide-5
SLIDE 5

Sensor Data Example

Input Document

time

  • ne 4 byte real

every 100 ms 3.5 Mb per day

Motivation Data Streams Reservoir Sampling

5

slide-6
SLIDE 6

Sensor Data Example

Input Document

time

  • ne million 4 byte reals

every 100 ms 3.5 Tb per day

Motivation Data Streams Reservoir Sampling

6

slide-7
SLIDE 7

Sensor Data Example Stream of large unbounded data

too large for memory too high latency for disk

We need real time processing!

Motivation Data Streams Reservoir Sampling

7

slide-8
SLIDE 8

Sensor Data Example

Input Document

time

Process data stream directly

Motivation Data Streams Reservoir Sampling

8

slide-9
SLIDE 9

Data Streams

Motivation Data Streams Reservoir Sampling

9

slide-10
SLIDE 10

What is a Data Stream?

Definition (Golab and Ozsu, 2003 A data stream is a real-time, continuous, ordered (implicitly by arrival time of explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety.

Motivation Data Streams Reservoir Sampling

10

slide-11
SLIDE 11

What is a Data Stream?

Definition (Golab and Ozsu, 2003 A data stream is a real-time, continuous, ordered (implicitly by arrival time of explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety. continous and sequential input typically unpredictable input rate can be large amounts of data not error free

Motivation Data Streams Reservoir Sampling

11

slide-12
SLIDE 12

Data Stream Applications

Online, real time processing Event detection and reaction Aggregation Approximation

Motivation Data Streams Reservoir Sampling

12

slide-13
SLIDE 13

Data Stream Example Stock monitoring

Motivation Data Streams Reservoir Sampling

13

slide-14
SLIDE 14

Data Stream Example Stock monitoring Website traffic monitoring

Motivation Data Streams Reservoir Sampling

14

slide-15
SLIDE 15

Data Stream Example Stock monitoring Website traffic monitoring Network management

Motivation Data Streams Reservoir Sampling

15

slide-16
SLIDE 16

Data Stream Example Stock monitoring Website traffic monitoring Network management Highway traffic

Motivation Data Streams Reservoir Sampling

16

slide-17
SLIDE 17

Data Stream Characteristics

Motivation Data Streams Reservoir Sampling

17

slide-18
SLIDE 18

Data Stream Characteristics

All items have the same structure. For example a tuple or

  • bject: (sender, recipient, text body)

Motivation Data Streams Reservoir Sampling

18

slide-19
SLIDE 19

Data Stream Characteristics

All items have the same structure. For example a tuple or

  • bject: (sender, recipient, text body)

timestamps: explicite vs. implicite, physical vs. logical

Motivation Data Streams Reservoir Sampling

19

slide-20
SLIDE 20

Database Management vs. Data Stream Management

Motivation Data Streams Reservoir Sampling

20

slide-21
SLIDE 21

DBMS vs. DSMS

Feature DBMS DSMS Model persistent relation transient relation Relation tuple set/bag tuple sequence Data update modifications appends Query transient persistent Query answer exact approximate Query evaluation arbitrary

  • ne pass

Query plan fixed adaptive

Motivation Data Streams Reservoir Sampling

21

slide-22
SLIDE 22

DSMS Architecture

Motivation Data Streams Reservoir Sampling

22

slide-23
SLIDE 23

Data Stream Mining

Motivation Data Streams Reservoir Sampling

23

slide-24
SLIDE 24

Data Stream Mining

event detection and reaction counting frequency of specific items pattern detection aggregation approximation sampling

Motivation Data Streams Reservoir Sampling

24

slide-25
SLIDE 25

Data Stream Mining

event detection and reaction counting frequency of specific items pattern detection aggregation approximation sampling

Motivation Data Streams Reservoir Sampling

25

slide-26
SLIDE 26

Resevoir Sampling

Motivation Data Streams Reservoir Sampling

26

slide-27
SLIDE 27

Problem: Sampling

Lines from a large text file Stream: Sample search engine queries, updated live

Motivation Data Streams Reservoir Sampling

27

slide-28
SLIDE 28

The Simple Way

1 Scan the text file, counting lines 2 Generate random line numbers [0, |lines|) 3 Sort the line numbers 4 Scan the text file, outputting selected lines Motivation Data Streams Reservoir Sampling

28

slide-29
SLIDE 29

The Simple Way

1 Scan the text file, counting lines 2 Generate random line numbers [0, |lines|) 3 Sort the line numbers 4 Scan the text file, outputting selected lines

Cost: two scans

Motivation Data Streams Reservoir Sampling

29

slide-30
SLIDE 30

The Simple Way

1 Scan the text file, counting lines 2 Generate random line numbers [0, |lines|) 3 Sort the line numbers 4 Scan the text file, outputting selected lines

Cost: two scans Impossible / Impractical for stream

Motivation Data Streams Reservoir Sampling

30

slide-31
SLIDE 31

The Simple Way for a Stream Problem: Sample top 1000 queries

1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest Motivation Data Streams Reservoir Sampling

31

slide-32
SLIDE 32

The Simple Way for a Stream Problem: Sample top 1000 queries

1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest

Additional storage required for random numbers.

Motivation Data Streams Reservoir Sampling

32

slide-33
SLIDE 33

The Simple Way for a Stream Problem: Sample top 1000 queries

1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest

Additional storage required for random numbers. So far not reservoir sampling!

Motivation Data Streams Reservoir Sampling

33

slide-34
SLIDE 34

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line:

Motivation Data Streams Reservoir Sampling

34

slide-35
SLIDE 35

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1

Motivation Data Streams Reservoir Sampling

35

slide-36
SLIDE 36

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line:

Motivation Data Streams Reservoir Sampling

36

slide-37
SLIDE 37

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line:

Motivation Data Streams Reservoir Sampling

37

slide-38
SLIDE 38

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line: 1

2

Motivation Data Streams Reservoir Sampling

38

slide-39
SLIDE 39

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

2

Motivation Data Streams Reservoir Sampling

39

slide-40
SLIDE 40

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

2

keep nth line:

Motivation Data Streams Reservoir Sampling

40

slide-41
SLIDE 41

Sample One Line Probability of keeping a line and dropping all others?

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

2

keep nth line: 1

2

Motivation Data Streams Reservoir Sampling

41

slide-42
SLIDE 42

Sample One Line

Flip a coin at each line. If it’s heads, record the line (and forget the others).

#!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir)

Motivation Data Streams Reservoir Sampling

42

slide-43
SLIDE 43

Sample One Line

Flip a coin at each line. If it’s heads, record the line (and forget the others).

#!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir) This is biased. The last line has probability 1

2.

Motivation Data Streams Reservoir Sampling

43

slide-44
SLIDE 44

Sample One Line

Flip a coin at each line. If it’s heads, record the line (and forget the others).

#!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir) This is biased. The last line has probability 1

2.

It should be the same probability for each line!

Motivation Data Streams Reservoir Sampling

44

slide-45
SLIDE 45

Uniformly Sample One Line

keep 1st line: 1 keep 2nd line: keep 3rd line: keep nth line:

Motivation Data Streams Reservoir Sampling

45

slide-46
SLIDE 46

Uniformly Sample One Line

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: keep nth line:

Motivation Data Streams Reservoir Sampling

46

slide-47
SLIDE 47

Uniformly Sample One Line

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

3

keep nth line:

Motivation Data Streams Reservoir Sampling

47

slide-48
SLIDE 48

Uniformly Sample One Line

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

3

keep nth line: 1

n

Motivation Data Streams Reservoir Sampling

48

slide-49
SLIDE 49

Uniformly Sample One Line

keep 1st line: 1 keep 2nd line: 1

2

keep 3rd line: 1

3

keep nth line: 1

n

1/1 1/2 1/2 1/3 1/3 1/3 1/n 1/n 1/n 1/n

Motivation Data Streams Reservoir Sampling

49

slide-50
SLIDE 50

Uniformly Sample One Line

#!/usr/bin/env python import sys import random line_number = 0 for line in sys.stdin: if random.randint(0, line_number) == 0: resevoir = line.strip() line_number += 1 print(resevoir) Line n overwrites the resevoir with probability 1

n

= ⇒ Uniform sampling

Motivation Data Streams Reservoir Sampling

50

slide-51
SLIDE 51

Proof Sketch: Induction

Base One line with probability 1. Inductive Assume n lines were sampled with probability 1

n each.

When the n + 1th line is added, the resevoir is kept with probability

n n+1. Thus the first n lines each have probability

1 n · n n + 1 = 1 n + 1 And the n + 1th line also has probability

1 n+1 by construction.

Motivation Data Streams Reservoir Sampling

51

slide-52
SLIDE 52

Sample Multiple Lines

1/5 1/5 1/5 1/5 1/5 Reservoir Size r = 1

Motivation Data Streams Reservoir Sampling

52

slide-53
SLIDE 53

Sample Multiple Lines

1/5 1/5 1/5 1/5 1/5 Reservoir Size r = 1 2/5 2/5 2/5 2/5 2/5 Reservoir Size r = 2

Motivation Data Streams Reservoir Sampling

53

slide-54
SLIDE 54

Sample Multiple Lines

1/5 1/5 1/5 1/5 1/5 Reservoir Size r = 1 2/5 2/5 2/5 2/5 2/5 Reservoir Size r = 2

with reservoir size r and sample count n Substitute an entry with probability:

Motivation Data Streams Reservoir Sampling

54

slide-55
SLIDE 55

Sample Multiple Lines

1/5 1/5 1/5 1/5 1/5 Reservoir Size r = 1 2/5 2/5 2/5 2/5 2/5 Reservoir Size r = 2

with reservoir size r and sample count n Substitute an entry with probability: r

n

Motivation Data Streams Reservoir Sampling

55

slide-56
SLIDE 56

Sample Multiple Lines

Without Replacement

First few lines: Fill the resevoir Afterwards: Substitute an entry with probability |samples|

|lines|

Motivation Data Streams Reservoir Sampling

56

slide-57
SLIDE 57

Summary

Efficiently sample streaming data Small memory

Motivation Data Streams Reservoir Sampling

57