An Introduction to Distributed Data Streaming
Elements and Systems
Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology
1
An Introduction to Distributed Data Streaming Elements and Systems - - PowerPoint PPT Presentation
An Introduction to Distributed Data Streaming Elements and Systems Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology 1 2 2 2 how to avoid this? 2 how to avoid this? 2 Q how to avoid this? 2 Q Q = +
Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology
1
2
2
2
2
how to avoid this?
2
how to avoid this?
2
how to avoid this?
Q
2
how to avoid this?
Q =
+
Q
3
Q =
+
3
Q =
+
3
Q Q Q =
+
3
Q Q Q =
+
3
Q Q Q =
+
4
Q
Standing Query
4
Q
Standing Query
4
Q
Standing Query
4
Q
Standing Query
most recent views of the stream ~ windows
5
6
f
S1 S2 So S’1 S’2
7
stream1 stream2
approximations predictions alerts ……
Q
sources sinks
8
Discussion Why do we need windows?
9
10
#seconds 40 80 Average #3 Average #2 Average #1 20 60 100
f
W: 1min, 20sec
11
#sec 40 80
Average #2 Average #1
20 60 100 #sec 40 80
Average #3 Average #2 Average #1
20 60 100 #sec 40 80
Average #2 Average #1
20 60 100 120 120
Sliding Tumbling Jumping
range > slide range = slide range < slide
We cannot infinitely store all events seen
12
f s
a summary of everything seen so far
t t’
What about window synopses?
13
14
whole stream
records with a 5% error
15
properties such as shortest paths
training and classification
16
f
sf
sf
sparallel instances
How do we partition the input streams?
f
sparallel instance. Typical partitioners are:
17
f
sf
sf
sf
sf
sf
sP P P
by color
18
Fire Detection Pipeline
{area,temp} {area,smoke} {loc,alert!}
trigger
trigger periodically
?
19
A
sF
sRolling Arithmetic Mean of Temperatures State Machine-based Fire Alarm
{area,temp} {area,avgTemp} {alarm}
Src
Sensor Data Sources
{area,temp}
Src
{area} Periodic Temperature Updates Smoke Detections trivial… What is the state and its transitions?
high temperature within the same area
20
Src
P
key:area
21
Src
P
key:area
{area,temp}
A
sA
sw w w = ?
22
22
F
s22
F
sT : avgTemp>40 T : avgTemp<40 S : Smoke
22
F
sT : avgTemp>40 T : avgTemp<40 …TTTSTTSTTTT…. S : Smoke
22
F
sT : avgTemp>40 T : avgTemp<40 …TTTSTTSTTTT….
OK HOT SMOKE FIRE
T T T S S T T S : Smoke
22
F
sT : avgTemp>40 T : avgTemp<40 …TTTSTTSTTTT….
OK HOT SMOKE FIRE
T T T S S T T synopsis= 1 state S : Smoke
23
{area,temp} {area,smoke}
Src Src
P P A
sA
skey:area key:area
w w F
sF
sP
key:area {area, alert}
{area,avg_temp} {area,smoke}
24
Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark
25
’95 Materialised Views ’01 Complex Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined Windows ’12 Policy-Based Windowing ’88 Active DataBases ’88 HiPac ’12 Twitter Storm ’12 IBM System S ’13 Spark Streaming ’14 Apache Flink ’13 Parallel Recovery ’05 Decentralised Stream Queries ’05 High Availability
concepts systems
’13 Google Millwheel ’13 Discretized Streams ’00 Eddies 02 Aurora ’12 Twitter Storm
26
Compositional Declarative
for composing custom
as windowing is often missing
functions on abstract data stream types
as windowing is supported
27
DStream, DataStream, PCollection…
execution graph / topology
and data analysts
28
(Bolts)
Spout Bolt Bolt
Spouts are the topology sources The listen to data feeds Bolts represent all intermediate computation vertices of the topology They do arbitrary data manipulation Each operator can emit/subscribe to Streams (computation results)
29 numbers
new_numbers
numbers
new_numbers toFile
30
Flink Runtime Flink Job Graph Builder/Optimiser Flink Client
Streaming Program
31
(Hadoop, Spark) (Spark Streaming)
1) Real Streaming (Distributed Data Flow)
LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS
2) Batched Execution
32
partitioned in time windows
policies
range slide
33
src-http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
34
forward() shuffle() broadcast() keyBy() partitionCustom() shuffleGrouping() allGrouping() fieldsGrouping() customGrouping() repartition(num) reduceByKey() updateStateByKey()
no fine-grained control full control
35
implementing a rolling max per key
36
(Spark Streaming)
put new states in output RDD
dstream.updateStateByKey(…)
In S’
37
38
{area,temp} {area,smoke}
Src Src
P P A
sA
skey:area key:area
w w F
sF
sP
key:area
{area,avg_temp} {area,smoke}
39
Standing Query
Q
39
Standing Query
Q
39
Standing Query
Q
39
Standing Query
add more sensors
Q
40
Standing Query
Q
40
Standing Query
Q
recovered!
40
Standing Query
Q
recovered!
40
Standing Query
Q
lost smoke events
Main Features
41
42
Guarantees Technique Storm at least once event dependency tracking Spark exactly once source upstream backup Flink exactly once periodic snapshots
43
Q
Standing Query
44