Building a Big Data Machine Learning Platform
Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
Building a Big Data Machine Learning Platform Cliff Click, CTO - - PowerPoint PPT Presentation
Building a Big Data Machine Learning Platform Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog H2O is... Pure Java, Open Source: 0xdata.com https://github.com/0xdata/h2o/ A Platform for doing
Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
2
3
– Plus read nearby rows, and/or computes a reduction
4
5
6
7
8
9
10
11
– (append order is preserved)
12
– (append order is preserved)
13
14
Setting carAges in map() makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call. Setting carAges in map makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call.
15
16
Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. This one is written, so needs a reduce.
17
18
19
Vec
20
Vec >> 2billion elements
FORTRAN speed
21
JVM 4 Heap
32Gig
JVM 1 Heap
32Gig
JVM 2 Heap
32Gig
JVM 3 Heap
32Gig
Vec >> 2billion elements
22
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
in heaps
concurrent access
any row, any JVM
more on that later
23
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
age sex zip ID car
user data
(vs ary-of-structs)
24
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
1e6 elements
clock cycles including compression
Good: more data per cache-miss
25
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
age sex zip ID car
single thread
Mutable Vectors
speed
use JMM rules class Person { }
26
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
Chunk of rows
control overheads
get fine-grained par
simple single- threaded style
27
JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
Chunks in parallel
handles all sync
comm, data manage
28
29
30
*See fine print
31
Spark: Partition[User] JVM Heap These structures are limited to 1 JVM heap There can be many in the heap, limited by only by memory H2O: Chunk
*Only data correspondance is shown; a real data copy is required!
32
JVM Heap #1 JVM Heap #2 JVM Heap #3 JVM Heap #4
Frame
RDD
H2O: Chunk Partition Vec
33
34
35
– (may be CPU limited in a few cases)
36
– Logistic Regression, Poisson, Gamma
37
38
39
40
41
– Just the loads & stores, nothing else
– (more tokens if subclassed objects used)
– Small stuff via UDP & big via TCP
– (can pull cable & re-insert & all will recover)
42
43
44
– (ManagedBlocker.block API is painful) – Still get thread starvation sometimes
– Painful to write explicit-CPS in Java
45
– (and even pass around cluster distributed)
46
NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code?
NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code?
K/V get/put UDP / TCP
47
48