[PPT] - lti Introduction Two recent lines of research in speeding up large PowerPoint Presentation

SLIDE 1

lti

Distributed Asynchronous Online Learning for Natural Language Processing

Kevin Gimpel Dipanjan Das Noah A. Smith

SLIDE 2

lti

Introduction

Two recent lines of research in speeding up

large learning problems:

Parallel/distributed computing Online (and mini-batch) learning algorithms:

stochastic gradient descent, perceptron, MIRA, stepwise EM

How can we bring together the benefits of

parallel computing and online learning?

SLIDE 3

lti

Introduction

We use asynchronous algorithms

(Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)

We apply them to structured prediction tasks:

Supervised learning Unsupervised learning with both convex and non-

convex objectives

Asynchronous learning speeds convergence and

works best with small mini-batches

SLIDE 4

lti

Problem Setting

Iterative learning

Moderate to large numbers of training examples Expensive inference procedures for each example For concreteness, we start with gradient-based

ptimization

Single machine with multiple processors

Exploit shared memory for parameters, lexicons,

feature caches, etc.

Maintain one master copy of model parameters

SLIDE 5

lti

Dataset: Processors:

P

Parameters: θ

D

Single-Processor Batch Learning

SLIDE 6

lti

Dataset: Processors:

P

Parameters: θ

D

θ P

Single-Processor Batch Learning

Time

SLIDE 7

lti

Dataset: Processors:

P

Parameters: θ

, θ

D

, θ

θ P

Calculate gradient on data using parameters

D

θ

θ

Single-Processor Batch Learning

Time

SLIDE 8

lti

Dataset: Processors:

P

Parameters: θ

, θ

θ θ,

D

θ

θ P

Update using gradient to obtain

θθ,

θ θ

θ

Single-Processor Batch Learning

Time

, θ

Calculate gradient on data using parameters

D

θ

SLIDE 9

lti

Dataset: Processors:

P

Parameters: θ

, θ

θ θ,

D

θ P

Update using gradient to obtain

θ θ

, θ

θ θ

θθ,

Single-Processor Batch Learning

Time

, θ

Calculate gradient on data using parameters

D

θ

SLIDE 10

lti

Time

θ

P , θ

Divide data into parts,

compute gradient on parts in parallel One processor updates parameters

D D ∪ D ∪ D

Dataset: Processors: P Parameters: θ

, θ , θ

P P

Parallel Batch Learning

Gradient:

SLIDE 11

lti

Time

θ

θθ,

θ

P , θ

Divide data into parts,

compute gradient on parts in parallel

One processor updates

parameters

D D ∪ D ∪ D

Dataset: Processors: P Parameters: θ

, θ , θ

P P

Gradient:

Parallel Batch Learning

SLIDE 12

lti

Time

θ

θθ,

θ

P , θ

Divide data into parts,

compute gradient on parts in parallel

One processor updates

parameters

D D ∪ D ∪ D

Dataset: Processors: P Parameters: θ

, θ , θ

P P

Gradient:

, θ , θ , θ

θθ,

Parallel Batch Learning

SLIDE 13

lti

θθ, θθ,

Time

θ

P

Same architecture, just

more frequent updates

P P

Gradient:

, θ , θ , θ , θ , θ , θ

Parallel Synchronous Mini-Batch Learning

Finkel, Kleeman, and Manning (2008) Mini-batches: Processors:

P

Parameters: θ

B B

∪ B ∪ B

θ

, , ,

SLIDE 14

lti

Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

B

SLIDE 15

lti

Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

B

SLIDE 16

lti

θθ, Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

B

SLIDE 17

lti

θθ, Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

, θ

B

SLIDE 18

lti

θθ, Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

, θ

θθ,

B

θ

SLIDE 19

lti

θθ, Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

, θ

θ

, θ

θθ,

B

SLIDE 20

lti

θθ, Time

θ

P P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

, θ

θ

, θ

θθ, θθ,

θ

B

SLIDE 21

lti

θθ, Time

θ

P

Gradients computed using stale

parameters

Increased processor utilization Only idle time caused by lock for

updating parameters

P P

Gradient:

Parallel Asynchronous Mini-Batch Learning

Nedic, Bertsekas, and Borkar (2001) Mini-batches: Processors:

P

Parameters: θ

, θ , θ , θ

, θ

θ

, θ

θθ, θθ,

θ

θθ, θθ

, θ

B

θ

SLIDE 22

lti

Theoretical Results

How does the use of stale parameters affect

convergence?

Convergence results exist for convex

ptimization using stochastic gradient descent

Convergence guaranteed when max delay is

bounded (Nedic, Bertsekas, and Borkar, 2001)

Convergence rates linear in max delay (Langford,

Smola, and Zinkevich, 2009)

SLIDE 23

lti

Experiments

4 10k 4 m HMM IBM Model 1 CRF Model 2M 42k N Stepwise EM Unsupervised Part-of-Speech Tagging 14.2M 300k Y Stepwise EM Word Alignment 1.3M 15k Y Stochastic Gradient Descent Named-Entity Recognition Convex? Method Task

|θ| |D|

To compare algorithms, we use wall clock time

(with a dedicated 4-processor machine)

m = mini-batch size

SLIDE 24

lti

Experiments

4 m CRF Model 1.3M 15k Y Stochastic Gradient Descent Named-Entity Recognition Convex? Method Task

|θ| |D|

CoNLL 2003 English data Label each token with entity type (person, location,

rganization, or miscellaneous) or non-entity

We show convergence in F1 on development data

SLIDE 25

lti

2 4 6 8 10 12 84 85 86 87 88 89 90 91 Wall clock time (hours) F1 Asynchronous (4 processors) Synchronous (4 processors) Single-processor

Asynchronous Updating Speeds Convergence

All use a mini-batch size of 4

SLIDE 26

lti

2 4 6 8 10 12 84 85 86 87 88 89 90 91 Wall clock time (hours) F1 Asynchronous (4 processors) Ideal

Comparison with Ideal Speed-up

SLIDE 27

lti

Why Does Asynchronous Converge Faster?

Processors are kept in near-constant use Synchronous SGD leads to idle

processors need for load-balancing

SLIDE 28

lti

1 2 3 4 5 6 7 8 85 86 87 88 89 90 91 F1 Synchronous (4 processors) Synchronous (2 processors) Single-processor 1 2 3 4 5 6 7 8 85 86 87 88 89 90 91 Wall clock time (hours) F1 Asynchronous (4 processors) Asynchronous (2 processors) Single-processor

Clearer improvement for asynchronous algorithms when increasing number of processors

SLIDE 29

lti

2 4 6 8 10 12 85 86 87 88 89 90 91 Wall clock time (hours) F1 Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay Asynchronous, µ = 10 Asynchronous, µ = 20

Artificial Delays

After completing a mini-batch, 25% chance of delaying Delay (in seconds) sampled from

N, /,

Avg. time per mini-batch = 0.62 s

SLIDE 30

lti

Experiments

10k m IBM Model 1 Model 14.2M 300k Y Stepwise EM Word Alignment Convex? Method Task

Given parallel sentences, draw links between words: We show convergence in log-likelihood

(convergence in AER is similar)

konnten sie es übersetzen ? could you translate it ?

|θ| |D|

SLIDE 31

lti

Stepwise EM

(Sato and Ishii, 2000; Cappe and Moulines, 2009)

Similar to stochastic gradient descent in the

space of sufficient statistics, with a particular scaling of the update

More efficient than incremental EM

(Neal and Hinton, 1998)

Found to converge much faster than batch EM

(Liang and Klein, 2009)

SLIDE 32

lti

10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

SLIDE 33

lti

10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

SLIDE 34

lti

10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

SLIDE 35

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

32
30
28
26
24
22
20

Wall clock time (minutes) Log-Likelihood

Asynch. (m = 10,000)
Synch. (m = 10,000)
Asynch. (m = 1,000)
Synch. (m = 1,000)
Asynch. (m = 100)
Synch. (m = 100)

SLIDE 36

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

32
30
28
26
24
22
20

Wall clock time (minutes) Log-Likelihood

Asynch. (m = 10,000)
Synch. (m = 10,000)
Asynch. (m = 1,000)
Synch. (m = 1,000)
Asynch. (m = 100)
Synch. (m = 100)

Asynchronous is faster when using small mini-batches

SLIDE 37

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

32
30
28
26
24
22
20

Wall clock time (minutes) Log-Likelihood

Asynch. (m = 10,000)
Synch. (m = 10,000)
Asynch. (m = 1,000)
Synch. (m = 1,000)
Asynch. (m = 100)
Synch. (m = 100)

Error from asynchronous updating

SLIDE 38

lti

10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

SLIDE 39

lti

10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor) Ideal

Comparison with Ideal Speed-up

For stepwise EM, mini-batch size = 10,000

SLIDE 40

lti

MapReduce?

We also ran these algorithms on a large

MapReduce cluster (M45 from Yahoo!)

Batch EM

Each iteration is one MapReduce job, using

24 mappers and 1 reducer

Asynchronous Stepwise EM

4 mini-batches processed simultaneously,

each run as a MapReduce job

Each uses 6 mappers and 1 reducer

SLIDE 41

lti

MapReduce?

10 20 30 40 50 60 70 80

40
35
30
25
20

Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor) 10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (MapReduce)

Batch EM (MapReduce)

SLIDE 42

lti

MapReduce?

10 20 30 40 50 60 70 80

40
35
30
25
20

Log-Likelihood

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor) 10 20 30 40 50 60 70 80

40
35
30
25
20

Wall clock time (minutes) Log-Likelihood

Asynch. Stepwise EM (MapReduce)

Batch EM (MapReduce)

SLIDE 43

lti

Experiments

4 m HMM Model 2M 42k N Stepwise EM Unsupervised Part-of-Speech Tagging Convex? Method Task

|θ| |D|

Bigram HMM with 45 states We plot convergence in likelihood and

many-to-1 accuracy

SLIDE 44

lti

1 2 3 4 5 6

7.5
7
6.5
6

x 10

6

Log-Likelihood 1 2 3 4 5 6 50 55 60 65 Wall clock time (hours) Accuracy (%)

Asynch. Stepwise EM (4 processors)
Synch. Stepwise EM (4 processors)
Synch. Stepwise EM (1 processor)

Batch EM (1 processor)

Part-of-Speech Tagging Results

mini-batch size = 4 for stepwise EM

SLIDE 45

lti

1 2 3 4 5 6

7.5
7
6.5
6

x 10

6

Log-Likelihood 1 2 3 4 5 6 50 55 60 65 Wall clock time (hours) Accuracy (%)

Asynch. Stepwise EM (4 processors)

Ideal

Comparison with Ideal

SLIDE 46

lti

1 2 3 4 5 6

7.5
7
6.5
6

x 10

6

Log-Likelihood 1 2 3 4 5 6 50 55 60 65 Wall clock time (hours) Accuracy (%)

Asynch. Stepwise EM (4 processors)

Ideal

Comparison with Ideal

Asynchronous better than ideal?

SLIDE 47

lti

Conclusions and Future Work

Asynchronous algorithms speed convergence

and do not introduce additional error

Effective for unsupervised learning and non-

convex objectives

If your problem works well with small

mini-batches, try this!

Future work

Theoretical results for non-convex case Explore effects of increasing number of processors New architectures (maintain multiple copies of )

θ

SLIDE 48