[PPT] - From Several Hundred Million to Some Billion Words: PowerPoint Presentation

SLIDE 1

From ¡Several ¡Hundred ¡Million ¡to ¡Some ¡Billion ¡Words: ¡ ¡ Scaling ¡up ¡a ¡Corpus ¡Indexer ¡and ¡a ¡Search ¡Engine ¡ ¡ with ¡MapReduce ¡

Jordi ¡Porta ¡ ComputaBonal ¡LinguisBcs ¡Area ¡ Departamento ¡de ¡Tecnología ¡y ¡Sistemas ¡ Centro ¡de ¡Estudios ¡de ¡la ¡Real ¡Academia ¡Española ¡ (porta@rae.es) ¡

SLIDE 2

Outline ¡ ¡

“The ¡more ¡observa-ons ¡we ¡gather ¡about ¡language ¡use, ¡ the ¡more ¡accurate ¡descrip-on ¡we ¡have ¡about ¡language ¡itself” ¡ ¡

Lin ¡& ¡Dyer, ¡2010: ¡Data-‑intensive ¡text ¡processing ¡with ¡MapReduce ¡

A ¡Case ¡Study: ¡Elpais.com ¡
Basic ¡Data ¡Structures ¡for ¡Complex ¡Queries ¡ ¡
IntroducBon ¡to ¡MapReduce ¡
The ¡Scale ¡out ¡vs. ¡Scale ¡up ¡Dilemma ¡
Scaling ¡up ¡with ¡Phoenix++ ¡
Experiments ¡introducing ¡MapReduce ¡into ¡a ¡CMS: ¡

– Query ¡Engine: ¡

Word ¡CounBng ¡
Word ¡Cooccurrencies ¡
N-‑grams ¡

– Indexer ¡

File ¡Inversion ¡

– Text ¡Inversion ¡ – Structure ¡Inversion ¡

2 ¡

“The ¡only ¡thing ¡ be<er ¡than ¡big ¡data ¡ is ¡bigger ¡data” ¡

SLIDE 3

A ¡Case ¡Study: ¡Elpais.com ¡on-‑line ¡archive ¡ ¡

3 ¡

Lists ¡of ¡linguisBc ¡elements: ¡
Concordances ¡
Counts ¡
AssociaBons ¡
DistribuBons ¡
… ¡
Sub-‑corpora ¡
Non-‑trivial ¡processing: ¡ ¡
Cooccurrence ¡networks ¡/ ¡

¡ ¡ ¡ ¡ ¡ ¡clustering ¡/ ¡… ¡

Word ¡vectors ¡
… ¡

Base ¡Corpus: ¡ ¡

Elpais.com ¡on-‑line ¡archive: ¡
1976–2011 ¡
2.3 ¡million ¡news ¡ ¡
Cleaned ¡+ ¡PoS ¡tagged ¡
23 ¡million ¡structural ¡elements ¡
878 ¡million ¡tokens ¡

¡

SLIDE 4

Basic ¡Data ¡Structures ¡for ¡Corpus ¡Indexing ¡ SupporBng ¡Complex ¡Queries ¡ ¡ ¡

Textual ¡indices: ¡ ¡

– PosBng ¡Lists: ¡ ¡

[token=“de”] ¡
[token=“la” ¡and ¡msd=“Nfs”] ¡
[token=“pelota”] ¡within ¡[secBon=“sports”] ¡
… ¡
Structural ¡indices: ¡ ¡

– Interval ¡Trees: ¡

P ¡within ¡TEXT ¡
P ¡containing ¡HI[REND=IT] ¡containing ¡[lemma=“de”] ¡ ¡
… ¡
Red/Black-‑Trees ¡

4 ¡

Doc ¡id ¡ Hits ¡ Pos1 ¡ ¡ ¡Pos2 ¡ … ¡ Doc ¡id ¡ First ¡ Token ¡ Last ¡ Token ¡ … ¡ <p> ¡ </p> ¡

SLIDE 5

How ¡Text ¡is ¡Indexed ¡ ¡

5 ¡

String ¡ Singletons ¡ Documents: ¡

begin, ¡
end, ¡
ffset ¡
¡… ¡

PosiBonal ¡ Indices ¡ Structural ¡ Indices ¡ Interval ¡tree ¡ PosBng ¡list ¡ ¡ Corpus ¡Layers ¡ Hash ¡Table ¡ ¡

168M+4.7G ¡ 779M ¡ 331M+6.8G ¡ 135M ¡ 3.7G ¡ 54M ¡ 321M+7.5G ¡ 130M ¡ 1.5G ¡(tags) ¡

SLIDE 6

What ¡is ¡MapReduce? ¡ ¡

A ¡widely ¡used ¡parallel ¡compuBng ¡model ¡
For ¡batch ¡processing ¡of ¡data ¡on ¡terabyte ¡or ¡petabyte ¡scales ¡
On ¡clusters ¡of ¡commodity ¡servers ¡(typically) ¡
Fault ¡tolerance: ¡task ¡monitoring ¡and ¡replicaBon ¡
Abstracts ¡parallelizaBon, ¡synchronizaBon ¡and ¡communicaBon ¡
Hadoop ¡is ¡the ¡most ¡popular ¡MapReduce ¡scale ¡out ¡implementaBon ¡ ¡

¡

6 ¡

SLIDE 7

Hadoop ¡MapReduce ¡ ¡

¡ ¡

7 ¡

Map(k1, ¡v1) ¡-‑> ¡list(k2, ¡v2) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Reduce(k2, ¡list(v2)) ¡-‑> ¡list(v2) ¡

SLIDE 8

Scaling ¡out ¡vs ¡Scaling ¡up ¡Dilemma ¡ ¡

Scaling ¡out ¡(horiz.) ¡

– Cluster ¡of ¡commodity ¡servers ¡ ¡ ¡ ¡ – Pros: ¡

Cheaper ¡
Fault ¡tolerance ¡
Easy ¡to ¡update ¡

– Cons: ¡

Power ¡consumpBon ¡
Cooling ¡
Space ¡
Difficult ¡to ¡implement ¡
Networking ¡equipment ¡
Licensing ¡costs ¡
Scaling ¡up ¡(vert.) ¡

– Powerful ¡server ¡(+proc./+RAM) ¡ – Pros: ¡

Power ¡consumpBon ¡
Cooling ¡
Space ¡
Easy ¡to ¡implement ¡
Networking ¡equip. ¡~ ¡0 ¡
Licensing ¡costs ¡ ¡

– Cons: ¡

Expensive ¡
Fault ¡tolerance ¡
Difficult ¡to ¡update ¡

8 ¡

SLIDE 9

Scaling ¡out ¡vs ¡Scaling ¡up ¡Dilemma ¡ ¡

Appuswamy ¡& ¡al. ¡(2014) ¡ ¡“Scale-‑up ¡vs ¡Scale-‑out ¡for ¡Hadoop: ¡Time ¡

to ¡rethink?” ¡In ¡Proc. ¡4th ¡Ann. ¡Symp. ¡on ¡Cloud ¡CompuFng. ¡

Input ¡size ¡in ¡analyBcs ¡producBon ¡clusters: ¡ ¡

– Microsot ¡/ ¡Yahoo: ¡median ¡< ¡14 ¡Gb, ¡Facebook: ¡90% ¡< ¡100 ¡Gb ¡

Conclusion: ¡ ¡

– “Scaling ¡up ¡is ¡more ¡cost-‑effec-ve ¡for ¡many ¡real-‑word ¡problems” ¡

¡

9 ¡

50%, ¡< ¡14 ¡Gb ¡

SLIDE 10

Scaling ¡up ¡with ¡Phoenix++ ¡MapReduce ¡ ¡

10 ¡

"Phoenix++: ¡Modular ¡MapReduce ¡for ¡Shared-‑Memory ¡Systems", ¡2011: ¡JusBn ¡Talbot, ¡Richard ¡M. ¡Yoo, ¡and ¡ Christos ¡Kozyrakis. ¡Second ¡InternaFonal ¡Workshop ¡on ¡MapReduce ¡and ¡its ¡ApplicaFons. ¡ ¡ (Code: ¡hyp://mapreduce.stanford.edu) ¡ ¡

Split: ¡divides ¡input ¡data ¡to ¡chunks ¡for ¡map ¡tasks ¡
Combiner: ¡are ¡thread-‑local ¡objects ¡ ¡
Map: ¡invokes ¡the ¡Combiner ¡for ¡each ¡key-‑value ¡pair ¡emiyed ¡(not ¡at ¡the ¡end ¡of ¡Map) ¡
ParBBon/Suffle: ¡not ¡in ¡Phoenix++ ¡
Reduce: ¡aggregates ¡key-‑value ¡pairs ¡in ¡Combiners ¡(All ¡Map ¡task ¡are ¡finished) ¡
Merge: ¡sort ¡key-‑value ¡pairs ¡(opBonal ¡phase) ¡
Problems: ¡introduces ¡rehash ¡between ¡map ¡and ¡reduce ¡phases ¡

SLIDE 11

Experiments: ¡Data ¡& ¡Machinery ¡ ¡

Base ¡Corpus ¡

– Elpais.com ¡on-‑line ¡archive ¡(1976–2011) ¡ – 2.3 ¡million ¡news ¡ ¡ – Cleaned ¡+ ¡PoS ¡tagged ¡ – 23 ¡million ¡structural ¡elements ¡ – 878 ¡million ¡tokens ¡

MulBples ¡of ¡Elpais.com ¡

– x2, ¡x4, ¡x8, ¡x16 ¡ – Don’t ¡obey ¡Heaps’ ¡Law ¡(V(n) ¡= ¡K·√nβ) ¡

1x ¡Dell ¡PowerEdge ¡R615 ¡

– RAM: ¡256 ¡Gb ¡ – CPUs: ¡2 ¡x ¡Xeon ¡E5-‑2620 ¡/ ¡2 ¡GHz ¡(24 ¡threads) ¡ – DISKS: ¡3 ¡x ¡SAS ¡7.200 ¡rpm, ¡RAID-‑5 ¡ ¡

¡

11 ¡

500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 5e+08 1e+09 1.5e+09 2e+09 Types Tokens

SLIDE 12

Word ¡CounBng ¡

Given ¡a ¡(sub-‑)corpus, ¡ ¡ compute ¡the ¡frequencies ¡ ¡

f ¡all ¡the ¡words ¡

(word ¡frequency ¡profile) ¡

12 ¡

SLIDE 13

Word ¡CounBng ¡

13 ¡

Map(<doc, ¡begin, ¡end>) ¡= ¡ ¡ ¡ ¡for ¡i ¡= ¡begin ¡to ¡end ¡do ¡ ¡ ¡ ¡ ¡ ¡ ¡emit(words[docs[doc].offset ¡+ ¡i], ¡1) ¡ ¡ ¡ ¡end ¡ Split(docs) ¡= ¡ ¡ ¡ ¡divide ¡docs ¡into ¡<doc, ¡begin, ¡end> ¡ Combiner: ¡

¡ ¡

Word ¡

Freq. ¡

0 ¡(la) ¡ 456 ¡ 19 ¡(de) ¡ 5667 ¡ … ¡ … ¡

… ¡

SLIDE 14

Word ¡CounBng ¡

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 7e+09 8e+09 Elapsed Time (seconds) 1 threads 2 threads 4 threads 8 threads 16 threads 24 threads

14 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡6 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡7 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡8 ¡

Corpus ¡Size ¡(billions ¡of ¡tokens) ¡

Baseline ¡(1 ¡thread) ¡ ~11 ¡s, ¡1 ¡Bw ¡ ~2 ¡s, ¡1 ¡Bw ¡ 75 ¡s, ¡8 ¡Bw ¡ ~10 ¡s, ¡8 ¡Bw ¡ 16 ¡threads ¡~ ¡24 ¡threads ¡

SLIDE 15

Word ¡CounBng ¡ ¡

1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Elapsed Time (seconds) Number of Threads elpais.com x4 elpais.com x2 elpais.com x1

15 ¡

The ¡decrease ¡in ¡ ¡ performance ¡of ¡threads ¡ 1 ¡thread ¡-‑> ¡2 ¡threads ¡ 12.0 ¡s ¡-‑> ¡6.2 ¡s ¡ 8 ¡threads ¡-‑> ¡16 ¡threads ¡ 1.5 ¡s ¡-‑> ¡1.0 ¡s ¡

SLIDE 16

Word ¡Cooccurrencies ¡

Given ¡a ¡word ¡(or ¡a ¡lema ¡or ¡a ¡PoS, ¡…), ¡ compute ¡the ¡frequency ¡of ¡the ¡ ¡ surrounding ¡words ¡(let/right) ¡

16 ¡

SLIDE 17

Word ¡Cooccurrences ¡ ¡

17 ¡

Map(<doc, ¡n, ¡pos>) ¡= ¡ ¡ ¡ ¡for ¡i ¡= ¡0 ¡to ¡n ¡-‑ ¡1 ¡do ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡j ¡= ¡1 ¡to ¡L ¡do ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡emit(words[docs[doc].offset ¡+ ¡pos[i] ¡-‑ ¡j], ¡1) ¡ ¡ ¡ ¡ ¡ ¡ ¡end ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡j ¡= ¡1 ¡to ¡R ¡do ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡emit(words[docs[doc].offset ¡+ ¡pos[i] ¡+ ¡j], ¡1) ¡ ¡ ¡ ¡ ¡ ¡ ¡end ¡ ¡ ¡ ¡end ¡ Split(PosBng ¡list) ¡= ¡ ¡ ¡ ¡divide ¡the ¡posBng ¡list ¡into ¡<doc, ¡n, ¡posiBons> ¡ Combiner: ¡

¡ ¡

Word ¡

Freq. ¡

0 ¡(la) ¡ 456 ¡ 19 ¡(de) ¡ 5667 ¡ … ¡ … ¡ Let ¡ Context ¡ Right ¡ Context ¡

…... ¡ …… ¡

‑L ¡
‑1 ¡
‑2 ¡

+1 ¡ +2 ¡ +R ¡

SLIDE 18

Word ¡Cooccurrences ¡

f ¡[token ¡= ¡“de”] ¡in ¡Elpais.com ¡x1 ¡=> ¡55.7 ¡Millions ¡

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 Elapsed Time (seconds) Radius (tokens) 1 threads 2 threads 4 threads 8 threads 16 threads 24 threads

18 ¡

10s ¡ Radius=10 ¡ 1 ¡Thread ¡ 2s ¡ Radius<=10 ¡ 24 ¡Threads ¡ L ¡= ¡R ¡= ¡Radius ¡ Baseline ¡(1 ¡thread) ¡

SLIDE 19

Word ¡Cooccurrences ¡

f ¡[token ¡= ¡“de”] ¡in ¡Elpais.com ¡xN ¡using ¡24 ¡threads ¡

1 2 3 4 5 6 7 8 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08 4e+08 4.5e+08 Elapsed Time (seconds) radius=10 radius=5 radius=3 radius=1

19 ¡

¡50 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡100 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡150 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡200 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡250 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡300 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡250 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡300 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡350 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Token ¡Frequency ¡(million ¡occurrences) ¡

x2 ¡ x1 ¡ x4 ¡ x8 ¡ 350 ¡M ¡ < ¡8 ¡s ¡ Radius=10 ¡ 55.7 ¡Mw ¡ 2 ¡s ¡ R=10 ¡

SLIDE 20

N-‑grams ¡

Given ¡a ¡(sub-‑)corpus, ¡ compute ¡the ¡frequency ¡of ¡all ¡the ¡sequences ¡of ¡ words ¡(lemma, ¡PoS, ¡…) ¡ ¡with ¡lengths ¡between ¡a ¡ minimum ¡and ¡a ¡maximum ¡

20 ¡

SLIDE 21

MapReduce ¡N-‑grams ¡ The ¡Suffix-‑σ ¡Algorithm ¡ ¡

Klaus ¡Berberich ¡and ¡Srikanta ¡Bedathur ¡(2013): ¡“CompuBng ¡n-‑Gram ¡

StaBsBcs ¡in ¡MapReduce”. ¡In ¡Proceedings ¡of ¡16th ¡InternaFonal ¡ Conference ¡on ¡Extending ¡Database ¡Technology ¡(EDBT ¡2013) ¡

– Stage ¡1: ¡Maximal ¡n-‑gram ¡counts ¡are ¡computed ¡and ¡sorted ¡ ¡ – Stage ¡2: ¡Shorter ¡n-‑gram ¡counts ¡are ¡computed ¡from ¡the ¡suffix ¡array ¡of ¡ maximal ¡n-‑grams ¡in ¡the ¡reducer ¡tasks ¡

New ¡York ¡Times ¡Annotated ¡Corpus ¡(NYT) ¡

– 1.8 ¡million ¡newspaper ¡arBcles, ¡1987-‑2007, ¡~3 ¡Gb ¡ – 1 ¡billion ¡tokens ¡(345,827 ¡types) ¡

ClueWeb09-‑B ¡(CW) ¡

– 50 ¡million ¡web ¡documents, ¡2009, ¡~246 ¡Gb ¡ ¡ – ~21 ¡billion ¡tokens ¡(979,935 ¡types) ¡

Hadoop ¡Cluster: ¡10 ¡nodes ¡

– 2x6 ¡cores, ¡64 ¡Gb ¡RAM, ¡4x2 ¡Tb ¡HDD ¡

21 ¡

2-‑5-‑grams ¡/ ¡min ¡freq.=10 ¡(NYT), ¡100 ¡(CW) ¡ ¡

SLIDE 22

MapReduce ¡N-‑grams ¡ ¡

22 ¡

Map(<doc, ¡begin, ¡end>) ¡= ¡ ¡ ¡ ¡for ¡pos ¡= ¡begin ¡to ¡end ¡do ¡ ¡ ¡ ¡ ¡ ¡ ¡emit(<words[docs[doc].offset ¡+ ¡pos, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡words[docs[doc].offset ¡+ ¡pos ¡+ ¡1, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡… ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡words[docs[doc].offset ¡+ ¡pos ¡+ ¡max>, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1) ¡ ¡ ¡ ¡end ¡ Split(Interval ¡tree) ¡= ¡ ¡ ¡ ¡divide ¡the ¡interval ¡tree ¡into ¡intervals: ¡ ¡<doc, ¡begin, ¡end> ¡ Combiner: ¡

¡ ¡

Word ¡

Freq. ¡

<0,10,19,24> ¡ (la,nube,de,cenizas) ¡ 456 ¡ <19,24,43,45> ¡ (de,cenizas,”,”,procedente) ¡ 5667 ¡ … ¡ … ¡ Maximal ¡ n-‑gram ¡

…. ¡ max=4 ¡

SLIDE 23

MapReduce ¡N-‑grams ¡ ¡

100 200 300 400 500 600 700 800 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 Elapsed Time (seconds) 2-grams to 3-grams 2-grams to 5-grams 2-grams to 10-grams 2-grams to 20-grams 23 ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡10 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡15 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡20 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡25 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡30 ¡ ¡ ¡ ¡ Sample ¡Size ¡(million ¡tokens) ¡

3 ¡min ¡

2-‑5-‑grams ¡/ ¡min ¡freq.=10 ¡(NYT), ¡100 ¡(CW) ¡ ¡

Problems ¡with ¡the ¡ hash ¡table ¡combiner ¡

SLIDE 24

MapReduce ¡into ¡the ¡Indexer: ¡ File ¡Inversion ¡

Text ¡inversion ¡
Structural ¡inversion ¡

¡

24 ¡

SLIDE 25

MapReduce ¡File ¡Inversion ¡ Text ¡Indexing ¡

25 ¡

Key ¡ Aggregated ¡ value ¡ Key ¡ Value ¡ Doc1 ¡ Doc2 ¡

SLIDE 26

MapReduce ¡File ¡Inversion: ¡ Indexing ¡Structure ¡

26 ¡

Emiyed ¡ tree ¡value ¡ for ¡key ¡k ¡ Emiyed ¡ tree ¡value ¡ ¡ for ¡key ¡k ¡ Aggregated ¡ tree ¡ Time(Join(T1, ¡T2)) ¡= ¡O(max(h1, ¡h2)) ¡= ¡O(max(log(n1, ¡n2))) ¡ ¡

SLIDE 27

MapReduce ¡File ¡Inversion: ¡ Text ¡Indexing ¡ ¡

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 2e+09 4e+09 6e+09 8e+09 1e+10 1.2e+10 1.4e+10 1.6e+10 Elapsed Time (hours)

27 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡6 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡8 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡10 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡12 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡14 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡16 ¡ ¡ Corpus ¡Size ¡(billion ¡tokens) ¡ 1 ¡Bw ¡ 2 ¡h ¡ ¡ 2 ¡Bw ¡ 4 ¡h ¡ ¡ 4 ¡Bw ¡ 9:30 ¡h ¡ ¡ 8 ¡Bw ¡ 1 ¡Day ¡ ¡ 15 ¡Bw ¡ 2,5 ¡Days ¡ ¡

Boyleneck: ¡ ¡ Concurrent ¡ read ¡access!! ¡ ¡ 24 ¡Threads ¡Indexing ¡

SLIDE 28

Conclusions ¡& ¡Future ¡Work ¡ ¡

Conclusions: ¡

– ApplicaBons ¡scale ¡well ¡to ¡the ¡sizes ¡of ¡interest ¡ using ¡MapReduce ¡ – MapReduce ¡in ¡mulBcore ¡servers ¡is ¡a ¡cost-‑effecBve ¡ soluBon ¡for ¡corpus ¡management ¡tools ¡

Future ¡Work: ¡

– Improve ¡hash ¡containers ¡ – Introduce ¡new ¡associaBve ¡containers ¡ – Implement ¡other ¡non-‑trivial ¡staBsBcs ¡

¡

28 ¡

SLIDE 29

¡ ¡

Thank ¡you ¡!! ¡ ¡ QuesBons ¡?? ¡ ¡

29 ¡

From ¡Several ¡Hundred ¡Million ¡to ¡Some ¡Billion ¡Words: ¡ ¡ Scaling ¡up ¡a ¡Corpus ¡Indexer ¡and ¡a ¡Search ¡Engine ¡ ¡ with ¡MapReduce ¡

Jordi ¡Porta ¡ ComputaBonal ¡LinguisBcs ¡Area ¡ Departamento ¡de ¡Tecnología ¡y ¡Sistemas ¡ Centro ¡de ¡Estudios ¡de ¡la ¡Real ¡Academia ¡Española ¡ (porta@rae.es) ¡

Outline ¡ ¡

“The ¡more ¡observa-ons ¡we ¡gather ¡about ¡language ¡use, ¡ the ¡more ¡accurate ¡descrip-on ¡we ¡have ¡about ¡language ¡itself” ¡ ¡

A ¡Case ¡Study: ¡Elpais.com ¡on-­‑line ¡archive ¡ ¡

Basic ¡Data ¡Structures ¡for ¡Corpus ¡Indexing ¡ SupporBng ¡Complex ¡Queries ¡ ¡ ¡

How ¡Text ¡is ¡Indexed ¡ ¡

What ¡is ¡MapReduce? ¡ ¡

Hadoop ¡MapReduce ¡ ¡

Scaling ¡out ¡vs ¡Scaling ¡up ¡Dilemma ¡ ¡

Scaling ¡out ¡vs ¡Scaling ¡up ¡Dilemma ¡ ¡

to ¡rethink?” ¡In ¡Proc. ¡4th ¡Ann. ¡Symp. ¡on ¡Cloud ¡CompuFng. ¡

¡

Scaling ¡up ¡with ¡Phoenix++ ¡MapReduce ¡ ¡

Experiments: ¡Data ¡& ¡Machinery ¡ ¡

– Elpais.com ¡on-­‑line ¡archive ¡(1976–2011) ¡ – 2.3 ¡million ¡news ¡ ¡ – Cleaned ¡+ ¡PoS ¡tagged ¡ – 23 ¡million ¡structural ¡elements ¡ – 878 ¡million ¡tokens ¡

– x2, ¡x4, ¡x8, ¡x16 ¡ – Don’t ¡obey ¡Heaps’ ¡Law ¡(V(n) ¡= ¡K·√nβ) ¡

– RAM: ¡256 ¡Gb ¡ – CPUs: ¡2 ¡x ¡Xeon ¡E5-­‑2620 ¡/ ¡2 ¡GHz ¡(24 ¡threads) ¡ – DISKS: ¡3 ¡x ¡SAS ¡7.200 ¡rpm, ¡RAID-­‑5 ¡ ¡

Word ¡CounBng ¡

Given ¡a ¡(sub-­‑)corpus, ¡ ¡ compute ¡the ¡frequencies ¡ ¡

(word ¡frequency ¡profile) ¡

Word ¡CounBng ¡

Word ¡CounBng ¡

Word ¡CounBng ¡ ¡

Word ¡Cooccurrencies ¡

Given ¡a ¡word ¡(or ¡a ¡lema ¡or ¡a ¡PoS, ¡…), ¡ compute ¡the ¡frequency ¡of ¡the ¡ ¡ surrounding ¡words ¡(let/right) ¡

Word ¡Cooccurrences ¡ ¡

Word ¡Cooccurrences ¡

Word ¡Cooccurrences ¡

N-­‑grams ¡

Given ¡a ¡(sub-­‑)corpus, ¡ compute ¡the ¡frequency ¡of ¡all ¡the ¡sequences ¡of ¡ words ¡(lemma, ¡PoS, ¡…) ¡ ¡with ¡lengths ¡between ¡a ¡ minimum ¡and ¡a ¡maximum ¡

MapReduce ¡N-­‑grams ¡ The ¡Suffix-­‑σ ¡Algorithm ¡ ¡

StaBsBcs ¡in ¡MapReduce”. ¡In ¡Proceedings ¡of ¡16th ¡InternaFonal ¡ Conference ¡on ¡Extending ¡Database ¡Technology ¡(EDBT ¡2013) ¡

MapReduce ¡N-­‑grams ¡ ¡

MapReduce ¡N-­‑grams ¡ ¡

MapReduce ¡into ¡the ¡Indexer: ¡ File ¡Inversion ¡

¡

MapReduce ¡File ¡Inversion ¡ Text ¡Indexing ¡

MapReduce ¡File ¡Inversion: ¡ Indexing ¡Structure ¡

MapReduce ¡File ¡Inversion: ¡ Text ¡Indexing ¡ ¡

Conclusions ¡& ¡Future ¡Work ¡ ¡

– ApplicaBons ¡scale ¡well ¡to ¡the ¡sizes ¡of ¡interest ¡ using ¡MapReduce ¡ – MapReduce ¡in ¡mulBcore ¡servers ¡is ¡a ¡cost-­‑effecBve ¡ soluBon ¡for ¡corpus ¡management ¡tools ¡

– Improve ¡hash ¡containers ¡ – Introduce ¡new ¡associaBve ¡containers ¡ – Implement ¡other ¡non-­‑trivial ¡staBsBcs ¡

¡

¡ ¡

Thank ¡you ¡!! ¡ ¡ QuesBons ¡?? ¡ ¡

A ¡Case ¡Study: ¡Elpais.com ¡on-‑line ¡archive ¡ ¡

– Elpais.com ¡on-‑line ¡archive ¡(1976–2011) ¡ – 2.3 ¡million ¡news ¡ ¡ – Cleaned ¡+ ¡PoS ¡tagged ¡ – 23 ¡million ¡structural ¡elements ¡ – 878 ¡million ¡tokens ¡

– RAM: ¡256 ¡Gb ¡ – CPUs: ¡2 ¡x ¡Xeon ¡E5-‑2620 ¡/ ¡2 ¡GHz ¡(24 ¡threads) ¡ – DISKS: ¡3 ¡x ¡SAS ¡7.200 ¡rpm, ¡RAID-‑5 ¡ ¡

Given ¡a ¡(sub-‑)corpus, ¡ ¡ compute ¡the ¡frequencies ¡ ¡

N-‑grams ¡

Given ¡a ¡(sub-‑)corpus, ¡ compute ¡the ¡frequency ¡of ¡all ¡the ¡sequences ¡of ¡ words ¡(lemma, ¡PoS, ¡…) ¡ ¡with ¡lengths ¡between ¡a ¡ minimum ¡and ¡a ¡maximum ¡

MapReduce ¡N-‑grams ¡ The ¡Suffix-‑σ ¡Algorithm ¡ ¡

MapReduce ¡N-‑grams ¡ ¡

MapReduce ¡N-‑grams ¡ ¡

– ApplicaBons ¡scale ¡well ¡to ¡the ¡sizes ¡of ¡interest ¡ using ¡MapReduce ¡ – MapReduce ¡in ¡mulBcore ¡servers ¡is ¡a ¡cost-‑effecBve ¡ soluBon ¡for ¡corpus ¡management ¡tools ¡

– Improve ¡hash ¡containers ¡ – Introduce ¡new ¡associaBve ¡containers ¡ – Implement ¡other ¡non-‑trivial ¡staBsBcs ¡