Computational Linguistics for Low-Resource Languages 27 April 2016 - - PowerPoint PPT Presentation

computational linguistics for low resource languages
SMART_READER_LITE
LIVE PREVIEW

Computational Linguistics for Low-Resource Languages 27 April 2016 - - PowerPoint PPT Presentation

Computational Linguistics for Low-Resource Languages 27 April 2016 Alexis Palmer palmer@cl.uni-heidelberg.de Course requirements & organization course website: www.cl.uni-heidelberg.de/courses/ss16/ cllrl/ schedule and literature


slide-1
SLIDE 1

Computational Linguistics for Low-Resource Languages

27 April 2016 Alexis Palmer palmer@cl.uni-heidelberg.de

slide-2
SLIDE 2

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Course requirements & organization ✦ course website: www.cl.uni-heidelberg.de/courses/ss16/ cllrl/ ✦ schedule and literature to be posted on course website ✦ your slides will also be posted ✦ language: auf Deutsch geht auch

2

slide-3
SLIDE 3

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Course requirements & organization ✦ reading & participation: read papers prior to relevant meeting, discuss ✦ questions: 2 questions/session, submitted (email) *before noon* on day of class ✦ presentation: presentation of selected paper(s), discussion after ✦ language resource assessment ✦ term paper: original research or in-depth survey and analysis (12-15 pages)

3

slide-4
SLIDE 4

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Student presentations ✦ topic: 1-2 related papers, depending on length and complexity ✦ presentation: scheduling TBD (depends on number of students), roughly 45 minutes for presentation plus discussion ✦ preparation: draft of slides at least one week prior to presentation, meeting for feedback ✦ Sprechstunde: Wednesdays 11:30-12:30, or by appointment (M/W/Th)

4

slide-5
SLIDE 5

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Language resource assessment ✦ goal: determine the state of language resources for a language of your choice ✦ presentation: short presentation (~10 min.), schedule TBD ✦ investigate: digital language resources, any NLP tools? corpora? work on revitalization/ preservation? availability of resources? ✦ TODO: choose your language before 04.05 (email me - first come, first served)

5

slide-6
SLIDE 6

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

CL for LRL

6

Questions of interest

  • What is a low-resource language? (aka less-studied

language, resource-poor language, minority language, less-privileged language, ...)

slide-7
SLIDE 7

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

CL for LRL

7

Questions of interest

  • What is a low-resource language? (aka less-studied

language, resource-poor language, minority language, less-privileged language, ...)

  • What are the challenges posed by LRLs, and what are

the major approaches to addressing these challenges?

slide-8
SLIDE 8

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

CL for LRL

8

Questions of interest

  • What is a low-resource language? (aka less-studied

language, resource-poor language, minority language, less-privileged language, ...)

  • What are the challenges posed by LRLs, and what are

the major approaches to addressing these challenges?

Some major themes

  • Role of labeled/annotated data
  • Role of expert/linguistic knowledge (anno & beyond)
  • Single language vs. “universal” solutions
  • Resource creation: does it always make sense? how

can it be done most efficiently?

slide-9
SLIDE 9

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

And another question... Why do we care?

9

slide-10
SLIDE 10

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

And another question... Why do we care? ✦ practical reasons ✦ cultural reasons ✦ theoretical reasons

10

slide-11
SLIDE 11

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Language endangerment

11

Language loss

  • Current estimated rate of language death: one every 2

weeks (Crystal 2000)

  • Half of world’s languages extinct by end this century
  • UNESCO Endangered Languages Programme (under

auspices of Section on Intangible Cultural Heritage)

  • UN General Assembly: 2008 was International Year of

Languages

slide-12
SLIDE 12

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Language endangerment

12

Language loss

  • Current estimated rate of language death: one every 2

weeks (Crystal 2000)

  • Half of world’s languages extinct by end this century
  • UNESCO Endangered Languages Programme (under

auspices of Section on Intangible Cultural Heritage)

  • UN General Assembly: 2008 was International Year of

Languages

UNESCO endangerment status

  • six levels: safe, unsafe (or vulnerable), definitively

endangered, severely endangered, critically endangered

  • criteria go beyond number of speakers
slide-13
SLIDE 13

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Evaluating language endangerment

13

Criteria to consider (UNESCO 2003)

  • Intergenerational language transmission
  • Absolute number of speakers
  • Proportion of speakers within the total population
  • Trends in existing language domains
  • Response to new domains and media
  • Materials for language education and literacy
  • Governmental and institutional attitudes and policies,

including official status and use

  • Community members’ attitudes toward their own

language

  • Amount and quality of documentation
slide-14
SLIDE 14

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Globally, 2488 languages in danger

14

source: UNESCO Interactive Atlas of the World’s Languages in Danger, 2009 edition

slide-15
SLIDE 15

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

528 ‘severely endangered’ languages

15

source: UNESCO Interactive Atlas of the World’s Languages in Danger, 2009 edition

slide-16
SLIDE 16

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Germany: 13 endangered languages

16

source: UNESCO Interactive Atlas of the World’s Languages in Danger, 2009 edition

slide-17
SLIDE 17

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Documenting endangered languages

17

The realities

  • Most projects are individual or small-group endeavors

with very small budgets

  • Each project seems to find its own workflow
  • Basic approach: collection, transcription, translation,

detailed linguistic annotation (NOT a pipeline)

  • Tangible end products: orthographies, grammars,

dictionaries, language teaching and learning materials, collections of stories, websites, etc.

  • Such materials support survival of the language
  • Do they support CL/NLP???
slide-18
SLIDE 18

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Uspanteko : 1320 speakers, ‘unsafe’ status

Uspantán, Quiché Department, Guatemala

18

slide-19
SLIDE 19

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Scenario: IGT for Uspanteko

Corpus of texts in the Mayan language Uspanteko

Produced by OKMA (Oxlajuuj Keej Maya' Ajtz'iib') 66 texts, mostly oral history, personal experience, and stories Total 284K words of transcribed text, 74K words glossed

IGT-XML: representational format specifically for IGT

19

# texts

# morphemes

train 21 38802 dev 5 16792 test 6 18704

slide-20
SLIDE 20

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Types of resources

20

Data

  • primary: audio, video, texts (archiving)
  • machine-readable corpora
  • data with annotations
  • parallel corpora, comparable corpora
slide-21
SLIDE 21

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Types of resources

21

Data

  • primary: audio, video, texts (archiving)
  • machine-readable corpora
  • data with annotations
  • parallel corpora, comparable corpora

Linguistic resources

  • traditional: grammars, dictionaries, word lists
  • WordNet, other ontological resources
  • treebanks, etc.
slide-22
SLIDE 22

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Types of resources

22

Data

  • primary: audio, video, texts (archiving)
  • machine-readable corpora
  • data with annotations
  • parallel corpora, comparable corpora

Linguistic resources

  • traditional: grammars, dictionaries, word lists
  • WordNet, other ontological resources
  • treebanks, etc.

Tools

  • user-oriented: spell checkers, input systems, etc.
  • for NLP: tokenization, POS tagging, parsing, etc.
slide-23
SLIDE 23

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Challenges and approaches

23

Having to do with insufficiency of data

  • create more data?
  • leverage resource-rich languages
  • use semi- or unsupervised methods
  • use rule-based methods
  • ...

Having to do with the nature of the data

  • use linguistic knowledge to seed unsupervised models
  • use linguistic knowledge to adapt models/approaches
  • change the data to look more like familiar languages
  • ...
slide-24
SLIDE 24

Topics and scheduling

slide-25
SLIDE 25

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Topics

25

  • More complete list of topics & readings on website
  • Some options
  • Data/resource creation
  • POS tagging and morphological analysis
  • Syntactic analysis
  • Linguistic universals, linguistic typology
  • Speech tools for LRLs
  • Machine translation
  • Cross-lingual approaches
  • ...
slide-26
SLIDE 26

Palmer, ICL, UHeidelberg CL4LRL, 27 April 2016

Scheduling

26

  • 4 May: foundations, Bird/Simons, Bird/Abney [me]
  • 11 May: possible start of student presentations

For next week:

  • Bird and Abney on building a Universal Corpus
  • Bird and Simons on requirements for good data
  • Email me with topic preferences (top 3) - by Monday

(02.05)