CSCI 562: EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 25 Aug - - PowerPoint PPT Presentation

csci 562 empirical methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

CSCI 562: EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 25 Aug - - PowerPoint PPT Presentation

CSCI 562: EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 25 Aug 2009 WHAT WE WANT What weve got? A.L.I.C.E. WHAT WOULD YOU DO? Where is USC located? DATA 1980s: if you wanted a computer to know something, you had to program


slide-1
SLIDE 1

CSCI 562: EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING

25 Aug 2009

slide-2
SLIDE 2

WHAT WE WANT

A.L.I.C.E. What we’ve got?

slide-3
SLIDE 3

WHAT WOULD YOU DO?

Where is USC located?

slide-4
SLIDE 4

DATA

  • 1980s: if you wanted a computer to “know” something, you

had to program it in.

  • Now: just about everything has been posted on the Web by

someone, somewhere. But, it is all in natural language.

slide-5
SLIDE 5

BAG OF WORDS

Where is USC located?

USC’s two primary campuses, both located in the heart of Los Angeles, welcome thousands of guests and visitors each year. The 226-acre University Park campus, home to the College of Letters, Arts and Sciences, the Graduate School, and most of USC’s professional schools, is adjacent to Exposition Park with its world-class museums and recreational facilities. A few miles to the northeast is the 61-acre Health Sciences campus, home to the Keck School of Medicine of USC and the School of Pharmacy as well as three major teaching hospitals.

is 1 located 1 usc 1 where 1 is 2 located 1 two 1 usc 3 where …

slide-6
SLIDE 6

AMBIGUITY

Where can a snow leopard be seen?

slide-7
SLIDE 7

GRAMMAR

What animal does a frog eat?

slide-8
SLIDE 8

GRAMMAR

SBARQ WHNP SINV VBZ NP What animal does a frog VP VB NP eat t

slide-9
SLIDE 9

MULTILINGUALITY

Tell me about Tai Yen-Hui.

slide-10
SLIDE 10

STRUCTURE

  • All language has various levels of structure, more than just a

bag of words

  • Virtually all Natural Language Processing tasks involve inferring

structure from text or transforming one kind of structure into another

slide-11
SLIDE 11

CS 562 - Intro (part 2)

CS 562 - Lecture 1, part II

  • More about ambiguities
  • Key problems to address in this class
  • grammar formalisms
  • search algorithms
  • learning methods
  • Learning Examples

11

slide-12
SLIDE 12

CS 562 - Intro (part 2)

More about Ambiguities

  • to middle school kids: what does this sentence mean?

12

Aravind Joshi

I saw her duck.

lexical ambiguity (word-sense)

slide-13
SLIDE 13

CS 562 - Intro (part 2)

More about Ambiguities

13

Aravind Joshi

I eat sushi with tuna.

  • to middle school kids: what does this sentence mean?

structural ambiguity (PP-attachment)

slide-14
SLIDE 14

CS 562 - Intro (part 2)

More about Ambiguities

14

Aravind Joshi

I eat sushi with tuna.

  • to middle school kids: what does this sentence mean?

lexical ambiguity (word-sense)

slide-15
SLIDE 15

CS 562 - Intro (part 2)

More about Ambiguities

15

Aravind Joshi

Everybody loves somebody.

  • to middle school kids: what does this sentence mean?

structural ambiguity (quantifier scope)

???

slide-16
SLIDE 16

CS 562 - Intro (part 2)

More about Ambiguities

16

Aravind Joshi

  • to middle school kids: what does this sentence mean?

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

Dogs dogs dog dog dogs.

Police police police police police

http://www.cse.buffalo.edu/~rapaport/BuffaloBuffalo/buffalobuffalo.html

slide-17
SLIDE 17

CS 562 - Intro (part 2)

Ambiguities in Translation

17

zi zhu zhong duan 自 助 终 端 self help terminal device

slide-18
SLIDE 18

CS 562 - Intro (part 2)

Ambiguities in Translation

18

slide-19
SLIDE 19

CS 562 - Intro (part 2)

  • r even...

19

clear evidence that NLP is used in real life!

slide-20
SLIDE 20

CS 562 - Intro (part 2)

Ambiguity Explosion

  • how about...
  • I saw her duck with a telescope.
  • I saw her duck with a telescope in the garden...

20

... I saw her duck.

slide-21
SLIDE 21

CS 562 - Intro (part 2)

Ambiguity Explosion

  • exponential explosion of the search space
  • Q1: how to represent ambiguities (compactly)?
  • Q2: how to search over this space (efficiently)?
  • Q3: how to rank different hypotheses?

21

..

S NP PRP I VP VBD saw NP PRP$ her NN duck PP IN with NP DT a NN telescope

slide-22
SLIDE 22

CS 562 - Intro (part 2)

Answers...

  • Q1: how to represent ambiguities?
  • context-free grammar (unit 2)
  • finite-state automata (unit I)
  • Q2: how to search in this space?
  • dynamic programming (units 1&2)
  • Q3: how to rank these hypotheses?
  • weighted grammar (units 1-3)
  • weights learned from data
  • (saw, with, telescope) seen more often in texts

22

S NP PRP I VP VBD saw NP PRP$ her NN duck PP IN with NP DT a NN telescop

slide-23
SLIDE 23

CS 562 - Intro (part 2)

Why Learning?

  • learning is better than hand-written rules, because:
  • less work; easily adapts to new languages/domains
  • Powerset (now bing.com): 15 years for English grammar!
  • now they are writing their Chinese grammar...
  • and languages constantly change!
  • learning can work, and often works better!
  • machine translation: used to be dominated by rule-based
  • now statistical methods are better: google vs. systran
  • google learns from the web, and translates 40+ langs

23

[also CS 567, Machine Learning, Fall 2009]

slide-24
SLIDE 24

CS 562 - Intro (part 2)

Example - Rosetta Stone

  • the most famous (tri-)parallel text
  • machines can do the same job! (if given parallel text)
  • UN/EU/Ca proceedings, News, tech docs, ...

24

slide-25
SLIDE 25

A sci-fi example

(Knight, 1997)

farok crrrok hihok yorok clok kantok ok-yurp Y

  • ur assignment: translate this Centauri

sentence into Arcturan

slide-26
SLIDE 26
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-27
SLIDE 27
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-28
SLIDE 28
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-29
SLIDE 29
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-30
SLIDE 30
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-31
SLIDE 31
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-32
SLIDE 32
  • 1c. ok-voon ororok sprok .
  • 1a. at-voon bichat dat .
  • 7c. lalok farok ororok lalok sprok izok enemok .
  • 7a. wat jjat bichat wat dat vat eneat .
  • 2c. ok-drubel ok-voon anok plok sprok .
  • 2a. at-drubel at-voon pippat rrat dat .
  • 8c. lalok brok anok plok nok .
  • 8a. iat lat pippat rrat nnat .
  • 3c. erok sprok izok hihok ghirok .
  • 3a. totat dat arrat vat hilat .
  • 9c. wiwok nok izok kantok ok-yurp .
  • 9a. totat nnat quat oloat at-yurp .
  • 4c. ok-voon anok drok brok jok .
  • 4a. at-voon krat pippat sat lat .
  • 10c. lalok mok nok yorok ghirok clok .
  • 10a. wat nnat gat mat bat hilat .
  • 5c. wiwok farok izok stok .
  • 5a. totat jjat quat cat .
  • 11c. lalok nok crrrok hihok yorok zanzanok .
  • 11a. wat nnat arrat mat zanzanat .
  • 6c. lalok sprok izok jok stok .
  • 6a. wat dat krat quat cat .
  • 12c. lalok rarok nok izok hihok mok .
  • 12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)

slide-33
SLIDE 33

A sci-fi example

(Knight, 1997)

farok crrrok hihok yorok clok kantok ok-yurp Y

  • ur assignment: translate this Centauri

sentence into Arcturan jjat arrat mat bat oloat at-yurp farok crrrok hihok yorok clok kantok ok-yurp Are these Arcturan words in Arcturan order?

slide-34
SLIDE 34
  • 1e. Garcia and associates .
  • 1s. Garcia y asociados .
  • 7e. the clients and the associates are enemies .
  • 7s. los clients y los asociados son enemigos .
  • 2e. Carlos Garcia has three associates .
  • 2s. Carlos Garcia tiene tres asociados .
  • 8e. the company has three groups .
  • 8s. la empresa tiene tres grupos .
  • 3e. his associates are not strong .
  • 3s. sus asociados no son fuertes .
  • 9e. its groups are in Europe .
  • 9s. sus grupos estan en Europa .
  • 4e. Garcia has a company also .
  • 4s. Garcia tambien tiene una empresa .
  • 10e. the modern groups sell strong pharmaceuticals .
  • 10s. los grupos modernos venden medicinas fuertes .
  • 5e. its clients are angry .
  • 5s. sus clientes estan enfadados .
  • 11e. the groups do not sell zenzanine .
  • 11s. los grupos no venden zanzanina .
  • 6e. the associates are also angry .
  • 6s. los asociados tambien estan enfadados .
  • 12e. the small groups are not modern .
  • 12s. los grupos pequenos no son modernos .

Clients do not sell pharmaceuticals in Europe .

(Knight,1997)

slide-35
SLIDE 35

CS 562 - Intro (part 2)

Take Home Message

  • languages are beyond just bags of words!
  • ambiguity is everywhere, and NLP is all about that
  • we’ll teach machines how to read and translate...
  • and how to learn to read and translate from data
  • have fun in this class! :)

35