[PPT] - From P an . inian Sandhi to Finite State Calculus Malcolm D. PowerPoint Presentation

SLIDE 1

From P¯ an . inian Sandhi to Finite State Calculus

Malcolm D. Hyman Max Planck Institute for the History of Science, Berlin

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1

SLIDE 2

Overview

1. Research context
2. An XML vocabulary for P¯

an .inian rules

3. From P¯

an .inian rules to an FST

4. Implications: remarks on linguistic description

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.2

SLIDE 3

Research context

Ongoing work on modeling components of Sanskrit grammar according to P¯ an .inian principles nominal inflection verbal inflection (using Dh¯ atup¯ at .ha) stem formation (perfect stem, participial

stems. . . )

morphophonology (sandhi)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.3

SLIDE 4

Methodology

How closely to follow P¯ an .ini? Practical concerns dictate an incremental approach. We are obliged to interpret P¯ an .ini. Research results concerning both Indian grammatical methods and facts of the Sanskrit language will emerge from computational studies.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.4

SLIDE 5

Building blocks of an XML model

The rules model not only a P¯ an .inian s¯ utra, but also its context and its interpretation. An XML schema A sound-based encoding (SLP1) A regular expression dialect (PCREs)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.5

SLIDE 6

The SLP1 encoding

a

a

✁✂

¯ a

A

✄

i

✄ ☎

¯ ı

I

✆

u

✝

¯ u

U

✞

r

✟

f

✠

¯ r

✟

F

✡

l

✟

x

☛

¯ l

✟

X

☞✁

e

☞ ✌ ✁

ai

E

✁ ✌ ✂

✁

✍ ✂

au

O

*

✎✑✏

k

✒ ✏ ✂

kh

K

✓ ✏ ✂

g

✁✔ ✏ ✂

gh

G

✕✖ ✏

˙ n

N

✕ ✕✗ ✏ ✂

c

✘ ✏

ch

C

✕✙ ✏ ✂

j

✚ ✏ ✂

jh

J

✛ ✏ ✂

ñ

Y

✜ ✏

t .

w

✢ ✏

t .h

W

✕✣ ✏

d .

q

✤ ✏

d .h

Q

✥✦ ✏ ✂

n .

R

✧ ✏ ✂

t

★ ✏ ✂

th

T

✩ ✏

d

✁✪ ✏ ✂

dh

D

✫ ✏ ✂

n

✥✬ ✏ ✂

p

✥✭ ✏

ph

P

✮ ✏ ✂

b

✯ ✏ ✂

bh

B

✰ ✏ ✂

m

✱ ✏ ✂

y

✕✲ ✏

r

✳ ✏

l

✴ ✏ ✂

v

✵ ✏ ✂

´ s

S

✥✶ ✏ ✂

s .

z

✕✷ ✏ ✂

s

✸ ✏

h

* anusv¯ ara = M; visarga = H

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.6

SLIDE 7

The rule element

8.3.23 mo ’nusv¯ arah . <rule source="m" target="M" rcontext="[@(wb)][@(hal)]" ref="A.8.3.23"/> (We may need more than one rule to express a s¯ utra.)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.7

SLIDE 8

The macro element

We need some means for translating P¯ an .ini’s metalanguage, e. g. sound classes (praty¯ ah¯ aras): <macro name="JaS" value="JBGQDjbgqd" c="voiced stop"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.8

SLIDE 9

The mapping element

1.1.2 ade˙ n gun . ah . <mapping name="guna" ref="A.1.1.2"> <map from="@(a)" to="a"/> <map from="@(i)" to="e"/> <map from="@(u)" to="o"/> <map from="@(f)" to="a"/> <map from="@(x)" to="a"/> </mapping>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.9

SLIDE 10

The function element

<function name="gunate"> <rule source="[@(a)@(i)@(u)]" target="%(guna($1))"/> <rule source="[@(f)@(x)]" target="%(guna($1)) %(semivowel($1))"/> </function>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.10

SLIDE 11

Applying a function

6.1.87 ¯ ad gun . ah . <rule source="[@(a)][@(wb)]([@(ik)])" target="!(gunate($1))" ref="A.6.1.87"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.11

SLIDE 12

Implementing the modeled rules

The XML model captures some of the structure of P¯ an .ini’s grammar. But the

bvious serial application of the rules is

computationally inefficient. The rules can be automatically translated into regular expressions for compilation into a finite state transducer using tools such as xfst (Xerox) or fsa (van Noord). The relation between the underlying strings and the surface strings is a regular relation.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.12

SLIDE 13

The replace operator

Rules may be translated into regular expressions employing the replace operator (Karttunen 1995). (a|A)( | #)(a|A) → a (a|A)( | #)(i|I) → e (a|A)( | #)(u|U) → o (a|A)( | #)(f|F) → ar (a|A)( | #)(x|X) → al

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.13

SLIDE 14

Context-dependent replacement

Documented algorithms exist for the translation

f context-dependent replacements into FSTs

(Mohri & Sproat 1996). 6.1.109 e˙ nah . pad¯ ant¯ adati <rule source="a" target="’" lcontext="[@(eN)][@(wb)]" ref="6.1.109"/> a → ’ / (e|o)( | #)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.14

SLIDE 15

An FST for 6.1.109

6.1.109 e˙ nah . pad¯ ant¯ adati

1 2 e, o ? ? e, o , # e, o ?, a:’

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.15

SLIDE 16

A composed FST for external sandhi

37 s¯ utras constitute core rules for external sandhi XML: 48 rules, 61 macros, 16 mappings, 3 functions compiled regular expressions are ~268KB composed transducer has 4,994 states, 417,814 arcs

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.16

SLIDE 17

Comparing two approaches

Serial application of rules:

FORM S ¯ UTRA

tat ca tad ca 8.2.39 taj ca 8.4.40, 44 tac ca 8.4.55 tacca

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17

SLIDE 18

Comparing two approaches

A unique path through the transducer: <t:t><a:a><t:c><" ":c><c:ǫ><a:a>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17

SLIDE 19

Limitations of segmentalism

Segments are atomic, and enumerating them limits linguistic generalization. Features overlap segments. It was

J. R. Firth’s insight that “some phonological

properties are not uniquely ‘placed’ with respect to particular segments within a larger unit” (Anderson, 1985, 185). Coarticulation “can be detected in almost every phoneme sequence in normal speech” (Goodglass, 1993, 62).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.18

SLIDE 20

Positions of the Indian grammarians

P¯ an .ini moved beyond the vik¯ ara system of earlier linguistic thinkers (Cardona 1965, 311). Use of abbreviations (praty¯ ah¯ aras) for sound classes and the principle of s¯ avarn . ya (A. 1.1.50) emphasize featural analysis. Segments contain subsegments (e. g. /r

/

contains r: MBh. 3.452.1 ff. Pitch is a property of the syllable (R

✁

Pr. 3.9) or

spreads to adjacent consonants (TPr. 1.43).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.19

SLIDE 21

N-retroflexion in finite state modeling

Non-final /n/ is realized as n . after {r

✁

, ¯ r

✁

, r, s . } despite intervening vowels, semivowels, gutturals/velars, labials, or anusv¯ ara. <rule source="n" target="R" lcontext="[fFrz] [#@(aw)@(ku)@(pu)M]" rcontext=".[@(ac)]" ref="8.4.1-2"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.20

SLIDE 22

N-retroflexion examples

There is a regular relation between a set of underlying and surface strings that includes the following pairs:

UNDERLYING SURFACE

br

✁

m . hana br

✁

m . han . a ‘making big/strong’ ¯ arabhyam¯ ana ¯ arabhyam¯ an . a ‘being commenced’ nis . anna nis . an . n . a ‘sitting’

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.21

SLIDE 23

A prosody of retroflexion

When R is projected onto the linear phonematic plane, n . occurs within its extension (Allen 1951, 943). b

R

r

✁

m . han . a ¯ a-

R

rabhyam¯ an . a ni-

R

s . an . n . a

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.22

SLIDE 24

How to represent length?

/dev¯ at/ ([+long] segment) /deva

t/

(phoneme of length) /devaat/ (two phonemes)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.23

SLIDE 25

Autosegmental approaches to length

d e v a t [DBL] d e v a t C V C V V C

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.24

SLIDE 26

Autosegmental implications

“stability” of suprasegmental units (Goldsmith 1976) compensatory lengthening (Latin consul → c¯

˜sul; cf. epigraphic COS)

Swedish has complementary distribution of vocalic/consonantal length in rime of stressed syllables long vowels are structurally parallel to diphthongs on the CV tier but not on the segmental tier

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.25

SLIDE 27

Length in Indian grammar

The P¯ an .inian ´ Sivas¯ utras specify only five basic vowels, not distinguishing between short or long (or pluta) vowels. P¯ an .ini characteristically refers to a-varn . a, etc., that is, the a vowel independent

f its length (1.1.69).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.26

SLIDE 28

The utility of linguistic descriptions

The virtue of particular linguistic descriptions is substantially relative to their purpose. Linear and non-linear descriptions each have advantages. The As . t . ¯ adhy¯ ay¯ ı is motivated by brevity and explanatory generality. Computational linguistics strives for efficiency and explicitness.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.27

From P¯ an . inian Sandhi to Finite State Calculus

Malcolm D. Hyman Max Planck Institute for the History of Science, Berlin

Overview

an .inian rules

an .inian rules to an FST

Research context

Ongoing work on modeling components of Sanskrit grammar according to P¯ an .inian principles nominal inflection verbal inflection (using Dh¯ atup¯ at .ha) stem formation (perfect stem, participial

morphophonology (sandhi)

Methodology

How closely to follow P¯ an .ini? Practical concerns dictate an incremental approach. We are obliged to interpret P¯ an .ini. Research results concerning both Indian grammatical methods and facts of the Sanskrit language will emerge from computational studies.

Building blocks of an XML model

The rules model not only a P¯ an .inian s¯ utra, but also its context and its interpretation. An XML schema A sound-based encoding (SLP1) A regular expression dialect (PCREs)

The SLP1 encoding

The rule element

8.3.23 mo ’nusv¯ arah . <rule source="m" target="M" rcontext="[@(wb)][@(hal)]" ref="A.8.3.23"/> (We may need more than one rule to express a s¯ utra.)

The macro element

We need some means for translating P¯ an .ini’s metalanguage, e. g. sound classes (praty¯ ah¯ aras): <macro name="JaS" value="JBGQDjbgqd" c="voiced stop"/>

The mapping element

1.1.2 ade˙ n gun . ah . <mapping name="guna" ref="A.1.1.2"> <map from="@(a)" to="a"/> <map from="@(i)" to="e"/> <map from="@(u)" to="o"/> <map from="@(f)" to="a"/> <map from="@(x)" to="a"/> </mapping>

The function element

<function name="gunate"> <rule source="[@(a)@(i)@(u)]" target="%(guna($1))"/> <rule source="[@(f)@(x)]" target="%(guna($1)) %(semivowel($1))"/> </function>

Applying a function

6.1.87 ¯ ad gun . ah . <rule source="[@(a)][@(wb)]([@(ik)])" target="!(gunate($1))" ref="A.6.1.87"/>

Implementing the modeled rules

The XML model captures some of the structure of P¯ an .ini’s grammar. But the

computationally inefficient. The rules can be automatically translated into regular expressions for compilation into a finite state transducer using tools such as xfst (Xerox) or fsa (van Noord). The relation between the underlying strings and the surface strings is a regular relation.

The replace operator

Rules may be translated into regular expressions employing the replace operator (Karttunen 1995). (a|A)( | #)(a|A) → a (a|A)( | #)(i|I) → e (a|A)( | #)(u|U) → o (a|A)( | #)(f|F) → ar (a|A)( | #)(x|X) → al

Context-dependent replacement

Documented algorithms exist for the translation

(Mohri & Sproat 1996). 6.1.109 e˙ nah . pad¯ ant¯ adati <rule source="a" target="’" lcontext="[@(eN)][@(wb)]" ref="6.1.109"/> a → ’ / (e|o)( | #)

An FST for 6.1.109

6.1.109 e˙ nah . pad¯ ant¯ adati

1 2 e, o ? ? e, o , # e, o ?, a:’

A composed FST for external sandhi

37 s¯ utras constitute core rules for external sandhi XML: 48 rules, 61 macros, 16 mappings, 3 functions compiled regular expressions are ~268KB composed transducer has 4,994 states, 417,814 arcs

Comparing two approaches

Serial application of rules:

FORM S ¯ UTRA

tat ca tad ca 8.2.39 taj ca 8.4.40, 44 tac ca 8.4.55 tacca

Comparing two approaches

A unique path through the transducer: <t:t><a:a><t:c><" ":c><c:ǫ><a:a>

Limitations of segmentalism

Segments are atomic, and enumerating them limits linguistic generalization. Features overlap segments. It was

properties are not uniquely ‘placed’ with respect to particular segments within a larger unit” (Anderson, 1985, 185). Coarticulation “can be detected in almost every phoneme sequence in normal speech” (Goodglass, 1993, 62).

Positions of the Indian grammarians

P¯ an .ini moved beyond the vik¯ ara system of earlier linguistic thinkers (Cardona 1965, 311). Use of abbreviations (praty¯ ah¯ aras) for sound classes and the principle of s¯ avarn . ya (A. 1.1.50) emphasize featural analysis. Segments contain subsegments (e. g. /r

contains r: MBh. 3.452.1 ff. Pitch is a property of the syllable (R

spreads to adjacent consonants (TPr. 1.43).

N-retroflexion in finite state modeling

Non-final /n/ is realized as n . after {r

, ¯ r

, r, s . } despite intervening vowels, semivowels, gutturals/velars, labials, or anusv¯ ara. <rule source="n" target="R" lcontext="[fFrz] [#@(aw)@(ku)@(pu)M]*" rcontext=".*[@(ac)]" ref="8.4.1-2"/>

N-retroflexion examples

There is a regular relation between a set of underlying and surface strings that includes the following pairs:

UNDERLYING SURFACE

br

m . hana br

m . han . a ‘making big/strong’ ¯ arabhyam¯ ana ¯ arabhyam¯ an . a ‘being commenced’ nis . anna nis . an . n . a ‘sitting’

A prosody of retroflexion

When R is projected onto the linear phonematic plane, n . occurs within its extension (Allen 1951, 943). b

r

m . han . a ¯ a-

rabhyam¯ an . a ni-

s . an . n . a

How to represent length?

/dev¯ at/ ([+long] segment) /deva

(phoneme of length) /devaat/ (two phonemes)

Autosegmental approaches to length

d e v a t [DBL] d e v a t C V C V V C

Autosegmental implications

“stability” of suprasegmental units (Goldsmith 1976) compensatory lengthening (Latin consul → c¯

Swedish has complementary distribution of vocalic/consonantal length in rime of stressed syllables long vowels are structurally parallel to diphthongs on the CV tier but not on the segmental tier

Length in Indian grammar

The P¯ an .inian ´ Sivas¯ utras specify only five basic vowels, not distinguishing between short or long (or pluta) vowels. P¯ an .ini characteristically refers to a-varn . a, etc., that is, the a vowel independent

The utility of linguistic descriptions

The virtue of particular linguistic descriptions is substantially relative to their purpose. Linear and non-linear descriptions each have advantages. The As . t . ¯ adhy¯ ay¯ ı is motivated by brevity and explanatory generality. Computational linguistics strives for efficiency and explicitness.

, r, s . } despite intervening vowels, semivowels, gutturals/velars, labials, or anusv¯ ara. <rule source="n" target="R" lcontext="[fFrz] [#@(aw)@(ku)@(pu)M]" rcontext=".[@(ac)]" ref="8.4.1-2"/>