[PDF] - Guiding People to Context Information: Providing an Interface to PDF Document

SLIDE 1

1

Guiding People to Information: Providing an Interface to a Digital Library Using Reference as a Basis for Indexing

S. Bradshaw, A.

Scheinkman & K. Hammond

Context

Information Retrieval Systems

and Machine Learning

– ML techniques/algorithms used in IR – IR applied to ML, esp. CBR

User feedback in learning

systems

Larger Issues Raised

Indexing and textual

classification:

– As specific to text-based knowledge systems – As relevant to automated knowledge acquisition in general – Implications as a cognitive model

Plan for Discussion

Background on IR
ML in document classification
Citation indexing

– CiteSeer, Rosetta

Indexing from the perspective
f rhetorical theory
Practical & theoretical aspects
f the underlying cognitive

problem of textual indexing & classification

CBR & texts; CBR “textuality”

SLIDE 2

2

Automated Text Categorization

Task: assign a value (usually

{0, 1}) to each entry in a decision matrix:

d ... d

j

... d

n

c a

00

... a

j

... a

n

... ... ... ... ... ... c

i

a

i0

... a

ij

... a

in

... ... ... ... ... ... c

m

a

m

... a

m j

... a

m n

Categories are labels (no access to

meaning)

Attribution is content-based (no

metadata)

Constraints can differ wrt

cardinalities of the assignment

Automated Text Categorization

CPC vs DPC:

– Category-pivoted categorization

One row at a time
When categories added dynamically

– Document-pivoted categorization

One column at a time
When documents added over long

period of time

Assignment vs “relevance”

– The latter is subjective – It is largely the same as the notion

f relevance to an information

need

Automated Text Categorization

Applications

– Automatic indexing for IR using a controlled dictionary; usually performed by experts; indices = categories – Classified ads – Document filtering (e.g., Reuters/AP) – WSD (word sense disambiguation)

Word occurrence contexts = docs,

senses = categories

Itself important as an indexing

technique

– Web-page categorization (e.g., Yahoo)

Document Categorization & ML

Earliest efforts (‘80s):

knowledge-engineering (manually building expert system) using rules

– Example: CONSTRUE (for Reuters) – Typical problem: “knowledge acquisition bottleneck” (updating)

More recently (‘90s): ML

approach

– Not construct the classifier, but the builder of classifiers – Variety of approaches (both inductive & lazy)

SLIDE 3

3

VSM (Vector-Space Model)

Vector of n weighted index

terms: “bag of words”

– More sophisticated approaches based on noun phrases

Linguistic vs. statistical notion of

phrase

Results not conclusive

VSM (Vector-Space Model)

Standard model: tfidf (term

frequency X inverse document frequency) = #(tk, dj)*log(|Tr|/#(tk))

Assumptions:

– the more often a term occurs in a document, the more representative it is – the more documents a term occurs in, the less discriminating it is – Is this always an appropriate model? Are “representative” and “significant” the same?

VSM (Vector-Space Model)

Pre-processing:

– Removal of function words – Stemming

“Distance”: based on dot

product of vectors

Dimensionality problem

– IR cosine-matching scales well, but other learning algorithms used for classifier induction do not – DR: dimensionality reduction (also reduces overfitting)

VSM (Vector-Space Model)

Dimensionality Reduction

– Local: each category separately

Each dj has a different

representation for each ci

In practice: subsets of dj’s original

representation*

– Global: all categories are reduced in same way – Bases: linear algebra, information theory

feature extraction vs selection*

– New features are not subset of

riginal; not homogeneous with
riginal: combine or transform the
riginals

SLIDE 4

4

VSM (Vector-Space Model)

Feature selection: TSR (term

space reduction): proven to diminish effectiveness the least

– Document frequency (those which occur most frequently in collection are most valuable)

Apparently contradicts premise of

tfidf that low df terms are more informative

But majority of words that occur in

corpus have extremely low df, so reduction by factor of 10 will only prune these (which are probably insignificant within the document they occur in as well)

– Other techniques: information gain, chi-square, correlation coefficient, etc.

Some improvement
Complexity of techniques obviates

easy interpretation of why results are better

VSM (Vector-Space Model)

Feature extraction:

reparameterization

– “synthesized” rather than naturally occurring – A way of dealing with polysemy, homonymy and synonymy

Term clustering

– Group words with pair-wise semantic relatedness, use their “centroid” as term

One way: co-occurrence/co-absence
Latent semantic indexing

– Combines original dimensions on basis of co-occurrence – Capable of educing an underlying semantics not available from

riginal terms

VSM (Vector-Space Model)

Example

– Great number of terms which contribute a small amout to the whole category: “Demographic shifts in the U.S. with economic impact” text: “The nation grew to 249.6 million in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West”

Problems

– Sometimes new terms not readily interpretable – Could eliminate an original term which was significant

Synthetic production of indices:

relation to the type of indexing done with citations

Building the Classifier

Two phases

– Definition of a mapping function – Definition of a threshold on values returned by that function

Methods for building the

mapping function

– Parametric (training data used to estimate parameters of a probability distribution) – Non-parametric

Profile-based (linear classifier):

extract vector from training (pre- categorized) documents; use this profile to categorize documents in D according to RSV (retrieval status value)

Example-based: use the documents

in training set with highest category status values as candidates for classifying documents in D

SLIDE 5

5

Building the Classifier

Parametric

– “Naïve Bayes” – Cannot use feature selection (need the full term space) – As in most Bayesian learning, assumes that features are independent – It has been shown to work well

Profile

– Embodies explicit/declarative representation of the category – Incremental or batch (on training docs) – Most common batch: Rocchio (wyj: weight of term in a given document; wyi weight for classifying dj as ci)

w

yi=β

w

yj

|{ d

j |ca ij =1

}|

{ dj|ca

ij=1

}

∑

+γ w

yj

|{ d

j|ca ij=0

}|

{ dj|ca

ij=

}

∑

β + γ = 1,β ≥ 0,γ ≤ 0

Building the Classifier

Profile-based, cont’d

– In general, rewards closeness to positive centroid, distance from negative centroid – Produces understandable classifiers (amenable to human tuning) – However, since it is linearly averaged, there are only two subspaces (n-spheres), so this risks excluding most of positive training examples.

EBLclassifiers

– Not explicit, declarative; – Use k-NN; looks at the k training documents most similar to dj, to see if they have been classified under ci; threshold value determines decision

Building the Classifier

EBLclassifiers

CSV

i(dj) =

RSV (dj,dz

dz ∈ TR

k (dj )

∑

)⋅ca

iz

– TRk(dj) set of k documents dz for which RSV(dj,dz) is maximum and ca values are from the correct decision matrix – RSV is some measure of semantic relatedness: could be probabilistic

r vector-based cosine
Does not subdivide space into
nly two spaces
Efficient: O(|Tr|)
One variant uses 1, -1 instead of

1, 0 for ca

Building the Classifier

Lam & Ho: attempts to

combine profile & example- based

– The k-NN algorithm is given generalized instances (GIs) instead of training documents

Clustering positive instances of

category ci into {cli1 … cliki}

Extracting a profile from each

cluster with linear classifier

Applying k-NN to these

– Avoids sensitivity of k-NN to noise, but exploits its superiority

ver linear classifiers

SLIDE 6

6

Evaluating Classifiers

Precision = conditional

probability P(ca = 1 | a = 1), i.e. that if a random document is categorized under c, this is correct [soundness]

Recall = conditional probability

P(a = 1| ca = 1), i.e., that if a random document should be categorized under c, it will be [completeness]

Subjective: based on user

expectation

Best: k-NN, neural nets,

regression-based

Document Indexing

Linked to the classification

problem

(At least) two applications:

(a) Find two documents which are similar (b) Given a user-defined query, return all “relevant” documents

In the broadest sense, VSM is

indexing

Two questions:

– Is a technique used for (a) necessarily effective for (b)? – Is the technique used for (a) rich enough?

Document Indexing

Questions posed by Bradshaw,

et al.

– Does vector-space indexing deliver “relevant” documents? – What kinds of queries are most common? What does this suggest about indexing? – How to deal with ambiguity in simple queries?

Citation Indexing

CiteSeer:

– Scientific citation index – Uses a variety of different techniques for “document similarity”

Levenshtein distance (number of

insertions, deletions and substitutions needed to transform

ne string into another)
Tfidf

– Does this on titles, content, metadata – Includes citations in common as a measure of similarity

SLIDE 7

7

Citation Indexing

Rosetta:

– Uses citation text to provide the indices, rather than the content of the article – Closer to the “brief” and “incomplete” way people tend to formulate information needs – Uses both individual terms as well as phrases – Potential exponential explosion avoided by limiting to 2 or 3- word phrases – Indices are weighted using a metric similar to tfidf

Citation Indexing

Rosetta:

– Phrases used to try to disambiguate – Assumption: the longer (more specific) phrase more often produces a better match than the simple conjunction of single terms – When info need is underspecified, system tries to get user feedback

Builds a directory by matching

queries to other terms

Presents these to user for browsing

Citation Indexing

Issues raised by Rosetta

Practical Questions:

– No in depth-study as of yet; their experiment using abstracts on a small sample may not be representative – Need control/comparative study – Do we need to compensate for lack of indices on texts which have few citations? Combine this with vector-space indexing?

The implications of textual

substitution

From CompLit to CompSci: A philologist’s musings

n the rhetoric of

indexing

SLIDE 8

8

Metaphor vs. Metonymy (< Roman Jakobson)

Metonymy

Example: “the crown of

France” [= Louis]

Displacement by contact,

contiguity, containment

Horizontal, syntagmatic
“in præsentia” (perceived)
Magic: by

contact/contamination

Aphasia: syntactical
Periphrasis by abbreviation,

expansion

Elaboration (explanatio)
Synecdoche -- part for

whole or whole for part -- is a type of metonymy

Metaphor

Example: “the king’s right

arm” [= Richelieu]

Displacement by similarity,

parallelism, contrast

Vertical, paradigmatic
“in absentia” (constructed)
Magic: by

imitation/replication

Aphasia: lexical
Periphrasis by synonymy,

antonymy, homonymy

Translation (translatio)

Metaphor vs. Metonymy

VSM is a metonymic

representation of a text

Citations

– may not work in an entirely metonymic way: users paraphrase the text; this may also involve metaphorical transformations

Latent semantic indexing:

– Can it capture metaphorical knowledge?*

Metaphor vs. Metonymy

Can this kind of information

simply be lexicalized?

– Metaphor and synonymy are similar, but can the former be reduced to the latter? Or is the latter a special case of the former? – Metaphor is “paradigmatic”; but where/how does one codify the paradigms?

A lexicon of concepts? How big is

that space?

Dynamic, not static? better learned

than engineered?

– Content vs metacontent:

It is easy to tell what the “content”*
f a text is; where do you draw the

boundaries for metacontent?

Text & Metatext

supertext text intertext intertext ... extratext supertext = culturo-linguistic superset

f possibilities

SLIDE 9

9

Text & Metatext

How does Rosetta achieve this?

– By drawing on a second set of texts which relate to the first

Two angles:

– Add this to system through interaction with human agents – Build it in somehow

Questions:

– Related to problem of reminding? – Does this kind of knowledge necessarily reside in the relation between texts? – What is the rôle of textual (re-)production?

How can this process best be

emulated/captured by artificial agents?

Active synthesis vs passive

analysis?

Interaction with Human Agents

Case-based approaches to

“knowledge navigation” [Hammond, Burk, Schmitt (1994)]

– Find-me Agents – Example: you want another video “like” Back to the Future

BttF II? Michael Fox movie?

Crocodile Dundee? Time after Time? Who Framed Roger Rabbit?

Some of the features of this are

very similar to the problems of textual hermeneutics we have touched on

Interaction with Human Agents

Many choices
New elements cannot be generated

ad hoc by system or user

The space is defined by a

vocabulary of features that may not be accessible to the user {“stranger in a strange land”}

The user discovers new examples in

the course of the search

The user discovers features that

define the domain in the course of the search

While they may not be able fully to

articulate constraints, users can comment on examples

Examples are “traded”
Basic idea: allow user to

suggest changes, then retrieve further examples

Interaction with Human Agents

One important idea in this

system: supporting non- hierarchical searching

– Searching is not “narrowing” – Space of possibilities which is dynamically changing – Use of subagents to guide user

Video navigator:

– class of problems where specifications are complex and not fully explicit – Exploits relationship between browsing and CBR adaptation

Adaptation done in response to

failure or gap between goal and retrieved plan; new plan created

Here, a new prototype is produced

which can be used to construct new indices into a fixed case base

SLIDE 10

10

Interaction with Human Agents

Rosetta shares with Find-me

systems reliance on human agents to provide direction, though in Rosetta it is static and text-based, while in Find-me it is dynamic and case-based

A reasonable approach for the

goals they set themselves

Text &/vs Case

Wilson & Bradshaw, “CBR

Textuality” (2000)

Strongly vs Weakly textual

contexts

(a) Raw case information is entirely

r predominantly textual, vs.

(b) Textual information as further support for knowledge-rich contexts (a’) transform texts totally or partially into articulated cases

Example: legal decisions

(b’) use IR techniques to refine case retrieval

Simple textual distance measures,

which are relatively weak, can be used to enhance knowledge-based reasoning

Example: architectural annotations
n a design

Text &/vs Case

DRAMA (Leake & Wilson

1999)

– CBR + “concept mapping”:

Tries to capture the designer’s

conceptualizations of the designs

Annotations concerning rationales

for choosing individual components in aircraft

Also allow designer to provide

additional criteria to be matched against annotations

– Where Rosetta uses paratexts to classify texts, Drama uses paratexts to aid in case retrieval

Text &/vs Case

More on the TCBR “continuum”

– TCBR means that the components were at origin textual, whatever degree of processing is eventually involved – TCBR assumes that the presence

f the text is in some way

essential, either as an end in itself

r difficult to encode differently;

in other words, text is not just the result of a poor representational choice

fully structured cases semi- structured cases fully textual cases

SLIDE 11

11

Text &/vs Case

Fully textual: a single textual

component (which could be represented in various ways, such as by a term-vector)

Fully Structured: no textual

components

“strongly textual CBR”: the

importance of a text exceeds the importance of the available knowledge-rich context

“weakly textual CBR”: the

importance of a text is exceeded the importance of the available knowledge-rich context

Wrt the individual textual feature:

the more additional case context available, the less need for texually- based discrimination power, and vice versa

Text &/vs Case

When augmenting fully

structured cases with textual info, the functionality of similarity metrics is extended

When transforming text into

structured cases,

– Information extraction – Turn single text components into multiple components

Questions raised:

– Could a set of multiple contextualized textual features perform as well as knowledge-rich representations? – Is it possible to know when enough contextual info has been gained in order to optimize retrieval/adaptation efficiency?

Text &/vs Case

DRAMA actually updated with

methods developed in Rosetta: small-text processing a good match for weakly-textual requirements

Integration of standard IR

techniques with nearest- neighbor methods:

– VSM – Tfidf – Use of noun phrases – Weighting – Cosine distance

Text &/vs Case

Some final questions:

– Text components incorporated into introspective reasoning? – For TCBR: a dynamic system in which adjustments are made to the structuring of textual components according to needs? Or even retextualizing the structured components under certain conditions? – Is there an analogy, in fully structured case bases, to my notions of metaphoricity and “paratextuality” as we have seen them in IR?: that is, is there a kind

f similarity between cases not

accessible through normal indexing/retrieval, but which might somehow be encoded by a set of “paracases”?

SLIDE 12

12

Bibliography

Fabrizio Sebastiani, “A Tutorial on

Automated Text Categorization”

Kurt D. Bollacker, Steve Lawrence & C.

Lee Giles, “CiteSeer: An Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications”

Chao Yang Ho & Wai Lam, “Automatic

Discovery of Document Classification Knowledge from Text Databases”

Kristian J. Hammond, Robin Burke &

Kathryn Schmitt, “A Case-Based Approach to Knowledge Navigation”

David Leake & David Wilson, “Combining

CBR with Interactive Knowledge Acquisition, Manipulation & Reuse.”

David C. Wilson & Shannon Bradshaw,

“CBR Textuality”. Available at http://citeseer.nj.nec.com