Guiding People to Context Information: Providing an Interface to - - PDF document

guiding people to context information providing an
SMART_READER_LITE
LIVE PREVIEW

Guiding People to Context Information: Providing an Interface to - - PDF document

Guiding People to Context Information: Providing an Interface to Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for ML techniques/algorithms used in IR Indexing IR applied to


slide-1
SLIDE 1

1

Guiding People to Information: Providing an Interface to a Digital Library Using Reference as a Basis for Indexing

  • S. Bradshaw, A.

Scheinkman & K. Hammond

Context

  • Information Retrieval Systems

and Machine Learning

– ML techniques/algorithms used in IR – IR applied to ML, esp. CBR

  • User feedback in learning

systems

Larger Issues Raised

  • Indexing and textual

classification:

– As specific to text-based knowledge systems – As relevant to automated knowledge acquisition in general – Implications as a cognitive model

Plan for Discussion

  • Background on IR
  • ML in document classification
  • Citation indexing

– CiteSeer, Rosetta

  • Indexing from the perspective
  • f rhetorical theory
  • Practical & theoretical aspects
  • f the underlying cognitive

problem of textual indexing & classification

  • CBR & texts; CBR “textuality”
slide-2
SLIDE 2

2

Automated Text Categorization

  • Task: assign a value (usually

{0, 1}) to each entry in a decision matrix:

d ... d

j

... d

n

c a

00

... a

j

... a

n

... ... ... ... ... ... c

i

a

i0

... a

ij

... a

in

... ... ... ... ... ... c

m

a

m

... a

m j

... a

m n

  • Categories are labels (no access to

meaning)

  • Attribution is content-based (no

metadata)

  • Constraints can differ wrt

cardinalities of the assignment

Automated Text Categorization

  • CPC vs DPC:

– Category-pivoted categorization

  • One row at a time
  • When categories added dynamically

– Document-pivoted categorization

  • One column at a time
  • When documents added over long

period of time

  • Assignment vs “relevance”

– The latter is subjective – It is largely the same as the notion

  • f relevance to an information

need

Automated Text Categorization

  • Applications

– Automatic indexing for IR using a controlled dictionary; usually performed by experts; indices = categories – Classified ads – Document filtering (e.g., Reuters/AP) – WSD (word sense disambiguation)

  • Word occurrence contexts = docs,

senses = categories

  • Itself important as an indexing

technique

– Web-page categorization (e.g., Yahoo)

Document Categorization & ML

  • Earliest efforts (‘80s):

knowledge-engineering (manually building expert system) using rules

– Example: CONSTRUE (for Reuters) – Typical problem: “knowledge acquisition bottleneck” (updating)

  • More recently (‘90s): ML

approach

– Not construct the classifier, but the builder of classifiers – Variety of approaches (both inductive & lazy)

slide-3
SLIDE 3

3

VSM (Vector-Space Model)

  • Vector of n weighted index

terms: “bag of words”

– More sophisticated approaches based on noun phrases

  • Linguistic vs. statistical notion of

phrase

  • Results not conclusive

VSM (Vector-Space Model)

  • Standard model: tfidf (term

frequency X inverse document frequency) = #(tk, dj)*log(|Tr|/#(tk))

  • Assumptions:

– the more often a term occurs in a document, the more representative it is – the more documents a term occurs in, the less discriminating it is – Is this always an appropriate model? Are “representative” and “significant” the same?

VSM (Vector-Space Model)

  • Pre-processing:

– Removal of function words – Stemming

  • “Distance”: based on dot

product of vectors

  • Dimensionality problem

– IR cosine-matching scales well, but other learning algorithms used for classifier induction do not – DR: dimensionality reduction (also reduces overfitting)

VSM (Vector-Space Model)

  • Dimensionality Reduction

– Local: each category separately

  • Each dj has a different

representation for each ci

  • In practice: subsets of dj’s original

representation*

– Global: all categories are reduced in same way – Bases: linear algebra, information theory

  • feature extraction vs selection*

– New features are not subset of

  • riginal; not homogeneous with
  • riginal: combine or transform the
  • riginals
slide-4
SLIDE 4

4

VSM (Vector-Space Model)

  • Feature selection: TSR (term

space reduction): proven to diminish effectiveness the least

– Document frequency (those which occur most frequently in collection are most valuable)

  • Apparently contradicts premise of

tfidf that low df terms are more informative

  • But majority of words that occur in

corpus have extremely low df, so reduction by factor of 10 will only prune these (which are probably insignificant within the document they occur in as well)

– Other techniques: information gain, chi-square, correlation coefficient, etc.

  • Some improvement
  • Complexity of techniques obviates

easy interpretation of why results are better

VSM (Vector-Space Model)

  • Feature extraction:

reparameterization

– “synthesized” rather than naturally occurring – A way of dealing with polysemy, homonymy and synonymy

  • Term clustering

– Group words with pair-wise semantic relatedness, use their “centroid” as term

  • One way: co-occurrence/co-absence
  • Latent semantic indexing

– Combines original dimensions on basis of co-occurrence – Capable of educing an underlying semantics not available from

  • riginal terms

VSM (Vector-Space Model)

  • Example

– Great number of terms which contribute a small amout to the whole category: “Demographic shifts in the U.S. with economic impact” text: “The nation grew to 249.6 million in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West”

  • Problems

– Sometimes new terms not readily interpretable – Could eliminate an original term which was significant

  • Synthetic production of indices:

relation to the type of indexing done with citations

Building the Classifier

  • Two phases

– Definition of a mapping function – Definition of a threshold on values returned by that function

  • Methods for building the

mapping function

– Parametric (training data used to estimate parameters of a probability distribution) – Non-parametric

  • Profile-based (linear classifier):

extract vector from training (pre- categorized) documents; use this profile to categorize documents in D according to RSV (retrieval status value)

  • Example-based: use the documents

in training set with highest category status values as candidates for classifying documents in D

slide-5
SLIDE 5

5

Building the Classifier

  • Parametric

– “Naïve Bayes” – Cannot use feature selection (need the full term space) – As in most Bayesian learning, assumes that features are independent – It has been shown to work well

  • Profile

– Embodies explicit/declarative representation of the category – Incremental or batch (on training docs) – Most common batch: Rocchio (wyj: weight of term in a given document; wyi weight for classifying dj as ci)

w

yi=β

w

yj

|{ d

j |ca ij =1

}|

{ dj|ca

ij=1

}

+γ w

yj

|{ d

j|ca ij=0

}|

{ dj|ca

ij=

}

β + γ = 1,β ≥ 0,γ ≤ 0

Building the Classifier

  • Profile-based, cont’d

– In general, rewards closeness to positive centroid, distance from negative centroid – Produces understandable classifiers (amenable to human tuning) – However, since it is linearly averaged, there are only two subspaces (n-spheres), so this risks excluding most of positive training examples.

  • EBLclassifiers

– Not explicit, declarative; – Use k-NN; looks at the k training documents most similar to dj, to see if they have been classified under ci; threshold value determines decision

Building the Classifier

  • EBLclassifiers

CSV

i(dj) =

RSV (dj,dz

dz ∈ TR

k (dj )

)⋅ca

iz

– TRk(dj) set of k documents dz for which RSV(dj,dz) is maximum and ca values are from the correct decision matrix – RSV is some measure of semantic relatedness: could be probabilistic

  • r vector-based cosine
  • Does not subdivide space into
  • nly two spaces
  • Efficient: O(|Tr|)
  • One variant uses 1, -1 instead of

1, 0 for ca

Building the Classifier

  • Lam & Ho: attempts to

combine profile & example- based

– The k-NN algorithm is given generalized instances (GIs) instead of training documents

  • Clustering positive instances of

category ci into {cli1 … cliki}

  • Extracting a profile from each

cluster with linear classifier

  • Applying k-NN to these

– Avoids sensitivity of k-NN to noise, but exploits its superiority

  • ver linear classifiers
slide-6
SLIDE 6

6

Evaluating Classifiers

  • Precision = conditional

probability P(ca = 1 | a = 1), i.e. that if a random document is categorized under c, this is correct [soundness]

  • Recall = conditional probability

P(a = 1| ca = 1), i.e., that if a random document should be categorized under c, it will be [completeness]

  • Subjective: based on user

expectation

  • Best: k-NN, neural nets,

regression-based

Document Indexing

  • Linked to the classification

problem

  • (At least) two applications:

(a) Find two documents which are similar (b) Given a user-defined query, return all “relevant” documents

  • In the broadest sense, VSM is

indexing

  • Two questions:

– Is a technique used for (a) necessarily effective for (b)? – Is the technique used for (a) rich enough?

Document Indexing

  • Questions posed by Bradshaw,

et al.

– Does vector-space indexing deliver “relevant” documents? – What kinds of queries are most common? What does this suggest about indexing? – How to deal with ambiguity in simple queries?

Citation Indexing

  • CiteSeer:

– Scientific citation index – Uses a variety of different techniques for “document similarity”

  • Levenshtein distance (number of

insertions, deletions and substitutions needed to transform

  • ne string into another)
  • Tfidf

– Does this on titles, content, metadata – Includes citations in common as a measure of similarity

slide-7
SLIDE 7

7

Citation Indexing

  • Rosetta:

– Uses citation text to provide the indices, rather than the content of the article – Closer to the “brief” and “incomplete” way people tend to formulate information needs – Uses both individual terms as well as phrases – Potential exponential explosion avoided by limiting to 2 or 3- word phrases – Indices are weighted using a metric similar to tfidf

Citation Indexing

  • Rosetta:

– Phrases used to try to disambiguate – Assumption: the longer (more specific) phrase more often produces a better match than the simple conjunction of single terms – When info need is underspecified, system tries to get user feedback

  • Builds a directory by matching

queries to other terms

  • Presents these to user for browsing

Citation Indexing

Issues raised by Rosetta

  • Practical Questions:

– No in depth-study as of yet; their experiment using abstracts on a small sample may not be representative – Need control/comparative study – Do we need to compensate for lack of indices on texts which have few citations? Combine this with vector-space indexing?

  • The implications of textual

substitution

From CompLit to CompSci: A philologist’s musings

  • n the rhetoric of

indexing

slide-8
SLIDE 8

8

Metaphor vs. Metonymy (< Roman Jakobson)

Metonymy

  • Example: “the crown of

France” [= Louis]

  • Displacement by contact,

contiguity, containment

  • Horizontal, syntagmatic
  • “in præsentia” (perceived)
  • Magic: by

contact/contamination

  • Aphasia: syntactical
  • Periphrasis by abbreviation,

expansion

  • Elaboration (explanatio)
  • Synecdoche -- part for

whole or whole for part -- is a type of metonymy

Metaphor

  • Example: “the king’s right

arm” [= Richelieu]

  • Displacement by similarity,

parallelism, contrast

  • Vertical, paradigmatic
  • “in absentia” (constructed)
  • Magic: by

imitation/replication

  • Aphasia: lexical
  • Periphrasis by synonymy,

antonymy, homonymy

  • Translation (translatio)

Metaphor vs. Metonymy

  • VSM is a metonymic

representation of a text

  • Citations

– may not work in an entirely metonymic way: users paraphrase the text; this may also involve metaphorical transformations

  • Latent semantic indexing:

– Can it capture metaphorical knowledge?*

Metaphor vs. Metonymy

  • Can this kind of information

simply be lexicalized?

– Metaphor and synonymy are similar, but can the former be reduced to the latter? Or is the latter a special case of the former? – Metaphor is “paradigmatic”; but where/how does one codify the paradigms?

  • A lexicon of concepts? How big is

that space?

  • Dynamic, not static? better learned

than engineered?

– Content vs metacontent:

  • It is easy to tell what the “content”*
  • f a text is; where do you draw the

boundaries for metacontent?

Text & Metatext

supertext text intertext intertext ... extratext supertext = culturo-linguistic superset

  • f possibilities
slide-9
SLIDE 9

9

Text & Metatext

  • How does Rosetta achieve this?

– By drawing on a second set of texts which relate to the first

  • Two angles:

– Add this to system through interaction with human agents – Build it in somehow

  • Questions:

– Related to problem of reminding? – Does this kind of knowledge necessarily reside in the relation between texts? – What is the rôle of textual (re-)production?

  • How can this process best be

emulated/captured by artificial agents?

  • Active synthesis vs passive

analysis?

Interaction with Human Agents

  • Case-based approaches to

“knowledge navigation” [Hammond, Burk, Schmitt (1994)]

– Find-me Agents – Example: you want another video “like” Back to the Future

  • BttF II? Michael Fox movie?

Crocodile Dundee? Time after Time? Who Framed Roger Rabbit?

  • Some of the features of this are

very similar to the problems of textual hermeneutics we have touched on

Interaction with Human Agents

  • Many choices
  • New elements cannot be generated

ad hoc by system or user

  • The space is defined by a

vocabulary of features that may not be accessible to the user {“stranger in a strange land”}

  • The user discovers new examples in

the course of the search

  • The user discovers features that

define the domain in the course of the search

  • While they may not be able fully to

articulate constraints, users can comment on examples

  • Examples are “traded”
  • Basic idea: allow user to

suggest changes, then retrieve further examples

Interaction with Human Agents

  • One important idea in this

system: supporting non- hierarchical searching

– Searching is not “narrowing” – Space of possibilities which is dynamically changing – Use of subagents to guide user

  • Video navigator:

– class of problems where specifications are complex and not fully explicit – Exploits relationship between browsing and CBR adaptation

  • Adaptation done in response to

failure or gap between goal and retrieved plan; new plan created

  • Here, a new prototype is produced

which can be used to construct new indices into a fixed case base

slide-10
SLIDE 10

10

Interaction with Human Agents

  • Rosetta shares with Find-me

systems reliance on human agents to provide direction, though in Rosetta it is static and text-based, while in Find-me it is dynamic and case-based

  • A reasonable approach for the

goals they set themselves

Text &/vs Case

  • Wilson & Bradshaw, “CBR

Textuality” (2000)

  • Strongly vs Weakly textual

contexts

(a) Raw case information is entirely

  • r predominantly textual, vs.

(b) Textual information as further support for knowledge-rich contexts (a’) transform texts totally or partially into articulated cases

  • Example: legal decisions

(b’) use IR techniques to refine case retrieval

  • Simple textual distance measures,

which are relatively weak, can be used to enhance knowledge-based reasoning

  • Example: architectural annotations
  • n a design

Text &/vs Case

  • DRAMA (Leake & Wilson

1999)

– CBR + “concept mapping”:

  • Tries to capture the designer’s

conceptualizations of the designs

  • Annotations concerning rationales

for choosing individual components in aircraft

  • Also allow designer to provide

additional criteria to be matched against annotations

– Where Rosetta uses paratexts to classify texts, Drama uses paratexts to aid in case retrieval

Text &/vs Case

  • More on the TCBR “continuum”

– TCBR means that the components were at origin textual, whatever degree of processing is eventually involved – TCBR assumes that the presence

  • f the text is in some way

essential, either as an end in itself

  • r difficult to encode differently;

in other words, text is not just the result of a poor representational choice

fully structured cases semi- structured cases fully textual cases

slide-11
SLIDE 11

11

Text &/vs Case

  • Fully textual: a single textual

component (which could be represented in various ways, such as by a term-vector)

  • Fully Structured: no textual

components

  • “strongly textual CBR”: the

importance of a text exceeds the importance of the available knowledge-rich context

  • “weakly textual CBR”: the

importance of a text is exceeded the importance of the available knowledge-rich context

  • Wrt the individual textual feature:

the more additional case context available, the less need for texually- based discrimination power, and vice versa

Text &/vs Case

  • When augmenting fully

structured cases with textual info, the functionality of similarity metrics is extended

  • When transforming text into

structured cases,

– Information extraction – Turn single text components into multiple components

  • Questions raised:

– Could a set of multiple contextualized textual features perform as well as knowledge-rich representations? – Is it possible to know when enough contextual info has been gained in order to optimize retrieval/adaptation efficiency?

Text &/vs Case

  • DRAMA actually updated with

methods developed in Rosetta: small-text processing a good match for weakly-textual requirements

  • Integration of standard IR

techniques with nearest- neighbor methods:

– VSM – Tfidf – Use of noun phrases – Weighting – Cosine distance

Text &/vs Case

  • Some final questions:

– Text components incorporated into introspective reasoning? – For TCBR: a dynamic system in which adjustments are made to the structuring of textual components according to needs? Or even retextualizing the structured components under certain conditions? – Is there an analogy, in fully structured case bases, to my notions of metaphoricity and “paratextuality” as we have seen them in IR?: that is, is there a kind

  • f similarity between cases not

accessible through normal indexing/retrieval, but which might somehow be encoded by a set of “paracases”?

slide-12
SLIDE 12

12

Bibliography

  • Fabrizio Sebastiani, “A Tutorial on

Automated Text Categorization”

  • Kurt D. Bollacker, Steve Lawrence & C.

Lee Giles, “CiteSeer: An Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications”

  • Chao Yang Ho & Wai Lam, “Automatic

Discovery of Document Classification Knowledge from Text Databases”

  • Kristian J. Hammond, Robin Burke &

Kathryn Schmitt, “A Case-Based Approach to Knowledge Navigation”

  • David Leake & David Wilson, “Combining

CBR with Interactive Knowledge Acquisition, Manipulation & Reuse.”

  • David C. Wilson & Shannon Bradshaw,

“CBR Textuality”. Available at http://citeseer.nj.nec.com