Finding Gene and Chemical Names in Patent Text Supervised Learning - - PowerPoint PPT Presentation

finding gene and chemical names in patent text
SMART_READER_LITE
LIVE PREVIEW

Finding Gene and Chemical Names in Patent Text Supervised Learning - - PowerPoint PPT Presentation

Finding Gene and Chemical Names in Patent Text Supervised Learning with no Training Data slides by Kuzman Ganchev Finding Gene and Chemical Names in Patent Text (Ganchev) The present invention relates to the use of immuno- modifying


slide-1
SLIDE 1

Finding Gene and Chemical Names in Patent Text

Supervised Learning with no Training Data slides by Kuzman Ganchev

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-2
SLIDE 2

The present invention relates to the use of immuno- modifying imidazoquinoline amines , imidazopyridine amines , 6,7-fused cycloalkylimidazopyridine amines , and 1,2-bridged imidazoquinoline amines to inhibit T helper-type 2 (TH2) immune response and thereby treat TH2 mediated

  • diseases. It also relates to the ability of these compounds to

inhibit induction of interleukin (IL)-4 and IL-5 , and to suppress eosinophilia. –US Patent Number 6,610,319

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-3
SLIDE 3

1 Introduction

Why Tag Patents for Entities? Example output Demonstration

2 How the tagging was done

The Gene Tagger The Chemical Tagger

3 Future Work

Much (more) Ado about Patents Emerging Research Questions

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-4
SLIDE 4

Motivation

Why do we care about patents?

  • Large collection of public information
  • Primary resource for drug publications
  • Drug patents often have long lists of chemicals that can be a good

start for research (e.g. inhibitors of proteins similar to the patent protein)

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-5
SLIDE 5

Motivation

Why would you want to tag genes/chemicals?

  • Restrict search to specific classes. E.g. “ace” as a gene name:
  • It inhibits the angiotensin converting enzyme ( ACE ) that

catalyses the conversion of . . .

  • ACE inhibitors include benazepril , captopril , . . .

rather than as something else:

  • . . . a wound healing agent, for example Ace Mannan (or other

components of Aloe vera) . . .

  • On the next day, Lipofect Ace reagent was mixed with . . .
  • Scan a list of named entities to find out if a patent is relevant to

your search.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-6
SLIDE 6

Motivation

What else could we do?

  • Find the chemical most often mentioned with some gene
  • Find the chemical most specific to a particular gene
  • Give me a ranked list of genes related to a chemical

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-7
SLIDE 7

Extracted Examples: The Good

  • The compound was found to be not cross-resistant with other

inhibitors of DHFR , such as pyrimethamine and cycloguanil .

  • U.S. Pat. No. 5,703,092 discloses the use of dydroxamic acid

compounds and carbocyclic acids as metalloproteinase and TNF inhibitors, and in particular in treatment of arthritis and

  • ther related inflammatory diseases.
  • N,N-diethyl-m-toluamide ( DEET ) is an active ingredient in

many insect repellants.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-8
SLIDE 8

Extracted Examples: The Bad

  • the N-acyl homoserine lactones, which are composed of

derivatives of amino acid and fatty acid molecules.

  • at least one additional solvent chosen from C -C alkyl acetates

and C -C alkyl alcohols

  • Patch ( Nail treatment sheet pack)
  • 10. Immunizing kit .
  • . . . a method for the selective extraction of a . . .
  • Thus, in view of a work by Bushnell et al. (Pacific Sci . 4, 167-83

(1950)) showing . . .

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-9
SLIDE 9

Demonstration

http://fling-l.seas.upenn.edu/~kuzman/cgi-bin/search.pl

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-10
SLIDE 10

The Gene/Protein Tagger

  • Used Ryan McDonald’s GeneTaggerCRF mostly out of the box
  • Trained on medline abstracts (patent text is out-of-domain)
  • Reported performance on medline abstracts:

.84 precision and .74 recall

  • Performance on patent text is not as good but probably enough to

be useful

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-11
SLIDE 11

The Chemical Tagger

Available Resources

  • Large list of chemicals
  • Large list of genes
  • Smaller lists of other types of non-chemicals
  • Lots of untagged patent text in the drug domain
  • A small amount of non-expert human time

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-12
SLIDE 12

Available Resources

List of Chemicals

  • Obtained from the aliases of PubChem compounds
  • About 8.14 × 105 names (nowhere near exhaustive)
  • Some names are not usually names of chemicals:
  • organisms: Wormwood, Sunflower, Bitterweed
  • colors: Wool Yellow, Whortleberry Red, Acetyl Red G
  • names: Wolfram, Wang, Sweet grass
  • Lots of complicated systematic names:

0-(3-(N-Cyclopentyl-N-methyl)aminopropyl)-2-chlorophenothiazine

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-13
SLIDE 13

Available Resources

Other Lists

  • Gene list: 1.1 × 106 entries
  • Gene list: 0.15 × 106 entries
  • List of common newswire words 23.9 × 103 entries
  • Smaller lists of person, location and organization names (thanks to

Partha Talukdar)

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-14
SLIDE 14

Available Resources

Unlabeled Patent Text

  • 2002-2003 all patents issued
  • 8280 patenst in the drugs domain
  • Most patents are on the order of 2,000-10,000 words

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-15
SLIDE 15

The Chemical Tagger

Intial strategy

  • Use membership in the list of chemicals
  • Need to match full multi-word name (otherwise we match e.g.

“Red”)

  • Train CRF tagger only on sentences where we found chemicals
  • Very poor recall on systematic names
  • Precision is not perfect

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-16
SLIDE 16

The Chemical Tagger

Next Step

  • Create a simple character language model to distinguish words

that sound like chemicals

  • Label words that sound a lot like chemicals as chemicals
  • Improves performance on systematic names

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-17
SLIDE 17

Chemical Language Model

Generative Incarnation

  • Naive Bayes model classifies each word
  • Features were character 3-grams

pro Easy to implement con Some 3-grams were poorly estimated (so smoothing required)

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-18
SLIDE 18

Chemical Language Model

Discriminative Incarnation

  • Maximum entropy classifies each word
  • Features were 2, 3 and 4 character N-grams
  • Now we get backoff for free

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-19
SLIDE 19

Chemical Language Model

Training

  • Positive examples from list of chemicals:
  • 814,293 entries, 1,813,343 tokens
  • “acid” occurs 57,000 times as a token
  • “methyl” occurs 17,000 times as a token
  • Negative examples:
  • paragraphs of patent text
  • removed lines that contain common chemical 4-grams like meth,

thyl, nol$ (by hand, using grep)

  • 459,694 lines, 4,720,732 tokens
  • most common non-stop words “invention”, “means”, “present”

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-20
SLIDE 20

Chemical Language Model

Example Results

P(CHEM) Examples .0 − .1 length, soybean, shrink .1 − .2 Fetal, adapter, casein .2 − .3 Cataract, diet, extrachromasomal .3 − .4 phytotoxicity, altrose, oral, drop .4 − .5 lipid, anestetized, copper .5 − .6 Serum, alba, nifedipine .6 − .7 schema, Sephadex, thiamine .7 − .8 acetate/hexane, proline, amphotericin .8 − .9 proazulene, acne, enterotoxins .9 − 1. alicyclic, bromide, dihydroxyethylstearylamine

  • 1. − 1.

2-(1-hydroxyethyl)-4,5-dimethyl-cyclohexanone

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-21
SLIDE 21

Problems with References

  • Inline references are confusing for taggers
  • Gene tagger has not trained on text with long references
  • Chemical tagger often tags names of journals or authors as

chemicals: . . . by Bushnell et al. (Pacific Sci . 4, 167-83 (1950))

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-22
SLIDE 22

Problems with References

  • Inline references are confusing for taggers
  • Gene tagger has not trained on text with long references
  • Chemical tagger often tags names of journals or authors as

chemicals: . . . by Bushnell et al. (Pacific Sci . 4, 167-83 (1950))

  • Solution: make a seperate tagger for references

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-23
SLIDE 23

Reference Tagger

  • Trained on 40 instances (12,814 tokens)
  • About 30 features (an extra 20,000 add about 1 point to

precision/recall)

  • Tested on 25 instances (6,372 tokens)
  • About 87% token precision/recall
  • Good enough to eliminate most errors on references
  • Total time to develop: 3-4 hours

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-24
SLIDE 24

The Chemical Tagger

Current incarnation

  • Train a CRF tagger on training data:
  • Use list membership in the chemicals and non-chemicals lists
  • Use high-confidence positive predictions of language model
  • Label text in references as non-chemical
  • One extra rule: “*-ic acid” is a chemical, except “nucleic acid”
  • Train only on instances that contain chemical names.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-25
SLIDE 25

Future work

Aggregation

As we said in the introduction, we would like to be able to e.g.

  • Find the chemical most often mentioned with some gene
  • Find the chemical most specific to a particular gene
  • Find a ranked list of genes related to a chemical

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-26
SLIDE 26

Future work

Normalization

All of the following (and many others) refer to the same chemical:

  • flavone acetic acid
  • Mitoflaxone
  • 2-Phenyl-8-(carboxymethyl)-benzopyran-4-one
  • 4-Oxo-2-phenyl-4H-1-benzopyran-8-acetic acid

Searching for one should bring up results for the others also.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-27
SLIDE 27

Future work

And another thing...

Sometimes database names for chemicals appear (although not too

  • ften).

. . . using the anticancer drug, flavone acetic acid (FAA, NSC 347512). It would be nice to identify these.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-28
SLIDE 28

Future work

Initial Lofty Goals

Link extracted patents to other resources (e.g. PlasmodDB). Ideally we want to support searches like:

  • find proteins in set S (obtained from another database) that have
  • rthologs (as defined in a third database) with n small-molecule

inhibitors (from patents)

  • find proteins in set S with no human homologue (different

database) but have inhibitors (from patents)

  • find proteins in set S have orthologs (other dtatabase) with

inhibitors (from patents) This need was the initial motivation for tagging genes and chemicals in patent text.

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-29
SLIDE 29

Emerging Research Questions

Semi-supervised structured learning

  • Hand-tagging a few instances is not prohibitively expensive (but

tagging many is)

  • Label context is very important for these problems
  • Lots of unlabeled data is available
  • List of known chemicals can be used as partial supervision
  • Lists of known non-chemicals can be used as partial supervision

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-30
SLIDE 30

Emerging Research Questions

Transfer learning

  • The gene tagger was trained on out of domain data
  • Lots of unlabeled data available
  • Correcting a few mistakes by hand is feasible
  • Creating a new large training set is too expensive

Finding Gene and Chemical Names in Patent Text (Ganchev)

slide-31
SLIDE 31

Many Thanks to

  • Fernando Periera
  • Dhanasekaran Shanamugam
  • David Roose
  • Mark Liberman
  • many others

Finding Gene and Chemical Names in Patent Text (Ganchev)