Tasks of a Parser Tasks of a Parser Document Parser Interfaces - - PDF document

tasks of a parser tasks of a parser document parser
SMART_READER_LITE
LIVE PREVIEW

Tasks of a Parser Tasks of a Parser Document Parser Interfaces - - PDF document

2. XML Processor APIs Document Parser Interfaces 2. XML Processor APIs Document Parser Interfaces (See, e.g., Leventhal (See, e.g., Leventhal, Lewis & Fuchs: Designing XML , Lewis & Fuchs: Designing XML How can (Java) applications


slide-1
SLIDE 1

XPT 2006 XML APIs: SAX 1

  • 2. XML Processor APIs
  • 2. XML Processor APIs
  • How can (Java) applications manipulate

How can (Java) applications manipulate structured (XML) documents? structured (XML) documents?

– – An overview of XML processor interfaces An overview of XML processor interfaces

2.1 SAX: an event 2.1 SAX: an event-

  • based interface

based interface 2.2 DOM: an object 2.2 DOM: an object-

  • based interface

based interface 2.3 JAXP: Java API for XML Processing 2.3 JAXP: Java API for XML Processing

XPT 2006 XML APIs: SAX 2

Document Parser Interfaces Document Parser Interfaces

  • Every XML application contains some kind of

Every XML application contains some kind of a parser a parser

– – editors, browsers editors, browsers – – transformation/style engines, DB loaders, ... transformation/style engines, DB loaders, ...

  • XML parsers have become standard tools of

XML parsers have become standard tools of application development frameworks application development frameworks

– – JDK 1.4 contains JAXP, with its default parser JDK 1.4 contains JAXP, with its default parser (Apache Crimson) (Apache Crimson)

(See, e.g., (See, e.g., Leventhal Leventhal, Lewis & Fuchs: Designing XML , Lewis & Fuchs: Designing XML Internet Applications, Chapter 10, and Internet Applications, Chapter 10, and D.

  • D. Megginson

Megginson: Events vs. Trees [online]) : Events vs. Trees [online])

XPT 2006 XML APIs: SAX 3

Tasks of a Parser Tasks of a Parser

  • Document instance decomposition

Document instance decomposition

– – elements, attributes, text, processing instructions, elements, attributes, text, processing instructions, entities, ... entities, ...

  • Verification

Verification

– – well well-

  • formedness

formedness checking checking » » syntactical correctness of XML markup syntactical correctness of XML markup – – validation (against a DTD or Schema; optional) validation (against a DTD or Schema; optional)

  • Access to contents of the DTD (if supported)

Access to contents of the DTD (if supported)

– – SAX 2.0 Extensions provide info of declarations: SAX 2.0 Extensions provide info of declarations: element type names and their content model element type names and their content model expressions expressions

XPT 2006 XML APIs: SAX 4

Document Parser Interfaces Document Parser Interfaces

I: Event I: Event-

  • based interfaces

based interfaces

– – Command line and ESIS interfaces Command line and ESIS interfaces

» » Element Structure Information Set, traditional Element Structure Information Set, traditional interface to stand interface to stand-

  • alone SGML parsers

alone SGML parsers

– – Event call Event call-

  • back interfaces: SAX

back interfaces: SAX

II: Tree II: Tree-

  • based (object model) interfaces

based (object model) interfaces

– – W3C DOM Recommendation W3C DOM Recommendation – – Java Java-

  • specific object models: JAXB, JDOM, dom4J

specific object models: JAXB, JDOM, dom4J

XPT 2006 XML APIs: SAX 5

Command Command-

  • line ESIS interface

line ESIS interface

Application Application SGML/XML Parser SGML/XML Parser Command Command line call line call <E <E </E> </E> Hi! Hi! i="1" i="1"> > ESIS ESIS Stream Stream (E (E Ai CDATA 1 Ai CDATA 1

  • Hi!

Hi! )E )E

XPT 2006 XML APIs: SAX 6

Event Call Event Call-

  • Back Interfaces

Back Interfaces

  • Application implements a set of

Application implements a set of call call-

  • back

back methods methods for handling parse events for handling parse events

– – parser notifies the application by method calls parser notifies the application by method calls – – qualified further by parameters: qualified further by parameters:

» » element type name element type name » » names and values of attributes names and values of attributes » » values of content strings, values of content strings, … …

  • Idea behind

Idea behind ‘‘ ‘‘SAX SAX’’ ’’ (Simple API for XML) (Simple API for XML)

– – an industry standard API for XML parsers an industry standard API for XML parsers – – could think as could think as “ “S Serial erial A Access ccess X XML ML” ”

XPT 2006 XML APIs: SAX 7

An event call An event call-

  • back application

back application

Application Main Application Main Routine Routine startDocument startDocument() () startElement startElement() () characters() characters() Parse() Parse() Callback Callback Routines Routines endElement endElement() () <A i="1"> <A i="1"> </A> </A> Hi! Hi!

"A",[i="1"] "A",[i="1"] "Hi!" "Hi!" "A" "A"

<?xml version='1.0'?> <?xml version='1.0'?>

XPT 2006 XML APIs: SAX 8

Object Model Interfaces Object Model Interfaces

  • The parser builds ...

The parser builds ...

– – a document object consisting of sub a document object consisting of sub-

  • objects such
  • bjects such

as as document document, , elements, attributes, text elements, attributes, text, , … …

  • Abstraction level higher than in event based

Abstraction level higher than in event based interfaces; more powerful access interfaces; more powerful access

– – to descendants, following siblings, to descendants, following siblings, … …

  • Drawback: Higher memory consumption

Drawback: Higher memory consumption

– – > used mainly in client applications > used mainly in client applications (to implement document manipulation by user) (to implement document manipulation by user)

slide-2
SLIDE 2

XPT 2006 XML APIs: SAX 9

An Object An Object-

  • Model Based Application

Model Based Application

Application Application Parser Parser Object Object In In-

  • Memory

Memory Document Document Representation Representation Parse Parse Access/ Access/ Modify Modify Build Build Document Document i=1 i=1 A A "Hi!" "Hi!" <A i="1"> <A i="1"> </A> </A> Hi! Hi!

XPT 2006 XML APIs: SAX 10

2.1 The SAX Event 2.1 The SAX Event Callback Callback API API

  • A de

A de-

  • facto industry standard

facto industry standard

– – Developed by members of the xml Developed by members of the xml-

  • dev mailing list

dev mailing list – – Version 1.0 in May 1998, Version 1.0 in May 1998, Vers

  • Vers. 2.0 in May 2000

. 2.0 in May 2000 – – Not Not a parser, but a common a parser, but a common interface interface for different for different parsers (like, say, JDBC is a common interface to parsers (like, say, JDBC is a common interface to various various RDBs RDBs) )

  • Supported directly by major XML parsers

Supported directly by major XML parsers

– – many Java based, and free; Examples: many Java based, and free; Examples: Apache Apache Xerces Xerces, Oracle's XML Parser for Java; , Oracle's XML Parser for Java; MSXML (in IE 5), James Clark's XP MSXML (in IE 5), James Clark's XP

XPT 2006 XML APIs: SAX 11

SAX 2.0 Interfaces SAX 2.0 Interfaces

  • Co

Co-

  • operation of an application and a parser
  • peration of an application and a parser

specified in terms of specified in terms of interfaces interfaces (i.e., (i.e., collections of methods) collections of methods)

  • My classification of SAX interfaces:

My classification of SAX interfaces:

– – Application Application-

  • to

to-

  • parser interfaces

parser interfaces

» » to use the parser to use the parser

– – Parser Parser-

  • to

to-

  • application (or call

application (or call-

  • back) interfaces

back) interfaces

» » to act on various parsing events to act on various parsing events

– – Auxiliary interfaces Auxiliary interfaces

» » to manipulate parser to manipulate parser-

  • provided information

provided information

XPT 2006 XML APIs: SAX 12

Application Application-

  • to

to-

  • Parser Interfaces

Parser Interfaces

  • Implemented by

Implemented by parser parser (or a SAX driver): (or a SAX driver):

– – XMLReader XMLReader

» » methods to register objects that implement call methods to register objects that implement call-

  • back interfaces, and to invoke the parser

back interfaces, and to invoke the parser

– – XMLFilter XMLFilter ( (extends extends XMLReader XMLReader) )

» » to connect to connect XMLReader XMLReaders s as a sequence of filters as a sequence of filters » » obtains events from an

  • btains events from an XMLReader

XMLReader and passes and passes them further (possibly modified) them further (possibly modified)

XPT 2006 XML APIs: SAX 13

Call Call-

  • Back Interfaces

Back Interfaces

  • Implemented by

Implemented by application application to act on parse events to act on parse events (A (A DefaultHandler DefaultHandler quietly ignores most of them) quietly ignores most of them) – – ContentHandler ContentHandler

» » methods to process document parsing events methods to process document parsing events

– – DTDHandler DTDHandler

» » methods to receive notification of unparsed external methods to receive notification of unparsed external entities and notations declared in the DTD entities and notations declared in the DTD

– – ErrorHandler ErrorHandler

» » methods for handling parsing errors and warnings methods for handling parsing errors and warnings

– – EntityResolver EntityResolver

» » methods for customised processing of external methods for customised processing of external entity references entity references

XPT 2006 XML APIs: SAX 14

SAX 2.0: Auxiliary Interfaces SAX 2.0: Auxiliary Interfaces

  • Attributes

Attributes

– – methods to access a list of attributes, e.g: methods to access a list of attributes, e.g: int int getLength getLength() () String String getValue(String getValue(String attrName attrName) )

  • Locator

Locator

– – methods for locating the origin of parse events methods for locating the origin of parse events (e.g. (e.g. systemID systemID, line and column numbers, say, for , line and column numbers, say, for reporting semantic errors controlled by the reporting semantic errors controlled by the application) application)

XPT 2006 XML APIs: SAX 15

The The ContentHandler ContentHandler Interface Interface

  • Information of general document events. (See API

Information of general document events. (See API documentation for a complete list): documentation for a complete list):

  • setDocumentLocator(Locator

setDocumentLocator(Locator locator) locator)

– – Receive a locator for the origin of SAX document events Receive a locator for the origin of SAX document events

  • startDocument

startDocument() (); ; endDocument endDocument() ()

– – notify the beginning/end of a document. notify the beginning/end of a document.

  • startElement(String

startElement(String nsURI nsURI, , String String localName localName, , String String rawName rawName, , Attributes Attributes atts atts) ); ; endElement endElement( ( … … ) ); ; same same params params, , w.o w.o. attributes . attributes

for for namespace namespace support support

XPT 2006 XML APIs: SAX 16

Namespaces in SAX: Example Namespaces in SAX: Example

< <xsl:stylesheet xsl:stylesheet version="1.0" version="1.0" xmlns:xsl xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ="http://www.w3.org/1999/XSL/Transform" xmlns xmlns="http://www.w3.org/TR/xhtml1/strict"> ="http://www.w3.org/TR/xhtml1/strict"> < <xsl:template xsl:template match="/"> match="/"> <html> <html> < <xsl:value xsl:value-

  • of select="//total"/>
  • f select="//total"/>

</html> </html> </ </xsl:template xsl:template> > </ </xsl:stylesheet xsl:stylesheet> >

  • startElement

startElement for this would pass following parameters: for this would pass following parameters:

– – nsURI nsURI= = http://www.w3.org/1999/XSL/Transform

http://www.w3.org/1999/XSL/Transform – – localname localname = = template, template, rawName rawName = = xsl:template xsl:template

slide-3
SLIDE 3

XPT 2006 XML APIs: SAX 17

Namespaces: Example (2) Namespaces: Example (2)

< <xsl:stylesheet xsl:stylesheet version="1.0" ... version="1.0" ... xmlns xmlns="http://www.w3.org/TR/xhtml1/strict"> ="http://www.w3.org/TR/xhtml1/strict"> < <xsl:template xsl:template match="/"> match="/"> <html> ... </html> <html> ... </html> </ </xsl:template xsl:template> > </ </xsl:stylesheet xsl:stylesheet> >

  • endElement

endElement for for html html would give would give

– – nsURI nsURI = = http://www.w3.org/TR/xhtml1/strict http://www.w3.org/TR/xhtml1/strict (as default namespace for element names without a prefix), (as default namespace for element names without a prefix), localname localname = = html, html, rawName rawName = = html html

XPT 2006 XML APIs: SAX 18

<!DOCTYPE A [<!ELEMENT A (B)> <!DOCTYPE A [<!ELEMENT A (B)> <!ELEMENT B (#PCDATA)> ]> <!ELEMENT B (#PCDATA)> ]> <A> <A> <B> <B> </B></A> </B></A>

ContentHandler ContentHandler interface (cont.)

interface (cont.)

  • characters(char

characters(char ch ch[], [], int int start, start, int int length) length)

– – notification of character data notification of character data

  • ignorableWhitespace(char

ignorableWhitespace(char ch ch[], [], int int start, start, int int length) length)

– – notification of ignorable notification of ignorable whitespace whitespace in element content in element content

Ignorable Ignorable whitespace whitespace Text content Text content

XPT 2006 XML APIs: SAX 19

SAX Processing Example (1/9) SAX Processing Example (1/9)

  • Input

Input: XML representation of a personnel database: : XML representation of a personnel database:

<?xml version="1.0" encoding="ISO <?xml version="1.0" encoding="ISO-

  • 8859

8859-

  • 1"?>

1"?> <db> <db> <person <person idnum idnum="1234"> ="1234"> <last>Kilpel <last>Kilpelä äinen</last><first>Pekka</first></person> inen</last><first>Pekka</first></person> <person <person idnum idnum="5678"> ="5678"> <last> <last>M Mö ött ttö önen nen</last><first> </last><first>Matti Matti</first></person> </first></person> <person <person idnum idnum="9012"> ="9012"> <last> <last>M Mö ött ttö önen nen</last><first> </last><first>Maija Maija</first></person> </first></person> <person <person idnum idnum="3456"> ="3456"> <last> <last>R Rö ömpp mppä änen nen</last><first> </last><first>Maija Maija</first></person> </first></person> </db> </db>

XPT 2006 XML APIs: SAX 20

SAX Processing Example (2/9) SAX Processing Example (2/9)

  • Task

Task: Format the document as a list like this: : Format the document as a list like this:

Pekka Kilpel Pekka Kilpelä äinen (1234) inen (1234) Matti Matti M Mö ött ttö önen nen (5678) (5678) Maija Maija M Mö ött ttö önen nen (9012) (9012) Maija Maija R Rö ömpp mppä änen nen (3456) (3456)

  • Event

Event-

  • based processing strategy:

based processing strategy:

– – at the start of at the start of person person, record the , record the idnum idnum (e.g., (e.g., 1234 1234) ) – – record starts and ends of record starts and ends of last last and and first first to store to store their contents (e.g., " their contents (e.g., "Kilpel Kilpelä äinen inen" and " " and "Pekka Pekka") ") – – at the end of a at the end of a person person, output the collected data , output the collected data

XPT 2006 XML APIs: SAX 21

SAX Processing Example (3/9) SAX Processing Example (3/9)

  • Application

Application: : First import relevant interfaces & classes: First import relevant interfaces & classes:

import import org.xml.sax.XMLReader

  • rg.xml.sax.XMLReader;

; import import org.xml.sax.Attributes

  • rg.xml.sax.Attributes;

; import import org.xml.sax.ContentHandler

  • rg.xml.sax.ContentHandler;

; //Default (no //Default (no-

  • op) implementation of
  • p) implementation of

//interface //interface ContentHandler ContentHandler: : import import org.xml.sax.helpers.DefaultHandler

  • rg.xml.sax.helpers.DefaultHandler;

; // JAXP to instantiate a parser: // JAXP to instantiate a parser: import import javax.xml.parsers javax.xml.parsers.*; .*;

XPT 2006 XML APIs: SAX 22

SAX Processing Example (4/9) SAX Processing Example (4/9)

  • Implement relevant call

Implement relevant call-

  • back methods:

back methods:

public class public class SAXDBApp SAXDBApp extends extends DefaultHandler DefaultHandler{ { // Flags to remember element context: // Flags to remember element context: private private boolean boolean InFirst InFirst = false, = false, InLast InLast = false; = false; // Storage for element contents and // Storage for element contents and // attribute values: // attribute values: private private String String FirstName FirstName, , LastName LastName, , IdNum IdNum; ;

XPT 2006 XML APIs: SAX 23

SAX Processing Example (5/9) SAX Processing Example (5/9)

  • Call

Call-

  • back methods:

back methods:

– – record the start of record the start of first first and and last last elements, elements, and the and the idnum idnum attribute of a attribute of a person person: : public public void void startElement startElement ( ( String String nsURI nsURI, String , String localName localName, , String String rawName rawName, Attributes , Attributes atts atts) { ) { if if ( (rawName.equals("person rawName.equals("person")) ")) IdNum IdNum = = atts. atts.getValue getValue("idnum ("idnum"); "); if if ( (rawName.equals("first rawName.equals("first")) ")) InFirst InFirst = true; = true; if if ( (rawName.equals("last rawName.equals("last")) ")) InLast InLast = true; = true; } // } // startElement startElement

XPT 2006 XML APIs: SAX 24

SAX Processing Example (6/9) SAX Processing Example (6/9)

  • Call

Call-

  • back methods continue:

back methods continue:

– – Record the text content of elements Record the text content of elements first first and and last last: : public public void void characters characters ( ( char char buf buf[], [], int int start, start, int int length) { length) { if if ( (InFirst InFirst) ) FirstName FirstName = = new new String(buf String(buf, start, length); , start, length); if if ( (InLast InLast) ) LastName LastName = = new new String(buf String(buf, start, length); , start, length); } // characters } // characters

slide-4
SLIDE 4

XPT 2006 XML APIs: SAX 25

SAX Processing Example (7/9) SAX Processing Example (7/9)

  • At the end of

At the end of person person, output the collected data: , output the collected data:

public public void void endElement endElement(String (String nsURI nsURI, , String String localName localName, String , String qName qName) { ) { if if ( (qName.equals("person qName.equals("person")) ")) System.out.println(FirstName System.out.println(FirstName + " " + + " " + LastName LastName + " (" + + " (" + IdNum IdNum + ")" ); + ")" ); //Update the context flags: //Update the context flags: if if ( (qName.equals("first qName.equals("first")) ")) InFirst InFirst = false; = false; //(and the same for "last" and //(and the same for "last" and InLast InLast) )

XPT 2006 XML APIs: SAX 26

SAX Processing Example (8/9) SAX Processing Example (8/9)

  • Application

Application main main method: method:

public public static void main (String static void main (String args args[]) []) throws throws Exception { Exception { // Instantiate an // Instantiate an XMLReader XMLReader (from JAXP (from JAXP // // SAXParserFactory SAXParserFactory): ): SAXParserFactory SAXParserFactory spf spf = = SAXParserFactory.newInstance SAXParserFactory.newInstance(); (); try try { { SAXParser SAXParser saxParser saxParser = = spf.newSAXParser spf.newSAXParser(); (); XMLReader XMLReader xmlReader xmlReader = = saxParser.getXMLReader saxParser.getXMLReader(); ();

XPT 2006 XML APIs: SAX 27

SAX Processing Example (9/9) SAX Processing Example (9/9)

  • Main

Main method continues: method continues:

// Instantiate and pass a new // Instantiate and pass a new // // ContentHandler ContentHandler to to xmlReader xmlReader: : ContentHandler ContentHandler handler = handler = new new SAXDBApp SAXDBApp(); (); xmlReader. xmlReader.setContentHandler setContentHandler(handler (handler); ); for for ( (int int i = 0; i < i = 0; i < args.length args.length; i++) { ; i++) { xmlReader. xmlReader.parse parse(args[i (args[i]); ]); } } } } catch catch (Exception e) { (Exception e) { System.err.println(e.getMessage System.err.println(e.getMessage()); ()); System.exit(1); System.exit(1); }; }; } // main } // main

XPT 2006 XML APIs: SAX 28

SAX: Summary SAX: Summary

  • A low

A low-

  • level parser

level parser-

  • interface for XML

interface for XML documents documents

  • Reports document parsing events through

Reports document parsing events through method call method call-

  • backs

backs

– – > efficient: does not create in > efficient: does not create in-

  • memory

memory representation of the document representation of the document – – > used often on servers, and to process LARGE > used often on servers, and to process LARGE documents documents