Nenie XML

An Eiffel XML parser and library

Release 0.1 - July 2001

Franck Arnaud


Contents


Introduction

Nenie XML is an XML parser and library, written purely in Eiffel. It depends only on the Gobo foundation library, itself very portable. It aims at being portable to all supported Eiffel compilers, using the practically portable subset of the Eiffel language. Having no external code dependency, operating system portability should be as good (or as bad) as the Eiffel compilers'.

On the XML front it aims at being fully compliant with the standards it implements. On the other hand, there is a preference towards implementing small and elegant, and preferably mature, standards, and not blindly take everything that comes and go in the XML community. For instance, when schema validation is supported, it is more likely that RELAX-NG is chosen rather than the gratuitously complex XML Schema.

This text assumes knowledge of both XML and Eiffel, and neither is introduced in great detail. The first chapter will discuss the way the library has been designed and some interesting implementation patterns while the remaining sections will cover the usage of the library facilities.

I'd like to thank Eric Bezault for providing the Gobo foundation and scanner and parser tool without which this library would have been much more work. Thanks to the SmallEiffel team for a fast and free compiler, and to Berend de Boer for a stimulating discussion about the handling of Unicode in Eiffel.


1 On the design

Before the detailed description of the library's functionality in the following chapters, the design of the library and its aims are discussed. Some interesting patterns within the implementation are also covered.

1.1 Supporting Unicode

XML is, by default, supporting Unicode as its character set and UTF-8 and UTF-16 are mandatory encodings. Most XML documents are encoded in UTF-8. Most Eiffel compilers on the other hand implement the CHARACTER types for code values between 0 and 255, the character sequence class, STRING, follows of course. There is therefore a problem in supporting unicode (or any form of wide character) text in Eiffel.

1.1.1 A discussion

This problem is not really a limitation of the Eiffel language per se, which does not mandate the size of the CHARACTER type, except its minimum size, so an Eiffel compiler could easily implement wide CHARACTERs. Few compilers[1] available at the time of writing unfortunately does this, so there is a necessity to find a way to handle wide characters in the existing environment.

One possible approach is to duplicate the character and string types for wide chars, which is what for instance the UC_STRING library from Majkel Kretschmar provides. In this scenario, when using wide characters and strings, the ordinary STRING and CHARACTER classes are simply abandoned. This ensures full support, and except for the lack of support for direct manifest constants, is close to what an implementation through the native CHARACTER and STRING types would do.

Unfortunately, this approach requires, for wide-character supporting code, to switch code wholesale to this approach, because STRING is not really supposed to be used anymore. Existing code, or new code written without using the alternative, say UC_STRING, throughout, is cumbersome to interface with code that uses STRING, without converting all the time between the two representations. It also provides choice in one area where it may not be welcome. The character size issue does not justify, in this author's opinion, to create distinct classes. Nearly all code that processes character sequences is not dependent, or should not be dependent, on the number of character codes that is available!

On the other hand, there exists an encoding of Unicode, which was specially designed to support transition from "one-character per byte" implementations, with only the code values 0-127 clearly defined -- the US ASCII character set. The unicode character set shares its lower characters (0 to 127) with US ASCII -- really with ISO Latin 1 for values up to 255, and therefore can be encoded in a manner which is ASCII transparent. UTF-8, probably the primary Unicode encoding, and the one used in XML, does just that. In an UTF-8 byte sequence, byte values between 0 and 127 encode themselves, that is any ASCII-only UTF-8 sequence is exactly the same in UTF-8 and in ordinary ASCII. Characters above 128 are encoded using bytes values 128-256. In any multi-byte encoding of a character, all bytes are above 128. The first byte determines (a) the number of bytes used to encode the current character and (b) the first few bits of the character itself. The following characters, all above 128, start with the bit sequence 10 and use the 6 lower bits for the following bits of the unicode character.

The encoding has the property of encoding the lower values of character codes into fewer bytes (one for ASCII, two for values between 128 and 2048, etc) which combines well with the fact that alphabets for widely used scripts are in the lower ranges of Unicode character codes. Higher values are mostly used for ideographic writing systems, where a character represents a word, so it's not an issue that a word uses 3 or 4 bytes of storage.

Still, back to Eiffel support, the most interesting property of UTF-8, beyond its ubiquity, is that it is ASCII transparent. Any code that processes STRING as sequences of bytes or 0-255 values, which either ignores the encoding or is only ASCII dependent in its handling of encoding, would fully work on STRINGs which happen to contain UTF-8 encoded Unicode. In most programming tasks, when string processing is encoding dependent it more often than not applies only to ASCII, looking for keywords or punctuation. Code that actually processes code in specific languages or encoding -- say a hyphenation engine, is a small proportion of all code.

So, by carefully adapting the code that is encoding-dependent, mostly foreign interfaces -- for instance an UTF-16 input file filter in XML -- and few select pieces of other code that are encoding dependent, the rest of the code can remain using the existing STRING type. Of course, code that is not encoding independent, must deal with STRINGs in the knowledge that they do not really contain 0-255 characters, but wider characters encoded using UTF-8. The assistance of the type system is lost, so this has to be done carefully. This is really the main disadvantage of the approach, but this author believes that this balances well with interoperability with the STRING world and future scalability.

The scalalability of this approach -- promoting encoding independence and interoperability with the existing use of STRING -- comes from providing a natural upgrade path to compilers that do support wide CHARACTER and STRINGs natively. Indeed, nearly all code can be used unchanged in the two cases concurrently (in different programs of course!).

In summary, in the belief that (a) in the long term the native types will support wide characters, (b) encoding-dependent code is actually rare, and (c) interoperability with existing STRING code is valuable, this XML library implements unicode support using ordinary strings, encoded using UTF-8.

There are XML libraries, like eXml, which support the other approach, abandoning STRING for an explicit replacement, and those who prefer this view have choice, for XML software at least. Part of the argument for either approach is which of the following options is more likely to happen: will most Eiffel code authors switch to the STRING replacement, or will compiler vendors extend the STRING type? In the sometimes slow-moving Eiffel world it is hard to guess but this author thinks the latter is much more likely.

1.1.2 Unicode in practice: UTF8-in-STRING

The approach of using UTF-8 encoded in Eiffel STRING implementations that only support character values between 0 and 255 will from now on be called UTF-8 in STRING in this document. When characters are accessed decoded, it is recommended to use the INTEGER type, which is large enough to hold all unicode character codes.

It is important to note that UTF-8 in STRING is not primarily a property of the library, it is a program-wide property, or invariant. A library is compatible with this approach by being encoding independent, or having few, well localised places where encoding specific processing is used. This is of course more of a guideline than a firm rule, and it is surely possible to put the boundary in the middle of a program with some libraries operating with UTF-8 in STRINGs and others using other encoding. This requires a well defined boundary and does not seem desirable, a boundary at the 'outside' world frontier seems much clearer.

By default Nenie XML operates in UTF8-in-STRING, and all STRINGs that come are used in the parser interface can be safely assumed to be in UTF-8. The encoding dependent parts are the UTF-16 input buffer that converts incoming UTF-16 files to UTF-8, and the pedantic parser variant that does unicode XML character classes validation. To use Nenie XML in an environment where say all STRINGs are ISO Latin-1 -- foregoing the possibility to process XML files that contain characters not in Latin-1 -- only the input buffer conversion need be adapted, and the pedantic parser not used. Beyond the parser, other library facilities, like the tree representations, are encoding-independent.

For those localised parts of the code which need encoding-dependent processing, a few facilities are provided: the NXML_UTF8_CURSOR class provides a way to iterate other the character of a UTF-8 characters (as INTEGER) in a STRING encoded in UTF-8. This has the same interface as Gobo container cursors, although it does not inherit from it because Gobo cursors need an associated container in the Gobo hierarchy, in which STRING is, of course, not part of. The UTF-8 processing primitives are exposed in a separate mixin class.

If external iterators are not satisfactory, encoding dependent portions of the code can of course convert in and out of specific wide-character alternatives to STRING classes.

1.2 The parser

Now we know how to process XML characters, we can go on to the design of the parser itself. One of the design aim is to provide a fully XML 1.0 compliant. XML is a reasonably simple standard, and although it suffers from some mild bloat and inelegances here and there, it is an acceptably implementable standard, and test suites are available which facilitate greatly the validation process, as there is both a specification and tests to work against. It is also mature, despite having been published only in 1998, so that most obscure issues have been met before and resolved. The standard document used for these parser, is XML 1.0 (Second edition), the October 2000 W3 Consortium Recommendation.

1.2.1 Resolving namespaces


	<root
		xmlns="http://schema.nenie.org/sample" 
		xmlns:other="http://schema.nenie.org/other">
			<item other:info="we are in the 'other' namespace">
				We are in the 'sample' namespace
			</item>
	</root>

Putting the cart before the horse, let's first cover how namespaces are handled. Historically namespaces were thought about at the time of the XML 1.0 specification but left for later standardisation. They are now standardised is a separate W3 document. Namespaces are widely used, but it is also valid to use XML without namespaces, and an XML 1.0 parser must accept all XML documents, including those which would not be valid with namespaces.

The main issue is that tag and attribute names in XML in documents that use namespaces are of either single names or a name prefixed with a qualifier, separated by a single colon. In ordinary XML, a colon may be anywhere in the name. For instance, ":", ":hello:-text:is_red" or "blah:more-text:atEnd" are all valid XML 1.0 names.

To handle both, while doing all the scanning and parsing in the XML parser and not leave it to a namespace client code to locally reparse the names, the names are parsed as a list of colon-separated items. This list of colon separated names can have any number of items, or empty items, so for instance ":" is parsed as a list of two empty strings. This is then delivered to the client with instances of NXML_PARSER_NAME which has a list like interface (add at end, get indexed items, etc).

This class, which works in all XML 1.0 compliant situations, is optimised for namespace use, in that it stores optimally lists with two items -- the first and second items of the list are attributes, while the rest is in a 'tail' implemented with an ordinary Gobo list. The interface hides this implementation. The client that uses namespaces can easily validate the name by checking that the list's count is 1 or 2 and that the items are not empty strings.

Namespace treatment is done downstream, allowing to do it fully. Given the availability of the raw form, names with resolved namespaces can be lightweight and the class NXML_NAME supports fully resolved namespaces. A resolved namespace is a name -- the single or right hand item in the XML source -- plus its qualifying namespace, the unique URI mentioned in the namespace declaration[2]. The qualified name, an artefact of the XML encoding, can disappear.

1.2.2 Event interfaces

This distinction between parser names is carried to the event interface. The event interface is the low-level interface to the XML parser: a client inherits from the interface, and implements actions for events such as "content arrived", "start tag commenced", "attribute defined", etc.

The lowest level interface, defined NXML_EVENT_INTERFACE, provides XML compliant events, without namespace treatment, therefore all its events which use names use the class NXML_PARSER_NAME introduced above.

Another interface, NXML_NAMESPACE_EVENT_INTERFACE is similar but with resolved names (all names are NXML_NAME). Both interfaces share the events that do not contain names, while events with XML names have procedures distinctively named to avoid name clashes. Both interfaces can thus easily be used in the same class, which is what NXML_PARSER_NAMESPACE does. It takes unresolved events from the low level interface and, after resolving namespaces and performing namespace validation, issues events of the namespace interface.

1.2.3 Parsing XML 1.0

XML 1.0, with which the parser complies, defines several categories of parser. This parser is a non-validating processor. That is it parses the XML document, including the document type definition (DTD) if present. The DTD is not acted upon to check the validity of the document, although it can resolve entity declaration and default attribute declarations defined in the DTD, and normalise the document according to default attribute values and declared entities. In addition, external entities can be resolved.

A validating parser can be built on top of this parser, although some would question whether DTD, which are not compatible with namespaces, are going to see continued use. In any case, the DTD parsing capability ensures compliance and has no impact on documents that do not use DTD.

The correctness of the parser is established by running against a test suite, as described in A.

1.2.4 A gelex/geyacc parser

The parser is implemented using Gobo Lex and Yacc scanner and parser generation tools, available in the file nxml_scanner.l and nxml_parser.y -- generating respectively the classes NXML_SCANNER and NXML_PARSER. Some of the implementation patterns used are described below.

Scanning and parsing

Unlike usual interfaces between a Gobo parser and scanner, using inheritance, this parser uses a client relationship to the scanner, through the read_token/last_value interface. This allows to switch the scanner during the parsing, which is what is used to support reading external entities (internal or external, parameter or ordinary) which are, seen from the parser, a stack of scanners. When an entity reference occurs, the scanner is switched to a scanner that reads the entity and switches back when reading the entity is finished.

Eric Bezault suggested that switching of input buffers rather than scanners be stacked and switched, but the entity scanners at this time also perform some preprocessing on the entities, like digesting the initial Text Declaration (<?xml enconding='UTF-8'?> for instance) or implementing the enclosing of parameters entities within Space tokens required by the specification. This may be reviewed later, by moving this treatment either to input buffers or to the parser.

The scanner, while it really detects some well-formedness errors, also delegates all error handling to the parser. If an invalid character pattern is received, a token is sent in all cases, which the parser will detect either explicitly or as an unexpected token -- a syntax error. Beyond the unexpected data token, there are also some distinct error tokens for erroneous character sequences matched explicitly by the scanner for better error reporting. This separation of roles allows to clearly concentrate error processing in one place.

The handling of names, as discussed above, is as a sequence of name and colon tokens, rather than single tokens. This provides a parsed input for namespace validation if required while maintaining compliance.

Scanner tricks

The scanner uses multiples states to handles DTD parsing, quoted values (which can syntactically contain markup!), comments, processing instructions and the like. States are also used to provide distinct tokens for sometimes the same constructs, for instance in the case of the literal value of entities in an entity value declaration which are, in some circumstances, better resolved when the entity is used (and parsed again) rather than at the time of declaration.

In a DTD, many 'keywords' also match the name tokens, and are keywords depending on position, e.g. the third item in a declaration. To deal with that the parser defines a 'DTD name' rule that is a name token or any of the name-compatible keywords.

One exceptional treatment of keywords is PUBLIC or SYSTEM declarations, whose quoted value is merged in a single token with the keyword (SYSTEM "id" is one token). This is because the quoted values after a SYSTEM or PUBLIC keyword are a different production than an ordinary quoted value, e.g. entity declaration's body, but most instances would match both definitions lexically. States could be used, but the PUBLIC keyword may be followed by one or two quoted values, thus making state switching more tricky. This may be changed in a future revision.

A special treatment of markup is done in the case of the XML declaration, whose parsing is hardcoded in the scanner. So the scanner matches directly version='1.0' as one token rather than like it would for attributes within content. This allows straightforward parsing of the XML declaration, for example in entity headers, while implementing the subtleties that make the XML declaration different syntactically from tag attributes.

The parser

The parser is a straightforward yacc parser. It was initially written by simply typing in an appropriate form the grammar's productions from the XML 1.0 standard. This provided a first version, but this required some reasonably extensive massaging work to resolve the stylistic oddities and other inconsistencies in the spec and to make it a non-ambiguous parser -- it has no shift/reduce or reduce/reduce conflicts -- while supporting the sometimes irregular syntax of XML, in particular within DTDs, which is where most of the complexity lies. Entity resolution, in particular parameter entities which introduce some form of macro processor in DTDs were particularly tricky. The handling of space in DTD, without generating grammar ambiguities, and supporting some of the asymmetries of syntax also provided some challenges.

The parser contains little explicit state: there are the bindings to defined entities, some validation booleans, and the scanner stack and interface discussed above. Validation of attribute duplication and the matching of start and end tag names is done within the parser's own stack of values.

Validating unicode character classes

The Gobo tools use fixed size character tables, which could be a problem with wide characters, but fortunately all meaningful markup in XML is ASCII which is perfectly compatible with the UTF8 in STRING approach followed by this library.

Validating unicode character classes, in XML names or while checking for forbidden character ranges -- like the UTF-16 substitution range disallowed in UTF-8 -- is implemented with an optional post processor, NXML_PARSER_PEDANTIC. For the basic character range checking it could be implemented as a preprocessor. Name checking, being of limited use beyond compliance checking may be better left as an optional post processor.

The other encoding dependent treatment is in the input buffer. The scanner, when it detects a UTF-16 input file -- the detection method is specified and simple, requiring only to look at the first two bytes of the file -- creates a variant of the ordinary Gobo input buffer that converts its input from UTF-16 to UTF-8.

1.3 Tree representation

In many applications, an event interface is not enough and a structured view is more useful. The event interface can be used to build a custom structured view, using application specific classes. The library also provides generic tree representations of the XML document for cases where a general structured view is suitable.

Polymorphic trees

A tree representation is a recursive list or set of nodes. In an object-oriented language, the representation of a tree with multiple node types has to solve the issue of traversal: usually, the nodes are different classes which share a common ancestor. If subnodes of a given node can be of any type, the list or attributes containing subnodes is of the type of the ancestor of all nodes.

When traversing a tree, which is not a custom tree which can contain most of the functionality in the node classes themselves, it is needed to access the actual type. Several patterns are available for this purpose. They include double dispatch with an external interface with distinct features for visiting each node type with dispatching within the node (the Visitor pattern). Another possibility is to provide conversion functions for each subtype in the root type, which is a statically typed implementation of another solution, run-time type checking with assignment attempts.

In our context, this means there is a benefit in simplicity and usability if at least some tree representations can be monomorphic. This is not too hard in XML where useful node types are few[3] and a meaningful unified interface can be devised -- a unified monomorphic interface does not prevent polymorphic instances, it is the interface which is constant so that all node can share one abstract type.

Narrative vs. data centric XML

XML, or more precisely SGML, was first designed as a markup language for text, like this document is or like HTML pages. There the distinction between text and markup is clear: markup is the text that is to be rendered or processed, while markup contains the instructions on how to render it. Strip the markup from a narrative XML document and all the textual content should still be there, and ideally just the content, with a correctly designed schema.

On the other hand, XML is increasingly used for marking up data, a concept closer to data management than text processing. Providing a structured tree is a convenient way to structure many types of information. XML precisely provide a structured tree representation which makes it popular in this area. Unlike narrative markup, data centric markup has little use for inline markup, known as mixed content in DTDs, that is like <em>this</em> in the middle of text. Virtually all data is atomic, which fits well in leaf elements or attributes. This usually brings the well known element or attribute? question for the schema designer. Without entering too much into this debate, this author believes any choice should be followed consistently, either all data is in attributes or all data is in leaf element, with only structural info, like references, in attributes.

The point in the context of this library is that both uses, narrative and data centric, are distinct which justifies providing different facilities for both uses. This is why one of the tree representations provided is geared towards data-centric markup. The main practical difference in the structure is the assumption, that data centric XML has no mixed content.

A data centric representation

The first tree representation in this library is precisely data-centric. The removal of mixed content allows monomorphic node to be build. The primary node is the element, known as NXML_ELEMENT, produced by NXML_PARSER_ELEMENT. An element, or node in the XML tree, has the following components:

The text is the aspect which differ between a leaf node and a node that contains sub-elements. With the data centric precondition, we know that all text content is in leaf nodes. In a non-leaf node, the text contains an XML representation of the subnodes, built from the child elements. So, instead of having two distinct type of nodes for text and elements, text nodes are simply elements which are also leaf nodes.

Names and name table

All the names in a NXML_ELEMENT are NXML_NAME that is a name with its namespace resolved, to the empty namespace for XML documents that do not use namespace. It is possible to twin a name, to obtain a name with the same namespace and another local name -- the local name is a string. Because the primitive form of many queries on an element takes an instance of NXML_NAME, there are syntactic sugar features which take the namespace from the element to lookup attributes or sub-elements, for the very common case when the namespace of the element and its components are the same, as illustrated below:

	-- explicitly create NXML_NAME instances, with same namespace 
	-- as element itself
	a_name := an_element.name.with_name ("tag")
	if an_element.attributes.has (a_name) then
		do_something (an_element.attributes.item (a_name))
		...
		
	-- syntactic sugar, for queries with same namespace as
	-- element
	if an_element.attributes.has_named ("tag") then
		do_something (an_element.attributes.named ("tag"))
		....
		

These features are part of NXML_NAME_TABLE[G] which is a descendant of Gobo's DS_HASH_TABLE[G,NXML_NAME] with what it takes to implement the above syntactic sugar.

As the example above hinted, attributes are in such a table, with STRING items. Because the order of element may matter, and duplicates are allowed, elements are stored in a Gobo list of NXML_ELEMENT. Order may matter but is not always significant, that's why elements are also available indexed in a table, keyed_elements, giving the first element with a given name. It is produced and cached on demand, adding no overhead if unused, and little overhead if used only once -- in time complexity building a hash table has a cost comparable to iterating the list without building the table. Of course, access speed is greatly improved for multiple access by name.

1.4 Limitations and future directions

Surprisingly, this library is not perfect, and it currently has several limitations.

Validating parser

First the parser is not validating, and no validation alternative is available within the library. It is possible to separate the task of processing valid content, using this library, and using an external program for validating, of which numerous are available. If a validating component is implemented, it will probably not be based on DTDs, but rather on a schema language, the most likely to be chosen being RELAX NG, the XML Schema alternative designed by James Clark and Makoto MURATA, standardised under the OASIS umbrella. This is a good design with both greater usability and easier implementability than XML Schema which is suffering from excessive bloat.

The parser, while fully parsing DTDs, does not as of yet expose DTD-related events, therefore not allowing a library client to write a DTD processor or validator without changing the library. This one is easy to fix, requiring only the definition of a meaningful DTD interface.

Performance

Regarding performance, the library was designed with reasonable efficiency in mind, and the first informal tests seem to exhibit acceptable performance[4] but further optimisation, after proper profiling and benchmarking, remains to be done.

Some sections of the code, mainly there for conformance compliance, may be rather primitive performance wise. This includes the character classes validator in the pedantic parser variant, and the UTF-16 input buffer.

Narrative tree

The tree representation also lacks a narrative-centric tree. One could argue that event processing is enough for narrative centric documents, but various options are being considered. Among them extending the current NXML_ELEMENT with a special variant, or doing something different, possibly in association with a future implementation of XPath.

Future directions

The immediate future directions have already been mentioned: an XPath implementation, a narrative tree, possibly a RELAX NG validator. Beyond that interesting projects may lie in the realm of integrating Eiffel classes more directly with XML representations.


2 Parsing

The parser is based on an event interface, which clients inherit. Two of these are available, before and after namespace resolving.

2.1 The event interfaces

The interfaces are purely deferred, inheriting from NXML_COMMON_EVENT_INTERFACE for the common features. All features of the event interfaces are called when the parser is finished parsing the given construct. The feature when_error is called when an error occurs. An invalid document may call some events until it is found invalid. Only for valid documents a complete sequence of events is guaranteed.


deferred class interface NXML_EVENT_INTERFACE
feature(s) from NXML_COMMON_EVENT_INTERFACE
   -- Actions
   when_content (a_string: STRING)
      -- Text content within element.
      require
         not_void: a_string /= Void
		 
   when_comment (a_content: STRING)
      -- Comment in input.
      require
         not_void: a_content /= Void
		 
   when_cdata (a_content: STRING)
      -- Cdata section content.
      -- (Default action: like ordinary content)
      require
         not_void: a_content /= Void
		 
   when_processing_instruction (a_target, a_content: STRING)
      -- Processing instruction in input.
      require
         target: a_target /= Void; --name: is_name (a_target)
         content: a_content /= Void
		 
feature(s) from NXML_COMMON_EVENT_INTERFACE
   -- Error
   when_error (a_message: STRING)
      -- Error has occurred.
      require
         not_void: a_message /= Void
		 
feature(s) from NXML_EVENT_INTERFACE
   -- Actions
   when_start_tag (a_name: NXML_PARSER_NAME)
      -- Element start tag.
      require
         not_void: a_name /= Void
		 
   when_start_tag_attribute (a_name: NXML_PARSER_NAME; a_value: STRING)
      require
         name_not_void: a_name /= Void;
         value_not_void: a_value /= Void
		 
   when_start_tag_end
      -- End of start tag.
	  
   when_end_tag (a_name: NXML_PARSER_NAME)
      -- Element end tag.
      require
         not_void: a_name /= Void
		 --well_formed: a_name.is_equal (matchin name in start_tag)
		 
end of deferred NXML_EVENT_INTERFACE

deferred class interface NXML_NAMESPACE_EVENT_INTERFACE
feature(s) from NXML_COMMON_EVENT_INTERFACE
   -- As above
   
feature(s) from NXML_NAMESPACE_EVENT_INTERFACE
   -- Actions
   when_start (a_name: NXML_NAME)
      -- Element start.
      require
         not_void: a_name /= Void
   
   when_attribute (a_name: NXML_NAME; a_value: STRING)
      -- Attribute.
      require
         not_void: a_name /= Void
   
   when_content_start
      -- Last attribute done.
   
   when_end
      -- Element end.

end of deferred NXML_NAMESPACE_EVENT_INTERFACE

2.2 The parsers

The main parser class is NXML_PARSER, a deferred class which gets its deferred features from NXML_EVENT_INTERFACE. The client inherits from this class -- the creation procedure is make -- and implements event handlers. The caller creates the class, calls parse_file or parse_string and can collect the result in the syntax_error boolean. If an error occurred, the error handling event is also called.

deferred class interface NXML_PARSER

feature -- Creation for descendants
   make
      -- Initialise.
	  
feature(s) from NXML_PARSER
   -- Parser
   parse_file (a_name: STRING)
      -- Parse file.
      require
         not_void: a_name /= Void
   parse_string (a_string: STRING)
      -- Parse string.
      require
         not_void: a_string /= Void

feature(s) from YY_PARSER
   -- Status report
   syntax_error: BOOLEAN
      -- Has last parsing been unsuccesful?

feature(s) from NXML_PARSER
   -- Error diagnostic
   line: INTEGER
      -- Current line.
   column: INTEGER
      -- Current column.
   filename: STRING
      -- Current file.
   error_header: STRING
      -- Header for error message.
      -- <filename>:<line>:<column>:
      ensure
         not_void: Result /= Void

feature(s) from NXML_COMMON_EVENT_INTERFACE
feature(s) from NXML_NAMESPACE_EVENT_INTERFACE
	-- Deferred, as above

end of deferred NXML_PARSER

Some ready made concrete descendants of NXML_PARSER are available:

class interface NXML_PARSER_ELEMENT
creation
   make
      -- Initialise.
feature(s) from NXML_PARSER
   -- Parser
   parse_file (a_name: STRING)
      -- Parse file.
      require
         not_void: a_name /= Void
   parse_string (a_string: STRING)
      -- Parse string.
      require
         not_void: a_string /= Void
feature(s) from YY_PARSER
   -- Parsing
   parse
      -- Parse input stream.
      -- Set `syntax_error' to True if
      -- parsing has not been successful.
feature(s) from YY_PARSER
   -- Status report
   syntax_error: BOOLEAN
      -- Has last parsing been unsuccesful?
feature(s) from NXML_PARSER
   -- Error diagnostic (as above)

feature(s) from NXML_PARSER_ELEMENT
   -- Access
   last_element: NXML_ELEMENT
      -- Last element.
      -- Root node when parsing finished and successful.
	  
end of interface NXML_PARSER_ELEMENT

3 Tree representations

Currently the library has a single tree representation: a data centric representation with a few support classes.

3.1 Names and tables

To support tree representation with namespaces, two classes hold namespaces and names: NXML_NAMESPACE and NXML_NAME. It is intended that these classes are immutable -- their value cannot change and they do not have procedures -- so obtaining new values is done through duplication or creation of new instances. Both classes inherit from HASHABLE and implement copy and equality semantics from ANY.

class interface NXML_NAMESPACE
creation
   make (an_uri: STRING)
      -- Create.
      require
         not_void: an_uri /= Void
		 
feature(s) from NXML_NAMESPACE
   -- Data
   uri: STRING
      -- Namespace unique identifier.
	  
feature(s) from NXML_NAMESPACE
   -- Output
   out: STRING
      -- URI.
      ensure
         definition: Result = uri
		 
   as_xml (a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE]): STRING
      -- Resolved prefix.
      require
         has: a_ns_map.has(Current)
		 
   namespace_declaration: STRING
      -- xmlns=<uri>
      ensure
         new_string: Result /= Void
		 
   namespace_declaration_in (a_string: STRING)
      -- Append `namespace_declaration' to string.
      require
         appendable: a_string /= Void
		 
   namespace_prefix_declaration (a_name: STRING): STRING
      -- xmlns:<a_name>='<uri>'
      require
         is_prefix: is_namespace_prefix(a_name)
      ensure
         new_string: Result /= Void
		 
   namespace_prefix_declaration_in (a_string: STRING; a_name: STRING)
      -- Append `named_namespace_declaration' to string.
      require
         appendable: a_string /= Void;
         is_prefix: is_namespace_prefix(a_name)
		 
   is_namespace_prefix (a_string: STRING): BOOLEAN
      -- Is this string a valid namespace prefix.

feature(s) from HASHABLE
   hash_code: INTEGER
      -- Hash code.
      ensure
         good_hash_value: Result >= 0
		 
feature(s) from NXML_NAMESPACE
   -- ANY
   is_equal (an_other: like Current): BOOLEAN
      -- Is equal?
      require
         other_not_void: an_other /= Void
      ensure
         consistent: standard_is_equal(an_other) implies Result;
         symmetric: Result implies an_other.is_equal(Current)
		 
invariant
   not_void_uri: uri /= Void;

end of NXML_NAMESPACE
class interface NXML_NAME
creation
   make (a_namespace: NXML_NAMESPACE; a_name: STRING)
      -- Make with namespace.
      require
         namespace: a_namespace /= Void;
         name: valid_name(a_name)
      ensure
         namespace: namespace.is_equal(a_namespace);
         name: name.is_equal(a_name)
   make_empty
      -- Empty name.
      ensure
         name: name.count = 0
   make_element (a_name: STRING)
      -- Name with empty namespace.
      require
         name: valid_name(a_name)
      ensure
         name: name.is_equal(a_name)
   make_inherit (a_parent: NXML_NAME; a_name: STRING)
      -- Inherit namespace.
      require
         name: valid_name(a_name)
      ensure
         namespace: namespace.is_equal(a_parent.namespace);
         name: name.is_equal(a_name)

feature(s) from NXML_NAME
   -- Access
   namespace: NXML_NAMESPACE
      -- Namespace.
   name: STRING
      -- Name of element or attribute.
	  
feature(s) from NXML_NAME
   -- Twin
   with_name (a_name: STRING): NXML_NAME
      -- New name with same namespace as current.
      require
         name: valid_name(a_name)
      ensure
         namespace: Result.namespace.is_equal(Current.namespace);
         name: Result.name.is_equal(a_name)
   same_namespace (an_other: NXML_NAME): BOOLEAN
      -- Same namespace as other?
      require
         not_void: an_other /= Void
      ensure
         def: Result = namespace.is_equal(an_other.namespace)

feature(s) from NXML_NAME
   -- Access
   valid_name (a_name: STRING): BOOLEAN
      -- Valid name?
   prefixed (a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE]): STRING
      -- Output as XML (prefixed).
      require
         has_namespace: a_ns_map.has(namespace)
      ensure
         new_string: Result /= Void;
         size: Result.count = namespace.as_xml(a_ns_map).count + 1 + name.count
   prefixed_in (a_string: STRING; a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE])
      -- Append prefixed name to string.
      require
         has_namespace: a_ns_map.has(namespace)
      ensure
         as_xml_in: a_string.count = old a_string.count + prefixed(a_ns_map).count
		 
feature(s) from HASHABLE
   hash_code: INTEGER
      -- Hash code.
      ensure
         good_hash_value: Result >= 0
		 
feature(s) from NXML_NAME
   -- ANY
   is_equal (an_other: like Current): BOOLEAN
      -- Equality.
      require
         other_not_void: an_other /= Void
      ensure
         consistent: standard_is_equal(an_other) implies Result;
         symmetric: Result implies an_other.is_equal(Current)
   out: STRING
      -- Printable representation.
      -- namespace-URI':'name (not valid XML because resolved!)
	  
invariant
   name_not_void: name /= Void;
   namespace_not_void: namespace /= Void;
end of NXML_NAME

Name table

A further support class, NXML_NAME_TABLE[G] is a descendant of DS_HASH_TABLE[G,NXML_NAME]. It adds feature to resolve local names, with the context's namespace. The namespace is given on creation, which is usually done within the construction of the tree representation.

class interface NXML_NAME_TABLE[G]
creation
   make (a_namespace: NXML_NAMESPACE)
      -- Namespace.
      require
         not_void: a_namespace /= Void
      ensure
         keep_reference: namespace = a_namespace
feature(s) from DS_HASH_TABLE
   -- Many features not reproduced here
   
feature(s) from NXML_NAME_TABLE
   -- Named interface to DS_HASH_TABLE
   has_named (a_string: STRING): BOOLEAN
      -- Is the name present in this table?
      require
         name: a_string /= Void
      ensure
         def: Result = has(name(a_string))
   named (a_string: STRING): G
      -- Named.
      require
         has: has_named(a_string)
      ensure
         def: Result = item(name(a_string))
   force_named (v: G; a_string: STRING)
      -- Force with name.
      -- Replace element if key already used.
      require
         name: a_string /= Void
      ensure
         forced: has_named(a_string) -- as `force'.
   force_new_named (v: G; a_string: STRING)
      -- Force new with name.
      require
         name: a_string /= Void;
         has: has_named(a_string)
      ensure
         forced: has_named(a_string) -- as `force_new'.
end of NXML_NAME_TABLE[G]

3.2 Data-centric tree

The main invariant of a data centric tree is that it does not contain mixed content. As described in this document, this is a monomorphic tree, with only nodes of type NXML_ELEMENT, leaf text nodes have is_leaf true, and the text content is returned by the text query.

class interface NXML_ELEMENT
creation
   make_root (a_name: NXML_NAME)
      -- Make root element.
      require
         not_void_name: a_name /= Void
      ensure
         root: is_root;
         starts_as_leaf: is_leaf
   make_child (a_parent: NXML_ELEMENT; a_name: NXML_NAME)
      -- Make with name and no content or attributes .
      require
         not_void_parent: a_parent /= Void;
         not_void_name: a_name /= Void
      ensure
         not_root: not is_root;
         starts_as_leaf: is_leaf
feature(s) from NXML_ELEMENT
   -- Data
   parent: NXML_ELEMENT
      -- Parent
      -- require not is_root
      -- ensure Result /= Void
   name: NXML_NAME
      -- Name.
   attributes: NXML_NAME_TABLE[STRING]
      -- Attributes
   elements: DS_BILINKED_LIST[NXML_ELEMENT]
      -- Elements which are under this element.
feature(s) from NXML_ELEMENT
   -- Access
   is_root: BOOLEAN
      -- Is this the root node?
      ensure
         definition: Result = (parent = Void)
   depth: INTEGER
      -- Depth from root.
      ensure
         start: is_root implies Result = 1;
         recurse: is_root or else Result = parent.depth + 1
   is_leaf: BOOLEAN
      -- Is this a final node with no child elements?
   text: STRING
      -- Text content of leaf or markup (post parsing) of non-leaf.
      -- Mixed content's text, comments etc are ignored.
      ensure
         not_void: Result /= Void
   namespaces: DS_SPARSE_SET[NXML_NAMESPACE]
      -- All namespaces used in this element and child elements.
   keyed_elements: NXML_NAME_TABLE[NXML_ELEMENT]
      -- Elements keyed by name. Only the first instance of 
      -- two or more element with the same name is in this 
      -- table.
      -- Warning: generated once per instance.
feature(s) from NXML_ELEMENT
   -- Text output
   out: STRING
      -- Pseudo XML representation of element.
      -- (pseudo in that namespaces are resolved and thus invalid)
   as_xml (a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE]): STRING
      -- XML representation, using a namepace prefix resolution table.
      require
         ns_map: is_namespace_map(a_ns_map)
   as_xml_in (a_string: STRING; a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE])
      -- Write as_xml in string.
      require
         appendable: a_string /= Void;
         ns_map: is_namespace_map(a_ns_map)
   is_namespace_map (a_ns_map: DS_SPARSE_TABLE[STRING,NXML_NAMESPACE]): BOOLEAN
      -- Is this a complete namespace map for this element?
   default_namespaces_map: DS_HASH_TABLE[STRING,NXML_NAMESPACE]
      -- Generate names for namespaces (ns123).
invariant
   root: is_root = parent = Void;
   name_not_void: name /= Void;
   elements_not_void: elements /= Void;
end of NXML_ELEMENT

A Result of conformance testing

The parser is tested and validated against the OASIS/NIST XML Test Suite 1.0, Second Edition, 2001-03-15, seen at http://www.oasis-open.org/committees/xml-conformance/. This test suite contains a large number of XML documents exhibiting many aspects of XML parsing.

The suite contains two types of tests: binary tests, to which a parser must answer one of "valid" or "invalid" for each input document. There is an expected result for each case, for a given type of conforming parser -- non-validating in this parser's case. Output tests compare the output of a tested parser in some form of canonical XML, with what is expected from a conformant parser.

This section is correct at the time of writing but is subject to be updated as the parser changes and more tests are conducted.

A.1 Binary tests

Out of the box, the parser passes 98.5% of the 1820 tests. The 27 failed tests can be dispatched in the following categories, to give a result of 10 really failed tests (99.5% passed):

Incorrect tests (3 tests)
2 tests are missing an apparently empty file (zero-sized file lost in packaging?) and passes once this file is created. Another test input file does not exist. (sun/valid/empty, xmltest/valuid/not-sa/003, sun/valid/ext01)
Irrelevant tests (3 tests)
3 test are optional errors, with no particular recommended behaviour for a conformant parser, as defined in section 3.4 of the test suite (sun/invalid/pe01, oasis/p11pass1, sun/not-wf/uri-01).
Validity constraint: undeclared external entity in DTD (11 tests)
These tests are where a validity constraint, which must be checked by a validating parser but is not normally detected by a well-formed parser, is enforced. It would be trivial to replace this error with a warning, and the parser is just doing a bit more than it is required to.
Validity constraint: parameter entity proper group nesting (6 tests)
This is similar to the above, in that a validity constraint is detected while not required. It is different from the above in that this detection is the result of the way the parser is implemented, and that this class of error may or may not be detected depending on the input: all documents that satisfy the constraint will be parsed correctly, but of those that do not satisfy the constraint, some will be detected, by accident, and some will not be (depending on whether scanner states are properly nested in the entity). Given that the behaviour is accidental, it can be considered a test failure, which could be fixed.
Not well formed entity (2 tests)
That a content entity is well formed is not checked. Failed, and can be implemented.
PE disallowed in internal subset, specifically in entity value (1 test)
Failed, can be implemented.
Unicode validation (1 test)
Discrete legal character in PI target (ibm/valid/P02/ibm02v01). To be investigated.

A.2 Output tests

Output test validation has not been performed yet.


Notes

1. Except ISE's Eiffel# environment which is said to adopt wide characters from the underlying runtime.

2. While namespace declaration URIs usually look like HTTP references, then need not point to an existing web page, they are no more or less than a unique identifier for the namespace -- using domain names being a convenient way to provide global unicity.

3. XML tree representations often include nodes for things like comments or processing instructions, but these nodes are rarely used.

4. Approximately half a megabyte per second is processed on an ordinary current PC, using the canonical pedantic parser compiled with SmallEiffel.