|
|||||||||
|
|||||||||
Related Syntax | Related Concepts | ||||||||
control structure | do xml-parse |
Syntax
do xml-parse document (with id-checking Boolean-expression)? (with utf-8 Boolean-expression)? (creating xml-dtds key keyname)? scan (input source | (input input-function-call)) action+ done do xml-parse instance (with document-element element-name)? with (xml-dtds key key | current xml-dtds) (with id-checking Boolean-value)? scan (input source | (input input-function-call)) done
You can invoke the markup processor, and its XML parser, with do xml-parse
. To invoke the markup parser with its SGML parser, use do sgml-parse
.
do xml-parse
initiates a code block, ending with done
, in which you must do the following:
%c
or suppress
) to initiate processing the data by markup rules.
The simplest use of do xml-parse
is to process a complete XML document:
do xml-parse document scan file "my-xml.xml" output "%c" done
This assumes that the file "myxml.xml" contains an XML document. You will often find that the DTD and the instance you want to process are in two different files. The simplest way to handle this is:
do xml-parse document scan file "my-dtd.dtd" || file "my-xml.xml" output "%c" doneBut suppose you have 20 instances to process, all of which use the same DTD. It is wasteful to parse the same DTD 20 times. To avoid doing this, you can pre-compile the DTD and place it on the built-in shelf
xml-dtds
:
do xml-parse document creating xml-dtds key "my-dtd" scan file "my-dtd.dtd" suppress doneYou can then process each instance in turn. The following code assumes you have placed the filenames of the instances on a shelf called "my-instances":
repeat over my-instances do xml-parse instance with xml-dtds key "my-dtd" scan file my-instances output "%c" done againIn some cases you may wish to parse a partial instance, that is, a piece of data comprising an element from a DTD which is not the
doctype
element of that DTD. In this case, you can specify the element to be used as the effective doctype
for parsing the data:
do xml-parse instance with document-element "lamb" with xml-dtds key "my-dtd" scan file "partinst.xml" output "%c" doneThe element's start and end tags can be present, or they can be omitted if the element allows. XML comments, processing instructions, and even marked sections can precede and follow the element's start and end tags, but anything else (particularly other elements, data, entity references, or USEMAP declarations) is an error.
By default, OmniMark checks all XML IDREF attributes to make sure they reference a valid ID. This checking may not be appropriate in processing a partial instance. It also takes time. You can turn this checking on and off using with id-checking
followed by a Boolean expression. The following code will parse the specified document without checking IDREFs:
do xml-parse document scan file "my-xml.xml" with id-checking false output "%c" done
When parsing a document, markup rules are fired as follows (if specified in your code):
dtd-start
dtd-end
prolog-end
epilog-start
When parsing an instance part, only general markup rules are fired.
If there are errors in the XML declaration or prolog (DTD), then the processing of the content of the do xml-parse
action will terminate. Execution is resumed in the actions following the parse continuation operator in the body of the do xml-parse
. However, the amount of input read is undefined in this situation. That is, OmniMark may choose to consume the entire input source, it may stop reading the input immediately, or it may do something in between.
While you can process XML documents with UTF-8 (a Unicode character encoding), the default for do xml-parse
is not to do so. Issuing do xml-parse
without changing the UTF default means that character references greater than 255 will output the literal binary equivalents of character references between 128 and 255.
If you want to process XML documents as UTF-8 character encoding, do the following:
process do xml-parse document with utf-8 true scan file "myfile.sgm" output "%c" done
Note that actual UTF-8-encoded characters in your input data are unaffected by this setting.
Note that with utf-8
can only be used with a full document
and not with an instance
parse.
More about UTF-8
XML is a Unicode-based language, and the most common encoding of Unicode characters is UTF-8. UTF-8 encodes characters from 0 to 127 as single bytes and characters 128 and up as multiple bytes. If you issue do xml-parse document with utf-8 true
, you are telling the xml parser that the document you are processing contains numerical character references (for example ï
). These will be translated to the appropriate UTF-8 byte sequence on output. So character references between 128 and 255 will not be output as single bytes with the corresponding values, but as the UTF-byte sequences that represent those character values.
Related Syntax #current-output creating document-end document-start external-text-entity find-end find-start suppress |
Related Concepts Input Input functions XML DTDs: creating XML/SGML parsing: built-in shelves |
---- |