swirl Guide to OmniMark 5   OmniMark home
docs home 
IndexConceptsTasksSyntaxLibrariesOMX VariablesErrors
 
     
Parsing (XML and SGML)

OmniMark has integrated XML and SGML parsers. Because they are part of the language, the parsers do not use a parser interface like DOM or SAX. Instead, they are integrated into the streaming model of OmniMark.

In general terms, any program that reads a data stream and analyzes it to reveal its structure is a parser. Almost all OmniMark programs are parsers in this sense. The integrated XML and SGML parsers perform a specific and formal kind of parsing that corresponds to the requirements of the XML and SGML specifications respectively.

The XML and SGML parsers perform three basic functions:

  1. separate markup from data and report the structure of the document
  2. validate the structure of the document and report errors
  3. expand entities (sometimes with help from your program)

This behavior is appropriate in all cases in which you are attempting to interpret the XML or SGML document based on its structure and content. If you want to process an XML or SGML document in another way (for example, to programmatically edit existing XML or SGML documents) it may be appropriate to write your own "parser" using scanning techniques.

The OmniMark parsers fit into the streaming and hierarchical model of OmniMark processing. The parser takes over the job of scanning the input source and reports the structure of the parsed document by firing markup rules. You write code in the body of the markup rules to respond to the reported structure of the document.

The data content of a parsed document is streamed through to current output unless you explicitly process it. See data content, processing and parsed data, formatting.

Here is a simple XML document:

  <person>
   <name>Mary</name>
   <bio>
    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>
   </bio>
  </person>

Here is a program that processes this XML document and produces HTML output:

  process
     local stream output-file
     open output-file as file "output.txt"
     using output as output-file
     do xml-parse instance
      scan file "input.xml"
        output "<html>%c</html>"
     done

  element "person"
     output "<body>%c</body>"

  element "name"
     output "<H1>%c</H1>"

  element "bio"
     output "%c"

  element "p"
     output "<p>%c</p>"

The output of the program is:

  <HTML><BODY>
   <H1>Mary</H1>

    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>

  </BODY></HTML>

The do xml-parse statement is used to configure the parser before parsing begins. It tells the parser what form of parsing to use and what source to scan. See do xml-parse for details on configuring the XML parser. See do sgml-parse for details on configuring the SGML-parser.

Parsing actually begins when the parse continuation operator "%c" is encountered within the body of the do xml-parse block. This initiates parsing which then proceeds until the first markup rule is fired, in this case it will be the "person" element rule. Within the element rule you can do any processing you want to do before and after the content of the "person" element is parsed. Anything you do before "%c" is invoked happens before the element's content is parsed. Anything after "%c" happens after the element's content is parsed. In this case, the HTML tag "<BODY>" is output before the element's content is parsed and the tag "</BODY>" is output after.

The next element rule to fire is "name". It fires as a result of the parsing initiated by "%c" in the "person" element rule. The "person" rule is suspended at the "%c" until all its content is parsed. In this way OmniMark builds up a hierarchy of fired rules that corresponds to the hierarchy of the document being parsed.

The "name" element rule contains a single action: output "<H1>%c</H1>". This causes the string "<H1>" to be output. Then the "%c" causes the parser to resume. The "name" element of the document contains only the data content "Mary". The parser streams this data content to the current output scope. The "name" rule then resumes and outputs the string "</H1>". The result of these three output events, in this order, is that the current output scope receives the text "<H1>Mary</H1>".

You can assign "%c" to a variable:

  element "name"
     local stream name-text
     set name-text to "%c"

The variable name-text will contain "Mary". However, do not be misled into believing that "%c" returns the data content of an element. All "%c" does is force the parser to continue. The parser then outputs the data content of the element to the current output scope. The reason that the text "Mary" ends up in the variable name-text is that the set command creates a new current output scope and makes its first argument the destination for that output scope. This change of output scopes last only as long as the set action, but since "%c" occurs in the set action, that scope is in effect when the parser outputs the data content "Mary".

The consequence of this mechanism becomes clear if we introduce a set statement into the "person" element:

  element "person"
     local stream name-text
     set person-text to "%c"

This will place the following text into the variable person-text:

  <BODY>
   <H1>Mary</H1>

    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>

  </BODY>

The variable has become the output destination for all the processing that occurs as a result of parsing the "person" element.

Validation

Both the XML and SGML parsers validate the documents they parse. The kind of validation done depends on how the parser is configured. The example above does well-formed parsing, so the only validation done is to ensure that the document is well formed. You can also configure the parsers for DTD validation.

When the parser encounters invalid input a markup-error rule is fired. A markup-error is not a program error, it is simply an event that you can deal with in your program by writing the appropriate code in a markup-error rule.

For more information, see Errors, markup (XML or SGML).

Retrieving parse state information

Because they are streaming parsers, the OmniMark XML and SGML parsers do not build a parse tree in memory. However, the hierarchy of rules existing at any point in a parsing operation contains all the information you need about the current state of the parse and the structure of the document at that point.

For example, you can test to see if an element has a particular parent:

  element "name" when parent is "person"
     output "<H1>%c</H1>"

  element "name" when parent isnt "person"
     output "%c"

You can also make the test inside the rule body:

  element "name"
     do when parent is "person"
        output "<H1>%c</H1>"
     else
        output "%c"
     done

Different parse state information is available in different rules. See the various markup rules for specific information.

Combining scanning and parsing

You can process a data source by first scanning the data and then parsing the output of the scanning process. This is particularly valuable when you are translating data into XML or SGML and want to use the parser to verify the structure of the output document. It is also a useful way to do certain kinds of processing by first normalizing the data to XML or SGML format and then processing the normalized data.

The most obvious way to process a document by scanning and then parsing is to stream scanning output to a buffer or a file and then to parse the file or buffer. However, this approach is resource intensive and can be slow. You can avoid buffering the intermediate form by feeding the output of the scanning process directly to the parser. That is, the output scope of the scanning process becomes the input source of the parsing operation. You do this with the input keyword and an input functionthat initiates the scanning process:

  define function make-xml as
     submit #main-input

  process
     do xml-parse instance
      scan input make-xml
        output "%c"
     done

Note that the function make-xml does not return a value. The effect of the input statement is to bind the output of the scanning process to the input of the parse. Thus any output generated as a result of the code in the function make-xml (in this case, the output generated by the find rules that process the data scanned by the submit statement) becomes the XML source to be parsed.

The scanning and parsing processes run in parallel , meaning that the parsing process asks the scanning process for some data, parses it, and then asks for more. This is exactly how the parser behaves with any source. In this case, however, the source is an active OmniMark process. The two processes run in turn until the whole input is processed, with minimal buffering between the two processes. See input function for more details.

If you are using this technique to validate an XML or SGML document you are creating from other data, you need to add another stream to the output scope created by input so that you can capture the data you are creating:

  define function make-xml as
     local stream output-file
     open output-file as file "output.xml"
     using output as #current-output & output-file
      submit #main-input

  do xml-parse instance
   scan input make-xml
     suppress
  done

In the code above, the output of the find rules will go both to the stream output-file and to the parser. Note that output "%c" in the do-xml-parse block has been replaced with suppress, which suppresses all output from the parser and markup rules.

Alternatively, you can create your final XML or SGML output file from the output of the parser:

  define function make-xml as
     submit #main-input

  do xml-parse instance
   scan input make-xml
     output "%c"
  done

  element #implied
     output "<%q>%c</%q>"

       
----

Top [ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ OMX ] [ OMX ] [ ERRORS ]

Generated: August 11, 2000 at 3:06:26 pm
If you have any comments about this section of the documentation, send email to docerrors@omnimark.com

Copyright © OmniMark Technologies Corporation, 1988-2000.