About OmniMark - OmniMark Concept

About OmniMark

OmniMark is a streaming programming language. It is designed to make it easy for you to write programs using the streaming programming model.

The streaming programming model is an approach to programming that concentrates on describing the process to be applied to a piece of data, and on processing data directly as it streams from one location to another. In the streaming model, the use of data structures to model input data is eliminated, and the use of data structures to model output is greatly reduced. For instance, here is an OmniMark program that takes a document and converts all references to monetary amounts from the English style ("$29.95") to the French style ("29,95$"):

  process
     submit "The doggy in the window costs $24.95."

  find "$" digit+ => dollars "." digit{2} => cents
     output dollars || "," || cents || "$"

This program outputs:

  The doggy in the window costs 24,95$.

Here's how this program works:

The word process starts a process rule. An OmniMark program is a collection of rules. A process rule fires when the program is run. It is the equivalent of function "main" in other languages.
The word submit creates an OmniMark source. In this case, the content of the source is the literal string "The doggy in the window costs $24.95.".
The submit also initiates scanning of the source it creates. Scanning is a process in which data is moved from a source to a destination and has a process applied to it as it moves.
The word find defines a find rule. A find rule is a filter for data that is being scanned. The find rule specifies a pattern to be matched in the data and the process to be applied when the data is matched. When a source is scanned by find rules, data that is not trapped streams through to the current output scope. Data that is matched by a pattern is consumed and does not stream through to output. Output generated by the rule is merged with the data streaming to the current output scope.

The pattern used in this find rule is designed to match English style dollar values. Leaving out the pattern variable assignments, which we'll discuss in a moment, it looks like this:

  "$" digit+ "." digit{2}

The pattern reads as follows: Match a literal dollars sign ("$"). Then match one or more digits (the keyword digit with a plus sign after it, meaning "one or more"). Then match a literal period {".") followed by exactly 2 digits (digit{2}).

In order for the program to create the proper output it needs to capture the digits that represent the dollars and the cents portions of the matched data. This is done by assigning the matched data to pattern variables, using the pattern variable assignment operator =>. This is the pattern with the pattern variables in place:

  find "$" digit+ => dollars "." digit{2} => cents

When the scanning process encounters a piece of data that matches this pattern it will fire the find rule and the data matched by digit+ will be assigned to dollars and the data matched by digit{2} will be assigned to cents.

Next, the actions associated with the find rule will be fired. The output statements output the dollars and cents values with the "," and "$" characters in the appropriate place. The output goes to the current output scope. Since the unmatched data is also going to this scope, the output of the rule is merged into the source data as it flows to its destination.

There is a lot of detail in this explanation. To get a better idea of how this program works, paste the program into the OmniMark IDE, create an appropriate input file, and trace through the program.

Taking control of input and output

To process data other than literal strings, you need to be able to create a scanning source from external data sources. You also want to be able to send output somewhere other than the screen. In this revised version of the program the input comes from a file named on the command line and the output goes to another file named on the command line.

  process
     local stream out-file
     open out-file as file #args[1]
     using output as out-file
      submit file #args[2]

  find "$" digit+ => dollars "." digit{2} => cents
     output dollars || "," || cents || "$"

Note that only the process rule has changed. The find rule that does the actual work of processing the data remains the same no matter where the data comes from or where it goes. Here's how this new process rule works:

The first line creates a variable of type stream with the name out-file. A stream variable is a conduit through which data can flow to a destination.
The next line uses the open keyword to attach the stream out-file to the file "output.txt". Any data written to that stream will now be directed to that file. In OmniMark you never write directly to a destination. You always write to a stream attached to a destination. This means there is no difference in how you generate output no matter where you are sending it to.
The next line uses the command using output as to make the stream out-file part of the current output scope. This makes it the target of all output statements that are executed in that output scope.
The submit is now prefixed by the using output as statement, which means that all output generated as a result of the submit will go to that output scope.

The streaming model at work

Beyond the details of the program, notice the streaming model at work:

Firstly, notice that the input data is not buffered. No data structure is created to represent it. The process of replacing the English form with the French form is carried out as the data flows from source to destination. The output is not buffered either. This program will run with equal success on a 2 kilobyte file or a 2 gigabyte file.

Secondly, notice how the program describes the process it performs. A reasonable description of the function of this program would be: "It finds the English format for expressing currency and replaces it with the French format. The input comes from one file and goes to another." And when we look at the code, we see that the process rule describes the path the data takes from input file to output file, and the find rule says find the English currency format and output the French currency format.

Thirdly, notice the abstraction involved in dealing with sources and destinations of information. The find rule does not specify what data it is acting on: it is the current input data, whatever source that may flow from. The output statement does not say where the output goes to; it goes to the current output scope, whatever that may be attached to. This means that the same scanning techniques can be applied to any piece of data from program variables, to files, to network data streams, in exactly the same manner. Scanning is a fully general data processing technique, independent of the source or destination of the data to be scanned.

Fourthly, notice how much work is done for you by the scanning mechanism. There is no data movement code in this program. There is no need to maintain pointers or offsets into the data. There is no memory management to worry about. There is no need to explicitly buffer input and output. There is not even any need to worry about the opening and closing of files. All these things are done for you, in a highly robust and optimized way.

Streaming parsing

The same streaming techniques apply to XML parsing. Here is a simple XML document:

  <person>
   <name>Mary</name>
   <bio>
    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>
   </bio>
  </person>

Here is a program that processes this XML document to produce HTML output:

  process
      do xml-parse instance
      scan file "input.xml"
        output "<html>%c</html>"
      done

  element "person"
     output "<body>%c</body>"

  element "name"
     output "<H1>%c</H1>"

  element "bio"
     output "%c"

  element "p"
     output "<p>%c</p>"

You should step through this program in the OnmiMark IDE to observe how it works. The output of the program is:

  <HTML><BODY>
   <H1>Mary</H1>

    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>

  </BODY></HTML>

The process rule plays the same role in this program as in the previous one. It establishes an input source and an output destination and it starts the scanning process. The difference here is that it is the parser that scans the data, not find rules. When the parser finds element markup in the source it is scanning, it fires an element rule. Just as with find rules, the unmatched data -- the "data content" in XML terms -- streams through to the current output. Thus each element rule can output into the current output stream just the way a find rule does.

Since the program is creating HTML, its element rules output HTML markup:

  element "person"
     output "<body>%c</body>"

This rule outputs the start and end tags for the HTML BODY element. In the final output, however, there will be a good deal of markup and data between "<BODY>" and "</BODY>". Because XML data is hierarchical in nature, element rules fire hierarchically as well. The "person" element rule is suspended at the point the string "%c" occurs in the output statement. All the contents of the "person" element are then parsed, with the appropriate rules being fired. This results in the other markup and data being sent to output. Once this is done, the "person" element rule resumes and "</BODY>" is output.

Processing hierarchical data with find rules

The streaming model also makes it easy to process hierarchical data without the assistance of a parser. To demonstrate this, the following program processes the same XML document using find rules. Once again, you should step through this program in the IDE to see how it works:

  declare catch end-tag

  process
     output "<html>"
     submit file #args[1]
     output "</html>"

  find "<person>"
     output "<body>"
     submit #current-input
     catch end-tag
        output "</body>"

  find "<name>"
     output "<H1>"
     submit #current-input
     catch end-tag
        output "</H1>"

  find "<bio>"
     submit #current-input
     catch end-tag

  find "<p>"
     output "<p>"
     submit #current-input
     catch end-tag
        output "<p>"

  find "</" [\ ">"]* ">"
     throw end-tag

In each find rule for a start tag, the scanning of the current input is handed off to another scanning process. In the single find rule that handles all end tags (find "</" [\ ">"]* ">") the word throw is used to collapse the current process and return execution to the rule that started it. Execution resumes at the catch statement.

Notice how these find rules parallel the element rules from the previous program. The commands submit #current-input and catch end-tag replace the "%c" and do the same thing: they build and collapse the hierarchy of rules that corresponds to the hierarchy in the data stream.

This illustrates how the streaming model eliminates the need to buffer input and output, and how it models the hierarchical structure found in most data, whether XML encoded or not.

To learn more about the basic principles of OmniMark programming see:

To learn about specific OmniMark syntax, just follow the links in the code samples.

----

[ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ OMX ] [ OMX ] [ ERRORS ]

Generated: August 11, 2000 at 3:06:14 pm
If you have any comments about this section of the documentation, send email to docerrors@omnimark.com

Copyright © OmniMark Technologies Corporation, 1988-2000.