swirl Guide to OmniMark 5   OmniMark home
docs home 
IndexConceptsTasksSyntaxLibrariesOMX VariablesErrors
 
    Related Syntax  
Pattern matching

OmniMark allows you to search for particular strings in input data using find rules. For example, the following find rule will fire if the string "Hamlet:" is encountered in the input:

  find "Hamlet:"
     output "<b>Hamlet</b>: "

Using this method, however, you would have to write a separate find rule for each character name you wanted to enclose in HTML bold tags. For example:

  find "Hamlet:"
     output "<b>Hamlet</b>: "
  find "Horatio:"
     output "<b>Horatio</b>: "
  find "Bernardo:"
     output "<b>Bernardo</b>: "

As you can imagine, this is a pretty inefficient way to program.

This is where OmniMark "patterns" come in. OmniMark has rich, built-in, pattern-matching capabilities which allow you to match strings by way of a more abstract "model" of a string rather than matching a specific string. For example:

  find letter+ ":"

This find rule will match any string that contains any number of letters followed immediately by a colon.

Unfortunately, the pattern described in this find rule isn't specific enough to flawlessly match only character names. It will match any string of letters that is followed by a colon that appears anywhere in the text, meaning that words in the middle of sentences will be matched.

Words that appear in the middle of sentences rarely begin with an uppercased letter, while names usually do. This allows us to add further details to our find rule:

  find uc letter+ ":"

This find rule matches any string that begins with an uppercase letter (uc) followed by at least one other letter (letter+) and a colon (":").

If we were actually trying to mark up an ASCII copy of "Hamlet", however, our find rule would only match character names that contain a single word, such as "Hamlet", "Ophelia", or "Horatio". Only the second part of two-part names would be matched, so the names "Queen Gertrude", "Lord Polonius", and so forth, would be incorrectly marked up.

In order to match these more complex names as well as the single-word names, we'll have to further refine our find rule:

  find uc letter+ (white-space+ uc letter+)? ":"

In this version of the find rule, the pattern can match a second word prior to the colon. The pattern (white-space+ uc letter+)? can match one or more white-space characters followed by an uppercase letter and one or more letters. All of this allows the find rule to match character names that consist of one or two words.

If you wanted to match a series of three numbers, you could use the following pattern:

  find digit {3}

If you wanted to match either a four-digit or a five-digit number, you could use the following pattern:

  find digit {4 to 5}

To match a date that occurs in the" yy/mm/dd" format, the following pattern could be used:

  find digit {2} "/" digit {2} "/" digit {2}

A Canadian postal code could be matched with the following pattern:

  find letter digit letter " " digit letter digit

The letter and uc keywords that are used to create the patterns shown above are called "character classes". OmniMark provides a variety of these built-in character classes:

Any pattern can be modified through the use of occurrence indicators

So, as shown in the find rules above, for example, letter+ matches one or more letters, letter* matches zero or more letters, and uc? matches zero or one uppercased letter.

Defining your own character classes

You can define your own character classes. For example:

  find ["+-*/"]
     output "found an arithmetic operator%n"

This find rule would fire if any one of the four arithmetic operators was encountered in the input data.

Compound character classes can be created using the except or or keywords:

  find [\ "}"]

The find rule above would match any character except for a right brace.

This find rule would match any one of the arithmetic operators or a single digit:

  find ["+-*/" or digit]

This one would match any of the arithmetic operators or any digit except zero ("0"):

  find ["+-*/" or digit \ "0"]

Zero-length pattern matching

The occurrence indicators ? and * allow for a pattern to succeed if it is matched zero (or more) times. In effect, this means that these patterns always match, since the "zero" in "zero or more" really means "the pattern succeeds even if it is not found in the data".

This is very useful behavior when there is an optional element in a pattern. For example, this pattern matches a currency amount in dollars whether or not cents are specified:

  find "$" digit+ ("." digit{2})?

The sub-pattern ("." digit{2})? will match a cents amount like ".34" if it exists, but if it does not, the pattern succeeds anyway. The pattern always matches. Sometimes it matches zero characters.

Because a pattern can succeed while matching zero characters, a rule can fire without consuming any data:

  find ("$" digit+ "." digit{2})?

The entire pattern above has a "zero-or-one" occurrence indicator. While it will match a currency value if one exists, it will also match zero characters at any point in the input. This means that it will fire whenever no previous pattern fires, no matter where it is in the data.

Since no data has been consumed, the pattern matching context has not changed and the rule would then fire again and again. However, OmniMark does not let this happen. OmniMark does not allow two consecutive zero-length pattern matches.

Once any pattern has matched zero characters, all rules in the current scan are prevented from matching zero characters until at least one character has been consumed. You can remove this restriction using the null pattern modifier.

      Related Syntax
   ~
   do scan
   find
   repeat scan
   utf8-char
 
----

Top [ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ OMX ] [ OMX ] [ ERRORS ]

Generated: August 11, 2000 at 3:06:26 pm
If you have any comments about this section of the documentation, send email to docerrors@omnimark.com

Copyright © OmniMark Technologies Corporation, 1988-2000.