Jeni’s “Grouper” (or how to specify and parse additional rules to a grammar) is just a minor task for LBNF

In her recent post "RELAX NG for matching" Jeni Tennison said:


The “grouper” is the most interesting and difficult of these. It needs to act like a tokeniser, except use regular expressions over markup rather than over text. For example, say I had:


I want to be able to create a rule that says “any sequence that looks like a number element that contains a
two-digit number between 1 and 31, followed by a punc element that contains a slash, followed by another two-digit number between 1 and 12, followed by a punc element that contains a slash, followed by another two-digit number should be wrapped in a date element”.

Now this is something that XPath is really bad at. Try writing an expression that selects, from a sequence of elements that may contain other <number> and <punc> elements as well as other elements, only those sequences of elements that match the pattern I just described. It’s something like:

number[. >= 1 and . <= 31 and string-length(.) = 2]
      [following-sibling::*[1]/self::punc = '/']
      [following-sibling::*[2]/self::number[. >= 1 and . <= 12 and string-length(.) = 2]]
      [following-sibling::*[3]/self::punc = '/']
      [following-sibling::*[4]/self::number[string-length(.) = 2]]
  /(self::number, following-sibling::*[position() <= 4])

which is fiddly and messy and only works in this particular example because I know precisely how many elements there are
supposed to be in the group.


Then she proceeds by making the suggestion that

    "As a language, RELAX NG is really good at this"


Jeni ends with the following statement:


"So I think a “grouper” component should use RELAX NG to identify sequences to be marked up. But I have no idea if there are RELAX NG libraries out there that can be used in this way: to identify and extract matching sequences rather than to validate entire documents"


It is obvious that the solution of this problem is not strictly bound to RELAX NG itself
(which just happens to be able to parse a schema defined using the RNC grammar). The tool Jeni reasons about would be any compiler writing system that allows additional grammar rules, that are not required to reach the start symbol of the language, but may be useful for other purposes.

Very fortunately, I know at least one such system:

         the Labelled BNF Grammar Formalism (LBNF).

 Jeni’s "grouper" tool can be implemented by adding additional rules to the language being specified
using LBNF’s "internal pragmas".

Here’s how the authors of LBNF Markus Forsberg, Aarne Ranta from Chalmers University of Technology
and the University of Gothenburg describe LBNF’s internal pragmas:

6 LBNF Pragmas

6.1 Internal pragmas

Sometimes we want to include in the abstract syntax structures that are not part of the concrete syntax, and hence not parsable. They can be, for instance, syntax trees that are produced by a type-annotating type checker. Even though they are not parsable, we may want to pretty-print them, for instance, in the type checker’s error messages. To define such an internal constructor, we use a pragma 

    "internal" Rule ";"

where Rule is a normal LBNF rule. For instance,

   internal EVarT. Exp ::= "(" Ident ":" Type ")";

introduces a type-annotated variant of a variable expression.

Of course, LBNF is a very nice and cool tool to carry out a number of really important language development tasks, besides the "grouper".

This entry was posted in compiler writing systems. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s