[XPath 4.0] Proposal: Short-circuiting functions, function-arity guards and lazy hints

As originally posted on Jan. 2nd 2023 at: the QT site:


I. Shortcutting and lazy hints

Let us have this expression:

let $f := function($arg1 as item()*, $arg2 as item()*) as function(item()*) as item()*
             {  (: Some code here :) }
  return
    $f($x) ($y)

Evaluating $f($x) produces a function. The actual arity of this resulting function can be any number N >= 0 :

  • If N > 1 there would be arity mismatch error, as only one argument $y is provided in the expression.

  • If N = 1 the final function call can be evaluated, and the argument $y must be evaluated, or

  • If N = 0, then $y is unneeded and can safely be ignored according to the updated Coercion Rules / Function Coercion in Xpath 4.0.

Because a possibility exists to be able to ignore the evaluation of $y, it is logical to delay the evaluation of $y until the actual arity of $f($x) is known.

The current XPath 4.0 evaluation rules do not require an implementation to base its decision whether or not to evaluate $y on the actual arity of the function produced by $f($x), thus at present an implementation could decide to evaluate $y regardless of the actual arity of the function produced by $f($x).

This is where a lazy hint comes: it indicates to the XPath processor that it is logical to make the decision about evaluation of $y based on the actual arity of the function returned by $f($x).

A rewrite of the above expression using a lazy hint looks like this:

let $f := function($arg1 as item()*, $arg2 as item()*) as function(item()*) as item()*
             {  (: Some code here :) }  
  return
    $f($x) (lazy $y)

Here is one example of a function with short-cutting and calling it with a lazy hint:

let $fAnd := function($x as xs:boolean, $y as xs:boolean) as xs:boolean
             {
                let $partial := function($x as xs:boolean) as function(xs:boolean) as  xs:boolean
                                {
                                  if(not($x)) then ->(){false()}
                                              else ->($t) {$t}
                                }
                 return $partial($x)($y)
             }
  return
     $fAnd($x (: possibly false() :), lazy $SomeVeryComplexAndSlowComputedExpression)

Without the lazy hint in the above example, it is perfectly possible that an XPath implementation, unrestricted by the current rules, would evaluate $SomeVeryComplexAndSlowComputedExpression – something that is unneeded and could be avoided completely.

Formal syntax and semantics

  1. The lazy keyword should immediately precede any argument in a function call. If specified, it means that it is logical to make the decision about evaluation of this argument based on the actual arity of the function in this function call.

    Based on this definition, it follows that lazy $argK implies lazy for all arguments following $argK in the function call. Thus specifying more than one lazy hint within a given function call is redundant and an implementation may report this redundancy to the user.

    The scope of a lazy keyword specified on an argument is this and all following arguments of (only) the current function call.

  2. It is possible to specify a lazy keyword that is in force for the respective argument(s) of all function calls of the given function. To do this, the lazy keyword must be specified immediately preceding a parameter name in the function definition of that function.

    For example, if the function $f is specified as:

    let $f := function($arg1 as item()*, lazy $arg2 as item()*, $arg3 as item()*, $arg4 as item()* ) 
              { (: some code here:) }
      return
         $someExpression
    

    Then any call of $f in its definition scope that has the form:

    $f($x, $y, $z, $t)
    

    is equivalent to:

    $f($x, lazy $y, $z, $t)
    

II. A function’s arity is a guard for its arguments

Let us have a function $f defined as below:

let $f := function($arg1 as item()*, $arg2 as item()*, …, $argN as item()*)
   as function(item()*, item()*, …, item()*) as item()*
     {
       if($cond0($arg1))       then -> () { 123 }
        else if($cond1($arg1)) then -> ($Z1 as item()*) {$Z1}
        else if($cond2($arg1)) then -> ($Z1 as item()*, $Z2 as item()*) {$Z1 + $Z2}
        (:    .        .        .        .         .        .        .         .  :)
        else if($condK($arg1)) then -> ($Z1 as item()*, $Z2 as item()*, …, $Zk as item()*)
                                       {$Z1 + $Z2 + … + $Zk}
        else ()
     }
  return
     $f($y1, $y2, …, $yN) ($z1, $z2, …, $zk)

A call to $f returns a function whose arity may be any of the numbers: 0, 1, …, K.

Depending on the arity of the returned function (0, 1, …, K), the last (K, K-1, K-2, …, 2, 1, 0) arguments of the function call:

$f($y1, $y2, . . . , $yN) ($z1, $z2, . . . , $zk)

are unneeded and it is logical that they would not need to be evaluated.

So, the actual arity of the result of calling $f is a guard for the arguments of a call to this function-result.

Thus, one more bullet needs to be added to [2.4.5 Guarded Expressions] https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-guarded-expressions), specifying an additional guard-type:

  • In an expression of the type E(A1, A2, ..., AN) any of the arguments A<sub>K</sub> is guarded by the condition  actual-arity(E) ge K. This rule has the consequence that if the actual arity of E() is less than K then if any argument Am (where m &gt;= K) is evaluated, this must not raise a dynamic error. An implementation may base on the actual arity of E() its decision for the evaluation of the arguments.
Advertisement
Posted in XPath, XPath 4.0 | Leave a comment

XPath Gems: Construct the PowerSet of a given set

XPath has become a really powerful functional language. Here is a recent proof of this.

Let us see how one can construct with XPath 3.1 the PowerSet of a set.

A refresh from set theory, and as per Wikipedia: "In mathematics, the power set (or powerset) of a set S is the set of all subsets of S, including the empty set and S itself.[1] In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is postulated by the axiom of power set.[2] The powerset of S is variously denoted as P(S), 𝒫(S), P(S), or 2^S. The notation 2^S, meaning the set of all functions from S to a given set of two elements (e.g., {0, 1}), is used because the powerset of S can be identified with, equivalent to, or bijective to the set of all the functions from S to the given two elements set.[1]"

As there is no Set datatype yet in XPath, here we will denote a "set" as a sequence of items, the empty set being represented by the empty sequence. If desired, one could use XPath 3.1 arrays instead to represent items that are themselves sets (or "composite sequences").

As our function must return a set of sets, its return type cannot be just a simple sequence, because in XPath sequences are flat and there is no such thing as "sequence of sequences".

I have chosen to represent the result of the powerSet function below as a sequence of arrays, each array representing one subset of the provided set.

Finally, here is the code itself:

let $powerSet := function($set as item()*) as array(*)*
 {
   fold-right($set, [], function($it as item(), $accum as array(*)*)
                        {
                            $accum, $accum ! array:append(.,   $it)                                   
                        })
 }
  return $powerSet((1, 2, 3, 4))

As expected, the result is this set of 16 (2^4) subsets:

[]
[4]
[3]
[4,3]
[2]
[4,2]
[3,2]
[4,3,2]
[1]
[4,1]
[3,1]
[4,3,1]
[2,1]
[4,2,1]
[3,2,1]
[4,3,2,1]
Posted in XPath, XPath 3.0, XPath 3.1 | Leave a comment

[XPath 4.0] Proposal: Decorators support

Decorators

A Decorator is a tool to wrap (extend/change/modify) the behavior of one or more existing functions without modifying their code.

There are many cases when we want to handle a call to a specific function f() and do some or all of the following:

  1. Perform some initial action based on the context and on the arguments in the call to f().
  2. Transform the set of the actual arguments on the call to some other set of argument values — substituted arguments.
  3. Decide whether or not to invoke f(), passing to it the actual arguments or the substituted ones, created in the previous step.
  4. If we invoked the function in the previous step, we could do something with its result.
  5. Perform some final processing that may (but does not have to) depend on the result of the invocation of f().

Here is a small, typical problem that is well handled with decorators:

We want to have a tool to help us with debugging any function. When a function is called (and if we are in Debug mode), this tool will:

  • Tell us that a call to the function was performed and will list the function name and the parameters, passed to the function
  • Tell us what the result of the call to the function was

We will be able to do such tracing with not just one but with all functions, whose behavior we want to observe.

let $trace-decorator := function($debug as xs:boolean, $f as function(*))
    {
      let $theDecorator := function($args as array(*), $kw-args as map(*))
      {
        if($debug)
        then 
          let $func-name := (function-name($f), $kw-args("$funcName"))[1],
            $pre-msg := "Calling function " || $func-name || " with params: " 
                        || array:flatten(($args)) || "," 
            || map:for-each( $kw-args, 
                             function($key as xs:anyAtomicType, $val)
                             {if($key ne "$funcName")
                               then (" "|| string($key)||": " || string($val))
                               else ()
                              }),
            
            $result := $f($args, $kw-args),
            $post-msg := "Got result: " || string($result) || "&#10;"
           return
             ($pre-msg, $post-msg)
         else $f($args, $kw-args)
      }
      return $theDecorator
    },
    
    $upper := function($args as array(*), $kw-args as map(*))
    {
      let $txt := $args[1]
        return upper-case($txt)
    }
    
    return 
    (
     $trace-decorator(true(), $upper)(["hello"], map{"$funcName" : "$upper"}),
     $trace-decorator(true(), $upper)(["reader"], map{"$funcName" : "$upper"}),
        "=======================================================================",

     $trace-decorator(false(), $upper)(["hello"], map{"$funcName" : "$upper"}),
     $trace-decorator(false(), $upper)(["reader"], map{"$funcName" : "$upper"})
    )

The result of evaluating the above XPath 3.1 expression is exactly what we wanted to get:

Calling function $upper with params: hello,

Got result: HELLO

Calling function $upper with params: reader,

Got result: READER

====================================================

HELLO

READER

So, it is possible to write and use decorators even in XPath 3.1 as above. Then why XPath decorators are as rare as the white peacock?

peacock Photo Via: aboutpetlife.com

The answer is simple: just try to write even the simple decorator in XPath 3.1 and you’ll know how difficult and error-prone this is. This is why several programming languages provide special support for decorators:

  • Python: decorators are a standard feature of the language.

    • A standard syntax is provided to declare that a function is being decorated by another function. Composition of multiple decorators is supported.
    • There is a standard Python way of getting "any actual arguments" with which the unknown in advance function (to be decorated) is called.
    • There is a standard Python way to call any function passing to it just an array (for its positional arguments) and a map (for its keyword arguments).
  • Typescript:

    • A standard syntax is provided to declare that a function is being decorated by another function. Composition of multiple decorators is supported.
    • Decorators can be applied not only to methods but also to their parameters (and to classes, constructors, static and instance properties, accessors).
    • The actual parameters in calling the manipulated method accessed in a standard way as a spread (the opposite of destructuring). The spread syntax is used both for getting the parameters and in providing them in the call to the manipulated method.
  • Javascript : Almost the same as in Typescript (above)

Goal of this proposal

To provide standard XPath support for decorators, as seen in other languages (above):

  1. Provide syntax for specifying the decoration of a function:

Update rule:

[72] FullSignature ::= "function" "(" ParamList? ")" TypeDeclaration?

To:

[72] FullSignature ::= ("^" DecoratorReference)* "function" "(" ParamList? ")" TypeDeclaration?

And add this new rule:

[NN] DecoratorReference ::= VarRef ArgumentList? | FunctionCall
  1. When a decorated function is called, the XPath processor should create from the actual arguments of the call an array $args that holds all positionally-specified arguments in the call, and a map $kw-args that will hold the name – value pairs of all keyword-arguments in the call. Then the decorator will be called with these two arguments: ($args, $kw-args).

  2. When any function is called just with two arguments: ($args, $kw-args) (such as from a decorator), the XPath processor must perform everything necessary in order to call the function in the way it expects to be called. For this purpose, a new overload of fn:apply() is defined:

fn:apply($function as function(*), $array as array(*), $map as map(*)) as item()*

The result of the function is obtained by creating and invoking the same dynamic call that would be the result of a function-call to $function with (positional) arguments taken from the members of the supplied array $array and (keyword arguments) taken from $map.

The effect of calling

fn:apply($f, [$a, $b, $c, ...], map{"k1" : v1, "k2" : v2, ...})

is the same as the effect of the dynamic function call resulting from

$function($a, $b, $c, ...., $k1 = v1, $k2 = v2, ...)

The function conversion rules are applied to the supplied arguments in the usual way.


Example:

With this support added to the language, we simply write:

let $f := ^$trace-decorator()  upper-case#1
  return
    ("hello", "reader") ! $f(true()),
    "=======================================================================",
    ("hello", "reader") ! $f(false())

which produces the same correct result:

Calling function upper-case with params: hello,

Got result: HELLO

Calling function upper-case with params: reader,

Got result: READER

====================================================

HELLO

READER

Do note:

  1. The decorator and the manipulated function are completely independent of each other and may be written long before / after each other.
  2. They can reside in different code files.
  3. We can have a library of useful decorator functions and can append them to decorate any wanted function.
  4. As the updated Rule 72 above suggests, one can specify a chain of decorators manipulating a specific function. The inner-most decorator is passed to the next-inner-most-decorator, and so on…, which is passed … to the outer-most decorator. The different decorators specified don’t know about each other, are completely independent and may be written by different authors at any times and reside in different, unrelated function libraries.

This content was presented as a proposal for inclusion in XPath 4.0 here

Posted in XPath, XPath 4.0 | Leave a comment

[XPath 4.0]Proposal: Maps with Infinite Number of Keys: Total Maps and Decorated maps

Maps with Infinite Number of Keys: Total Maps and Decorated maps

1. Total Maps

Maps have become one of the most useful tools for creating readable, short and efficient XPath code. However, a significant limitation of this datatype is that a map can have only a finite number of keys. In many cases we might want to implement a map that can have more than a fixed, finite number of arguments.

Here is a typical example (Example 1):
A hotel charges per night differently, depending on how long the customer has been staying. For the first night the price is $100, for the second $90, for the third $80 and for every night after the third $75. We can immediately try to express this pricing data as a map, like this:

map {
1 : 100,
2 : 90,
3 : 80
(:  ??? How to express the price for all eventual next nights? :)
}

We could, if we had a special key, something like "TheRest", which means any other key-value, which is not one of the already specified key-values.

Here comes the first part of this proposal:

  1. We introduce a special key value, which, when specified in a map means: any possible key, different from the other keys, specified for the map. For this purpose we use the string: "\"

Adding such a "discard symbol" makes the map a total function on the set of any possible XPath atomic items.

Now we can easily express the hotel-price data as a map:

map {
1 : 100,
2 : 90,
3 : 80
'\' : 75
}

Another useful Example (2) is that now we can express any XPath item, or sequence of items as a map. Let’s do this for a simple constant, like π:

let $π := map {
'\' : math:pi()
}
 return $π?*   (: produces 3.141592653589793  :)

the map above is empty (has no regular keys) and specifies that for any other key-value $k it holds that $π($k) eq math:pi()

Going further, we can express even the empty sequence (Example 3) as the following map:

let $Φ := map {
'\' : ()
}
 return $Φ?*   (: produces the empty sequence :)

Using this representation of the empty sequence, we can provide a solution for the "Forgiveness problem" raised by Jarno Jelovirta in the XML.Com #general channel in March 2021:

This expression will raise an error:

[map {"k0": 1}, map{"k0": [1, 2, 3]}]?*?("k0")?*

[XPTY0004] Input of lookup operator must be map or array: 1.

To prevent ("forgive", thus "Forgiveness Problem") the raising of such errors we could accept the rule that in XPath 4.0 any expression that evaluates to something different than a map or an array, could be coerced to the following map, which returns the empty sequence as the corresponding value for any key requested in a lookup:

map {
'\' : ()
}  (: produces the empty sequence  for any lookup:)

To summarize, what we have achieved so far:

  1. The map constructed in Example 1 is now a total function over the domain of all natural numbers. Any map with a "\" (discard key) is a total function over the value-space of all xs:anyAtomicType values
  2. We can represent any XPath 4.0 item or sequence in an easy and intuitive way as a map.
  3. It is now straight-forward to solve the "Forgiveness Problem" by introducing the natural and intuitive rule for coercing any non-map value to the empty map, and this allows to use anywhere the lookup operator ? without raising an error.

2. Decorated Maps

Although we already achieved a lot in the first part, there are still use-cases for which we don’t have an adequate map solution:

  1. In the example (1) of expressing the hotel prices, we probably shouldn’t get $75 for a key such as -1 or even "blah-blah-blah" But the XPath 4.0 language specification allows any atomic values to be possible keys and thus to be the argument to the map:get() function. If we want validation for the actually-allowed key-values for a specific given map, we need to have additional processing/functionality.

  2. With a discard symbol we can express only one infinite set of possible keys and group them under the same corresponding value. However, there are problems, the data for which needs several infinite sets of key-values to be projected onto different values. Here is one such problem:

Imagine we are the organizers of a very simple lottery, selling many millions of tickets, identified by their number, which is a unique natural number.

We want to grant prizes with this simple strategy.

  • Any ticket number multiple of 10 wins $10.
  • Any ticket number multiple of 100 wins $20
  • Any ticket number multiple of 1000 wins $100
  • Any ticket number multiple of 5000 wins $1000
  • Any ticket number which is a prime number wins $25000
  • Any other ticket number doesn’t win a prize (wins $0)

None of the sets of key-values for each of the 6 categories above can be conveniently expressed with the map that we have so far, although we have merely 6 different cases!

How can we solve this kind of problem still using maps?

Decorators to the rescue

What is decorator, what is the decorator pattern and when it is good to use one? According to Wikipedia:

What solution does it describe? Define Decorator objects that

  • implement the interface of the extended (decorated) object (Component) transparently by forwarding all requests to it
  • perform additional functionality before/after forwarding a request.

This allows working with different Decorator objects to extend the functionality of an object dynamically at run-time.

The idea is to couple a map with a function (the decorator) which can perform any needed preprocessing, such as validation or projection of a supplied value onto one of a predefined small set of values (that are the actual keys of the map). For simplicity, we are not discussing post-processing here, though this can also be part of a decorator, if needed.

Let us see how a decorated-map solution to the lottery problem looks like:

let $prize-table := map {
  "ten" : 10,
  "hundred" : 20,
  "thousand" : 100,
  "five-thousand" : 1000,
  "prime" : 25000,
 "\" : 0
},
$isPrime := function($input as  xs:integer) as xs:boolean
{
  exists(index-of((2, 3, 5, 7, 11, 13, 17, 19, 23), $input)) (: simplified primality checker :)
},
$decorated-map := function($base-map as map(*), $input as xs:anyAtomicType) as item()*
{
  let $raw-result :=
         (
          let $key := 
           if(not($input castable as xs:positiveInteger)) then '\'  (: we can call the error() function here :) 
             else if($input mod 5000 eq 0) then 'five-thousand'
             else if($input mod 1000 eq 0) then 'thousand'
             else if($input mod 100 eq 0) then 'hundred'
             else if($input mod 10 eq 0) then 'ten'
             else if($isPrime($input)) then 'prime'
             else "\"
              return $base-map($key)
         ),
      $post-process := function($x) {$x},  (: using identity here for simplicity :)
      $post-processed := $post-process($raw-result)
    return $post-processed
},

$prizeForTicket := $decorated-map($prize-table, ?),       (: Note: this is exactly the lookup operator  ?    :)
$ticketNumbers := (1, 10, 100, 1000, 5000, 19, -3, "blah-blah-blah")

return $ticketNumbers ! $prizeForTicket(.)          (: produces 0, 10, 20, 100, 1000, 25000, 0, 0 :)

Conclusion

  1. In the 2nd part of this proposal, a new type/function — the decorated-map was described.
  2. We defined the signature of a decorated-map and gave an example how to construct and use one in solving a specific problem.
  3. Finally, we showed that the lookup operator ? on a decorated map $dm is identical to $dm($base-map, ?)

What remains to be done?

The topic of decorators is extremely important, as a decorator may and should be possible to define on any function, not just on maps. This would be addressed in one or more new proposals. Stay tuned 😊

This content was presented as a proposal for inclusion in XPath 4.0 here:

Posted in XPath, XPath 4.0 | Leave a comment

Recursion with anonymous (inline) functions in XPath 3.0 — Part II

In my first post about implementing recursion with anonymous functions I provided the following example:

let $f := function($n as xs:integer,
                   $f1 as function(xs:integer,
                   function()) as xs:integer
                   ) as xs:integer
             {if($n eq 0)
                 then 1
                 else $n * $f1($n -1, $f1)
              },
    $F := function($n as xs:integer) as xs:integer
            {
                $f($n, $f)
            }

   return
           $F(5)

While this technique works well with relatively small values for n, we run into problems when n becomes bigger.

Let’s see this based on another example:

  let $f := function($nums as xs:double*,
                   $f1 as function(xs:double*,
                   function()) as xs:double
                   ) as xs:double
             {
              if(not($nums[1]))
                 then 0
                 else $nums[1] + $f1(subsequence($nums,2), $f1)
              },
    $F := function($nums as xs:double*) as xs:double
            {
                $f($nums, $f)
            }
   return
           $F(1 to 10)

This calculates correctly the sum of the numbers 1 to 10 — the result is:

55

However, if we try:

$F(1 to 100)

the result is the following Saxon 9.4.6EE exception:

Error on line 22
Too many nested function calls. May be due to infinite recursion.
Transformation failed: Run-time errors were reported

So, what happens here? Most readers would have guessed by now — our old Stack Overflow (not the site) exception.

Is there any way to avoid this exception?

One could rely on the smartness of the XSLT processor to do this. A slight fraction of XSLT processors recognize a limited kind of tail recursion and implement it using iteration, thus avoiding recursion.

The above code isn’t tail-recursive. Let us refactor it into the following tail-recursive code:

let $f := function($nums as xs:double*,
                   $accum as xs:double,
                   $f1 as  
                     function(xs:double*, xs:double, function())
                              as xs:double
                  ) as xs:double
             {
              if(not($nums[1]))
                 then $accum
                 else $f1(subsequence($nums,2), $accum+$nums[1], $f1)
              },

        $F := function($nums as xs:double*) as xs:double
              {
                     $f($nums, 0, $f)
              }
    return
      $F(1 to 100)

Saxon is one of the few intelligent processors that recognizes tail recursion, however it still raises the stack-overflow exception — even for the above tail-recursive code . Why?

Here is the Wikipedia definition of tail recursion:

In computer science, a tail call is a subroutine call that happens inside another procedure as its final action; it may produce a return value which is then immediately returned by the calling procedure. The call site is then said to be in tail position, i.e. at the end of the calling procedure. If any call that a subroutine performs, such that it might eventually lead to this same subroutine being called again down the call chain, is in tail position, such a subroutine is said to be tail-recursive.

At present, the XSLT/XPath processors that do recognize some kind of tail recursion only do so if a function/template calls itself by name.

None of them handles the case when the tail call is to another function (Michael Kay, author of Saxon, shared on the Saxon mailing list that Saxon correctly handles any tail calls for templates, but not for functions).

So, what can we do in this situation? One decision is to wait until some processor starts handling any type of tail call inside functions.

Another, is to use the DVC (Divide and Conquer) technique for minimizing the maximum depth of nested recursion calls.

The idea is to split the sequence into subsequences (usually two) of roughly the same length, recursively process each subsequence, and then combine the results of processing each individual subsequence.

Here is the above code, re-written to use DVC:

let $f := function($nums as xs:double*,
                   $f1 as function(xs:double*, function()) 
                             as xs:double
                   ) as xs:double
             {if(not($nums[1]))
                 then 0
                 else if(not($nums[2]))
                        then $nums[1]
                        else
                         let $half := count($nums) idiv 2
                          return
                            $f1(subsequence($nums,1, $half), $f1)
		           +
		            $f1(subsequence($nums, $half+1), $f1)
              },
    $F := function($nums as xs:double*) as xs:double
            {
                $f($nums, $f)
            }

   return
           $F(1 to 10000)

Sure enough, this time we get the result:

5.0005E7

Using this technique, the maximum recursion depth is Log2(N) — thus for processing a sequence with 1M (one million elements) the maximum recursion depth is just 19.

Conclusion: The DVC technique is a tool that can be immediately used to circumvent the lack of intelligence of current XPath 3.0 processors when dealing with tail-call optimization.

Posted in Higher Order Functions, Performance Optimization, XPath 3.0 | 7 Comments

Word Ladders, or How to go from “angry” to “happy” in 20 steps

Acknowledgement: To Brandon, the person who attracted my attention to this problem.

From Wikipedia, the free encyclopedia:

A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.

As with the original game of Lewis Carroll, each step consists of a single letter substitution.

My struggle to find a solution to this problem processed into these stages:

1.   Total lack of understanding and trying to apply brute force approaches, only to realize what astronomical effort was involved and that “brute force” wasn’t readily definable.

2.   “Discovering” the idea that when playing with four-letter words we only need the 4-letter words part of a dictionary.

3.   Thinking hard how to pre-process an N-letter projection of a dictionary into a structure that would be more suitable for solving word-ladders.

4.   Finally being told that the structure I was trying to invent is called a graph

So, the main idea is that the set of N-letter words from an English dictionary is represented as a graph, where every node is a word from the dictionary, and there is an arc between two nodes exactly when the two words in these nodes differ only in a single letter.

Obviously, any such graph is undirected – if there is an arc between node N1 and N2, then there is also an arc between N2 and N1.

Finding the shortest chain of words is finding the shortest path in this graph that connects the two nodes. This problem has a well-known solution, called BFS (Breadth-First Search).

What may be new here is that the solution we have here is implemented in XSLT.

There were a number of subtasks:

ST1: Split a dictionary into a set of dictionaries, each containing only words of the same size N, where N = 1, 2, 3, 4, …

This one is quite natural and easy:

MakeLength-SizedDicts.xsl:

<xsl:stylesheet version=”2.0” xmlns:xsl=”http://www.w3.org/1999/XSL/Transform“>
 <xsl:output omit-xml-declaration=”yes” indent=”yes“/><xsl:template match=”/*“>
        <xsl:for-each-group select=”w” group-by=”string-length()“>
<xsl:sort select=”string-length()“/>
       <xsl:sort select=”lower-case(.)“/>       <xsl:result-document href=
           “file:///c:/temp/delete/dictSize{string-length()}.txt”>
             <xsl:for-each select=”current-group()“>
                 <xsl:value-of select=”lower-case(.), ‘&#0xA0; ‘“/>
             </xsl:for-each>
          </xsl:result-document>           <xsl:result-document href=
          “file:///c:/temp/delete/dictSize{string-length()}.xml”>
           <words>
            <xsl:for-each select=”current-group()“>
              <w><xsl:value-of select=”lower-case(.)“/></w>
            </xsl:for-each>
           </words>
          </xsl:result-document>
        </xsl:for-each-group>
 </xsl:template>
</xsl:stylesheet>

Thus we have produced the following separate dictionaries: dictSize1.xml, dictSize2.xml, …, dictSize24.xml .

ST2: For a given dictionary dictSizeN, produce a dictionary – graph:

AddNeighborsNaive.xsl:

<xsl:stylesheet version=”2.0
xmlns:xsl=”http://www.w3.org/1999/XSL/Transform&#8221;
   xmlns:my=”my:my
xmlns:xs=”http://www.w3.org/2001/XMLSchema&#8221;
  exclude-result-prefixes=”my xs“>
<xsl:output omit-xml-declaration=”yes” indent=”yes“/> 

 <xsl:key name=”kFindWord” match=”w” use=”.“/>

 <xsl:variable name=”vACode” as=”xs:integer
select=”string-to-codepoints(‘a’)“/>

 <xsl:variable name=”vDict” select=”/“/>

 <xsl:template match=”/*“>
   <xsl:variable name=”vPass1“>
       <words>
           <xsl:for-each select=
               “w[generate-id()=generate-id(key(‘kFindWord’, .)[1])]”>
           <word>
             <xsl:sequence select=”.“/>
             <xsl:sequence select=”my:neighbors(.)“/>
           </word>
        </xsl:for-each>
      </words>
   </xsl:variable>

  <xsl:apply-templates select=”$vPass1” mode=”pass2“/>
</xsl:template>

 <xsl:template match=”node()|@*” mode=”pass2“>
   <xsl:copy>
       <xsl:apply-templates select=”node()|@*” mode=”#current“/>
</
xsl:copy>
 </xsl:template> 

 <xsl:template match=”word” mode=”pass2“>
   <word>
      <xsl:sequence select=”w“/>
      <xsl:apply-templates select=”nb” mode=”pass2“/>
   </word>
 </xsl:template>

 <xsl:function name=”my:neighbors“>
    <xsl:param name=”pCurrentWord” as=”xs:string“/> 

    <xsl:variable name=”vWordCodes” select=”string-to-codepoints($pCurrentWord)“/> 

  <xsl:for-each select=”1 to count($vWordCodes)“>
    <xsl:variable name=”vPos” select=”.“/>
    <xsl:variable name=”vLetterCode” select=”$vWordCodes[$vPos]“/> 

    <xsl:for-each select=”(1 to 26)[. ne $vLetterCode – $vACode+1]“>
        <xsl:variable name=”vNewCode” select=”. -1 +$vACode“/>
        <xsl:variable name=”vNewWordCodes” select=
         “subsequence($vWordCodes, 1, $vPos -1),
          $vNewCode,
          subsequence($vWordCodes,$vPos +1)
         “/>

        <xsl:variable name=”vNewWord” select=”codepoints-to-string($vNewWordCodes)“/> 

        <xsl:if test=”key(‘kFindWord’, $vNewWord, $vDict)“>
         <nb><xsl:sequence select=”$vNewWord“/></nb>
      </xsl:if>
    </xsl:for-each>
  </xsl:for-each>
 </xsl:function>
</xsl:stylesheet>

This transformation is applied on an N-letter words dictionary which in the case of N = 5 looks like:

<words>

   <w>aalii</w>

   <w>aaron</w>

   <w>abaca</w>

   <w>aback</w>

   <w>abaff</w>

   <w>abaft</w>

   <w>abama</w>

   <w>abase</w>

   <w>abash</w>

   <w>abask</w>

   <w>abate</w>

   <w>abave</w>

   <w>abaze</w>

   <w>abbas</w>

   <w>abbey</w>

   <w>abbie</w>

   <w>abbot</w>

   <w>abdal</w>

   <w>abdat</w>

   <w>abeam</w>

       .   .   .   .   .   .

</words>

And produces our desired dictionary-graph, which looks like this:

<words>

   <word>

      <w>aalii</w>

   </word>

   <word>

      <w>aaron</w>

      <nb>baron</nb>

      <nb>saron</nb>

      <nb>acron</nb>

      <nb>apron</nb>

   </word>

   <word>

      <w>abaca</w>

      <nb>araca</nb>

      <nb>abama</nb>

      <nb>aback</nb>

   </word>

       .   .   .   .   .   .

</words>

ST3: Now that we have the graph, we are ready to implement the “Find Shortest Path” BFS algorithm:

findChainOfWords.xsl:

<xsl:stylesheet version=”2.0
xmlns:xsl=http://www.w3.org/1999/XSL/Transform
   xmlns:my=”my:my
xmlns:xs=http://www.w3.org/2001/XMLSchema
  exclude-result-prefixes=”my xs“>
   <xsl:output method=”text“/>

<xsl:key name=”kFindWord” match=”w” use=”.“/>  <xsl:param name=”pStartWord  select=”‘nice’” as=”xs:string“/>
    <xsl:param name=”pTargetWord” select=”‘evil’” as=”xs:string“/>     <xsl:variable name=”vDictGraph” select=”/“/>      <xsl:template match=”/*“>
        <xsl:sequence select=”my:chainOfWords($pStartWord,
$pTargetWord)
“/>
    </xsl:template>        <xsl:function name=”my:chainOfWords” as=”xs:string*“>
     <xsl:param name=”pStartWord” as=”xs:string“/>
     <xsl:param name=”pEndWord” as=”xs:string“/>        <xsl:sequence select=
         “if(not(key(‘kFindWord’, $pStartWord, $vDictGraph))
           or
               not(key(‘kFindWord’, $pEndWord, $vDictGraph))
               )
               then error((), ‘A word-argument isn`t found in the dictionary.’)
               else ()
   “/> 

  <xsl:variable name=”vStartWord” as=”xs:string
select=key(‘kFindWord’, $pStartWord, $vDictGraph)
                       [count(../*)  lt  count(key(‘kFindWord’, $pEndWord,
$vDictGraph)/../* )
]
  |
     key(‘kFindWord’, $pEndWord, $vDictGraph)
                [count(../*) le count(key(‘kFindWord’, $pStartWord,
$vDictGraph)/../*)
]
  “/> 

   <xsl:variable name=”vEndWord” as=”xs:string
       select=”($pStartWord, $pEndWord)[not(. eq $vStartWord)]“/>

   <xsl:variable name=”vStartNode” as=”element()“>
    <node>
        <value><xsl:value-of select=”$vStartWord“/></value>
    </node>
   </xsl:variable> 

   <xsl:sequence
select=”my:processQueue($vStartNode, $vEndWord, $vStartWord)“/>
 </xsl:function> 

 <xsl:function name=”my:processQueue  as=”xs:string*“>
     <xsl:param name=”pQueue” as=”element()*“/>
     <xsl:param name=”pTarget” as=”xs:string“/>
     <xsl:param name=”pExcluded” as=”xs:string*“/> 

    <xsl:sequence select=
     “if(not($pQueue))
         then ()
         else
for $vTop in $pQueue[1],
                  $vResult in my:processNode($vTop, $pTarget, $pExcluded)[1]
             return
               if($vResult/self::result)
                 then string-join($vResult/*, ‘ ==> ‘)
                 else my:processQueue((subsequence($pQueue, 2), $vResult/*),
                                                                $pTarget,
                                                                ($pExcluded, $vResult/*/value)
                                                               )“/>
 </xsl:function> 

 <xsl:function name=”my:processNode” as=”element()“>
     <xsl:param name=”pNode” as=”element()“/>
     <xsl:param name=”pTarget” as=”xs:string“/>
     <xsl:param name=”pExcluded” as=”xs:string*“/> 

     <xsl:variable name=”vCurWord
           select=”key(‘kFindWord’, $pNode/value, $vDictGraph)“/> 

     <xsl:variable name=”vNeighbors
select=”$vCurWord/following-sibling::*“/>
     <xsl:choose>
         <xsl:when test=”$pTarget = $vNeighbors“>
              <xsl:variable name=”vResult  as=”element()“>
                <result>
                  <xsl:sequence select=”my:enumerate($pNode)“/>
                  <w><xsl:sequence select=”$pTarget“/></w>
                </result>
              </xsl:variable>

               <xsl:sequence select=”$vResult“/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name=”vMustAdd” as=”element()“>
               <add>
                 <xsl:for-each select=”$vNeighbors[not(. = $pExcluded)]“>
                   <node>
                    <parent><xsl:sequence select=”$pNode“/></parent>
                    <value><xsl:value-of select=”.“/></value>
                   </node>
                 </xsl:for-each>
               </add>
         </xsl:variable> 

        <xsl:sequence select=”$vMustAdd“/>
      </xsl:otherwise>
   </xsl:choose>
 </xsl:function> 

 <xsl:function name=”my:enumerate” as=”element()*“>
     <xsl:param name=”pNode” as=”element()?“/> 

      <xsl:sequence select=
        “if($pNode)
            then (my:enumerate($pNode/parent/node), $pNode/value)
            else ()“/>
  </xsl:function>
</xsl:stylesheet>

And sure enough, we get this result:

 

evil ==> emil ==> emir ==> amir ==> abir ==> abie ==> able ==> aile ==> nile ==> nice

The first variant of this solution seems encouraging. But it takes a while – especially on my 9-year old Pc, which has 2GB of RAM, and a tiny cache. For example, the 8-parts chain connecting thus to else took 118 seconds (26 seconds on my newer, 2-year old PC).

Therefore, we are clearly in need of:

ST3: Optimizations:

Optimization1: Order the arc-words in increasing number of their own arcs. This means that we will first try the neighbor-node with the fewest number of neighbors.

To implement this optimization, we modify this part of the AddNeighborsNaive.xsl:

<xsl:template   match=”word”   mode=”pass2“>
<word>
<xsl:sequence select=”w“/>
<xsl:apply-templates   select=”nb”   mode=”pass2“/>
</word>
</xsl:template>

To:

<xsl:template match=”word” mode=”pass2“>
  <word>
     <xsl:sequence select=”w“/>
     <xsl:apply-templates select=”nb” mode=”pass2“>
        <xsl:sort select=”count(key(‘kFindWord’, .)/../nb)
              data-type=”number“/>
      </xsl:apply-templates>
   </word>
 </xsl:template>

Now the time for producing the 8-parts chain connecting thus to else falls from 118 sec. to 107 sec. –  10% faster.

Optimization2: When adding new words/nodes in the queue, always add them by increasing Hamming Distance to the target word:

To implement this, I added to the transformation the following function:

<xsl:function name=”my:HammingDistance” as=”xs:integer“>
    <xsl:param name=”pStr1” as=”xs:string“/>
    <xsl:param name=”pStr2” as=”xs:string“/> 

    <xsl:sequence select=
     “count(f:zipWith(f:eq(),
                                        f:string-to-codepoints($pStr1),
                                        f:string-to-codepoints($pStr2)
                                        )
                                        [not(.)]
                   )
   “/>
 </xsl:function>

Here I am using FXSL – the f:zipWith() function, as well as the function implementing the Xpath eq operator and the FXSL HOF analogue of the standard Xpath function  string-to-codepoints().

In order for this to compile, one must add the following (or similar) imports:

 <xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-zipWith.xsl“/>
  <xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-Operators.xsl“/>
  <xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-standardStringXpathFunctions.xsl“/>

This will not be needed if you were using XSLT 3.0 – then the above function would be:

<xsl:function name=”my:HammingDistance” as=”xs:integer“>
    <xsl:param name=”pStr1” as=”xs:string“/>
    <xsl:param name=”pStr2” as=”xs:string“/> 

  <xsl:sequence select=
     “count(map-pairs(f:eq#2,
                                          string-to-codepoints($pStr1),
                                          string-to-codepoints($pStr2)
                                         )
                                     [not(.)]
                    )
   “/>
 </xsl:function>

 But one would still need to write the f:eq() function.

The new function is used to change this code in my:processNode():

    <xsl:variable name=”vMustAdd” as=”element()“>
           <add>
                <xsl:for-each select=”$vNeighbors[not(. = $pExcluded)]“>
                   <node>
                     <parent><xsl:sequence select=”$pNode“/></parent>
                     <value><xsl:value-of select=”.“/></value>
                  </node>
               </xsl:for-each>
           </add>
    </xsl:variable>
    <xsl:sequence select=”$vMustAdd“/>

 To this new code:

    <xsl:variable name=”vMustAdd” as=”element()“>
           <add>
                <xsl:for-each select=”$vNeighbors[not(. = $pExcluded)]“>
                   <xsl:sort select=”my:HammingDistance(., $pTarget)
                                  data-type=”number“/>
                   <node>
                      <parent><xsl:sequence select=”$pNode“/></parent>
                      <value><xsl:value-of select=”.“/></value>
                   </node>
             </xsl:for-each>
           </add>
    </xsl:variable>
    <xsl:sequence select=”$vMustAdd“/>

The complete code of the transformation after this optimization becomes:

<xsl:stylesheet   version=”2.0
xmlns:xsl=”http://www.w3.org/1999/XSL/Transform
xmlns:xs=”http://www.w3.org/2001/XMLSchema
xmlns:f=”http://fxsl.sf.net/”   xmlns:my=”my:my
exclude-result-prefixes=”my   f xs“>
<xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-zipWith.xsl“/>
<xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-Operators.xsl“/>
<xsl:import href=”../../../CVS-DDN/fxsl-xslt2/f/func-standardStringXpathFunctions.xsl“/>
<xsl:output   method=”text“/>

<xsl:key   name=”kFindWord”   match=”w”   use=”.“/>

<xsl:param   name=”pStartWord”  select=”‘point’”   as=”xs:string“/>
<xsl:param   name=”pTargetWord”   select=”‘given’”   as=”xs:string“/>

<xsl:variable   name=”vDictGraph”   select=”/“/>

<xsl:template   match=”/*“>
<xsl:sequence   select=”my:chainOfWords($pStartWord,   $pTargetWord)“/>
</xsl:template>

<xsl:function   name=”my:chainOfWords”   as=”xs:string*“>
<xsl:param name=”pStartWord”   as=”xs:string“/>
<xsl:param name=”pEndWord”   as=”xs:string“/>

<xsl:sequence select=
if(not(key(‘kFindWord’,   $pStartWord, $vDictGraph))
        or
         not(key(‘kFindWord’, $pEndWord,   $vDictGraph))
        )
        then error((), ‘A word-argument isn`t   found in the dictionary.’)
        else ()
    “/>

<xsl:variable name=”vStartWord”   as=”xs:string”   select=
key(‘kFindWord’,   $pStartWord, $vDictGraph)
            [count(../*) lt count(key(‘kFindWord’, $pEndWord,
$vDictGraph)/../* )
]
  |
   key(‘kFindWord’, $pEndWord, $vDictGraph)
        [count(../*) le count(key(‘kFindWord’, $pStartWord,  $vDictGraph)/../*)]
  “/>

<xsl:variable name=”vEndWord”   as=”xs:string
select=”($pStartWord,   $pEndWord)[not(. eq $vStartWord)]“/>

<xsl:variable   name=”vStartNode”   as=”element()“>
<node>
<value><xsl:value-of   select=”$vStartWord“/></value>
</node>
</xsl:variable>

<xsl:sequence   select=
my:processQueue($vStartNode,   $vEndWord, $vStartWord)“/>
</xsl:function>

<xsl:function name=”my:processQueue”  as=”xs:string*“>
<xsl:param name=”pQueue”   as=”element()*“/>
<xsl:param name=”pTarget”   as=”xs:string“/>
<xsl:param name=”pExcluded”   as=”xs:string*“/>

<xsl:sequence select=
if(not($pQueue))
      then ()
      else
for $vTop in $pQueue[1],
         $vResult in   my:processNode($vTop, $pTarget, $pExcluded)[1]
            return
               if($vResult/self::result)
                 then string-join($vResult/*, ‘ ==>   ‘)
                 else   my:processQueue((subsequence($pQueue, 2),
$vResult/*),
                                                                    $pTarget,
                                                                    ($pExcluded,   $vResult/*/value)
                                                                   )“/>
</xsl:function>

<xsl:function name=”my:processNode”   as=”element()“>
<xsl:param name=”pNode”   as=”element()“/>
<xsl:param name=”pTarget”   as=”xs:string“/>
<xsl:param name=”pExcluded”   as=”xs:string*“/>

<xsl:variable name=”vCurWord
select=”key(‘kFindWord’,   $pNode/value, $vDictGraph)“/>

<xsl:variable name=”vNeighbors
select=”$vCurWord/following-sibling::*“/>

<xsl:choose>
<xsl:when test=”$pTarget   = $vNeighbors“>
<xsl:variable   name=”vResult”  as=”element()“>
<result>
<xsl:sequence   select=”my:enumerate($pNode)“/>
<w><xsl:sequence   select=”$pTarget“/></w>
</result>
</xsl:variable>

<xsl:sequence select=”$vResult“/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name=”vMustAdd”   as=”element()“>
<add>
<xsl:for-each select=”$vNeighbors[not(.   = $pExcluded)]“>
<xsl:sort select=”my:HammingDistance(.,   $pTarget)
data-type=”number“/>
<node>
<parent><xsl:sequence   select=”$pNode“/></parent>
<value><xsl:value-of   select=”.“/></value>
</node>
</xsl:for-each>
</add>
</xsl:variable>

<xsl:sequence select=”$vMustAdd“/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>

<xsl:function name=”my:enumerate”   as=”element()*“>
<xsl:param name=”pNode”   as=”element()?“/>

<xsl:sequence select=
if($pNode)
         then (my:enumerate($pNode/parent/node),   $pNode/value)
         else ()“/>
</xsl:function>

<xsl:function name=”my:HammingDistance”   as=”xs:integer“>
<xsl:param name=”pStr1”   as=”xs:string“/>
<xsl:param name=”pStr2”   as=”xs:string“/>

<xsl:sequence select=
count(f:zipWith(f:eq(),
                     f:string-to-codepoints($pStr1),
                     f:string-to-codepoints($pStr2)
                   )
                    [not(.)]
        )
  “/>
</xsl:function>
</xsl:stylesheet>

 The effect of this:

“point” to “given” (13-parts chain): 137 sec. before the optimization, 118 seconds after the optimization.

That is: another 14% increase of efficiency.  The combined effect of the two optimizations is speeding up the initial transformation with about 23%.

As I mentioned, my newer PC is about 4 times faster, which means that the maximum time for discovering a 5-letters shortest chain with Saxon 9.1.05 on a modern PC would be less than 35 seconds.

Another interesting fact is that these transformations take approximately 50% less time when run with XQSharp (XmlPrime), which lowers the maximum transformation time to 15 – 16 seconds.

Finally, here is the longest word chain I have found so far between 5-letter words (angry to happy):

angry ==> anury ==> anura ==> abura ==> abuta ==> aluta ==> alula ==> alala ==> alada ==> alida ==> alima ==> clima ==> clime ==> slime ==> slimy ==> saimy ==> saily ==> haily ==> haply ==> happy

And a few more interesting word chains:

story ==> novel :

novel ==> nevel ==> newel ==> tewel ==> tewer ==> teaer ==>
teaey ==> teary ==> seary ==> stary ==> story

fudge ==> cream:

cream ==> creat ==> crept ==> crepe ==> crape ==> grape ==>
gripe ==> gride ==> guide ==> guige ==> gudge ==> fudge

small ==> sized

sized ==> sizer ==> siver ==> saver ==> sayer ==> shyer ==>
shier ==> shiel ==> shill ==> shall ==> small

Posted in Uncategorized | Leave a comment

Recursion with anonymous (inline) functions in XPath 3.0

A few days ago Roger Costello asked at the xsl-list
forum:

Hi Folks

Is it possible to do recursion in an anonymous function?

Example: I would like to implement an "until" function. It has 
three arguments:

1. p is a boolean function
2. f is a function on x
3. x is the value being processed

Read the following function call as: decrement 3 until 
it is negative
$until ($isNegative, $decrement, 3)

where
$isNegative := function($x as xs:integer) {$x lt 0}
$decrement := function($x as xs:integer) {$x - 1}

Here's how I attempted to implement function until:

$until := function(
                   $p as function(item()*) as xs:boolean,
                   $f as function(item()*) as item()*,
                   $x as item()*
                   ) as item()*
            {
             if ($p($x))
                then
                   $x
                else
                   $until($p, $f, $f($x))  <-- RECURSE ...THIS IS
                           NOT ALLOWED, I THINK
            }

Is there a way to implement function until in XPath 3.0? 
(I know how to implement it in XSLT 3.0)

Roger is strictly speaking correct – because an anonymous function cannot call itself by name – it just doesn’t have a name …

However, we can still implement recursion using only inline function(s).

Here is my first attempt on a factorial inline function:

let $f := function($n as xs:integer,
                   $f1 as function(xs:integer) as xs:integer
                  ) as xs:integer
          {if($n eq 0)
               then 1
               else $n * $f1($n -1, $f1)
          }
     return
        $f(5, $f)

The result we get is correct:

120

OK, what happens here?

An inline function cannot call itself, because it doesn’t know its name and in this respect behaves like an Alzheimer disease victim.

The idea is to give this sick person a written note identifying the person to whom he has to pass the remainder of the task. And if he himself happens to be on this note, he will pass the task to himself (and forget immediately it was him doing the previous step).

The only special thing to notice here is how the processing is initiated:

$f(5, $f)

calling the function and passing itself to itself.

This initiation may seem weird to a client and is also error-prone. This is why we can further improve the solution so that no weirdness remains on the surface:

let $f := function($n as xs:integer,
                   $f1 as function(xs:integer,
                   function()) as xs:integer
                   ) as xs:integer
             {if($n eq 0)
                 then 1
                 else $n * $f1($n -1, $f1)

              },
    $F := function($n as xs:integer) as xs:integer
            {
                $f($n, $f)
            }

   return
           $F(5)

Now we produced an inline, anonymous function $F, which given an argument $n, produces $n!

There are two interesting sides of this story:

  1. We show how elegant and powerful XPath 3.0 HOFs are.
  2. More than ever it is clear now that XPath 3.0 is emerging as a full-pledged, stand-alone functional programming language in its own right that doesn’t need to be hosted by another language, such as XSLT or XQuery.

Some nice features we might still need, that could be just after the turn of the road (Read: In a post – 3.0 XPath version):

  1. Stand-alone XPath processors.
  2. Import/include directives for XPath-only files.
  3. Separate packaging/compilation of XPath-only programs.
  4. New data structures such as tuples.
  5. Generics – parametric data types.

I have been dreaming about this since the time I shared in this blog the XPath-only implementation of the Binary Search tree and the Set datatype.

Posted in Higher Order Functions, XPath 3.0 | Tagged , , | 6 Comments

Fizz Buzz with XPath 2.0/3.0

A few days ago Jim Fuller asked on Twitter:

@xquery fizz buzz with xquery http://bit.ly/wC6Ra5 can anyone come up with a faster version ? #xquery9:00 AM Feb 26, 2012

Here are my two answers in the categories: 1) Elegant; 2) Fast

These and other solutions can be found at: http://en.wikibooks.org/wiki/XQuery/Fizzbuzz

for $n  in (to 100),
      $fizz in not($n mod 3
),
       $buzz in not($n mod 5
)
return
         concat(“fizz”[$fizz], “buzz”[$buzz], $n[not($fizz or $buzz)])

No explicit conditional if/then/else clauses are used here. No XQuery-only cpecific constructs.

This XPath 2.0 expression can be evaluated “as-is” in any XPath 2.0 host (XSLT 2.0, XQuery 1.0 or other).

2.

for $k in 1 to 100 idiv 15 +1,
       $start in 15*($k1) +1
,
       $end in min((100, $start + 14
))
return

      let $results :=
                   (
$start, $start+1
,
                   ‘fizz’
,
                   $start+3
,
                   ‘buzz’, ‘fizz’
,
                   $start+6, $start+7
,
                   ‘fizz’
,
                   ‘buzz’
,
                   $start+10
,
                   ‘fizz’
,
                   $start+12, $start+13
,
                   ‘fizzbuzz’
)
return

             subsequence($results, 1, $end$start +1)

This expression could be efficient because there is no mod operator used at all.

It is a pure XPath 3.0 expression and can be evaluated “as-is” in any XPath 3.0 host (XSLT 3.0, XQuery 3.0 or other).

UPDATE:

Kit Wallace has collected all offered solutions and on his page one can compare all of them for running times. Here is a typical screensshot (RQ-1-CW and DN-2 seem to have almost identical speed. If you repeatedly refresh this page, half of the times one of these solutions is the fastest and half of the time — the other.):


Home fizzbuzz Timing

The Fizzbuzz problem was created as a test for prospective programmers.

Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

Script

Author

Title

Milliseconds *

DN-2 Dimitre Novatchev A highly optimised algorithm 0.98
RQ-1-CW Rob Whitby Optimised by factoring a repeated subsequence – modified 1.01
CW-2 Chris Wallace An algorithm 2.29
JF-1 Jim Fuller A simple, clean algorithm 3.3
ML-1 Mark Lawson A simple algorithm 3.67
DN-1 Dimitre Novatchev An XPath algorithm 4.11
CW-1 Chris Wallace Strings and their modulus values parameterised to support ease of modification 6.57
RQ-1 Rob Whitby Optimised by factoring a repeated subsequence failed
Posted in Performance Optimization, XPath, XPath 3.0, XSLT 2.0, XSLT 3.0 | 5 Comments

The set datatype implemented in XPath 3.0

In my previous two posts I introduced the binary search tree datatype, implemented in XPath 3.0. In most cases the binary search tree operations find, insert and delete have efficiency of O(log(d)) and the print/serialize operation is O(N*log(d)), where d is the maximum depth of the tree and N is the total number of tree nodes.

Today, I will show a set implemented using a binary search tree. This implementation choice means that the set datatype operations have the same efficiency as the corresponding binary search tree operations. This implementation choice also results in the constraint that the set of values for any kind of item, that we want to put in such set, must have total ordering. In other words, the standard (or user-defined) lt operation must be applicable on any pair of items belonging to the set.

Here is the code:

import module namespace set-impl =http://fxsl.sf.net/data/set/implementation&#8221;
              at “implementations/set-implementation-as-binary-search-tree.xquery”;
             

declare namespace set =http://fxsl.sf.net/data/set

 

(:
  Create an empty set — (0x2205)
 :)
declare functionset:set()
         as
function() as item()*

{
  
set-impl:set()
};

declare functionset:empty($pSet as function() as item()*)
         as xs:boolean
{
  
set-impl:empty($
pSet)
};

declare functionset:size($pSet as function() as item()*)
         as xs:integer
{
  
set-impl:size($
pSet)
};

declare function set:equals
            (
$pSet1 as function() as item()*
,
            
$pSet2 as function() as item()*

            )
         as xs:boolean
{
   
set-impl:equals($pSet1, $pSet2)
};

declare function set:add
     (
$pSet as function() as item()*
,
     
$pItem as item
()
     )
         as
function() as item()*

{
  
set-impl:add($pSet, $pItem)
};

declare functionset:data($pSet as function() as item()*

)
         as
item()*

{
  
set-impl:data($pSet)
};

  The Set Theory belongs to (member of)   (0x22F2)
  operation
 :)
declare function set:member
     ($pItem as item()?,
     
$pSet as function() as item()*

      )
         as xs:boolean
{
  
set-impl:member($pItem, $pSet)
};

(:
 The Set Theory does not belong to (not member of)  (0x2209)
  operation
 :)

declare function set:not-member
     (
$pItem as item()?
,
     
$pSet as function() as item()*

      )
         as xs:boolean
{
  
not(set:member($pItem, $pSet))
};

declare function set:remove
     (
$pSet as function() as item()*
,
     
$pItem as item
()
     )
         as
function() as item()*

{
  
set-impl:remove($pSet[1], $pItem)
};

(:
  The classic set-theory U operation — union of two sets
:)
declare function set:U
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
  
set-impl:U($pSet1, $pSet2)
};

(:
  The classic set-theory set-difference operation \

:)
declare function set:diff
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
  
set-impl:diff($pSet1, $pSet2)
};

(: 
The classic set-theory ∩  (∩) /\
  set-intersection operation
:)
declare function set:intersect
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
 
set-impl:intersect($pSet1, $pSet2)
};
(:
The classic set-theory symmetric-difference operation
:)
declare function set:sym-diff
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
  
set-impl:sym-diff($pSet1, $pSet2)
};

(: 
The classic set-theory set (‘⊇’) operation
:)
declare function set:includes
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as xs:boolean
{
 
set-impl:includes($pSet1, $pSet2)
};

declare function set:print
     (
$pSet as function() as item()*
)
         as
element()?

{
 
set-impl:print($pSet)
};

declare function set:serialize
     (
$pSet as function() as item()*
)
         as
element()?

{
 
set-impl:serialize($pSet)
};

declare function set:deserialize
     (
$pSerialization as element()?
)
         as
function() as item()*

{
 
set-impl:deserialize($pSerialization)
};

Now, you would be right that this code shows very little, if anything, at all. However, it has at least two virtues:

  • Can serve as interface.
  • Is implementation independent – to use a different implementation just import another implementation module and bind the prefix set-impl to its namespace.

Having said that, here is the “true” implementation, which is contained in the file “implementations/set-implementation-as-binary-search-tree.xquery”.

This is a separate XQuery module with its own namespace. It imports and uses the module that implements the binary search tree:

module namespace set-impl =http://fxsl.sf.net/data/set/implementation&#8221;
              at “../../BinaryTree/bintreeModule.xquery”;

import module namespace tree =http://fxsl.sf.net/data/bintree;

The set implementation module “contains” a binary search tree. The “empty set” is nothing else than a zero-argument function that returns an empty tree:

(:
  Create an empty set — (0x2205)
 :)
declare functionset-impl:set()
         as
function() as item()*

{
  
function() {tree:tree()}   (: underlying bintree :)
};

Many of the basic set operations have their binary search tree counterparts. Here is what it means in our implementation for a set to be “empty” and how we compute the size of a set:

declare functionset-impl:empty($pSet as function() as item()*)
         as xs:boolean
{
  
tree:empty($pSet[1
]())
};

 

declare functionset-impl:size($pSet as function() as item()*)
         as xs:integer
{
  
tree:size($pSet[1
]())
};

Two sets are equal exactly when they have the same size and the sequences of their atomized values are deep-equal:

declare function set-impl:equals
            (
$pSet1 as function() as item()*
,
            
$pSet2 as function() as item()*

            )
         as xs:boolean
{
   
set-impl:size($pSet1) eqset-impl:size($pSet1)
  
and

    deep-equal(set-impl:data($pSet1), set-impl:data($pSet2))
};

Here is how we add an item to a set:

declare function set-impl:add
     (
$pSet as function() as item()*
,
     
$pItem as item
()
     )
         as
function() as item()*

{
  
function() {tree:insert($pSet[1](), $pItem)}
};

and how we atomize a set:

declare functionset-impl:data($pSet as function() as item()*)
         as
item()*

{
  
tree:data($pSet[1]())
};

When is an item member of a set:

(:
  The Set Theory belongs to (member of)   (0x22F2)
  operation
 :)
declare function set-impl:member
     (
$pItem as item()?
,
     
$pSet as function() as item()*

      )
         as xs:boolean
{
  
tree:contains($pSet[1](), $pItem)
};

and when it isn’t a member of a set:

(:
  The Set Theory does not belong to (not member of)   (0x2209)
  operation
 :)

declare function set-impl:not-member
     (
$pItem as item()?
,
     
$pSet as function() as item()*

      )
         as xs:boolean
{
  
not(set-impl:member($pItem, $pSet))
};

This is how we remove an item from a set:

declare function set-impl:remove
     (
$pSet as function() as item()*
,
     
$pItem as item
()
     )
         as
function() as item()*

{
  
function() {tree:deleteNode($pSet[1](), $pItem)}
};

The classic set theory U (union) operation returns the union of two sets. Making use of the new standard XPath 3.0 higher order function fold-left(), we add all items from the smaller of the two sets to the bigger, thus performing only the minimal possible number of add operations:

(:
  The classic set-theory U operation — union of two sets
:)
declare function set-impl:U
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
  
let$vBiggerSet :=
          
if(set-impl:size($pSet1) geset-impl:size($
pSet2))
            
then$
pSet1
            
else$
pSet2,
      
$vSmallerSet :=

           if(not(set-impl:size($pSet1) geset-impl:size($pSet2)))
            
then$
pSet1
            
else$
pSet2
   
return

       fold-left(set-impl:add#2, $vBiggerSet, set-impl:data($vSmallerSet))
};

Similarly, the classic set theory \ (set difference) operation:

(:
  The classic set-theory set-difference operation \
:)
declare function set-impl:diff
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
 
fold-left(set-impl:remove#2, $pSet1, set-impl:data($pSet2))
};

And the classic set theory  intersect operation:

(:
  The classic set-theory ∩  (∩) /\
  set-intersection operation
:)
declare function set-impl:intersect
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
  
let$vBiggerSet :=
          
if(set-impl:size($pSet1) geset-impl:size($
pSet2))
            
then$
pSet1
            
else$
pSet2,
      
$vSmallerSet :=

           if(not(set-impl:size($pSet1) geset-impl:size($pSet2)))
            
then$
pSet1
            
else$
pSet2,
      
$to-be-removed :=

          
filter(set-impl:not-member(?, $
vBiggerSet),
                     
set-impl:data($
vSmallerSet)
                     )
   
return

       fold-left(set-impl:remove#2, $vSmallerSet, $to-be-removed)
};

Two more set theory operations – symmetric difference:

(:
  The classic set-theory symmetric-difference operation
:)
declare function set-impl:sym-diff
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as
function() as item()*
{
 
set-impl:U(set-impl:diff($pSet1, $pSet2), set-impl:diff($pSet2, $pSet1))
};

and (non-strict) set inclusion:

(:
  The classic set-theory set (‘⊇’) operation
:)
declare function set-impl:includes
     (
$pSet1 as function() as item()*
,
     
$pSet2 as function() as item()*

     )
         as xs:boolean
{
 
empty(set-impl:data($pSet2)[set-impl:not-member(.,$pSet1)])
};

And finally, three infrastructural operations: print(), serialize() and deserialize() :

declare function set-impl:print
     (
$pSet as function() as item()*
)
         as
element()?

{
 
tree:print($pSet[1]())
};
 

declare function set-impl:serialize
     (
$pSet as function() as item()*
)
         as
element()?

{
 
set-impl:print($pSet)
};

declare function set-impl:deserialize
     (
$pSerialization as element()?
)
         as
function() as item()*

{
 
function() {tree:deserialize($pSerialization)}
};

Here is a test of our set implementation:

let$set0 :=set:add(set:set(), 2),
    
$set1 :=set:add(set:add(set:add(set:set(), 10), 5), 17
),
    
$set2 :=set:remove($set1, 5
),
    
$set3 :=set:remove($set1, 12
),
    
$set4 :=set:add(set:set(), 2
),
    
$set4 :=set:add(set:add(set:add(set:set(), 8), 4), 9
),
    
$set5 :=set:add(set:add(set:add(set:set(), 17), 10), 19
)
 
return

    (
           
‘ set:empty(set:set()): ‘,
             
set:empty(set:set
()),
             
‘ set:size(set:set()): ‘
,
             
set:size(set:set
()),
             
‘ set:size($set0): ‘
,
             
set:size($
set0),
             
‘ set:size($set1): ‘
,
             
set:size($
set1),
             
‘ set:data($set1): ‘
,
             
set:data($
set1),
             
‘ set:member(10,$set1): ‘
,
             
set:member(10,$
set1),
             
‘ set:member(17,$set1): ‘
,
             
set:member(17,$
set1),
             
‘ set:member(5,$set1): ‘
,
             
set:member(5,$
set1),
             
‘ set:member(2,$set1): ‘
,
             
set:member(2,$
set1),
             
‘ set:member(15,$set1): ‘
,
             
set:member(15,$
set1),
             
‘ set:data($set2): ‘
,
             
set:data($
set2),
             
‘ set:data($set3): ‘
,
             
set:data($
set3),
             
‘ set:data($set4): ‘
,
             
set:data($
set4),
             
‘ set:data(set:U($set1,$set4)): ‘
,
             
set:data(set:U($set1,$
set4)),
             
‘ set:print(set:U($set1,$set4)): ‘
,
             
set:print(set:U($set1,$
set4)),
             
‘ set:print(set:U($set4,$set1)): ‘
,
             
set:print(set:U($set4,$
set1)),
             
‘ set:print(set:U($set1,$set2)): ‘
,
             
set:print(set:U($set1,$
set2)),
             
‘ set:equals($set1, set:U($set1,$set2)): ‘
,
             
set:equals($set1, set:U($set1,$
set2)),
             
‘ set:data($set1): ‘
,
             
set:data($
set1),
             
‘ set:data($set4): ‘
,
             
set:data($
set4),
             
‘ set:data(set:U($set1,$set4)): ‘
,
             
set:data(set:U($set1,$
set4)),
             
‘ set:print(set:U($set1,$set4)): ‘
,
             
set:print(set:U($set1,$
set4)),
             
‘ set:data($set1): ‘
,
             
set:data($
set1),
             
‘ set:data($set2): ‘
,
             
set:data($
set2),
             
‘ set:data(set:diff($set1,$set2)): ‘
,
             
set:data(set:diff($set1,$
set2)),
             
‘ set:print(tree:print(set:diff($set1,$set2)): ‘
,
             
set:print(set:diff($set1,$
set2)),
             
‘ set:data($set1): ‘
,
             
set:data($
set1),
             
‘ set:data($set2): ‘
,
             
set:data($
set2),
             
‘ set:data(set:intersect($set1, $set2)): ‘
,
  
set:data(set:intersect($set1, $
set2)),
             
‘ set:data($set1): ‘
,
             
set:data($
set1),
             
‘ set:data($set5): ‘
,
             
set:data($
set5),
             
‘ set:data(set:sym-diff($set1, $set5)): ‘
,
  
set:data(set:sym-diff($set1, $
set5)),
             
‘ set:includes($set1, $set2): ‘
,
  
set:includes($set1, $
set2),
             
‘ set:includes($set1, $set5): ‘
,
  
set:includes($set1, $
set5),
             
‘ set:includes($set1, set:set()): ‘
,
  
set:includes($set1, set:set
()),
             
‘ set:member((), $set1): ‘
,
  
set:member((), $
set1),
             
‘ set:member((), set:set()): ‘
,
  
set:member((), set:set
()),
             
‘ set:serialize($set5): ‘
,
  
set:serialize($
set5),
             
‘ set:data(set:deserialize(set:serialize($set5))): ‘
,
  
set:data(set:deserialize(set:serialize($
set5))),
             
‘ set:size(set:deserialize(set:serialize($set5))): ‘
,
  
set:size(set:deserialize(set:serialize($
set5)))
             )

When this test is executed, the expected, correct results are produced:

set:empty(set:set()):  true
set:size(set:set()):  0
set:size($set0):  1
set:size($set1):  3
set:data($set1):  5 10 17
set:member(10,$set1):  true
set:member(17,$set1):  true
set:member(5,$set1):  true
set:member(2,$set1):  false
set:member(15,$set1):  false
set:data($set2):  10 17
set:data($set3):  5 10 17
set:data($set4):  4 8 9
set:data(set:U($set1,$set4)):  4 5 8 9 10 17
set:print(set:U($set1,$set4)):

<treeNode>

   <value>10</value>
   <treeNode>
      <value>5</value>
      <treeNode>
         <value>4</value>
      </treeNode>
      <treeNode>
         <value>8</value>
         <treeNode>
            <value>9</value>
         </treeNode>
      </treeNode>
   </treeNode>
   <treeNode>
      <value>17</value>
   </treeNode>
</treeNode>
set:print(set:U($set4,$set1)):
<treeNode>
   <value>8</value>
   <treeNode>
      <value>4</value>
      <treeNode>
         <value>5</value>
      </treeNode>
   </treeNode>
   <treeNode>
      <value>9</value>
      <treeNode>
         <value>10</value>
         <treeNode>
            <value>17</value>
         </treeNode>
      </treeNode>
   </treeNode>
</treeNode>
set:print(set:U($set1,$set2)):
<treeNode>
   <value>10</value>
   <treeNode>
      <value>5</value>
   </treeNode>
   <treeNode>
      <value>17</value>
   </treeNode>
</treeNode>
set:equals($set1, set:U($set1,$set2)):  true
set:data($set1):  5 10 17
set:data($set4):  4 8 9
set:data(set:U($set1,$set4)):  4 5 8 9 10 17
set:print(set:U($set1,$set4)):

<treeNode>
   <value>10</value>
   <treeNode>
      <value>5</value>
      <treeNode>
         <value>4</value>
      </treeNode>
      <treeNode>
         <value>8</value>
         <treeNode>
            <value>9</value>
         </treeNode>
      </treeNode>
   </treeNode>
   <treeNode>
      <value>17</value>
   </treeNode>
</treeNode>
set:data($set1):  5 10 17
set:data($set2):  10 17
set:data(set:diff($set1,$set2)):  5
set:print(tree:print(set:diff($set1,$set2)):

<treeNode>
   <value>5</value>
</treeNode>
set:data($set1):  5 10 17
set:data($set2):  10 17
set:data(set:intersect($set1, $set2)):  10 17
set:data($set1):  5 10 17
set:data($set5):  10 17 19
set:data(set:sym-diff($set1, $set5)):  5 19
set:includes($set1, $set2):  true
set:includes($set1, $set5):  false
set:includes($set1, set:set()):  true
set:member((), $set1):  true
set:member((), set:set()):  true
set:serialize($set5):
<treeNode>
   <value>17</value>
   <treeNode>
      <value>10</value>
   </treeNode>
   <treeNode>
      <value>19</value>
   </treeNode>
</treeNode>
set:data(set:deserialize(set:serialize($set5))): 10 17 19
set:size(set:deserialize(set:serialize($set5))):  3

In a few of my next posts I will show other interesting applications of the two new datatypes: the set and the binary search tree.

Useful utility: Zip Codes – Free zip code lookup and zip code database download.

Posted in functional data structures, XPath, XPath 3.0 | 2 Comments

Part 2: The Binary Search Data Structure with XPath 3.0, Deleting a node

In the first part of this post I presented a Binary Search Tree implemented in pure XPath 3.0. Some new nice features of XPath 3.0 were shown in action and other, still missing but very important features were identified and discussed.

The only operation that was missing was the deletion of a node from the tree. I show how to do this now.

This is the function that consumers will use:

(:
Delete from $pTree the node containing $pItem.
:)

declare function tree:deleteNode
         (
$pTree as (function() as item()*)*
,
         
$pItem as item
()
         )
        
as (function() as item()*)*

{
 
tree:deleteNode($pTree, $pItem, 1)
};

It simply delegates to a similar internal function that also requires a third argument – the depth of the (sub)tree on which the delete is performed. Why is the depth necessary? The answer will come later.

(:
Delete from $pTree the node containing $pItem.
$pTree has a depth of $pDepth (1 if not a subtree)
:)

declare function tree:deleteNode
         (
$pTree as (function() as item()*)*
,
         
$pItem as item
(),
         
$
pDepth as xs:integer
         )
        
as (function() as item()*)*

{
if(tree:empty($pTree))
  
then $
pTree
  
else

     let $top   := tree:top($pTree),
        
$left  := tree:left($
pTree),
        
$right := tree:right($
pTree)
     
return

         if($pItem eq $top)
          
then tree:deleteTopNode($pTree, $
pDepth)
          
else

            if($pItem lt $top)
             
then

                (
                 
function() {tree:top($pTree)},
                 
function
()
                  {
tree:deleteNode(tree:left($
pTree),
                                  
$
pItem,
                                  
$pDepth+1
)
                  },
                 
function() {tree:right($
pTree)}
                 )
             
else

                (
                 
function() {tree:top($pTree)},
                 
function() {tree:left($
pTree)},
                 
function
()
                  {
tree:deleteNode(tree:right($
pTree),
                                             
$
pItem,
                                             
$pDepth+1
)
                  }
                 )
};

We see that the deleteNode() function simply dispatches the operation to the left or right subtree and calls another function – deleteTopNode() – whenever the item to be deleted happens to be the value of the top node.

declare function tree:deleteTopNode
         (
$pTree as (function() as item()*)*
,
         
$
pDepth as xs:integer
         )
        
as (function() as item()*)*

{
 
let $left := tree:left($pTree),
     
$right := tree:right($
pTree)
   
return

      if(tree:empty($left) and tree:empty($right))
       
then tree:tree
()
       
else

         if(tree:empty($left) and not(tree:empty($right)))
          
then

             (
              
function() {tree:top($right)},
              
function() {tree:left($
right)},
              
function() {tree:right($
right)},
              
function() {tree:size($right) 1
}
              )
          
else

            if(not(tree:empty($left)) and tree:empty($right))
            
then

             (
              
function() {tree:top($left)},
              
function() {tree:left($
left)},
              
function() {tree:right($
left)},
              
function() {tree:size($left) 1
}
              )
            
else  (: both subtrees are non-empty :)

               if($pDepth mod 2 eq 0)
               
then

                  let $subtree := tree:right($pTree),
                     
$vItem := tree:top(tree:leftmost($
subtree)),
                     
$newSubtree :=

                         tree:deleteNode($subtree, $vItem, $pDepth+1)
                   
return

                      (function() {$vItem},
                      
function() {$
left},
                      
function() {$
newSubtree},
                      
function() {tree:size($pTree) 1
}
                       )
               
else

                  let $subtree := tree:left($pTree),
                     
$vItem := tree:top(tree:rightmost($
subtree)),
                     
$newSubtree:=

                         tree:deleteNode($subtree, $vItem, $pDepth+1)
                   
return

                      (function() {$vItem},
                      
function() {$
newSubtree},
                      
function() {$
right},
                      
function() {tree:size($pTree) 1
}
                       )
};

deleteTopNode() does the real deletion work. The operation is straightforward in the case when the tree is empty or either the left subtree or the right subtree is empty.

It becomes interesting when both subtrees are non-empty. In order to understand what the last 20 lines of this code mean, it is best to read a good source on the binary search tree delete operation.

In a few words:

  • If the depth is even, then we move to the top the leftmost node (without its right subtree) of the right subtree.
  • If the depth is odd, then we move to the top the rightmost node (without its left subtree) of the left subtree.

Why is this different processing done, depending on the oddness of the depth of the subtree? If we always used the leftmost node of the right subtree, then a series of deletions would create a severely imbalanced (to the right) tree. Imbalanced trees lose the advantages of the binary search tree with search and insertions becoming of linear complexity (instead of O(log(N))  ) and sorting becoming O(N^2) instead of O(N*log(N)).

Thus, in order to avoid this deteriorating effect, we need to use with a pseudo-random frequency both the left and the right subtrees. In most cases this can be achieved simply by using the oddness of the depth of the tree, whose top node is to be replaced.

Below is the complete code of the Binary Search tree. Do note that I have added a new “property” to the tree datatype – “size”. Also, note the functions tree:leftmost() and tree:rightmost() – used by tree:deleteTopNode()

Another change is that now the second argument of tree:contains()   is of type  item()?   — that is, it can be empty. By definition, any tree contains the “empty item”.

Finally, there is a new pair of functions: tree:serialize() and   tree:deserialize() . The former is a synonym of  tree:print() and the latter loads the XML representation created by  tree:serialize() into a tree.

module namespace tree = http://fxsl.sf.net/data/bintree&#8221;;

declare function tree:top
         (
$pTree as (function() as item()*)*
)
         as
item()?

{
if(empty($pTree))
  
then
()
  
else

     $pTree[1]()
};
 
 

 

declare function tree:left
         (
$pTree as (function() as item()*)*
)
        
as (function() as item()*)*

{
if(empty($pTree[2]))
then tree:tree
()
else

   $pTree[2]()
 
  
 

 

declare function tree:right
         (
$pTree as (function() as item()*)*
)
        
as (function() as item()*)*

{
if(empty($pTree[3]))
  
then tree:tree
()
  
else

     $pTree[3]()
};
  
declare function tree:size
         (
$pTree as (function() as item()*)*
)
         as xs:integer
{
if(empty($pTree[4
]))
  
then 0

   else
     $pTree[4]()
};
  
declare function tree:tree()
        
as (function() as item()*)+
{};
   (
    function() {()},   (: top() :)
    function() {()},   (: left() :)
    function() {()},   (: right() :)
    function() {0}     (: size() :)
   )
};
declare function tree:empty
         (
$pTree as (function() as item()*)*
)
         as xs:boolean
{
empty($pTree) or empty($pTree[1
]())
};

 declare function tree:contains
         (
$pTree as (function() as item()*)*,
         
$pItem as item()?
         )
         as xs:boolean
{
 
if(empty($pItem))
  
then true()
  
else
   
if(tree:empty($pTree))
     
then false()
     
else
       
let $top := tree:top($pTree)
        
return
            (
$pItem eq $top)
          
or
            (
$pItem lt $top
           
and
            
tree:contains(tree:left($pTree),$pItem)
             )
          
or
            (
$pItem gt $top
           
and
            
tree:contains(tree:right($pTree),$pItem)
             )
};

(:
Delete from $pTree the node containing $pItem.
:)
declare function tree:deleteNode
         (
$pTree as (function() as item()*)*
,
         
$pItem as item
()
         )
        
as (function() as item()*)*

{
 
tree:deleteNode($pTree, $pItem, 1)
};
declare function tree:deleteNode
         (
$pTree as (function() as item()*)*
,
         
$pItem as item
(),
         
$
pDepth as xs:integer
         )
        
as (function() as item()*)*

{
if(tree:empty($pTree))
  
then $
pTree
  
else

     let $top   := tree:top($pTree),
        
$left  := tree:left($
pTree),
        
$right := tree:right($
pTree)
     
return

         if($pItem eq $top)
          
then tree:deleteTopNode($pTree, $
pDepth)
          
else

            if($pItem lt $top)
             
then

                (
                 
function() {tree:top($pTree)},
                 
function
()
                  {
tree:deleteNode(tree:left($
pTree),
                                  
$
pItem,
                                  
$pDepth+1
)
                  },
                 
function() {tree:right($
pTree)}
                 )
             
else

                (
                 
function() {tree:top($pTree)},
                 
function() {tree:left($
pTree)},
                 
function
()
                  {
tree:deleteNode(tree:right($
pTree),
                                             
$
pItem,
                                            
$pDepth+1
)
                  }
                 )
};
 

declare function tree:deleteTopNode
         (
$pTree as (function() as item()*)*,
         
$pDepth as xs:integer
         )
        
as (function() as item()*)*
{
 
let $left := tree:left($pTree),
     
$right := tree:right($pTree)
   
return
     
if(tree:empty($left) and tree:empty($right))
       
then tree:tree()
       
else
        
if(tree:empty($left) and not(tree:empty($right)))
          
then
           
$right
          
else
           
if(not(tree:empty($left)) and tree:empty($right))
            
then
              
$left
            
else  (: both subtrees are non-empty :)
              
if($pDepth mod 2 eq 0)
               
then
                 
let $subtree := tree:right($pTree),
                     
$vItem := tree:top(tree:leftmost($subtree)),
                     
$newSubtree :=
                        
tree:deleteNode($subtree, $vItem, $pDepth+1)
                   
return
                      (
function() {$vItem},
                      
function() {$left},
                      
function() {$newSubtree},
                      
function() {tree:size($pTree) 1}
                       )
               
else
                 
let $subtree := tree:left($pTree),
                     
$vItem := tree:top(tree:rightmost($subtree)),
                     
$newSubtree:=
                        
tree:deleteNode($subtree, $vItem, $pDepth+1)
                   
return
                      (
function() {$vItem},
                      
function() {$newSubtree},
                      
function() {$right},
                      
function() {tree:size($pTree) 1}
                       )
};

 
(:
  Find/Return the leftmost node of the
  non-empty tree $pTree
:)
declare function tree:leftmost
         (
$pTree as (function() as item()*)+
)
        
as (function() as item()*)*

{
 
let $left := tree:left($pTree)
   
return

      if(tree:empty($left))
       
then $
pTree
       
else tree:leftmost($
left)
};
 

 

 
declare function tree:rightmost
         (
$pTree as (function() as item()*)+
)
        
as (function() as item()*)*

{
 
let $right := tree:right($pTree)
   
return

      if(tree:empty($right))
       
then $
pTree
       
else tree:rightmost($
right)
};
 declare function tree:insert
         (
$pTree as (function() as item()*)*
,
         
$pItem as item
()
         )
        
as (function() as item()*)+

{
  
if(tree:empty($pTree))
     
then

      (
      
function() {$pItem},        (: top()   :)
       function() {tree:tree()}, (: left()  :)
       function() {tree:tree()}, (: right() :)
       function() {1}                     (: size()  :)
       )
     
else
       if($pItem lt tree:top($pTree))
        
then

          (
          
function() {tree:top($pTree)},
          
function() {tree:insert(tree:left($pTree), $
pItem)},
          
function() {tree:right($
pTree)},
          
function() {tree:size($pTree)+1
}
           )
        
else

          if($pItem gt tree:top($pTree))
          
then

           (
           
function() {tree:top($pTree)},
           
function() {tree:left($
pTree)},
           
function() {tree:insert(tree:right($pTree), $
pItem)},
           
function() {tree:size($pTree)+1
}
           )
          
else

            $pTree
};

declare function tree:print
        (
$pTree as (function() as item()*)*
)
        as
element()?

{
  
if(not(tree:empty($pTree)))
   
then

     <treeNode>
      <value>{tree:top($pTree)}</value>
       {
       
tree:print(tree:left($pTree)),
       
tree:print(tree:right($
pTree))
       }
    
</treeNode>

    else ()
};
  

declare function tree:serialize
        (
$pTree as (function() as item()*)*)
        as
element()?
{
  
tree:print($pTree)
};

declare function tree:deserialize($pXml as element()?)
             
as (function() as item()*)*
{
 
if(empty($pXml))
 
then tree:tree()
 
else
   
let $left := tree:deserialize($pXml/treeNode[1]),
       
$right := tree:deserialize($pXml/treeNode[2])
    
return
      (
      
function() {$pXml/value/node()[1]},
      
function() {$left},
      
function() {$right},
      
function() {tree:size($left)+tree:size($right)+1}
      )
};

 

(: tree:data()
  Prints only the data — depth first.
   In effect this is sort() — quite good
   for random data.
:)
declare function tree:data
        (
$pTree as (function() as item()*)*
)
        as
item()*

{
if(not(tree:empty($pTree)))
  
then

    (
    
tree:data(tree:left($pTree)),
    
tree:top($
pTree),
    
tree:data(tree:right($
pTree))
     )
  
else
()
};

Now, let’s test just the deleteNode operation:

let $vBigTree :=
       tree:insert(tree:insert(tree:insert(tree:tree(),10),5),17),
   
$vBigTree2 :=

       tree:insert(tree:insert(tree:insert($vBigTree,3),8),14),
   
$vBigTree3 :=

       tree:insert(tree:insert(tree:insert($vBigTree2,20),6),9),
   
$vBigTree4 :=

       tree:insert(tree:insert(tree:insert($vBigTree3,12),15),7)
   
return

      (‘ tree:print($vBigTree4): ‘,
      
tree:print($
vBigTree4),
      
‘ tree:data($vBigTree4): ‘
,
      
tree:data($
vBigTree4),
      
‘ ====================================’
,
      
‘ tree:print(tree:deleteNode($vBigTree4, 7)): ‘
,
      
tree:print(tree:deleteNode($vBigTree4, 7
)),
      
‘ tree:data(tree:deleteNode($vBigTree4, 7)): ‘
,
      
tree:data(tree:deleteNode($vBigTree4, 7
)),
      
‘ ====================================’
,
      
‘ tree:print(tree:deleteNode($vBigTree4, 12)): ‘
,
      
tree:print(tree:deleteNode($vBigTree4, 12
)),
      
‘ tree:data(tree:deleteNode($vBigTree4, 12)): ‘
,
      
tree:data(tree:deleteNode($vBigTree4, 12
)),
      
‘ ====================================’
,
      
‘ tree:print(tree:deleteNode($vBigTree4, 6)): ‘
,
      
tree:print(tree:deleteNode($vBigTree4, 6
)),
      
‘ tree:data(tree:deleteNode($vBigTree4, 6)): ‘
,
      
tree:data(tree:deleteNode($vBigTree4, 6
)),
      
‘ ====================================’
,
      
‘ tree:print(tree:deleteNode($vBigTree4, 5)): ‘
,
      
tree:print(tree:deleteNode($vBigTree4, 5
)),
      
‘ tree:data(tree:deleteNode($vBigTree4, 5)): ‘
,
      
tree:data(tree:deleteNode($vBigTree4, 5
)),
      
‘ ====================================’
,
      
‘ tree:print(tree:deleteNode($vBigTree4, 10)): ‘
,
      
tree:print(tree:deleteNode($vBigTree4, 10
)),
      
‘ tree:data(tree:deleteNode($vBigTree4, 10)): ‘
,
      
tree:data(tree:deleteNode($vBigTree4, 10
))
      )

And here are the results:

tree:print($vBigTree4):

<treeNode>

      <value>10</value>

      <treeNode>

            <value>5</value>

            <treeNode>

                  <value>3</value>

            </treeNode>

            <treeNode>

                  <value>8</value>

                  <treeNode>

                        <value>6</value>

                        <treeNode>

                              <value>7</value>

                        </treeNode>

                  </treeNode>

                  <treeNode>

                        <value>9</value>

                  </treeNode>

            </treeNode>

      </treeNode>

      <treeNode>

            <value>17</value>

            <treeNode>

                  <value>14</value>

                  <treeNode>

                        <value>12</value>

                  </treeNode>

                  <treeNode>

                        <value>15</value>

                  </treeNode>

            </treeNode>

            <treeNode>

                  <value>20</value>

            </treeNode>

      </treeNode>

</treeNode>

tree:data($vBigTree4):

3 5 6 7 8 9 10 12 14 15 17 20

====================================

tree:print(tree:deleteNode($vBigTree4, 7)):

 

<treeNode>
  
<value>10</value>
  
<treeNode>
     
<value>5</value>
     
<treeNode>
        
<value>3</value>
     
</treeNode>
     
<treeNode>
        
<value>8</value>
        
<treeNode>
           
<value>6</value>
        
</treeNode>
        
<treeNode>
           
<value>9</value>
        
</treeNode>
     
</treeNode>
  
</treeNode>
  
<treeNode>
     
<value>17</value>
     
<treeNode>
        
<value>14</value>
        
<treeNode>
           
<value>12</value>
        
</treeNode>
        
<treeNode>
           
<value>15</value>
        
</treeNode>
     
</treeNode>
     
<treeNode>
        
<value>20</value>
     
</treeNode>
  
</treeNode>
</treeNode>

 

tree:data(tree:deleteNode($vBigTree4, 7)):

3 5 6 8 9 10 12 14 15 17 20

====================================

tree:print(tree:deleteNode($vBigTree4, 12)):

<treeNode>

      <value>10</value>

      <treeNode>

            <value>5</value>

            <treeNode>

                  <value>3</value>

            </treeNode>

            <treeNode>

                  <value>8</value>

                  <treeNode>

                        <value>6</value>

                        <treeNode>

                              <value>7</value>

                        </treeNode>

                  </treeNode>

                  <treeNode>

                        <value>9</value>

                  </treeNode>

            </treeNode>

      </treeNode>

      <treeNode>

            <value>17</value>

            <treeNode>

                  <value>14</value>

                  <treeNode>

                        <value>15</value>

                  </treeNode>

            </treeNode>

            <treeNode>

                  <value>20</value>

            </treeNode>

      </treeNode>

</treeNode>

tree:data(tree:deleteNode($vBigTree4, 12)):

3 5 6 7 8 9 10 14 15 17 20

====================================

tree:print(tree:deleteNode($vBigTree4, 6)):

<treeNode>

      <value>10</value>

      <treeNode>

            <value>5</value>

            <treeNode>

                  <value>3</value>

            </treeNode>

            <treeNode>

                  <value>8</value>

                  <treeNode>

                        <value>7</value>

                  </treeNode>

                  <treeNode>

                        <value>9</value>

                  </treeNode>

            </treeNode>

      </treeNode>

      <treeNode>

            <value>17</value>

            <treeNode>

                  <value>14</value>

                  <treeNode>

                        <value>12</value>

                  </treeNode>

                  <treeNode>

                        <value>15</value>

                  </treeNode>

            </treeNode>

            <treeNode>

                  <value>20</value>

            </treeNode>

      </treeNode>

</treeNode>

tree:data(tree:deleteNode($vBigTree4, 6)):

3 5 7 8 9 10 12 14 15 17 20

====================================

tree:print(tree:deleteNode($vBigTree4, 5)):

<treeNode>

      <value>10</value>

      <treeNode>

            <value>6</value>

            <treeNode>

                  <value>3</value>

            </treeNode>

            <treeNode>

                  <value>8</value>

                  <treeNode>

                        <value>7</value>

                  </treeNode>

                  <treeNode>

                        <value>9</value>

                  </treeNode>

            </treeNode>

      </treeNode>

      <treeNode>

            <value>17</value>

            <treeNode>

                  <value>14</value>

                  <treeNode>

                        <value>12</value>

                  </treeNode>

                  <treeNode>

                        <value>15</value>

                  </treeNode>

            </treeNode>

            <treeNode>

                  <value>20</value>

            </treeNode>

      </treeNode>

</treeNode>

tree:data(tree:deleteNode($vBigTree4, 5)):

3 6 7 8 9 10 12 14 15 17 20

====================================

tree:print(tree:deleteNode($vBigTree4, 10)):

<treeNode>

      <value>9</value>

      <treeNode>

            <value>5</value>

            <treeNode>

                  <value>3</value>

            </treeNode>

            <treeNode>

                  <value>8</value>

                  <treeNode>

                        <value>6</value>

                        <treeNode>

                              <value>7</value>

                        </treeNode>

                  </treeNode>

            </treeNode>

      </treeNode>

      <treeNode>

            <value>17</value>

            <treeNode>

                  <value>14</value>

                  <treeNode>

                        <value>12</value>

                  </treeNode>

                  <treeNode>

                        <value>15</value>

                  </treeNode>

            </treeNode>

            <treeNode>

                  <value>20</value>

            </treeNode>

      </treeNode>

</treeNode>

tree:data(tree:deleteNode($vBigTree4, 10)):

3 5 6 7 8 9 12 14 15 17 20

 

In my next posts I’ll show some of the most important applications of the Binary Search tree.


Posted in functional data structures, XPath, XPath 3.0 | Leave a comment