Update: Minor correction in the last two rows of the table — thanks to a comment by Michael Ludwig. I will talk about the efficiency of this and other related XPath expressions in my next post.
In my first post I provided a compact one-liner XPath expression that obtains all duplicate items in a given sequence.
Since then a number of people have contacted me, asking for an explanation what this expression means and how it is really evaluated by a compliant XPath 2.0 engine.
C. M. Sperberg-McQueen, a MIT professor and member of W3C, blogged about it:
"Dimitre’s solution is beautiful and concise: 21 characters long (longer if you use variables with names longer than singler characters), and well worth the five or ten minutes it took me to work out why it works. I won’t explain it to you, dear reader; you deserve the brief puzzlement and Aha-moment of understanding it."
"Dimitre gets my vote for this month’s best programming-language application of Strunk and White’s rule: “Omit needless words!” "
I have always admired any of Michael’s works and his high recognition is the best reward I would not have dared dreaming of.
Now that enough time has passed for anyone to try and understand how the solution works, let me explain it myself.
From the very start working on this problem I decided that a good solution must use the available standard functions on sequences.
While distinct-values() was mentioned by the OP, it is not directly useful here, as we are looking for any item that occurs more than once in the sequence.
We can eventually think of using exists() or empty(), however the one function that is most relevant to the task is index-of(). As explained in the F & O Spec.,
fn:index-of( |
$seqParam |
as xs:anyAtomicType*, |
$srchParam |
as xs:anyAtomicType) as xs:integer* | |
fn:index-of( |
$seqParam |
as xs:anyAtomicType*, |
$srchParam |
as xs:anyAtomicType, | |
$collation |
as xs:string) as xs:integer* |
Summary: Returns a sequence of positive integers giving the positions within the sequence $seqParam of items that are equal to $srchParam.
The first step in the solution is to ask the question:
How are duplicate items different from unique items (in terms of index-of() ) ?
The answer comes naturally:
For any unique item vUi, the sequence that
index-of($vSeq, $vUi)
produces, is just of length of one.
For any duplicate item vDup, the sequence that
index-of($vSeq, $vDup)
produces, is of length of two or more.
In other words,
index-of($vSeq, $vDup)
always has a second item, while
index-of($vSeq, $vUi)
never has a second item.
To express this in XPath 2.0, the expression:
index-of($vSeq, $vItem)[2]
produces the empty sequence for any unique item $vItem, and it produces an xs:integer index for any $vItem that occurs two or more times in the sequence.
Then one way to find all duplicate items in the sequence is to evaluate:
$vSeq[exists(index-of($vSeq, .)[2])]
In XPath 2.0 the expression in the [] brackets that follow a sequence is called a FilteringExpression. It is evaluated for every item of the sequence and only items that satisfy this predicate are produced.
Let $vSeq is defined as:
1, 2, 2, 3, 4, 5, 5, 6, 7
Then
$vSeq[exists(index-of($vSeq, .)[2])]
produces:
2, 2, 5, 5
This is quite close to the result we need, we just need to dedup it:
distinct-values
($vSeq[exists(index-of($vSeq, .)[2])])
produces the wanted result:
2, 5
This is fine, but the last expression seems already too long. Can we do even better?
What about this one:
$vSeq[index-of($vSeq,.)[2]]
It produces :
2, 5
exactly the wanted result and it is really tight.
The only problem is to explain how this expression "works" :)
Let me admit that on the morning after discovering this solution I found I had completely forgotten the rational behind it. After failing again and again to explain it, I finally asked the Gods for help.
Here is the explanation provided by Dr. Michael Kay:
"This expression
$vSeq[index-of($vSeq,.)[2]]
means
$vSeq [position() = index-of($vSeq,.)[2]]
Let’s evaluate the predicate for each item in $vseq:
pos |
value |
index-of($vSeq,.) |
index-of($vSeq,.)[2] |
pos=index-of($vSeq,.)[2] |
1 |
1 |
1 |
() |
false |
2 |
2 |
2, 3 |
3 |
false |
3 |
2 |
2, 3 |
3 |
true |
4 |
3 |
4 |
() |
false |
5 |
4 |
5 |
() |
false |
6 |
5 |
6, 7 |
7 |
false |
7 |
5 |
6, 7 |
7 |
true |
8 |
6 |
8 |
() |
false |
9 |
7 |
9 |
() |
false |
So the only items that match the predicate are those in positions 3 and 7, with values (2, 5)."
To summarize, the critical step in understanding how this works is to remember that if the type of the FilteringExpression is xs:integer, then
$someSequence[someFilteringExpression]
is simply an abbreviation for:
$someSequence[position() = someFilteringExpression]
Therefore,
$vSeq[index-of($vSeq,.)[2]]
means :
All items from $vSeq such that their position is equal to the value of the index of their second occurence in the sequence.
As there is at most one item at any given position, this is how we get just one "2" and one "5" in
2, 5
as contrasted to two of each in
2, 2, 5, 5
Concluding, here is one final note:
If you have ever used the Muenchian method of grouping, you may have spotted its resemblance to the current solution.
A typical expression used in Muenchian grouping is something like this:
<xsl:key match="person" name="kPersByName" use="@name"/>
<xsl:for-each select=
"*/person
[generate-id()
=
generate-id(key(‘kPersByName’, @name)[1])
]">
<!– Do something –>
</xsl:for-each>
In the above, the role of the filtering expression is played by:
generate-id()
=
generate-id(key(‘kPersByName’, @name)[1])
and this is the same as the role played by:
index-of($vSeq,.)[2]
in finding the duplicate items.
The Muenchian method uses an index of "1" in:
key(‘kPersByName’, @name)[1]
and in our solution:
index-of($vSeq,.)[2]
a similar role is played by the index "2".
"1" in the Muenchian method means: Take one node from each group of nodes with unique keys. Because any group can consist just of only one node, it is only safe here to take the first node from any group.
In the case of finding all duplicate items in a sequence, we want to pick one index from the list of indexes of any duplicate item. Having a first index does not qualify an item as duplicate — it needs to have at least a second index in the sequence. On the other side there may be duplicate items that occur only twice (not thrice or more times) and they will not have a third index.
Therefore, the only safe way to take an existing index for any duplicate item is to take its second index.
If you have successfully read up to here, you would have little difficulty solving the following problems, provided as exercises:
Let $vSeq be (1, 2,2, 3,3,3, 4,4,4,4, 5,5, 6, 7)
P1. Produce all items in $vSeq that occur only once in it.
P2. Produce all items in $vSeq that occur exactly two times in it.
P3. Produce all items in $vSeq that occur exactly three times in it (triples).
P4. Produce all items in $vSeq that occur exactly four times in it (quadruples).
P5. Produce all items in $vSeq that occur more than once but less than four times.
Hint: You may consider using the last() function in your solution.
This is extremely helpful. Thanks!
I am dying to know…What are the solutions to the practice questions? I ended up using e.g.
distinct-values($vSeq[count(index-of($vSeq,.))>1][count(index-of($vSeq,.))<4])
How do you use last()?
The expressions I have in mind are: