Assumed you wish to match a large wordlist against a huge chunk of text.
As a small test case, let
for, far, bar, foo, boofaz, boofar, boof, faz, foobaz, foobars, boofar
be your wordlist. Now, you may apply the according regualar expression:
But which way a regex engine would implement the assignment?
There are different options. The very worst algorithm would be surely to
look up every word separately in the whole text. That would be the same as
doing
Assumed you would match m words against a text consisting of n letters,
this peace of coding horror would have a runtime estimation of O(m*n).
Now, a better approach would be to run only once through the text,
using a matching stack. Thus, assume " foobar " would appear somewhere in
the text, the stack trace might look as follows then (read from bottom to top):
So, but what if the wordlist is getting large? It seems that we should run nearly
through the whole list each time a character is pushed onto the stack in order to
find out whether the current stack contents still may be matched or not.
It's clear that a considerable optimization would be to sort the word list
in advance. Moreover, instead of looking up one item after another,
a really smart approach would be to walk downwards a search tree instead.
As a tree, the wordlist above would appear like this:
Here, the "?" denotes an optional node. Remember the length of the way downwards
such a tree is in logarithmic relation to the number of nodes. Thus, loosely speeking,
we have improved the worst algorithm above up to O(n*log(m)) at least.
Actually I'm not sure whether regex engines would apply optimizations like that
when compiling. I guess they do, so it might be needless to replace the regex (1) above
by the optimized version, implementing the sorted tree of alternative and optional nodes:
Nevertheless I couldn't help to create a little Perl routine that folds a wordlist into an
optimized regex. Now, here it is:
Well, not so easy, but it works
Here, the inner recursion fold will create the actually tree, where nodes
having the form of arrays consisting of prefix, subnodes and a flag denoting optional nodes.
The second inner function toString then creates the actual regular
expression string from that tree.
So, for instance, calling
leave a comment