C++ Regular Expression Performance: GRETA and Boost.Regex

2/19/2004

The Boost and GRETA regular expression libraries have very different implementations. The Boost regex library uses a byte-code representation that is interpreted iteratively with a big switch statement. This is a typical of many finite-state machine implementations. GRETA compiles regular expressions into a directed graph of polymorphic nodes, which is executed recursively. GRETA's approach is similar to the Interpreter design pattern, and it has different performance characteristics than Boost. The tables below show how the two libraries perform in different situations.

Times were obtained on a dual 1.7GHz Pentium IV PC with 1 GB RAM running Windows XP SP1, and the code was compiled with Visual C++ 7.1 with all optimizations turned on. Most of the regular expressions are taken from the Boost regex examples, or from the Library of Regular Expressions. GRETA version 2.6.1 is used for this test, and the version of Boost.Regex is the latest released version, available in Boost 1.31.0

John Maddock (the author of Boost regex) performed his own performance comparison between GRETA and Boost, which is available here. The results, particularly when matching against very long strings, are markedly different. The discrepancy is due largely to the fact that John is using an old version of GRETA for his performance tests. Newer versions of GRETA perform better when matching against long strings.

Comparison 1: Short Search

For each of the following regular expressions the time taken to match against the text indicated was measured. For all cases GRETA outperforms Boost. John Maddock has suggested that the reason for the discrepency is that Boost must perform allocations while matching a pattern, whereas GRETA's recursive implementation allows it to match patterns without any allocations.

ExpressionTextGRETABoost
^([0-9]+)(\-| |$)(.*)$100- this is a line of ftp response which contains a message string1
(1.04e‑006s)
1.03
(1.07e‑006s)
([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}1234-5678-1234-4561
(1.22e‑006s)
1.39
(1.7e‑006s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$john_maddock@compuserve.com1
(2.08e‑006s)
1.23
(2.56e‑006s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$foo12@foo.edu1
(1.87e‑006s)
1.3
(2.44e‑006s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$bob.smith@foo.tv1
(2.32e‑006s)
1.05
(2.44e‑006s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$EH10 2QQ1
(5.35e‑007s)
1.42
(7.58e‑007s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$G1 1AA1
(5.2e‑007s)
1.43
(7.44e‑007s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$SW1 1ZZ1
(5.5e‑007s)
1.41
(7.73e‑007s)
^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$4/1/20011
(5.8e‑007s)
1.13
(6.54e‑007s)
^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$12/12/20011
(5.5e‑007s)
1.22
(6.69e‑007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$1231
(5.05e‑007s)
1.41
(7.13e‑007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$+3.141591
(5.5e‑007s)
1.41
(7.73e‑007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$-3.141591
(5.66e‑007s)
1.34
(7.58e‑007s)

Comparison 2: Long Search

For each of the following regular expressions the time taken to find all occurrences of the expression within a long English language text was measured (mtent12.txt from Project Gutenberg, 19Mb). Boost and GRETA both come of well here.

ExpressionGRETABoost
Twain1
(0.0331s)
4.48
(0.148s)
Huck[[:alpha:]]+1
(0.0312s)
4.63
(0.144s)
[[:alpha:]]+ing3.77
(3.52s)
1
(0.935s)
^[^ ]*?Twain3.64
(1.56s)
1
(0.429s)
Tom|Sawyer|Huckleberry|Finn2.56
(0.62s)
1
(0.242s)
(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)2.21
(0.843s)
1
(0.382s)


Many thanks to John Maddock for allowing me to use his regular expression test harness and HTML templates.