Regular expressions are ubiquitous in NLP, not to mention many miscellaneous text-processing tasks. People use regular expressions as a quick solution to matching and parsing. People build entire complex extraction systems with regular expressions. Some people get themselves into serious trouble by trying to apply them beyond their natural limits.

Once upon a time, regular expressions were, well, regular. They came from finite automata theory, and you could turn any expression into a ‘machine’ that could apply it in a predictable amount of space and time. However, programmers being programmers, regular expressions did not remain regular for long. You can read the historical details on Wikipedia. Once your regular expressions are not so regular, it’s not so easy to ensure that you can apply them in bounded time and space.

Sadly, the regular expression classes that come with Java suffer from this problem. A bit of lucky dipping on stackoverflow.com yielded:

“^[A-Za-z0-9]\.?([A-Za-z0-9_-]+\.?)*[A-Za-z0-9]\.[A-Za-z]{2,6}”

While applying this will terminate eventually, you will not want to pay for an Amazon Web Services node to run for the many years involved. The net result is that Java regular expressions are hard to love for large-scale processing, especially if you’d like to let your customers write their own expressions.

At Babel Street, we originally thought about regular expressions in C++. Years ago, we needed a reasonably capable, speedy, regular expression library that could handle the full Unicode repertoire. After some false starts, we settled on the TCL regular expression engine, itself an adaptation of a library written by Henry Spencer. To quote Wikipedia quoting O’Reilly & Associates:

The Tcl library is a hybrid NFA/DFA implementation with improved performance characteristics, earning praise from Jeffrey Friedl who said, “…it really seems quite wonderful.”

While the state of the art in C/C++ has arguably advanced since then, this library has served us well for years. As we ported our products into pure Java, however, the various problems with the built-in Java regular expression classes began to pinch.

We looked around the landscape; surely someone had been here before in open source? Well, actually, not so much, at least if you need non-viral licensing. We were unable to discover any Apache/MIT/BSD-licensed regular expression library that was comparable in power to Java and TCL, but which avoided the traps of unbounded runtime.

So we built one. Starting from the TCL source code, we did an adaptation into Java. The official Java API is not quite friendly to alternative implementations (it has a MatchResult interface, but the rest is all classes), but we gave ours the same shape, as well as exposing all the TCL features. And it’s out there, on github.com, with the Apache License. For now, we’re just keeping it in our corporate github organization, but if a community begins to emerge, we can move it to its own organization or even consider the Apache Incubator.

We hope other people find it useful. We hope other people pitch in and help improve it. We’re looking forward to using it to provide our customers with ever-better NLP solutions.

A Better Pure Java RegEx Engine