String searching using Aho-Corasick

At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.

Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search.

Let’s look at some code. Assume we have some sort of Node class/struct defined as follows

The ch variable will hold the character at the current node, transitions will hold all nodes we can transition to, results will be a list of results in case this node is a keyword we stored and fail are transitions in case of failure (more on this later).  For example, storing the keyword Manning will create 7 nodes and for the node where ch=’g’, results will also hold Manning. One detail that I omitted is that we also need a root node whose fail transition is itself and where all other nodes fail transitions point to by default. Now let’s look at some very verbose code  that creates the trie-like data structure without the fail transitions.

This first part is simple. We go keyword by keyword and character by character. For each new character we need to add, we check if our current node already has a transition to that character (lines 9-11). If it doesn’t we create a new node and add it to the list of transitions (lines 14-18). Finally, after we have gone through all characters in a keyword, we add that keyword to the results list of the node we are currently at.
The second part is where the coolness behind Aho-Corasick lies. Constructing the failure transitions is simple, because we can simply perform a Breadth-First Search update the failure links as we go. Note that a couple of things in the code sample below differ from the original paper. First, we already set the failure transition to root for all nodes with root as the parent, as well as the root node itself — the original paper calls these states with depth d=1; also, we do not update the “output function” during our BFS as the original paper does because the results list is created during the previous step.

Now, given our completed data structure, searching is trivial and lends itself nicely to a generator construction.

For each character in text, we move through our data structure and if there is a suitable transition we make it; otherwise we transition to the failure link (line 5). If we are at a node with a non-empty results list (output function in the original paper), then we return that result and continue. Let’s see it in action:

The algorithm has been around since 1975, so there are plenty of efficient implementations out there — the Wikipedia page mentions a few. Happy searching!

3 Comments

  1. This algorithm has an error as you’ve coded it.

    Instead of continuing if there is a result, it acts like there was a failure. So if I was looking for ‘Pat’ and ‘Patton’ in a block of text that had the word ‘Patton’, it would only find ‘Pat’.

Comments are closed.