At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.

Today we talk about an algorithm that is *O(n + m)*, where *n* is the length of the target text and *m* is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search.

Let’s look at some code. Assume we have some sort of *Node* class/struct defined as follows

1 |
Node = MutableNamedTuple('Node', ['ch', 'transitions', 'results', 'fail']) |

The *ch* variable will hold the character at the current node, *transitions* will hold all nodes we can transition to, *results* will be a list of results in case this node is a keyword we stored and *fail* are transitions in case of failure (more on this later). For example, storing the keyword Manning will create 7 nodes and for the node where *ch=’g’*, results will also hold *Manning*. One detail that I omitted is that we also need a root node whose fail transition is itself and where all other nodes fail transitions point to by default. Now let’s look at some very verbose code that creates the trie-like data structure without the fail transitions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
root = Node(ch=None,transitions=[],results=[],fail=None) root.fail = root queue = deque([root]) for keyword in keywords: current_node = root for char in keyword: new_node = None for transition in current_node.transitions: if transition.ch == char: new_node = transition break if new_node is None: new_node = Node(ch=char) current_node.transitions.append(new_node) if current_node is root: new_node.fail = root current_node = new_node current_node.results.append(keyword) |

This first part is simple. We go keyword by keyword and character by character. For each new character we need to add, we check if our current node already has a transition to that character (lines 9-11). If it doesn’t we create a new node and add it to the list of transitions (lines 14-18). Finally, after we have gone through all characters in a keyword, we add that keyword to the results list of the node we are currently at.

The second part is where the coolness behind Aho-Corasick lies. Constructing the failure transitions is simple, because we can simply perform a Breadth-First Search update the failure links as we go. Note that a couple of things in the code sample below differ from the original paper. First, we already set the failure transition to root for all nodes with root as the parent, as well as the root node itself — the original paper calls these states with depth d=1; also, we do not update the “output function” during our BFS as the original paper does because the results list is created during the previous step.

1 2 3 4 5 6 7 8 9 10 11 |
while queue: current_node = queue.popleft() for node in current_node.transitions: queue.append(node) fail_state_node = current_node.fail while not any(x for x in fail_state_node.transitions if node.ch == x.ch) and fail_state_node is not root: fail_state_node = fail_state_node.fail node.fail = next((x for x in fail_state_node.transitions if node.ch == x.ch and x is not node), root) return root |

Now, given our completed data structure, searching is trivial and lends itself nicely to a generator construction.

1 2 3 4 5 6 7 8 9 10 11 12 |
current_node = tree_root for c in text: transition = None while transition is None and transition is not tree_root: transition = next((x for x in current_node.transitions if x.ch == c), current_node.fail) current_node = transition if len(transition.results) > 0: for result in transition.results: yield result current_node = transition.fail |

For each character in *text*, we move through our data structure and if there is a suitable transition we make it; otherwise we transition to the failure link (line 5). If we are at a node with a non-empty results list (output function in the original paper), then we return that result and continue. Let’s see it in action:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
In [1]: import tree as ac In [2]: aho = ac.build_tree(['Brady', 'Manning', 'Johnson', 'Ochochinco']) In [3]: gen = ac.search(aho, "Brady is a better QB than Manning.") In [4]: gen.next() Out[4]: 'Brady' In [5]: gen.next() Out[5]: 'Manning' In [6]: gen.next() --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) projects/aho/ in () ----> 1 gen.next() StopIteration |

The algorithm has been around since 1975, so there are plenty of efficient implementations out there — the Wikipedia page mentions a few. Happy searching!

hi team, wondering if you have the full code of this script

This algorithm has an error as you’ve coded it.

Instead of continuing if there is a result, it acts like there was a failure. So if I was looking for ‘Pat’ and ‘Patton’ in a block of text that had the word ‘Patton’, it would only find ‘Pat’.

For anyone wanting a working example of this code in python, I put my version here:

http://pastebin.com/cjAsAu3e