Aho-Corasick Automaton
Key Points
- •Aho–Corasick is a trie with failure links that finds all occurrences of many patterns in a single pass over the text.
- •Its running time is linear in the text length plus total pattern length plus the number of matches: O(n + + z).
- •Failure links behave like KMP on a trie: when a character does not match, you jump to the longest proper suffix that is also a prefix in the trie.
- •Dictionary (output) links or propagated outputs let you report multiple matches that end at the same position, including overlapping patterns.
- •Building the automaton uses BFS to compute failure links and to complete transitions, making lookups O(1) per character for fixed alphabets.
- •You can count occurrences of each pattern by accumulating visits and pushing counts up the failure-link tree.
- •Space can be O(N ) for dense transition tables or O(N + E) with sparse maps, where N is states and is alphabet size.
- •It is the go-to tool for fast multi-pattern searching, spam filtering, code scanning, DNA motif search, and competitive programming.
Prerequisites
- →Trie (prefix tree) — Aho–Corasick builds on a trie to share common prefixes among patterns.
- →KMP (Knuth–Morris–Pratt) algorithm — Failure links in Aho–Corasick generalize KMP's prefix function to a trie.
- →BFS (Breadth-First Search) — Failure links are computed level by level using BFS.
- →Big-O notation — Needed to understand the linear-time guarantees and memory trade-offs.
- →Character encoding and alphabets — Choosing the alphabet size affects memory and transition design.
- →Basic C++ STL (vector, array, queue) — The implementations use standard containers for nodes and BFS.
- →Graph/tree aggregation — Counting occurrences uses a bottom-up accumulation on the failure tree.
Detailed Explanation
Tap terms for definitions01Overview
The Aho–Corasick automaton is a data structure and algorithm for searching many patterns at once inside a text. Instead of scanning the text separately for each word, we first insert all patterns into a trie (a prefix tree). Then we decorate this trie with extra pointers called failure links, which tell us where to continue when a character does not match—similar to the logic of KMP, but lifted to a tree of prefixes. Once built, we can process the text in a single left-to-right pass: for each character, we follow the next trie edge if it exists, otherwise we follow failure links until a match is possible, and then continue. Every time we land in a node that corresponds to the end of some pattern (or a suffix of the current path that is a pattern), we report that match. The key benefit is efficiency: the automaton never rewinds the text pointer and handles all patterns simultaneously. Its time complexity is linear in the size of the text plus the total size of all patterns, plus the number of matches reported. This makes it ideal for multi-keyword filtering and matching tasks in competitive programming and real-world applications.
02Intuition & Analogies
Imagine you are scanning a long book to highlight several phrases at once. If you searched for each phrase separately, you would reread the book many times. Aho–Corasick builds a single guide map that tells you, for each partial phrase you might be in the middle of, how to proceed when you see the next letter. The trie is like a road network where each node is a partial phrase (a prefix of some pattern). As you read the book character by character, you try to follow a road labeled with that character. If there is no road, the failure link is an emergency detour: it jumps you to the longest suffix of your current partial phrase that still exists in the network. This is like saying: "Even if my current word stopped matching, maybe I was accidentally building the start of another phrase—let's reuse that progress." Because the detour always shortens your current partial phrase, you will not loop forever; each character triggers only a small amount of work overall. Matches can overlap because the map also connects each location to its smaller suffix phrases that are also complete patterns. That is why, when you land at a node, you may have multiple phrases ending there. In effect, the automaton turns many separate searches into one continuous, efficient stroll through the text with guaranteed linear pace.
03Formal Definition
04When to Use
Use Aho–Corasick whenever you must search for many patterns in a single pass: keyword filters (profanity/spam/abuse detection), intrusion detection systems looking for signatures, static analysis tools scanning source for APIs, malware pattern scanning, and bioinformatics motif searches. In competitive programming, it shines when: (1) you must find all occurrences of K words in a large text; (2) patterns can overlap and you must count all matches; (3) you need to compute for each pattern whether it appears at least once, or how many times; (4) you need to process a stream online, reporting as soon as any pattern appears. It is also a foundation for DP over states (e.g., counting strings that avoid forbidden patterns by running DP over automaton states). Prefer Aho–Corasick over multiple KMP runs when the number of patterns is large or when total pattern length is comparable to text length. If you only have one pattern, use KMP; if you need approximate or regex matching with wildcards, AC alone is not enough (consider automata for regex or other specialized structures). For very large alphabets and extremely sparse patterns, consider sparse adjacency (hash maps) to reduce memory.
⚠️Common Mistakes
Common pitfalls include: (1) Forgetting to report matches reachable via failure links. If you only check the current node’s output, you may miss suffix patterns; fix this by using dictionary/output links or by propagating outputs along failure links during build. (2) Incorrectly initializing root transitions. Ensure next[root][c] defaults to root (0) for missing edges, otherwise the search loop may need special casing and can be error-prone. (3) Miscomputing failure links: compute links in BFS order from the root, and for a child via character c, set link[child] = next[link[parent]][c]. (4) Double counting matches when both out[v] and out[link[v]] are appended into out[v] while also following output links during search—do one approach consistently. (5) Large memory blow-up when using dense next tables with big alphabets; switch to sparse maps or compress the alphabet. (6) Off-by-one errors in reporting positions: the starting index is end_index - |pattern| + 1. (7) Counting occurrences incorrectly: when using the failure-tree DP, you must aggregate counts from deeper to shallower nodes (reverse BFS/topological over the link tree). (8) Mixing 1-based and 0-based node indices or uninitialized arrays; initialize next to -1 and links to 0/root to avoid undefined behavior.
Key Formulas
Aho–Corasick Total Time
Explanation: Total matching time is linear in the text length n, plus the total pattern length, plus z reported matches. Each character advances the automaton with amortized O(1) work, and reporting costs O(1) per match.
State Count Upper Bound
Explanation: The number of trie nodes N is at most the total number of characters inserted plus the root. Shared prefixes reduce N, but this bound always holds.
Failure Link Definition
Explanation: The failure link of node v points to the longest suffix of its string that is still present as a trie prefix. This is the next best match after a mismatch.
Completed Transition Function
Explanation: By defining next for all characters via failure links, the automaton becomes a DFA. This enables O(1) transitions during search for fixed alphabets.
Dictionary (Output) Link
Explanation: The output link jumps from any node to the nearest terminal node reachable by following failure links. Chaining these reports all suffix pattern matches.
Failure-Tree Aggregation
Explanation: To count occurrences per pattern after scanning, accumulate visit counts from children to parents along failure links in reverse depth order. Each node’s count contributes to its suffix matches.
Space Complexity
Explanation: Dense arrays use space proportional to the number of states times alphabet size. Sparse maps store only existing edges E, saving memory for large alphabets.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct AhoCorasick { 5 static const int ALPH = 26; // lowercase 'a'..'z' 6 struct Node { 7 array<int, ALPH> next; // transitions 8 int link; // failure link 9 int out_link; // dictionary (output) link to next terminal (or -1) 10 vector<int> out; // pattern IDs that end at this node 11 Node() : link(0), out_link(-1) { next.fill(-1); } 12 }; 13 14 vector<Node> t; // trie + automaton 15 vector<int> pat_len; // length of each pattern by ID 16 17 AhoCorasick() { t.emplace_back(); } // root = 0 18 19 // Map character to alphabet index 20 static int idx(char c) { return c - 'a'; } 21 22 // Add a pattern and return its ID 23 int add_pattern(const string &s) { 24 int v = 0; 25 for (char ch : s) { 26 int c = idx(ch); 27 if (t[v].next[c] == -1) { 28 t[v].next[c] = (int)t.size(); 29 t.emplace_back(); 30 } 31 v = t[v].next[c]; 32 } 33 int id = (int)pat_len.size(); 34 t[v].out.push_back(id); 35 pat_len.push_back((int)s.size()); 36 return id; 37 } 38 39 // Build failure links, dictionary links, and complete transitions 40 void build() { 41 queue<int> q; 42 // Initialize root transitions to 0 if missing 43 for (int c = 0; c < ALPH; ++c) { 44 int u = t[0].next[c]; 45 if (u == -1) t[0].next[c] = 0; // stay at root on missing 46 else { 47 t[u].link = 0; 48 q.push(u); 49 } 50 } 51 t[0].out_link = -1; // root has no output link 52 53 // BFS 54 while (!q.empty()) { 55 int v = q.front(); q.pop(); 56 // Set dictionary link based on failure link 57 if (!t[t[v].link].out.empty()) t[v].out_link = t[v].link; 58 else t[v].out_link = t[t[v].link].out_link; 59 60 for (int c = 0; c < ALPH; ++c) { 61 int u = t[v].next[c]; 62 if (u != -1) { 63 t[u].link = t[t[v].link].next[c]; 64 q.push(u); 65 } else { 66 t[v].next[c] = t[t[v].link].next[c]; 67 } 68 } 69 } 70 } 71 72 // Search text and return (patternID, startIndex) for each match 73 vector<pair<int,int>> search(const string &text) const { 74 vector<pair<int,int>> res; 75 int v = 0; 76 for (int i = 0; i < (int)text.size(); ++i) { 77 int c = idx(text[i]); 78 if (c < 0 || c >= ALPH) { v = 0; continue; } // skip non-lowercase 79 v = t[v].next[c]; 80 81 // Report all patterns ending here via out and dictionary links 82 int u = v; 83 while (u != -1) { 84 for (int id : t[u].out) { 85 int start = i - pat_len[id] + 1; // 0-based start index 86 res.emplace_back(id, start); 87 } 88 u = t[u].out_link; 89 } 90 } 91 return res; 92 } 93 }; 94 95 int main() { 96 ios::sync_with_stdio(false); 97 cin.tie(nullptr); 98 99 AhoCorasick ac; 100 vector<string> patterns = {"he", "she", "hers", "his"}; 101 for (auto &p : patterns) ac.add_pattern(p); 102 ac.build(); 103 104 string text = "ushers his sheep"; 105 auto matches = ac.search(text); 106 107 // Print matches: pattern, start index 108 for (auto [id, pos] : matches) { 109 cout << "pattern '" << patterns[id] << "' at index " << pos << "\n"; 110 } 111 return 0; 112 } 113
We build a trie of lowercase patterns, compute failure and dictionary links by BFS, complete all transitions to make a DFA, then scan the text in one pass. At each position, we report matches in the current node and along its dictionary links so we capture suffix patterns and overlaps. Positions are 0-based start indices.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct AhoCount { 5 static const int ALPH = 26; 6 struct Node { 7 array<int, ALPH> next; 8 int link; 9 Node() : link(0) { next.fill(-1); } 10 }; 11 12 vector<Node> t; 13 vector<int> pid_state; // terminal state of each pattern ID 14 vector<int> pat_len; 15 16 AhoCount() { t.emplace_back(); } 17 18 static int idx(char c) { return c - 'a'; } 19 20 int add_pattern(const string &s) { 21 int v = 0; 22 for (char ch : s) { 23 int c = idx(ch); 24 if (t[v].next[c] == -1) { 25 t[v].next[c] = (int)t.size(); 26 t.emplace_back(); 27 } 28 v = t[v].next[c]; 29 } 30 int id = (int)pid_state.size(); 31 pid_state.push_back(v); 32 pat_len.push_back((int)s.size()); 33 return id; 34 } 35 36 void build() { 37 queue<int> q; 38 for (int c = 0; c < ALPH; ++c) { 39 int u = t[0].next[c]; 40 if (u == -1) t[0].next[c] = 0; 41 else { t[u].link = 0; q.push(u); } 42 } 43 while (!q.empty()) { 44 int v = q.front(); q.pop(); 45 for (int c = 0; c < ALPH; ++c) { 46 int u = t[v].next[c]; 47 if (u != -1) { t[u].link = t[t[v].link].next[c]; q.push(u); } 48 else t[v].next[c] = t[t[v].link].next[c]; 49 } 50 } 51 } 52 53 // Return occurrences count for each pattern ID 54 vector<long long> count_in_text(const string &text) const { 55 int N = (int)t.size(); 56 vector<long long> visit(N, 0); 57 int v = 0; 58 for (char ch : text) { 59 int c = idx(ch); 60 if (c < 0 || c >= ALPH) { v = 0; continue; } 61 v = t[v].next[c]; 62 // increment visits at current state 63 visit[v]++; 64 } 65 // Build failure tree adjacency to aggregate counts 66 vector<vector<int>> children(N); 67 for (int vtx = 1; vtx < N; ++vtx) children[t[vtx].link].push_back(vtx); 68 69 // Post-order DFS or stack to accumulate from deeper to shallower 70 vector<int> stk = {0}; 71 vector<int> order; order.reserve(N); 72 while (!stk.empty()) { int x = stk.back(); stk.pop_back(); order.push_back(x); for (int y : children[x]) stk.push_back(y); } 73 // reverse order for bottom-up 74 for (int i = (int)order.size() - 1; i >= 0; --i) { 75 int x = order[i]; 76 for (int y : children[x]) visit[x] += visit[y]; 77 } 78 79 vector<long long> ans(pid_state.size(), 0); 80 for (int id = 0; id < (int)pid_state.size(); ++id) ans[id] = visit[pid_state[id]]; 81 return ans; 82 } 83 }; 84 85 int main() { 86 ios::sync_with_stdio(false); 87 cin.tie(nullptr); 88 89 AhoCount ac; 90 vector<string> patterns = {"a", "aa", "aaa"}; 91 for (auto &p : patterns) ac.add_pattern(p); 92 ac.build(); 93 94 string text = "aaaaa"; // 'a' x5 95 auto occ = ac.count_in_text(text); 96 for (int i = 0; i < (int)patterns.size(); ++i) { 97 cout << patterns[i] << ": " << occ[i] << "\n"; 98 } 99 return 0; 100 } 101
We count how many times each automaton state is visited during the scan, then aggregate counts along failure links from children to parents. The count at a terminal state equals the number of occurrences of its pattern, including those found via suffix matches. This approach avoids per-position output traversal and yields all counts at once.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct AhoStream { 5 static const int ALPH = 128; // ASCII subset 6 struct Node { 7 array<int, ALPH> next; 8 int link; 9 bool terminal; // any pattern ends here 10 vector<int> out; // pattern IDs (optional to store strings separately) 11 Node() : link(0), terminal(false) { next.fill(-1); } 12 }; 13 14 vector<Node> t; 15 vector<int> pat_len; 16 int state = 0; // current streaming state 17 long long pos = -1; // last processed index 18 19 AhoStream() { t.emplace_back(); } 20 21 static int idx(unsigned char c) { return (int)c; } 22 23 int add_pattern(const string &s) { 24 int v = 0; 25 for (unsigned char ch : s) { 26 int c = idx(ch); 27 if (t[v].next[c] == -1) { t[v].next[c] = (int)t.size(); t.emplace_back(); } 28 v = t[v].next[c]; 29 } 30 t[v].terminal = true; 31 int id = (int)pat_len.size(); 32 t[v].out.push_back(id); 33 pat_len.push_back((int)s.size()); 34 return id; 35 } 36 37 void build() { 38 queue<int> q; 39 for (int c = 0; c < ALPH; ++c) { 40 int u = t[0].next[c]; 41 if (u == -1) t[0].next[c] = 0; else { t[u].link = 0; q.push(u); } 42 } 43 while (!q.empty()) { 44 int v = q.front(); q.pop(); 45 // propagate terminal info along failure links 46 if (t[t[v].link].terminal) { 47 t[v].terminal = true; 48 // optionally merge out lists (keep light to avoid duplicates) 49 // for streaming early exit, knowing terminal is enough 50 } 51 for (int c = 0; c < ALPH; ++c) { 52 int u = t[v].next[c]; 53 if (u != -1) { t[u].link = t[t[v].link].next[c]; q.push(u); } 54 else t[v].next[c] = t[t[v].link].next[c]; 55 } 56 } 57 } 58 59 // Feed one character; return optional pair(patternID, endPosition) for the first match at this character 60 // For simplicity, return only whether any banned pattern matched and its ending position. 61 optional<pair<int,long long>> feed(char ch) { 62 ++pos; 63 state = t[state].next[idx((unsigned char)ch)]; 64 if (t[state].terminal) { 65 // We can find any one pattern ID (first stored) and compute end position 66 int pid = t[state].out.empty() ? -1 : t[state].out[0]; 67 return make_pair(pid, pos); 68 } 69 return nullopt; 70 } 71 }; 72 73 int main() { 74 ios::sync_with_stdio(false); 75 cin.tie(nullptr); 76 77 AhoStream ac; 78 vector<string> banned = {"spam", "hack", "bad"}; 79 for (auto &p : banned) ac.add_pattern(p); 80 ac.build(); 81 82 string stream = "this message might contain spam later"; 83 for (char ch : stream) { 84 auto hit = ac.feed(ch); 85 if (hit) { 86 int pid = hit->first; 87 long long endpos = hit->second; 88 long long start = (pid == -1 ? -1 : endpos - (long long)ac.pat_len[pid] + 1); 89 cout << "BANNED detected at [" << start << ", " << endpos << "]\n"; 90 break; // early exit on first match 91 } 92 } 93 return 0; 94 } 95
We build an ASCII automaton with dense transitions for speed. During streaming, we maintain the current state and position; each feed updates the state and immediately checks if we are at any terminal node. This allows instant detection and early exit on the first banned phrase, suitable for real-time filters.