🗂️Data StructureAdvanced

Aho-Corasick Automaton

Key Points

  • Aho–Corasick is a trie with failure links that finds all occurrences of many patterns in a single pass over the text.
  • Its running time is linear in the text length plus total pattern length plus the number of matches: O(n + + z).
  • Failure links behave like KMP on a trie: when a character does not match, you jump to the longest proper suffix that is also a prefix in the trie.
  • Dictionary (output) links or propagated outputs let you report multiple matches that end at the same position, including overlapping patterns.
  • Building the automaton uses BFS to compute failure links and to complete transitions, making lookups O(1) per character for fixed alphabets.
  • You can count occurrences of each pattern by accumulating visits and pushing counts up the failure-link tree.
  • Space can be O(N ) for dense transition tables or O(N + E) with sparse maps, where N is states and is alphabet size.
  • It is the go-to tool for fast multi-pattern searching, spam filtering, code scanning, DNA motif search, and competitive programming.

Prerequisites

  • Trie (prefix tree)Aho–Corasick builds on a trie to share common prefixes among patterns.
  • KMP (Knuth–Morris–Pratt) algorithmFailure links in Aho–Corasick generalize KMP's prefix function to a trie.
  • BFS (Breadth-First Search)Failure links are computed level by level using BFS.
  • Big-O notationNeeded to understand the linear-time guarantees and memory trade-offs.
  • Character encoding and alphabetsChoosing the alphabet size affects memory and transition design.
  • Basic C++ STL (vector, array, queue)The implementations use standard containers for nodes and BFS.
  • Graph/tree aggregationCounting occurrences uses a bottom-up accumulation on the failure tree.

Detailed Explanation

Tap terms for definitions

01Overview

The Aho–Corasick automaton is a data structure and algorithm for searching many patterns at once inside a text. Instead of scanning the text separately for each word, we first insert all patterns into a trie (a prefix tree). Then we decorate this trie with extra pointers called failure links, which tell us where to continue when a character does not match—similar to the logic of KMP, but lifted to a tree of prefixes. Once built, we can process the text in a single left-to-right pass: for each character, we follow the next trie edge if it exists, otherwise we follow failure links until a match is possible, and then continue. Every time we land in a node that corresponds to the end of some pattern (or a suffix of the current path that is a pattern), we report that match. The key benefit is efficiency: the automaton never rewinds the text pointer and handles all patterns simultaneously. Its time complexity is linear in the size of the text plus the total size of all patterns, plus the number of matches reported. This makes it ideal for multi-keyword filtering and matching tasks in competitive programming and real-world applications.

02Intuition & Analogies

Imagine you are scanning a long book to highlight several phrases at once. If you searched for each phrase separately, you would reread the book many times. Aho–Corasick builds a single guide map that tells you, for each partial phrase you might be in the middle of, how to proceed when you see the next letter. The trie is like a road network where each node is a partial phrase (a prefix of some pattern). As you read the book character by character, you try to follow a road labeled with that character. If there is no road, the failure link is an emergency detour: it jumps you to the longest suffix of your current partial phrase that still exists in the network. This is like saying: "Even if my current word stopped matching, maybe I was accidentally building the start of another phrase—let's reuse that progress." Because the detour always shortens your current partial phrase, you will not loop forever; each character triggers only a small amount of work overall. Matches can overlap because the map also connects each location to its smaller suffix phrases that are also complete patterns. That is why, when you land at a node, you may have multiple phrases ending there. In effect, the automaton turns many separate searches into one continuous, efficient stroll through the text with guaranteed linear pace.

03Formal Definition

Let P = {, , ..., } be a set of patterns over an alphabet . Build a trie T where each node v corresponds to some string s(v) that is a prefix of at least one pattern. Mark nodes whose s(v) is exactly a pattern as terminal and store their pattern IDs. For each node v (except the root), define its failure link (v) to be the node u (possibly root) such that s(u) is the longest proper suffix of s(v) that appears in T. When processing text ... , maintain a current node v. For character , follow the trie edge labeled if it exists; otherwise repeatedly set v := (v) until such an edge exists, then take it (or remain at root). The output of a node v consists of all pattern IDs whose strings equal s(v); additionally, to report suffix patterns, one may follow dictionary/output links defined as (v) = v if v is terminal, else (v) = ((v)). The automaton can be extended to a DFA by defining next transitions for all (v, c) via next[v][c] = child if present, else next[(v)][c], with next[root][c] defaulting to root for missing edges. This produces linear-time matching.

04When to Use

Use Aho–Corasick whenever you must search for many patterns in a single pass: keyword filters (profanity/spam/abuse detection), intrusion detection systems looking for signatures, static analysis tools scanning source for APIs, malware pattern scanning, and bioinformatics motif searches. In competitive programming, it shines when: (1) you must find all occurrences of K words in a large text; (2) patterns can overlap and you must count all matches; (3) you need to compute for each pattern whether it appears at least once, or how many times; (4) you need to process a stream online, reporting as soon as any pattern appears. It is also a foundation for DP over states (e.g., counting strings that avoid forbidden patterns by running DP over automaton states). Prefer Aho–Corasick over multiple KMP runs when the number of patterns is large or when total pattern length is comparable to text length. If you only have one pattern, use KMP; if you need approximate or regex matching with wildcards, AC alone is not enough (consider automata for regex or other specialized structures). For very large alphabets and extremely sparse patterns, consider sparse adjacency (hash maps) to reduce memory.

⚠️Common Mistakes

Common pitfalls include: (1) Forgetting to report matches reachable via failure links. If you only check the current node’s output, you may miss suffix patterns; fix this by using dictionary/output links or by propagating outputs along failure links during build. (2) Incorrectly initializing root transitions. Ensure next[root][c] defaults to root (0) for missing edges, otherwise the search loop may need special casing and can be error-prone. (3) Miscomputing failure links: compute links in BFS order from the root, and for a child via character c, set link[child] = next[link[parent]][c]. (4) Double counting matches when both out[v] and out[link[v]] are appended into out[v] while also following output links during search—do one approach consistently. (5) Large memory blow-up when using dense next tables with big alphabets; switch to sparse maps or compress the alphabet. (6) Off-by-one errors in reporting positions: the starting index is end_index - |pattern| + 1. (7) Counting occurrences incorrectly: when using the failure-tree DP, you must aggregate counts from deeper to shallower nodes (reverse BFS/topological over the link tree). (8) Mixing 1-based and 0-based node indices or uninitialized arrays; initialize next to -1 and links to 0/root to avoid undefined behavior.

Key Formulas

Aho–Corasick Total Time

Explanation: Total matching time is linear in the text length n, plus the total pattern length, plus z reported matches. Each character advances the automaton with amortized O(1) work, and reporting costs O(1) per match.

State Count Upper Bound

Explanation: The number of trie nodes N is at most the total number of characters inserted plus the root. Shared prefixes reduce N, but this bound always holds.

Failure Link Definition

Explanation: The failure link of node v points to the longest suffix of its string that is still present as a trie prefix. This is the next best match after a mismatch.

Completed Transition Function

Explanation: By defining next for all characters via failure links, the automaton becomes a DFA. This enables O(1) transitions during search for fixed alphabets.

Dictionary (Output) Link

Explanation: The output link jumps from any node to the nearest terminal node reachable by following failure links. Chaining these reports all suffix pattern matches.

Failure-Tree Aggregation

Explanation: To count occurrences per pattern after scanning, accumulate visit counts from children to parents along failure links in reverse depth order. Each node’s count contributes to its suffix matches.

Space Complexity

Explanation: Dense arrays use space proportional to the number of states times alphabet size. Sparse maps store only existing edges E, saving memory for large alphabets.

Complexity Analysis

Let k be the number of patterns, their lengths, and n the text length. Build time starts with inserting all patterns into a trie in O( ), since each character creates or follows an edge. Failure links are computed via BFS: for each node v and character c, we set next[v][c] to either a child or the transition from link[v]. If we explicitly complete transitions for a fixed alphabet of size , building is O(N ), where N is the number of nodes (N 1 + ). Alternatively, we can avoid completing absent transitions and compute them lazily, retaining O( + N ) worst case but often closer to O( ) with sparse edges. Once built, matching processes each text character with amortized O(1) transitions because every mismatch strictly shortens the matched prefix via failure links, and each character causes at most a constant number of failure jumps on average. Reporting matches costs O(1) per matched pattern using dictionary links or pre-propagated outputs, so total time is O(n + + z), where z is the total number of matches reported. Space usage depends on representation: dense next tables take O(N ), good for small fixed alphabets (e.g., lowercase letters). For large alphabets (e.g., full ASCII/Unicode) or sparse tries, use hash maps or ordered maps per node, yielding O(N + E) memory, where E is the number of edges actually present. Additional arrays for failure links, output links, terminal lists, and pattern metadata add O(N + k) space.

Code Examples

Basic Aho–Corasick: report all matches with positions and pattern IDs (lowercase a–z)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct AhoCorasick {
5 static const int ALPH = 26; // lowercase 'a'..'z'
6 struct Node {
7 array<int, ALPH> next; // transitions
8 int link; // failure link
9 int out_link; // dictionary (output) link to next terminal (or -1)
10 vector<int> out; // pattern IDs that end at this node
11 Node() : link(0), out_link(-1) { next.fill(-1); }
12 };
13
14 vector<Node> t; // trie + automaton
15 vector<int> pat_len; // length of each pattern by ID
16
17 AhoCorasick() { t.emplace_back(); } // root = 0
18
19 // Map character to alphabet index
20 static int idx(char c) { return c - 'a'; }
21
22 // Add a pattern and return its ID
23 int add_pattern(const string &s) {
24 int v = 0;
25 for (char ch : s) {
26 int c = idx(ch);
27 if (t[v].next[c] == -1) {
28 t[v].next[c] = (int)t.size();
29 t.emplace_back();
30 }
31 v = t[v].next[c];
32 }
33 int id = (int)pat_len.size();
34 t[v].out.push_back(id);
35 pat_len.push_back((int)s.size());
36 return id;
37 }
38
39 // Build failure links, dictionary links, and complete transitions
40 void build() {
41 queue<int> q;
42 // Initialize root transitions to 0 if missing
43 for (int c = 0; c < ALPH; ++c) {
44 int u = t[0].next[c];
45 if (u == -1) t[0].next[c] = 0; // stay at root on missing
46 else {
47 t[u].link = 0;
48 q.push(u);
49 }
50 }
51 t[0].out_link = -1; // root has no output link
52
53 // BFS
54 while (!q.empty()) {
55 int v = q.front(); q.pop();
56 // Set dictionary link based on failure link
57 if (!t[t[v].link].out.empty()) t[v].out_link = t[v].link;
58 else t[v].out_link = t[t[v].link].out_link;
59
60 for (int c = 0; c < ALPH; ++c) {
61 int u = t[v].next[c];
62 if (u != -1) {
63 t[u].link = t[t[v].link].next[c];
64 q.push(u);
65 } else {
66 t[v].next[c] = t[t[v].link].next[c];
67 }
68 }
69 }
70 }
71
72 // Search text and return (patternID, startIndex) for each match
73 vector<pair<int,int>> search(const string &text) const {
74 vector<pair<int,int>> res;
75 int v = 0;
76 for (int i = 0; i < (int)text.size(); ++i) {
77 int c = idx(text[i]);
78 if (c < 0 || c >= ALPH) { v = 0; continue; } // skip non-lowercase
79 v = t[v].next[c];
80
81 // Report all patterns ending here via out and dictionary links
82 int u = v;
83 while (u != -1) {
84 for (int id : t[u].out) {
85 int start = i - pat_len[id] + 1; // 0-based start index
86 res.emplace_back(id, start);
87 }
88 u = t[u].out_link;
89 }
90 }
91 return res;
92 }
93};
94
95int main() {
96 ios::sync_with_stdio(false);
97 cin.tie(nullptr);
98
99 AhoCorasick ac;
100 vector<string> patterns = {"he", "she", "hers", "his"};
101 for (auto &p : patterns) ac.add_pattern(p);
102 ac.build();
103
104 string text = "ushers his sheep";
105 auto matches = ac.search(text);
106
107 // Print matches: pattern, start index
108 for (auto [id, pos] : matches) {
109 cout << "pattern '" << patterns[id] << "' at index " << pos << "\n";
110 }
111 return 0;
112}
113

We build a trie of lowercase patterns, compute failure and dictionary links by BFS, complete all transitions to make a DFA, then scan the text in one pass. At each position, we report matches in the current node and along its dictionary links so we capture suffix patterns and overlaps. Positions are 0-based start indices.

Time: O(n + \sum m_i + z)Space: O(N \sigma + N + k)
Count occurrences of each pattern using failure-tree aggregation
1#include <bits/stdc++.h>
2using namespace std;
3
4struct AhoCount {
5 static const int ALPH = 26;
6 struct Node {
7 array<int, ALPH> next;
8 int link;
9 Node() : link(0) { next.fill(-1); }
10 };
11
12 vector<Node> t;
13 vector<int> pid_state; // terminal state of each pattern ID
14 vector<int> pat_len;
15
16 AhoCount() { t.emplace_back(); }
17
18 static int idx(char c) { return c - 'a'; }
19
20 int add_pattern(const string &s) {
21 int v = 0;
22 for (char ch : s) {
23 int c = idx(ch);
24 if (t[v].next[c] == -1) {
25 t[v].next[c] = (int)t.size();
26 t.emplace_back();
27 }
28 v = t[v].next[c];
29 }
30 int id = (int)pid_state.size();
31 pid_state.push_back(v);
32 pat_len.push_back((int)s.size());
33 return id;
34 }
35
36 void build() {
37 queue<int> q;
38 for (int c = 0; c < ALPH; ++c) {
39 int u = t[0].next[c];
40 if (u == -1) t[0].next[c] = 0;
41 else { t[u].link = 0; q.push(u); }
42 }
43 while (!q.empty()) {
44 int v = q.front(); q.pop();
45 for (int c = 0; c < ALPH; ++c) {
46 int u = t[v].next[c];
47 if (u != -1) { t[u].link = t[t[v].link].next[c]; q.push(u); }
48 else t[v].next[c] = t[t[v].link].next[c];
49 }
50 }
51 }
52
53 // Return occurrences count for each pattern ID
54 vector<long long> count_in_text(const string &text) const {
55 int N = (int)t.size();
56 vector<long long> visit(N, 0);
57 int v = 0;
58 for (char ch : text) {
59 int c = idx(ch);
60 if (c < 0 || c >= ALPH) { v = 0; continue; }
61 v = t[v].next[c];
62 // increment visits at current state
63 visit[v]++;
64 }
65 // Build failure tree adjacency to aggregate counts
66 vector<vector<int>> children(N);
67 for (int vtx = 1; vtx < N; ++vtx) children[t[vtx].link].push_back(vtx);
68
69 // Post-order DFS or stack to accumulate from deeper to shallower
70 vector<int> stk = {0};
71 vector<int> order; order.reserve(N);
72 while (!stk.empty()) { int x = stk.back(); stk.pop_back(); order.push_back(x); for (int y : children[x]) stk.push_back(y); }
73 // reverse order for bottom-up
74 for (int i = (int)order.size() - 1; i >= 0; --i) {
75 int x = order[i];
76 for (int y : children[x]) visit[x] += visit[y];
77 }
78
79 vector<long long> ans(pid_state.size(), 0);
80 for (int id = 0; id < (int)pid_state.size(); ++id) ans[id] = visit[pid_state[id]];
81 return ans;
82 }
83};
84
85int main() {
86 ios::sync_with_stdio(false);
87 cin.tie(nullptr);
88
89 AhoCount ac;
90 vector<string> patterns = {"a", "aa", "aaa"};
91 for (auto &p : patterns) ac.add_pattern(p);
92 ac.build();
93
94 string text = "aaaaa"; // 'a' x5
95 auto occ = ac.count_in_text(text);
96 for (int i = 0; i < (int)patterns.size(); ++i) {
97 cout << patterns[i] << ": " << occ[i] << "\n";
98 }
99 return 0;
100}
101

We count how many times each automaton state is visited during the scan, then aggregate counts along failure links from children to parents. The count at a terminal state equals the number of occurrences of its pattern, including those found via suffix matches. This approach avoids per-position output traversal and yields all counts at once.

Time: O(n + \sum m_i + N + k)Space: O(N \sigma + N + k)
Streaming filter (ASCII) with early exit on first banned phrase
1#include <bits/stdc++.h>
2using namespace std;
3
4struct AhoStream {
5 static const int ALPH = 128; // ASCII subset
6 struct Node {
7 array<int, ALPH> next;
8 int link;
9 bool terminal; // any pattern ends here
10 vector<int> out; // pattern IDs (optional to store strings separately)
11 Node() : link(0), terminal(false) { next.fill(-1); }
12 };
13
14 vector<Node> t;
15 vector<int> pat_len;
16 int state = 0; // current streaming state
17 long long pos = -1; // last processed index
18
19 AhoStream() { t.emplace_back(); }
20
21 static int idx(unsigned char c) { return (int)c; }
22
23 int add_pattern(const string &s) {
24 int v = 0;
25 for (unsigned char ch : s) {
26 int c = idx(ch);
27 if (t[v].next[c] == -1) { t[v].next[c] = (int)t.size(); t.emplace_back(); }
28 v = t[v].next[c];
29 }
30 t[v].terminal = true;
31 int id = (int)pat_len.size();
32 t[v].out.push_back(id);
33 pat_len.push_back((int)s.size());
34 return id;
35 }
36
37 void build() {
38 queue<int> q;
39 for (int c = 0; c < ALPH; ++c) {
40 int u = t[0].next[c];
41 if (u == -1) t[0].next[c] = 0; else { t[u].link = 0; q.push(u); }
42 }
43 while (!q.empty()) {
44 int v = q.front(); q.pop();
45 // propagate terminal info along failure links
46 if (t[t[v].link].terminal) {
47 t[v].terminal = true;
48 // optionally merge out lists (keep light to avoid duplicates)
49 // for streaming early exit, knowing terminal is enough
50 }
51 for (int c = 0; c < ALPH; ++c) {
52 int u = t[v].next[c];
53 if (u != -1) { t[u].link = t[t[v].link].next[c]; q.push(u); }
54 else t[v].next[c] = t[t[v].link].next[c];
55 }
56 }
57 }
58
59 // Feed one character; return optional pair(patternID, endPosition) for the first match at this character
60 // For simplicity, return only whether any banned pattern matched and its ending position.
61 optional<pair<int,long long>> feed(char ch) {
62 ++pos;
63 state = t[state].next[idx((unsigned char)ch)];
64 if (t[state].terminal) {
65 // We can find any one pattern ID (first stored) and compute end position
66 int pid = t[state].out.empty() ? -1 : t[state].out[0];
67 return make_pair(pid, pos);
68 }
69 return nullopt;
70 }
71};
72
73int main() {
74 ios::sync_with_stdio(false);
75 cin.tie(nullptr);
76
77 AhoStream ac;
78 vector<string> banned = {"spam", "hack", "bad"};
79 for (auto &p : banned) ac.add_pattern(p);
80 ac.build();
81
82 string stream = "this message might contain spam later";
83 for (char ch : stream) {
84 auto hit = ac.feed(ch);
85 if (hit) {
86 int pid = hit->first;
87 long long endpos = hit->second;
88 long long start = (pid == -1 ? -1 : endpos - (long long)ac.pat_len[pid] + 1);
89 cout << "BANNED detected at [" << start << ", " << endpos << "]\n";
90 break; // early exit on first match
91 }
92 }
93 return 0;
94}
95

We build an ASCII automaton with dense transitions for speed. During streaming, we maintain the current state and position; each feed updates the state and immediately checks if we are at any terminal node. This allows instant detection and early exit on the first banned phrase, suitable for real-time filters.

Time: Build: O(N \sigma). Streaming: O(1) per character amortized.Space: O(N \sigma + N + k)