Coder's Cat

KMP String Search Algorithm

2020-05-10

Explaination of Knuth–Morris–Pratt string-searching algorithm (or KMP).

file:img/2020_05_10_kmp-string-search.org_20200510_135023.png

Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) is a string search algorithm.

It’s named with the combined names of authors and first published at 1970.

KMP algorithm is elegant and efficient. Give any text and pattern, the overall time complexity is O(M+N), where M is the length of text and N is the length of pattern.

To understanding KMP, we need to explain with naive string search algorithm and prefix string.

Naive string search algorithm

Naive string search has two loops, we iterate each index from text and check each substring with pattern.

The overall time complexity is O(M*N).

int naive_search(char* s, char* p) {
int sl = strlen(s);
int pl = strlen(p);
int i = 0;
int j = 0;
while (i < sl && j < pl) {
if (s[i] == p[j]) {
i++;
j++;
} else {
i = i - j + 1;
j = 0; // Fallback to the beginning of pattern
}
}
return j == pl ? (i-j) : (-1);
}

The drawback of naïve algorithm is we fallback to the beginning of pattern every time. It will perform badly for such an input:

text    : a a a d e d f
pattern : a a a c

loop #1 : a a a c
loop #2 : a a a c
loop #3 : a a a c

Each time we checked almost all the characters of pattern, but the last one is different, and then we restart the checking process from the beginning.

How to solve this issue?

Prefix and suffix string sets

To understand how to optimize naive string search, we need to know what is the common prefix and suffix substring.

Given any string, a prefix string set is a set contains all the prefix substring of it. For example, the pefix string set of “hello” is:

{ "h", "he", "hel", "hell"}

Suffix string set is similar but only contains suffix sub-string, for “hello”:

{ "o", "ol", "llo", "ello"}

With these two sets, we could get the longest length of common string in two sets. For example:

text: "ababab"
Prefix set: { "a", "ab", "aba", "abab", "ababa" }
Suffix set: { "b", "ab", "bab", "abab", "babab" }

Longest common string length: length("abab") => 4

How to use a common string of prefix and suffix substring to make string search quicker?

For each index of pattern, we build a table which contains the longest length of current substring.

For pattern “ccca”, the table is:

file:img/2020_05_10_kmp-string-search.org_20200510_133533.png

With text input text of “ccccccccca”, we can skip some part of pattern:

file:img/kmp-s1.gif

We even can skip multiple characters from pattern. Given the text of “ababababca”, let’s search the pattern of “abababca”:

file:img/kmp-s2.gif

KMP algorithm

With all above explaination, we can figure out KMP is similar with naive string search algorithm, but with an optimization of next table. We use next table to skip prefix parts of pattern:

void getNext(char* p, int* next){
int i = 0, j = -1;
next[0] = -1;
while (i < strlen(p)) {
if (j == -1 || p[i] == p[j]) {
++i;
++j;
next[i] = j;
}
else j = next[j];
}
}

int KMP(char* text, char* pattern) {
int lt = strlen(text);
int lp = strlen(pattern);
int i = 0;
int j = 0;

int next[lp + 1];
getNext(pattern, next);

while (i < lt && j < lp) {
if (j == -1 || text[i] == pattern[j]) {
i++;
j++;
}
else j = next[j];
}
if (j == lp) return i - j;
return -1;
}


int main() {
char* text = strdup("ababababca");
char* pattern = strdup("abababca");
printf("res: %d\n", KMP(text, pattern));
return 0;
}