Algorithmathon Day 13: Trie Longest Common Substring

Created: 30 July 2013 Modified:

Instead of more String manipulations, as I indicated, we will be looking into Tries and the Longest Common Substring problem. A friend mentioned the problem to me and I became very interested in Tries as a result.

Trie is a type of Tree that proves itself useful in string manipulation. To find the Longest Common Substring (LCS) Suffix Trie is particularly useful. You create a Suffix tree by taking all the possible suffixes of a string and adding it to the Trie. We can find the LCS by adding all of the suffixes for two strings to a Trie with a label for each string. Then finding the longest path with both string labels will give us the LCS.

Label strings

Find all suffixes of the Strings

Add all suffixes to the trie.

Find next pointer which points to a branch with both string labels

Follow the branch until it no longer has both string labels.

Record the resulting sequence of letters

If the sequence of letters is longer than the last sequence of letters remember it.

Goto step 4

The cost of adding the strings should be n(n-1). For string S_{1} and string S_{2} this would be S_{1}(S_{1} - 1) + S_{2}(S_{2} - 1). If S_{1} represents the the longer string. The searching should have a cost of S_{1 + (26 * S2(S2 - 1)). Where S2 is the shorter of the two strings.}

Descriptions of Suffix Tries often refer to them being Compressed Trie variant. To reduce complexity I have chose not to compress the trie.

We will need several RSpec tests to verify the needed functionality for adding strings to the Trie.

trie_spec.rb

The code below satisfies these tests and provides the core functionality of our Trie. Rather than building a custom node Class we are using Arrays.

trie.rb

We now need to write a method to add all the suffixes of a string to the trie.

trie_spec.rb (excerpt)

The code to satisfy these tests is quite straightforward. We add all possible suffixes of the string with a marker.

trie.rb

Now all that is required is for us to find the LCS. Which we can do by searching for branches of the trie that have markers from both strings and returning the longest one.

trie_spec.rb (excerpt)

Finally the code to satisfy the latest RSpec tests will give us our functionality.