Finding anagrams with a trie

(0 comments)

A trie is an ordered tree data structure often used to store word lists, for example, for auto-complete applications. Each node in the tree is associated with a letter; the association of nodes with each other (parents, children, grandchildren, etc.) orders the letters into words. For example, the simple dictionary consisting only of the words "bat", "ball", "coo", "cow" and "cowl" could be represented by the trie in the figure below, in which the black circles indicate nodes which represent word terminators.

enter image description here

In Python, a trie can be constructed as a nested series of dictionaries in which the keys are letters (or None, representing the end of a word):

def make_trie(*words):
    root = {}
    for word in words:
        this_dict = root
        for letter in word:
            this_dict = this_dict.setdefault(letter, {})
        this_dict[None] = None
    return root

trie = make_trie(*("bat", "ball", "coo", "cow", "cowl"))

from pprint import pprint
pprint(trie)

Output:

{'b': {'a': {'l': {'l': {None: None}}, 't': {None: None}}},
 'c': {'o': {'o': {None: None}, 'w': {None: None, 'l': {None: None}}}}}

Checking to see if a word is in the dictionary is as simple as traversing the trie and checking that each letter is attached to a node at the appropriate place (and that the last letter is attached to the word terminator):

def in_trie(trie, word):
    this_dict = trie
    for letter in word:
        try:
            this_dict = this_dict[letter]
        except KeyError:
            return False
    else:
        return None in this_dict

print('cow in trie: ', in_trie(trie, 'cow'))
print('ban in trie: ', in_trie(trie, 'ban'))
print('bal in trie: ', in_trie(trie, 'bal'))

Returns, as expected:

cow in trie:  True
ban in trie:  False
bal in trie:  False

To generate anagrams, we first build a dictionary, letter_counts, of letter: count pairs from the sequence of letters, and then search the trie recursively with the function _anagram. Going into the recursion, for each letter in the dictionary, if it is in the current position in the trie, add it to the list path, decrease its count and call _anagram from that current position. Coming out of the recursion, remove the letter from path and increment its count in letter_counts. If we reach a state where the length of path is the same as the length of the original sequence of letters, then we yield the word from path.

To use the code below, you'll need a word list. Linux and Unix-like systems include one, usually as /usr/share/dict/words. Alternatively, the sowpods.txt word list often used for Scrabble can be found here.

import sys
from collections import Counter

DICTIONARY_FILENAME = 'sowpods.txt'

def read_words(filename):
    """Read in words from filename, making them all lower case."""

    return [line.strip().lower() for line in open(filename)]

def make_trie(words):
    """Build a trie from the iterable object words."""

    root = {}
    for word in words:
        this_dict = root
        for letter in word:
            this_dict = this_dict.setdefault(letter, {})
        this_dict[None] = None
    return root

def anagram(letters):
    """Return (yield) all anagrams of letters found in the dictionary."""

    def _anagram(letter_counts, path, root, word_length):
        """Find anagrams of the letters and counts in the dict letter_counts.

        _anagram is called recursively to search the trie for anagrams: each
        letter from the letter_counts keys which is found in the trie is added
        to the path list and its count decreased before _anagram is called
        from the current position in the trie.
        _anagram is a generator: it yields words of the same length of the
        original word (word_length) as it encounters them.

        """
        if None in root.keys() and len(path) == word_length:
            word = ''.join(path)
            yield word
        for letter, this_dict in root.items():
            count = letter_counts.get(letter, 0)
            if count == 0:
                continue
            letter_counts[letter] = count - 1
            path.append(letter)
            for word in _anagram(letter_counts, path, this_dict, word_length):
                yield word
            path.pop()
            letter_counts[letter] = count

    # Build a dictionary of letter: count pairs from the input letters sequence
    letter_counts = Counter(letters)
    for word in _anagram(letter_counts, [], trie, len(letters)):
        yield word

if __name__ == '__main__':
    # Get the letters from the command line and return all anagrams.
    try:
        letters = sys.argv[1]
    except IndexError:
        print('usage: {} <letters>'.format(sys.argv[0]))
        sys.exit(1)

    trie = make_trie(read_words(DICTIONARY_FILENAME))

    for word in anagram(letters):
        print(word)

For example,

$ python anagram.py dissenter
residents
tiredness
dissenter
Current rating: 4.9

Comments

Comments are pre-moderated. Please be patient and your comment will appear soon.

There are currently no comments

New Comment

required

required (not published)

optional

required