Learning Scientific Programming with Python (2nd edition)

P4.2.2: The most 100 frequent words in Moby Dick

Question P4.2.2

The novel Moby Dick is out of copyright and can be downloaded as a text file from Project Gutenberg. Write a program to output the 100 most frequent words in the book by storing a count of each word encountered in a dictionary.

Hints: Use Python's string methods to strip out any punctuation. It suffices to replace any instances of the following characters with the empty string: !?":;,()'.*[]. When you have a dictionary with words as the keys and the corresponding word-counts as the values, create a list of (count, word) tuples and sort it.

Bonus exercise: compare the frequencies of the top 2000 words in Moby Dick with the prediction of Zipf's Law: $$ \log f(w) = \log C - a \log r(w), $$ where $f(w)$ is the number of occurences of word $w$, $r(w)$ is the corresponding rank (1 = most common, 2 = second most common, etc.) and $C$ and $a$ are constants. In the traditional formulation of the law, $\log C = \log f(w_1)$ and $a=1$ where $w_1$ is the most common word, such that $r(w_1)=1$.