Learning Scientific Programming with Python (2nd edition)

P4.3.6: Word analysis

Question P4.3.6

The Brown Corpus is a collection of 500 samples of (American) English-language text compiled in the 1960s for use in the field of computational linguistics. It can be dowloaded here.

Each sample in the corpus consists of words which have been tagged with their part-of-speech after a forward slash for example:

The/at football/nn opponent/nn on/in homecoming/nn is/bez ,/, of/in
course/nn ,/, selected/vbn with/in the/at view/nn that/cs

Here, The has been tagged as an article (/at), football as a noun (/nn) and so on. A full list of the tags is available from the accompanying manual though the tags themselves are presented better on the Wikipedia article.

Write a program which analyses the Brown corpus and returns a list of the eight-letter words which feature each possible two-letter combinations exactly twice. For example, the two-letter combination pc is present in only the words topcoats and upcoming; mt is present only in the words boomtown and undreamt.