Wordnet – Dr John's Tech Talk

Intro
My friend and I are in a perpetual game of Words With Friends on our smartphones these days. It suits me to a T because I like to take a looong time to come up with just the right move (although as time has passed you’ll see I’ve become less enamored with the app. You’ll see this if you have the patience to read through the whole article which was written progressively as I continued to play more games0. And you can’t knock over the board and lose all your played tiles! Yet you can still get advice from other friends by showing the board on your smartphone.

I have a good vocabulary, my friend a little less so. But he takes advantage of a peculiar quirk of playing the game in this way – he makes up words and sees if they’re accepted by the program. And…sometimes they are! So Words with Friends uses a ridiculous dictionary or dictionaries. That’s how he came up with hila. Later I turned the tables on him. I played a “word” that I didn’t believe to be a word, but rather one that I believed Words with Friends might believe to be a word: carbo. Sure enough. Accepted. 59 points. The “o” allowed me to stretch to reach the triple word tile and intersect a “T” to make “to” across. Check the American heritage dictionary. Not there, except as a prefix. My next best idea would only have been about 21 points.

I tried one of those cheating programs, www.lexicalwordfinder.com. I had the letters e,e, u, m, n, a and I think a. What does it suggest? neume. Ha? Who in the world ever heard of that? So it’s obviously plugged into those ridiculous dictionaries.You might as well let computers play other computers if you’re going to use cheats like that. My idea is to make gentle suggestions that you the educated person would have thought of on your own if you had enough time. Or maybe like me it’s on the tip of your tongue but you can’t quite find it in your brain. So I am writing a simple program which draws on a common dictionary – words every well-educated person ought to know. That’s also easier to program! Since I can’t compete with the big boys, my contribution will be to show the steps how I am writing such a program.

I’m running Ubuntu server, which is a Debian Linux variant. Initially I wasn’t sure what all I would need so I did:

# sudo apt-get install dictd dict dict-wn dict-gcide

in the hopes of getting a words file! dict-wn is the WordNet dictionary; gcide is Gnu collaborative international dictionary of english

I found the goods in /usr/share/dictd. wn.index has 87,924 unique words:

# awk '{print $1}' wn.index|uniq|wc

March, 2013 update – CentOS
I’ve since switched my hosting platform to CentOS. There there is the dictionary /usr/share/dict/linux.words. If you don’t have it install the package words-3.0-17.el6. It has 480,000 “words!” I’m not sure why the huge discrepancy, but I know that’s a lot more words than are traditionally mentioned as the number of words in the English language.

. After removing numbers and punctuation marks it’s at 80,724:

# awk '{print $1}' wn.index|uniq|egrep -v [0-9\'.-]|wc

Here are the first few words now:

# awk '{print $1}' wn.index|uniq|egrep -v [0-9\'.-]|head -25
a
aa
aaa
aachen
aah
aaland
aalborg
aalii
aalst
aalto
aar
aardvark
aardwolf
aare
aarhus
aaron
aarp
aas
aave
ab
aba
abaca
abacinate
aback
abactinal

Many of these look suspicious. I want to at least throw out the proper names. I don’t care if Words with Friends accepts them or not. I don’t believe they should be used. For instance:

# dict aar
1 definition found

From WordNet (r) 3.0 (2006) [wn]:

  Aar
      n 1: a river in north central Switzerland that runs northeast
           into the Rhine [syn: {Aare}, {Aar}, {Aare River}]

So how are we going to get rid of the words with capitals? Unfortunately in the index they are all in lower case. I only see the possibility to do a dictionary lookup on each and every one. Here’s what I came up with for that. I’m sure some wiser guy could write this as a 1-liner, but hey, it is what it is:

#!/usr/bin/perl
# DrJohn, 11/2011
# check if words are upper or lower case
# we can only learn when doing a dict lookup
# input are proposed words
$DEBUG = 0;
while() {
  chomp;
  $word = $_;
  $cnt = 0;
  open(DEF,"dict -d wn $word|");
  while() {
    $cnt++;
    if ($cnt == 5) {
# this is the line that repeats the word
      ($wordagain) = $_ =~ /(\w+)/;
      print $wordagain if $DEBUG;
      print "$word\n" if $wordagain eq $word;
      last;
    } # end line five condition stuff
  } # end loop over definition
} # end STDIN

Run it on and the first few results are now like this:

aa
aah
No definitions found for "aaland"
aalii
aardvark
aardwolf
aba
abaca
abacinate
aback
abactinal

We got rid of Aar and many others, but we also got rid of one of the most common words – “a.” Not that it matters for this game, but this points to a possibility that the choice of case in the definition is somewhat arbitrary and a word with several meanings, any of which requires upper case, say like English, is going to be written upper case. Indeed that is the case for English, which of course is a nice word and perfectly acceptable when used with the meaning (sports) the spin given to a ball by striking it on one side. Sigh, nothing’s ever easy. So lets’ modify our program to make a separate list of rejected words so we can review it by hand and add back in words which can be used in lower case. Whenever yuo do something by eye yuo want to reduce the task as much as possible. Here we can take advantage of another fact: a capitalized word with a single definition is not of interest to us because that single definition is the one for which capitalization is required, like Aar. So we only need to consider the cases of capitalized words with mutliple definitions, one of which may be typically used with the word spelled in lower case, like English.

We showed the results syntax for a word with single definition above, for Aar. Here’s an example of a capitalized word with multiple definitions:

 dict -d wn english
1 definition found

From WordNet (r) 3.0 (2006) [wn]:

  English
      adj 1: of or relating to or characteristic of England or its
             culture or people; "English history"; "the English landed
             aristocracy"; "English literature"
      2: of or relating to the English language
      n 1: an Indo-European language belonging to the West Germanic
           branch; the official language of Britain and the United
           States and most of the commonwealth countries [syn:
           {English}, {English language}]
      2: the people of England [syn: {English}, {English people}]
      3: the discipline that studies the English language and
         literature
      4: (sports) the spin given to a ball by striking it on one side
         or releasing it with a sharp twist [syn: {English}, {side}]

There’s different characteristics we could use as markers. I propose to look for a digit immediately followed by a colon as a definition marker. More than one occurrence probably means the word has multiple definitions and we should consider it. So here’s a re-worked version of our program to accomplish that:

#!/usr/bin/perl
# DrJohn, 11/2011
# check if words are upper or lower case
# we can only learn when doing a dict lookup
# input are proposed words
$DEBUG = 0;
open(CAND,">/tmp/candidates") || die "cannot open /tmp/candidates!!\n";
while() {
  chomp;
  $word = $_;
  $cnt = 0;
  $cand = 0;
  $cntdef = 0;
  open(DEF,"dict -d wn $word|");
  while() {
    $cnt++;
    if ($cnt == 5) {
# this is the line that repeats the word
      ($wordagain) = $_ =~ /(\w+)/;
      print $wordagain if $DEBUG;
      if ($wordagain eq $word) {
        print "$word\n";
        last;
      } else {
# maybe there are multiple definitions
        $cand = 1;
      }
    } elsif ($cand) {
      $cntdef++ if /\d:/;
    } # end line five condition stuff
  } # end loop over definition
# print candidate rejected word if there were multiple definitions
  print CAND "$word\n" if $cntdef > 1;
} # end STDIN

Note the regex \d: that we use to determine a definition.

Running this on the first few words, we have a, aaron and ab to consider. Ab is interesting because it’s the name of a degree (in fact I have that degree), as well as shorthand for muscles of the abdomen. So, a lower case usage exists!

Now the program is running kind of slow. So since we don’t want to run it multiple times, perhaps it’s time to turn our attention to this other the problem: all the lines like No definitions found for “aaland” that we are seeing in the meantime:

# grep ^aaland wn.index
aaland islands  OgB     Cz

So these are due to compound words which we don’t want anyways because they won’t be accepted.

So running the modified program produces an output with 63185 words, and another 2170 words to be considered for review. The first few are as follows:

aa
aah
aalii
aardvark
aardwolf
aba
abaca
abacinate
aback
abactinal
abacus
abaft
abalone
abamp
abampere
abandon

Let’s check one:

# dict aalii
1 definition found

From WordNet (r) 3.0 (2006) [wn]:

  aalii
      n 1: a small Hawaiian tree with hard dark wood

Now check American Heritage dictionary, which I consider the Bible. Not there. I believe it would be accepted by Words with Friends because it plays using the Lexical Word Finder cheating program. Just enter your tiles as A A L I I A A to see for yourself.

And here are the first few rejected words which are to be reviewed:

a
aaron
ab
abdias
abel
aberdeen
abilene
abkhas
abkhasian
abkhaz
abkhazian
abnaki
aboriginal
ac
achaean
achomawi
actinia
actium
ad
adalia
adam
adams
add

Add? How did it get there? Well, you know, ADD, the medical condition? Yes, it’s exasperating. But that’s what you get with free. 2000 is not too many to review, however.

I’m having some doubts about the whole project now. I had the letters D E E G R I U. An open D was on the board where there was room for a couple tiles above it and three tiles below. Using the three tiles below would make the play triple word. I initially came up with drug. Then edger, which works out to the same number of points because in WWF the U is two points for some reason. But I wanted to use another tile so I thought and thought. Edgier? Nope, doesn’t fit. Ridge? Fits, but too short. Then the Aha moment: ridged. And I felt a moment of pleasure realizing that most people wuold not have come up with it. But a word suggestion program? It wuold have spit it out first thing, taking away from that aspect of the game.

But then there is the other side. Successful play depends on rote memorization of all two-letter words. And what passes for a WWF word is very questionable, as I’ve said above. Like how about jo? Yup. No, not in standard dictionaries, though.

I’m still thinking about what algorithm to use for the actual program. I can see that potentially it will be expensive, computationally speaking, given all the variants that must be tested. So I thought of making the task even easier: throw out words with more than eight letters. To see how many long words we’re going to toss:

# egrep '\w{9}' betterdict|wc

where betterdict is the results of all my pruning described above. It’s 32386 words we’ll toss. That ought to help alot. And 464 candidate words less we’ll have to consider. We’ll do something like # egrep -v ‘\w{9}’ betterdict > newbetterdict to build our even more slimmed-down dictionary.

Our First Match Program
Here’s a first stab at a matching program. It actually is pretty good (meaning there aren’t an overwhelming number of results) if you have the typical combination of unruly letters. Note we use the awesome of power of regular expressions to do all the heavy lifting.

#!/usr/bin/perl
$m = $ARGV[0];
$DEBUG = 0;
print "match: $m\n";
$dict = "/usr/share/dictd/newbetterdict";
open(DICT,"$dict") || die "cannot open dict $dict!!\n";
while(<DICT>) {
  chomp;
  $word = $_;
  print "word: $word\n" if $DEBUG;
  if ($word =~ /^[$m]{2,}$/) {
# we have the beginning of a match
    $cnt++;
    print "match: word: $word\n";
  }
}
print "matched: $cnt\n";

You run it from the command line with the letters you want to try as argument:

 # match.pl fxitau
match: fxitau
match: word: aa
match: word: affix
match: word: aft
match: word: ataxia
match: word: ax
match: word: fa
match: word: fat
match: word: faux
match: word: fax
match: word: fiat
match: word: fit
match: word: fix
match: word: ft
match: word: ii
match: word: iii
match: word: ix
match: word: tat
match: word: tatu
match: word: tau
match: word: taut
match: word: tax
match: word: taxi
match: word: tiff
match: word: tit
match: word: titi
match: word: tufa
match: word: tuff
match: word: tuft
match: word: tut
match: word: tux
match: word: xi
match: word: xii
match: word: xiii
match: word: xix
match: word: xx
match: word: xxi
...

What’s wrong of course is that while we are requiring matched words to be formed from our letters and only those letters, we have not taken care to avoid duplicate use of the same letter. That regular expression does a lot, but it’s not quite doing everything at this point. Now we start having to get creative to take it to the next level. I’m not sure myself how I’m going to do it.

More on that game. Now we’re getting into the groove. And by that I mean you make up words. So our final score was 381 to 347. Mind you, I’m not claiming any kind of expertise in the game. This is just a sad reflection on the liberalness of what WWF calls a “word.” Here are our made-up words from that one game: carbo, qi, ne, deni, oi, jo, wo and da. Of these, wo is the only one in the American heritage dictionary as a variant spelling of woe. Sad, right? It completely changes the strategy of the game.

Revised Program: Almost There
I thought for awhile I could use the transliteration operator tr, but alas, it does not do interpolation, so I had to go with something a bit more clumsy. But the whole program, which is basically functional, now returns only those words which match the letters provided, which is already a great help. Here is the warts-and-all version:

#!/usr/bin/perl
$m = $ARGV[0];
$DEBUG = 0;
print "match: $m\n";
# split up match
for($i=0;$i<length($m);$i++) {
  $ltr = substr($m,$i,1);
  print "ltr: $ltr\n" if $DEBUG;
# count frequency of this letter
  $ltrhash{$ltr}++;
}
$dict = "/usr/share/dictd/newbetterdict";
open(DICT,"$dict") || die "cannot open dict $dict!!\n";
while(<DICT>) {
  chomp;
  $word = $_;
  $bad = 0;
  %ltrwordhash = ();
  print "word,m: $word,$m\n" if $DEBUG;
  if ($word =~ /^[$m]{2,}$/) {
    print "Begin word analysis\n" if $DEBUG;
# we have the beginning of a match
    for($i=0;$i<length($word);$i++) {
      $ltr = substr($word,$i,1);
      $ltrwordhash{$ltr}++;
      print "ltr,cnt: $ltr,$ltrwordhash{$ltr}\n" if $DEBUG;
# throw out words with too many letter occurences
      if ($ltrwordhash{$ltr} > $ltrhash{$ltr}) {
        print "word tossed due to excess letters. max: $ltrhash{$ltr}\n" if $DEBUG;
        $bad = 1;
        last;
      }
    }
    next if $bad;
# what remains are the good words!
    print "matched word: $word\n";
  }
}

The DEBUG statements helped me find coding errors. I set DEBUG = 1 and run the program. I had forgotten the
%ltrwordhash = (); statement initially to clear out that hash for each new word. That was not good, but a review of the debug output quickly showed what was going on. Now we run it again with my current letters plus a free one (“l”) I want to use from the board:

# match.pl liaeinpu
match: liaeinpu
matched word: ail
matched word: ain
matched word: ale
matched word: alien
matched word: aline
matched word: alp
matched word: alpine
matched word: ane
matched word: ani
matched word: anil
matched word: anile
matched word: ape
matched word: elan
matched word: en
matched word: ie
matched word: ii
matched word: il
matched word: in
matched word: inula
matched word: lane
matched word: lap
matched word: lapin
matched word: lea
matched word: lean
matched word: leap
matched word: lei
matched word: leu
matched word: li
matched word: lie
matched word: lien
matched word: lieu
matched word: lii
matched word: line
matched word: lineup
matched word: lip
matched word: lupin
matched word: lupine
matched word: nail
matched word: nap
matched word: nape
matched word: napu
matched word: neap
matched word: nil
matched word: nip
matched word: nu
matched word: pa
matched word: pail
matched word: pain
matched word: pal
matched word: pale
matched word: pan
matched word: pane
matched word: panel
matched word: pe
matched word: pea
matched word: peal
matched word: pean
matched word: pel
matched word: pen
matched word: penal
matched word: penial
matched word: pi
matched word: pia
matched word: pie
matched word: pilau
matched word: pile
matched word: pin
matched word: pine
matched word: pineal
matched word: plain
matched word: plan
matched word: plane
matched word: plea
matched word: pul
matched word: pula
matched word: pule
matched word: pun
matched word: ulna
matched word: unai
matched word: up

Cool, huh? I wanted to get a long word starting with “l” to pick up a double-word. I was coming up short. I maybe eventually would have thought of it, but before I ran the program, I was not coming up with long matches. So lineup and lupine are really helpful suggestions. Even though it’s gentle hints, it still feels like cheating, however!

And only 38 lines of code, including comments and DEBUG statements.

Next we’ll add some features. Let’s allow a starting/middle/ending letter to be specified. To support optional command-line arguments it’s nice to have more sophisticated argument parsing.

We started a new game. It’s just getting ridiculous as my friend gravitates towards a style of play which is a lot less about word knowledge than about trying all possible letter combinations to maximize the total, regardless of whehter or not it seems like a word. And in order to remain competitive I have to play that way as well, to a degree. So far our made-up “words” in this game include: qi, noh (used twice), oho, obe, fe, mm, noo, jo, oxo and deva. That deva really killed me as it was used to make a triple word play. Now we’ve played a total of 14 vertical words and 12 horizontal words. So 11/26 of the words we’ve so-far used are fabricated nonsense words!
I think it’s a mixed blessing. I welcome additional two-letter words. They’re readily memorized and only a finite number can exist (26² of course). And they really help with the play. For three-letter made-up words I’m on the edge but inclined to not encourage their use. Definitely not for four-letter words and higher. The universe of such words is just too great.

Some Nice Touches
Here’s the program which permits optional beginning letter, end letter and middle letter. I’ve introduced argument parsing with the Getopt::Std module.

#!/usr/bin/perl
use Getopt::Std;
getopts('b:m:e:l:');
usage() unless $opt_l;
$m = $opt_l;
$begltr = $opt_b ? $opt_b: ".";
$endltr = $opt_e ? $opt_e: ".";
$midltr = $opt_m ? $opt_m: ".";
$DEBUG = 0;
print "match: $m\n";
# split up match
for($i=0;$i<length($m);$i++) {
  $ltr = substr($m,$i,1);
  print "ltr: $ltr\n" if $DEBUG;
# count frequency of this letter
  $ltrhash{$ltr}++;
}
$dict = "/usr/share/dictd/newbetterdict";
open(DICT,"$dict") || die "cannot open dict $dict!!\n";
while(<DICT>) {
  chomp;
  $word = $_;
  $bad = 0;
  %ltrwordhash = ();
  print "word,m: $word,$m\n" if $DEBUG;
  if ($word =~ /^[$m]{2,}$/) {
# meet begin/middle/end conditions
    next unless $word =~ /^$begltr.*$midltr.*$endltr$/;
    print "Begin word analysis\n" if $DEBUG;
# we have the beginning of a match
    for($i=0;$i<length($word);$i++) {
      $ltr = substr($word,$i,1);
      $ltrwordhash{$ltr}++;
      print "ltr,cnt: $ltr,$ltrwordhash{$ltr}\n" if $DEBUG;
# throw out words with too many letter occurences
      if ($ltrwordhash{$ltr} > $ltrhash{$ltr}) {
        print "word tossed due to excess letters. max: $ltrhash{$ltr}\n" if $DEBUG;
        $bad = 1;
        last;
      }
    }
    next if $bad;
# what remains are the good words!
    print "matched word: $word\n";
  }
}
sub usage {
print "Usage: $0 [-b beginning_letter] [-e end_letter] [-m middle_letter] -l letters\n";
exit(1);
}

I called it match2.pl. Here’s an example using it with my awful letters (And is it just me, or does WWF have a definite proclivity to throw horrible letters at you for many turns in a row??):

# match2.pl -b b -l duttdsb
match: duttdsb
matched word: bud
matched word: bus
matched word: bust
matched word: but
matched word: butt

Note that my simplistic dictionary does not seem to have plurals. It also does not seem to have verb tenses besides present tense. Oh, well. If you realize that, it’s no big deal. It’s supposed to be gentle hints after all (which I can hide behind any time the task of doing a complete job becomes too taxing!).

How We Can Put This on the Web
The typical time it takes to run match2.pl is about 55 msec:

# time ./match2.pl -l zatrd
match: zatrd
matched word: adz
matched word: art
matched word: dart
matched word: rad
matched word: rat
matched word: tad
matched word: tar
matched word: trad
matched word: tzar

real    0m0.055s
user    0m0.050s
sys     0m0.000s

That’s pretty fast. So, anyhow, I’m thinking to take the next step and make this available on the web, at least until it becomes popular, in the form of an Ajax/Perl program. Now I’ve never done an Ajax program, but I’ve been looking for an excuse to do one and I think this fits the bill. I’m envisioning a web page where you punch in your letters and the possible word matches appear on the same page. Another way to go is to write the whole thing as a Javascript program. The dictionary isn’t that large, after all. I’ve also never done anything this ambitious in Javascript, so that might take some doing. I think we’ll tackle Ajax first.

Now gcide.index has 149,682 words, but it’s a mess and includes lots of proper names, so I will not use it.

To be continued…