Intro
My friend and I are in a perpetual game of Words With Friends on our smartphones these days. It suits me to a T because I like to take a looong time to come up with just the right move (although as time has passed you’ll see I’ve become less enamored with the app. You’ll see this if you have the patience to read through the whole article which was written progressively as I continued to play more games0. And you can’t knock over the board and lose all your played tiles! Yet you can still get advice from other friends by showing the board on your smartphone.
I have a good vocabulary, my friend a little less so. But he takes advantage of a peculiar quirk of playing the game in this way – he makes up words and sees if they’re accepted by the program. And…sometimes they are! So Words with Friends uses a ridiculous dictionary or dictionaries. That’s how he came up with hila. Later I turned the tables on him. I played a “word” that I didn’t believe to be a word, but rather one that I believed Words with Friends might believe to be a word: carbo. Sure enough. Accepted. 59 points. The “o” allowed me to stretch to reach the triple word tile and intersect a “T” to make “to” across. Check the American heritage dictionary. Not there, except as a prefix. My next best idea would only have been about 21 points.
I tried one of those cheating programs, www.lexicalwordfinder.com. I had the letters e,e, u, m, n, a and I think a. What does it suggest? neume. Ha? Who in the world ever heard of that? So it’s obviously plugged into those ridiculous dictionaries.You might as well let computers play other computers if you’re going to use cheats like that. My idea is to make gentle suggestions that you the educated person would have thought of on your own if you had enough time. Or maybe like me it’s on the tip of your tongue but you can’t quite find it in your brain. So I am writing a simple program which draws on a common dictionary – words every well-educated person ought to know. That’s also easier to program! Since I can’t compete with the big boys, my contribution will be to show the steps how I am writing such a program.
I’m running Ubuntu server, which is a Debian Linux variant. Initially I wasn’t sure what all I would need so I did:
# sudo apt-get install dictd dict dict-wn dict-gcide
in the hopes of getting a words file! dict-wn is the WordNet dictionary; gcide is Gnu collaborative international dictionary of english
I found the goods in /usr/share/dictd. wn.index has 87,924 unique words:
# awk '{print $1}' wn.index|uniq|wc
March, 2013 update – CentOS
I’ve since switched my hosting platform to CentOS. There there is the dictionary /usr/share/dict/linux.words. If you don’t have it install the package words-3.0-17.el6. It has 480,000 “words!” I’m not sure why the huge discrepancy, but I know that’s a lot more words than are traditionally mentioned as the number of words in the English language.
. After removing numbers and punctuation marks it’s at 80,724:
# awk '{print $1}' wn.index|uniq|egrep -v [0-9\'.-]|wc
Here are the first few words now:
# awk '{print $1}' wn.index|uniq|egrep -v [0-9\'.-]|head -25 a aa aaa aachen aah aaland aalborg aalii aalst aalto aar aardvark aardwolf aare aarhus aaron aarp aas aave ab aba abaca abacinate aback abactinal
Many of these look suspicious. I want to at least throw out the proper names. I don’t care if Words with Friends accepts them or not. I don’t believe they should be used. For instance:
# dict aar 1 definition found From WordNet (r) 3.0 (2006) [wn]: Aar n 1: a river in north central Switzerland that runs northeast into the Rhine [syn: {Aare}, {Aar}, {Aare River}]
So how are we going to get rid of the words with capitals? Unfortunately in the index they are all in lower case. I only see the possibility to do a dictionary lookup on each and every one. Here’s what I came up with for that. I’m sure some wiser guy could write this as a 1-liner, but hey, it is what it is:
#!/usr/bin/perl # DrJohn, 11/2011 # check if words are upper or lower case # we can only learn when doing a dict lookup # input are proposed words $DEBUG = 0; while() { chomp; $word = $_; $cnt = 0; open(DEF,"dict -d wn $word|"); while( ) { $cnt++; if ($cnt == 5) { # this is the line that repeats the word ($wordagain) = $_ =~ /(\w+)/; print $wordagain if $DEBUG; print "$word\n" if $wordagain eq $word; last; } # end line five condition stuff } # end loop over definition } # end STDIN
Run it on and the first few results are now like this:
aa aah No definitions found for "aaland" aalii aardvark aardwolf aba abaca abacinate aback abactinal
We got rid of Aar and many others, but we also got rid of one of the most common words – “a.” Not that it matters for this game, but this points to a possibility that the choice of case in the definition is somewhat arbitrary and a word with several meanings, any of which requires upper case, say like English, is going to be written upper case. Indeed that is the case for English, which of course is a nice word and perfectly acceptable when used with the meaning (sports) the spin given to a ball by striking it on one side. Sigh, nothing’s ever easy. So lets’ modify our program to make a separate list of rejected words so we can review it by hand and add back in words which can be used in lower case. Whenever yuo do something by eye yuo want to reduce the task as much as possible. Here we can take advantage of another fact: a capitalized word with a single definition is not of interest to us because that single definition is the one for which capitalization is required, like Aar. So we only need to consider the cases of capitalized words with mutliple definitions, one of which may be typically used with the word spelled in lower case, like English.
We showed the results syntax for a word with single definition above, for Aar. Here’s an example of a capitalized word with multiple definitions:
dict -d wn english 1 definition found From WordNet (r) 3.0 (2006) [wn]: English adj 1: of or relating to or characteristic of England or its culture or people; "English history"; "the English landed aristocracy"; "English literature" 2: of or relating to the English language n 1: an Indo-European language belonging to the West Germanic branch; the official language of Britain and the United States and most of the commonwealth countries [syn: {English}, {English language}] 2: the people of England [syn: {English}, {English people}] 3: the discipline that studies the English language and literature 4: (sports) the spin given to a ball by striking it on one side or releasing it with a sharp twist [syn: {English}, {side}]
There’s different characteristics we could use as markers. I propose to look for a digit immediately followed by a colon as a definition marker. More than one occurrence probably means the word has multiple definitions and we should consider it. So here’s a re-worked version of our program to accomplish that:
#!/usr/bin/perl # DrJohn, 11/2011 # check if words are upper or lower case # we can only learn when doing a dict lookup # input are proposed words $DEBUG = 0; open(CAND,">/tmp/candidates") || die "cannot open /tmp/candidates!!\n"; while() { chomp; $word = $_; $cnt = 0; $cand = 0; $cntdef = 0; open(DEF,"dict -d wn $word|"); while( ) { $cnt++; if ($cnt == 5) { # this is the line that repeats the word ($wordagain) = $_ =~ /(\w+)/; print $wordagain if $DEBUG; if ($wordagain eq $word) { print "$word\n"; last; } else { # maybe there are multiple definitions $cand = 1; } } elsif ($cand) { $cntdef++ if /\d:/; } # end line five condition stuff } # end loop over definition # print candidate rejected word if there were multiple definitions print CAND "$word\n" if $cntdef > 1; } # end STDIN
Note the regex \d: that we use to determine a definition.
Running this on the first few words, we have a, aaron and ab to consider. Ab is interesting because it’s the name of a degree (in fact I have that degree), as well as shorthand for muscles of the abdomen. So, a lower case usage exists!
Now the program is running kind of slow. So since we don’t want to run it multiple times, perhaps it’s time to turn our attention to this other the problem: all the lines like No definitions found for “aaland” that we are seeing in the meantime:
# grep ^aaland wn.index aaland islands OgB Cz
So these are due to compound words which we don’t want anyways because they won’t be accepted.
So running the modified program produces an output with 63185 words, and another 2170 words to be considered for review. The first few are as follows:
aa aah aalii aardvark aardwolf aba abaca abacinate aback abactinal abacus abaft abalone abamp abampere abandon
Let’s check one:
# dict aalii 1 definition found From WordNet (r) 3.0 (2006) [wn]: aalii n 1: a small Hawaiian tree with hard dark wood
Now check American Heritage dictionary, which I consider the Bible. Not there. I believe it would be accepted by Words with Friends because it plays using the Lexical Word Finder cheating program. Just enter your tiles as A A L I I A A to see for yourself.
And here are the first few rejected words which are to be reviewed:
a aaron ab abdias abel aberdeen abilene abkhas abkhasian abkhaz abkhazian abnaki aboriginal ac achaean achomawi actinia actium ad adalia adam adams add
Add? How did it get there? Well, you know, ADD, the medical condition? Yes, it’s exasperating. But that’s what you get with free. 2000 is not too many to review, however.
I’m having some doubts about the whole project now. I had the letters D E E G R I U. An open D was on the board where there was room for a couple tiles above it and three tiles below. Using the three tiles below would make the play triple word. I initially came up with drug. Then edger, which works out to the same number of points because in WWF the U is two points for some reason. But I wanted to use another tile so I thought and thought. Edgier? Nope, doesn’t fit. Ridge? Fits, but too short. Then the Aha moment: ridged. And I felt a moment of pleasure realizing that most people wuold not have come up with it. But a word suggestion program? It wuold have spit it out first thing, taking away from that aspect of the game.
But then there is the other side. Successful play depends on rote memorization of all two-letter words. And what passes for a WWF word is very questionable, as I’ve said above. Like how about jo? Yup. No, not in standard dictionaries, though.
I’m still thinking about what algorithm to use for the actual program. I can see that potentially it will be expensive, computationally speaking, given all the variants that must be tested. So I thought of making the task even easier: throw out words with more than eight letters. To see how many long words we’re going to toss:
# egrep '\w{9}' betterdict|wc
where betterdict is the results of all my pruning described above. It’s 32386 words we’ll toss. That ought to help alot. And 464 candidate words less we’ll have to consider. We’ll do something like # egrep -v ‘\w{9}’ betterdict > newbetterdict to build our even more slimmed-down dictionary.
Our First Match Program
Here’s a first stab at a matching program. It actually is pretty good (meaning there aren’t an overwhelming number of results) if you have the typical combination of unruly letters. Note we use the awesome of power of regular expressions to do all the heavy lifting.
#!/usr/bin/perl $m = $ARGV[0]; $DEBUG = 0; print "match: $m\n"; $dict = "/usr/share/dictd/newbetterdict"; open(DICT,"$dict") || die "cannot open dict $dict!!\n"; while(<DICT>) { chomp; $word = $_; print "word: $word\n" if $DEBUG; if ($word =~ /^[$m]{2,}$/) { # we have the beginning of a match $cnt++; print "match: word: $word\n"; } } print "matched: $cnt\n"; |
You run it from the command line with the letters you want to try as argument:
# match.pl fxitau match: fxitau match: word: aa match: word: affix match: word: aft match: word: ataxia match: word: ax match: word: fa match: word: fat match: word: faux match: word: fax match: word: fiat match: word: fit match: word: fix match: word: ft match: word: ii match: word: iii match: word: ix match: word: tat match: word: tatu match: word: tau match: word: taut match: word: tax match: word: taxi match: word: tiff match: word: tit match: word: titi match: word: tufa match: word: tuff match: word: tuft match: word: tut match: word: tux match: word: xi match: word: xii match: word: xiii match: word: xix match: word: xx match: word: xxi ...
What’s wrong of course is that while we are requiring matched words to be formed from our letters and only those letters, we have not taken care to avoid duplicate use of the same letter. That regular expression does a lot, but it’s not quite doing everything at this point. Now we start having to get creative to take it to the next level. I’m not sure myself how I’m going to do it.
More on that game. Now we’re getting into the groove. And by that I mean you make up words. So our final score was 381 to 347. Mind you, I’m not claiming any kind of expertise in the game. This is just a sad reflection on the liberalness of what WWF calls a “word.” Here are our made-up words from that one game: carbo, qi, ne, deni, oi, jo, wo and da. Of these, wo is the only one in the American heritage dictionary as a variant spelling of woe. Sad, right? It completely changes the strategy of the game.
Revised Program: Almost There
I thought for awhile I could use the transliteration operator tr, but alas, it does not do interpolation, so I had to go with something a bit more clumsy. But the whole program, which is basically functional, now returns only those words which match the letters provided, which is already a great help. Here is the warts-and-all version:
#!/usr/bin/perl $m = $ARGV[0]; $DEBUG = 0; print "match: $m\n"; # split up match for($i=0;$i<length($m);$i++) { $ltr = substr($m,$i,1); print "ltr: $ltr\n" if $DEBUG; # count frequency of this letter $ltrhash{$ltr}++; } $dict = "/usr/share/dictd/newbetterdict"; open(DICT,"$dict") || die "cannot open dict $dict!!\n"; while(<DICT>) { chomp; $word = $_; $bad = 0; %ltrwordhash = (); print "word,m: $word,$m\n" if $DEBUG; if ($word =~ /^[$m]{2,}$/) { print "Begin word analysis\n" if $DEBUG; # we have the beginning of a match for($i=0;$i<length($word);$i++) { $ltr = substr($word,$i,1); $ltrwordhash{$ltr}++; print "ltr,cnt: $ltr,$ltrwordhash{$ltr}\n" if $DEBUG; # throw out words with too many letter occurences if ($ltrwordhash{$ltr} > $ltrhash{$ltr}) { print "word tossed due to excess letters. max: $ltrhash{$ltr}\n" if $DEBUG; $bad = 1; last; } } next if $bad; # what remains are the good words! print "matched word: $word\n"; } } |
The DEBUG statements helped me find coding errors. I set DEBUG = 1 and run the program. I had forgotten the
%ltrwordhash = (); statement initially to clear out that hash for each new word. That was not good, but a review of the debug output quickly showed what was going on. Now we run it again with my current letters plus a free one (“l”) I want to use from the board:
# match.pl liaeinpu match: liaeinpu matched word: ail matched word: ain matched word: ale matched word: alien matched word: aline matched word: alp matched word: alpine matched word: ane matched word: ani matched word: anil matched word: anile matched word: ape matched word: elan matched word: en matched word: ie matched word: ii matched word: il matched word: in matched word: inula matched word: lane matched word: lap matched word: lapin matched word: lea matched word: lean matched word: leap matched word: lei matched word: leu matched word: li matched word: lie matched word: lien matched word: lieu matched word: lii matched word: line matched word: lineup matched word: lip matched word: lupin matched word: lupine matched word: nail matched word: nap matched word: nape matched word: napu matched word: neap matched word: nil matched word: nip matched word: nu matched word: pa matched word: pail matched word: pain matched word: pal matched word: pale matched word: pan matched word: pane matched word: panel matched word: pe matched word: pea matched word: peal matched word: pean matched word: pel matched word: pen matched word: penal matched word: penial matched word: pi matched word: pia matched word: pie matched word: pilau matched word: pile matched word: pin matched word: pine matched word: pineal matched word: plain matched word: plan matched word: plane matched word: plea matched word: pul matched word: pula matched word: pule matched word: pun matched word: ulna matched word: unai matched word: up
Cool, huh? I wanted to get a long word starting with “l” to pick up a double-word. I was coming up short. I maybe eventually would have thought of it, but before I ran the program, I was not coming up with long matches. So lineup and lupine are really helpful suggestions. Even though it’s gentle hints, it still feels like cheating, however!
And only 38 lines of code, including comments and DEBUG statements.
Next we’ll add some features. Let’s allow a starting/middle/ending letter to be specified. To support optional command-line arguments it’s nice to have more sophisticated argument parsing.
We started a new game. It’s just getting ridiculous as my friend gravitates towards a style of play which is a lot less about word knowledge than about trying all possible letter combinations to maximize the total, regardless of whehter or not it seems like a word. And in order to remain competitive I have to play that way as well, to a degree. So far our made-up “words” in this game include: qi, noh (used twice), oho, obe, fe, mm, noo, jo, oxo and deva. That deva really killed me as it was used to make a triple word play. Now we’ve played a total of 14 vertical words and 12 horizontal words. So 11/26 of the words we’ve so-far used are fabricated nonsense words!
I think it’s a mixed blessing. I welcome additional two-letter words. They’re readily memorized and only a finite number can exist (262 of course). And they really help with the play. For three-letter made-up words I’m on the edge but inclined to not encourage their use. Definitely not for four-letter words and higher. The universe of such words is just too great.
Some Nice Touches
Here’s the program which permits optional beginning letter, end letter and middle letter. I’ve introduced argument parsing with the Getopt::Std module.
#!/usr/bin/perl use Getopt::Std; getopts('b:m:e:l:'); usage() unless $opt_l; $m = $opt_l; $begltr = $opt_b ? $opt_b: "."; $endltr = $opt_e ? $opt_e: "."; $midltr = $opt_m ? $opt_m: "."; $DEBUG = 0; print "match: $m\n"; # split up match for($i=0;$i<length($m);$i++) { $ltr = substr($m,$i,1); print "ltr: $ltr\n" if $DEBUG; # count frequency of this letter $ltrhash{$ltr}++; } $dict = "/usr/share/dictd/newbetterdict"; open(DICT,"$dict") || die "cannot open dict $dict!!\n"; while(<DICT>) { chomp; $word = $_; $bad = 0; %ltrwordhash = (); print "word,m: $word,$m\n" if $DEBUG; if ($word =~ /^[$m]{2,}$/) { # meet begin/middle/end conditions next unless $word =~ /^$begltr.*$midltr.*$endltr$/; print "Begin word analysis\n" if $DEBUG; # we have the beginning of a match for($i=0;$i<length($word);$i++) { $ltr = substr($word,$i,1); $ltrwordhash{$ltr}++; print "ltr,cnt: $ltr,$ltrwordhash{$ltr}\n" if $DEBUG; # throw out words with too many letter occurences if ($ltrwordhash{$ltr} > $ltrhash{$ltr}) { print "word tossed due to excess letters. max: $ltrhash{$ltr}\n" if $DEBUG; $bad = 1; last; } } next if $bad; # what remains are the good words! print "matched word: $word\n"; } } sub usage { print "Usage: $0 [-b beginning_letter] [-e end_letter] [-m middle_letter] -l letters\n"; exit(1); } |
I called it match2.pl. Here’s an example using it with my awful letters (And is it just me, or does WWF have a definite proclivity to throw horrible letters at you for many turns in a row??):
# match2.pl -b b -l duttdsb match: duttdsb matched word: bud matched word: bus matched word: bust matched word: but matched word: butt
Note that my simplistic dictionary does not seem to have plurals. It also does not seem to have verb tenses besides present tense. Oh, well. If you realize that, it’s no big deal. It’s supposed to be gentle hints after all (which I can hide behind any time the task of doing a complete job becomes too taxing!).
How We Can Put This on the Web
The typical time it takes to run match2.pl is about 55 msec:
# time ./match2.pl -l zatrd match: zatrd matched word: adz matched word: art matched word: dart matched word: rad matched word: rat matched word: tad matched word: tar matched word: trad matched word: tzar real 0m0.055s user 0m0.050s sys 0m0.000s
That’s pretty fast. So, anyhow, I’m thinking to take the next step and make this available on the web, at least until it becomes popular, in the form of an Ajax/Perl program. Now I’ve never done an Ajax program, but I’ve been looking for an excuse to do one and I think this fits the bill. I’m envisioning a web page where you punch in your letters and the possible word matches appear on the same page. Another way to go is to write the whole thing as a Javascript program. The dictionary isn’t that large, after all. I’ve also never done anything this ambitious in Javascript, so that might take some doing. I think we’ll tackle Ajax first.
Now gcide.index has 149,682 words, but it’s a mess and includes lots of proper names, so I will not use it.
To be continued…