Sharpen Your Linux Command-Line Skills with Spelling Bee Puzzles

The online game Spelling Bee from the New York Times is extremely popular these days and quite addictive. Given seven letters arranged in a grid of hexagons, form as many words as you can. Each word must be four letters or longer and contain the center letter, and letters may repeat.

For example, in the puzzle shown, you can form words like TACO and CHITCHAT because they contain only letters from the grid including the central C. Some invalid words are CAT (3 letters is too short), RIOT (it’s missing the center letter, C), and TOUCH (the letter O isn’t in the grid).

Let’s sharpen your Linux command-line skills by generating and solving these spelling puzzles using a few clever commands. We’ll also analyze the results and even print hints for other puzzle solvers.

Getting started

To generate and solve puzzles, you’ll need a Linux dictionary: a text file of words, one word per line. Most Linux distros include the file /usr/share/dict/words which contains over 100,000 words in this format:

$ wc -l /usr/share/dict/words
102305 /usr/share/dict/words

Otherwise, you can download an even larger dictionary (over 450,000 words) from GitHub.

Generating a puzzle and solution

Generating a puzzle doesn’t require any Linux commands: just pick any seven letters and designate one of them as the central letter. Let’s use the preceding puzzle, which features the letters A, H, I, O, R, T, and a central C.

To generate a solution to this puzzle, you’ll need to identify words that contain a C embedded within the letters in the set, as in this regular expression:

/^[achiort]*c[achiort]*$/

The following one-line awk command selects dictionary words that match this expression and are at least four characters long:

$ awk "/^[achiort]*c[achiort]*$/ && length >= 4" /usr/share/dict/words
acacia 
achoo 
actor 
arch 
archaic 
arctic 
arthritic
 ...and so on...

To view this long list of words more conveniently, pipe the output through the column command:

$ awk "/^[achiort]*c[achiort]*$/ && length >= 4" /usr/share/dict/words | column
acacia          cahoot          chart           cocoa           ricotta 
achoo           carat           chat            cohort          roach 
actor           carrot          chic            coot            rococo 
arch            cart            chichi          critic          tacit 
archaic         cataract        chit            croci           taco 
arctic          catarrh         chitchat        crotch          tact 
arthritic       catch           choir           hatch           tactic 
attach          cathartic       circa           hitch           thatch 
attic           chair           citric          hooch           thoracic 
attract         chaotic         coach           hootch          torch 
cacao           char            coat            itch            tract 
cacti           chariot         cocci           rich            tractor

The preceding command works only for our particular puzzle. To generate any puzzle’s solution, turn the command into a shell script with two arguments: the central letter (argument $1), and the remaining six letters (argument $2). So instead of ahiortc you’d write $2$1.

#!/bin/bash
awk "/^[$2$1]*$1[$2$1]*$/ && length >= 4" /usr/share/dict/words

Add some basic error-checking, and you have a complete script that solves puzzles. Call it spellingbee.

#!/bin/bash
DICTIONARY=/usr/share/dict/words      # Variable to support other dictionary files

if [ $# -ne 2 ]; then
  >&2 echo "Usage: $0 center_letter six_other_letters"
  >&2 echo "Example: $0 g abcdef"
  exit 1
fi

awk "/^[$2$1]*$1[$2$1]*$/ && length >= 4" "$DICTIONARY"

Run the script to print all valid words in a solution. Pipe its output to wc -l to count how many words the solution contains (in this case, 60 words).

$ ./spellingbee c ahiort
acacia
achoo
actor
 ...and so on...

$ ./spellingbee c ahiort | wc -l
60

Disclaimer: In case it isn’t obvious, you could also run the spellingbee script to cheat at the New York Times game. Don’t be naughty though. You’re only cheating yourself. 🙂

Playing the game

Now, try to discover words within your seven letters according to the rules. When you’ve found a word, such as baggage, simply use grep -iw to search for it in the output of the spellingbee script. If grep prints the word, it’s in the solution. Otherwise there’s no output.

$ ./spellingbee c ahiort | grep -iw chitchat
chitchat

$ ./spellingbee c ahiort | grep -iw cat
(no output)

To keep track of your guesses, append the results to a file, guesses, using redirection. Incorrect guesses will not append anything to the file.

$ ./spellingbee c ahiort | grep -iw chitchat >> guesses 
$ ./spellingbee c ahiort | grep -iw cat >> guesses
$ ./spellingbee c taco | grep -iw chitchat >> guesses
$ cat guesses
chitchat
taco

Scoring

The object of Spelling Bee is to score points. Each four-letter word is worth 1 point, and a longer word with N letters is worth N points. (There are more scoring rules, but let’s stop there for now.) Write an awk program that adds up the points:

# total-points.awk
BEGIN {total=0}                      # Initialize total to 0                     
length($0) == 4 {total+=1}           # 4 letter words are worth 1 point          
length($0) > 4  {total+=length($0)}  # Words of length N>4 are worth N points    
END {print total}                    # When finished, print the total

To count the maximum number of points in the game, pipe the full solution to your awk program. Our puzzle is worth 297 points in total:

$ ./spellingbee c ahiort | awk -f total-points.awk
297

To calculate your own score at any time, apply the same technique, but read the guesses file instead of the output of spellingbee.

$ cat guesses
chitchat
taco

$ awk -f total-points.awk guesses
9

Pangrams

Spelling Bee has another kind of word that is worth extra points. It’s called a pangram, and it contains all seven letters at least once, in any order. You can detect pangrams in a puzzle solution by matching seven patterns in sequence, one for each letter. An awk expression for our specific puzzle would be:

/c/ && /a/ && /h/ && /i/ && /o/ && /r/ && /t/

To generalize this method, use sed to generate the above expression as a string and pass it on the command line to awk. Start with the last six letters (ahiort, but not c) and tell sed to print them surrounded by slashes (//) and preceded by a double ampersand (&&). (To avoid escaping the slashes, I use @ as sed’s delimiter character.)

$ echo ahiort | sed 's@\(.\)@\&\& /\1/ @g'
&& /a/ && /h/ && /i/ && /o/ && /r/ && /t/

We can now create a script named pangrams:

#!/bin/bash
expression=$(echo $2 | sed 's@\(.\)@\&\& /\1/ @g')
./spellingbee $1 $2 | awk "/$1/ $expression {print;}"

Run the script, and you’ll find that “ahiort” with a required “c” has two pangrams:

$ ./pangrams c ahiort
chariot 
thoracic

Each pangram is worth an additional seven points. So use wc -l to count pangrams, then multiply the count by 7 using expr to calculate their points:

$ expr 7 '*' $(./pangrams c ahiort | wc -l)
14

Add these points to the previous total to get the true maximum possible score.

$ expr $(./spellingbee c ahiort | awk -f total-points.awk) \
     + $(expr 7 '*' $(./pangrams c ahiort | wc -l))
311

Challenge: Try to calculate your score from the guesses file including pangram points.

Generating hints

Each New York Times Spelling Bee puzzle comes with two kinds of hints. The first type is a grid that shows how many words have a certain first letter and length. It looks roughly like this:

    4  5  6  7  8  9 
   ------------------ 
A:  1  3  3  2  -  1   
C:  7  12 7  3  2  1   
H:  -  3  1  -  -  -   
I:  1  -  -  -  -  -   
R:  1  1  1  1  -  -   
T:  2  3  2  1  1  -

So in this example, if you look at row H, column 5, you’ll see the value 3. It means the solution contains three words that begin with H and have length 5. List them with this command:

$ ./spellingbee c ahiort | awk '/^h/ && length($0) == 5'
hatch 
hitch 
hooch

The following awk program generates the entire grid of hints.

# hints-grid.awk
BEGIN {
    # Track the minimum and maximum word length found so far.
    # Initialize them with values that are out of any practical range.
    min=9999;
    max=0;
}
{
    # Grab and capitalize the first letter of the current word.
    first_letter = toupper(substr($1,1,1));
    # Calculate the word's length
    len = length($1);

    # Keep track of the current min and max word length.
    min = (len < min) ? len : min;
    max = (len > max) ? len : max;

    # Maintain two arrays. "all_first_letters" tracks whether or not a
    # given letter has appeared at the beginning of any word (1 = yes).
    # "grid" is a two-dimensional array that records how many words with
    # a given first letter and given length have been seen so far.
    all_first_letters[first_letter] = 1;
    grid[first_letter, len]++;
}
END {
    # Print the table heading: all values between min and max.
    printf("  ");
    for (i = min; i <= max; i++) {
        printf ("%3d", i);
    }
    print ""
    # Print a nice long border. Probably too long.
    print "    ------------------"
    
    # For each letter...
    for (letter in all_first_letters) {
        row = "";
        # For each word length...
        for (i = min; i <= max; i++) {
            # Grab the number of words with this first letter & length
            entry = grid[letter, i];
            # Store either the number or a dash for zero
            if (entry) {
                row = row sprintf("%-3d", entry);
            } else {
                row = row sprintf("%-3s", "-")
            }
        }
        # Print all rows in alphabetical order
        printf "%c:  %s\n", letter, row | "sort";
    }
}

Pipe the output of spellingbee to this awk program to produce the hint grid for a given puzzle:

$ ./spellingbee c ahiort | awk -f hints-grid.awk
    4  5  6  7  8  9 
   ------------------------ 
A:  1  3  3  2  -  1   
C:  7  12 7  3  2  1   
H:  -  3  1  -  -  -   
I:  1  -  -  -  -  -   
R:  1  1  1  1  -  -   
T:  2  3  2  1  1  -

The second type of hint shows how many words begin with two given letters. The hints are presented in a shorthand: two letters, a dash, and a number. For example, CA-10 means that 10 words begin with CA. Here’s a full set of hints for our puzzle:

AC-3 AR-4 AT-3  
CA-10 CH-11 CI-2 CO-6 CR-3  
HA-1 HI-1 HO-2  
IT-1  
RI-2 RO-2  
TA-4 TH-2 TO-1 TR-2

You can isolate the first two letters of each solution word using the cut -c1-2 command, and then count the number of occurrences using sort and uniq -c:

$ ./spellingbee c ahiort | cut -c1-2 | sort | uniq -c
      3 ac
      4 ar
      3 at
     10 ca
     11 ch
      2 ci
      6 co
      3 cr
      1 ha
      1 hi
      2 ho
      1 it
      2 ri
      2 ro
      4 ta
      2 th
      1 to
      2 tr

Now, make the output look like the real hints in the New York Times game. Add this awk program that manipulates the previous results:

# hints-two-letters.awk
{
    # Grab the first letter, capitalized
    first_letter = toupper(substr($2,1,1));
    # Produce the hint
    formatted_hint = toupper($2) "-" $1
    # Maintain an array of strings. Each element is a rows of hints.
    # Append the current hint to the right row, indexed by first letter.
    hints[first_letter] = hints[first_letter] formatted_hint " "
}
END {
    # Print all hints
    for (hint in hints) {
        print hints[hint]
    }
}

Then pipe the output of the previous pipeline to this awk program to produce properly formatted two-letter hints for a given puzzle:

$ ./spellingbee c ahiort | cut -c1-2 | sort | uniq -c | awk -f hints-two-letters.awk
AC-3 AR-4 AT-3  
CA-10 CH-11 CI-2 CO-6 CR-3  
HA-1 HI-1 HO-2  
IT-1  
RI-2 RO-2  
TA-4 TH-2 TO-1 TR-2

I hope you’ve had fun geeking out with me on spelling puzzles. Check out my book Efficient Linux at the Command Line to continue building your Linux command-line skills.