Finding words using Kap
It appears a lot of my inspiration for writing blog posts comes from me listening to podcasts. This time I was listening to A Problem Squared with Bec Hill and Matt Parker. The episode in question was Episode 103 – Heaps Splad and Paul’s Pad.
I strongly recommend listening to the episode (in fact, all of the episodes) because it's really good. But, in case you don't have time to listen to it right now, please allow me to summarise one of the problems being solved:
Are there any words which, when encrypted using a Caesar cipher, yields another valid word?
Matt Parker explained that he iterated over all the words using some (in his words, “terrible”) Python code. Whether the code actually was terrible or not will remain unanswered (the code is likely not bad, since if you are good enough to know your code is bad, you're probably better than average), but regardless of the quality of the code, I think we can write something much more concise. We can make this assumption simply by virtue of the fact that Python was used.
We'll obviously use Kap for this, because why not?
Getting the word list and cleaning it up
I don't know which word list Matt Parker uses, but I found this one on Github which contains 466,550 words, which is in the same ballpark as the one mentioned in the podcast.
Let's load it into a variable, and print the size:
words ← io:read "words.txt"
≢words
466550
The list contains a lot of words with punctuation, numbers, etc. There are also upper case letters in it that we want to turn into lowercase. This is how we fix this:
words ← ∪ ("^[a-z]+$" regex:match)¨⍛/ unicode:toLower¨ words
The entire list is passed through the unique (∪
) function to remove any duplicates due to upper and lower case versions of the same word existing in the list.
After doing this cleanup, we're left with 416,292 words.
Computing the ciphered versions of a word
Our goal is to create a function which accepts a string, and returns the 25 different rotated versions of that word (26 letters in the English alphabet, but we are not interested in the 0-rotated version).
To do this, we will compute the index of each letter (by subtracting @a
), add the offset, take the result mod 26 and convert back to a letter (by adding @a
). Let's write a function called scramble
that does this:
scramble ⇐ { @a + 26| ⍺ + ⍵-@a }
Let's try it:
1 2 3 4 scramble¨ ⊂"abcd"
┌→──────────────────────────┐
│"bcde" "cdef" "defg" "efgh"│
└───────────────────────────┘
Finding all the words
Since we're going to do a lot of word matches, we'll start by putting all the words in a hashmap. Let's call it wordsMap
:
wordsMap ← map (⍪words),1
Next, we'll compute a 2-dimensional array where each row represents a word, and each column the n-scramble of that word. We can do that with an outer product with the words on the left and the numbers 0 to 25 on the right (represented using ⍳26
):
allScrambledWords ← words scramble⍨⌻ ⍳26
Since this performs the computation on every word this will take a while (about 10 seconds on my workstation, which is reasonably fast) so don't worry if you don't get a result right away.
Now we want to find which of these scrambled words are real words, which means we need to look them up in the hashmap we created earlier.
When using bracket indexing to look up a value in a hashmap, the return value is null
if the key does not exist in the hashmap. So, we'll simply do a not-equals comparison against null
to get a boolean array that indicates the locations where we found a word. We'll then assign this boolean array to the variable result
:
result ← null ≢¨ wordsMap[allScrambledWords]
Analysing the results
Now comes the fun part. We now have all the data we need, so now we can start answering questions about this data.
How many words scramble into at least one word?
To solve this, we drop the left-most column (since that column represents the 0-scramble, or the word itself) and perform an OR-reduce over the rows and sum the result:
+/ ∨/ 0 1↓ result
10850
Create a table of the results
Now we know the resulting list is 1,0850 elements long. We can
For each row in result
and allScrambledWords
, we'll pick the values where the row in result
is 1, and then filter out any rows which only have a single entry (i.e. it doesn't scramble into another valid word):
groups ← (1<≢¨)⍛/ result ((⊂(/))⍤1) allScrambledWords
We can confirm that the resulting list is 1,0850 elements long:
≢groups
10850
We'll also note that if a word scrambles into N different other words, then the same set of words will be repeated N times in this list. We'll eliminate these duplicates by sorting each sublist and calling the ∪
function on the result, then assigning that result back to the groups
variable:
groups ← ∪ ∧¨ groups
What is the longest word in the result?
To find this, we'll just sort by the length of the first word:
groups[⍒(≢↑)¨groups]
The first result is “abjurer”/“nowhere”, which is 7 letters. It's a bit disappointing that there isn't anything longer, but I was prepared for this after listening to the podcast.
Another interesting results are “unfiber”/“bumpily”, but to be honest, most of the 7-letter results are boring.
Which is the word with the most matches?
To find this, we just sort by the length of each element:
groups[⍒≢¨groups]
This gives us a very uninteresting list, since the top results are all the one and two letter words, because for some reason the word list contains all single letters and almost every combination of two letters, so we'll have to filter them out:
((2<≢↑)¨)⍛/ groups[⍒≢¨groups]
The three letter results are also not interesting at all. Apparently ajc
is a word, which scrambles into gpi
, mvo
, yha
, etc.
Four letters is more interesting. We'll display the first 8 rows:
8 ↑ ⍪ ((3<≢↑)¨)⍛/ groups[⍒≢¨groups]
┌→─────────────────────────────────────────────────┐
↓┌→───────────────────────────────────────────────┐│
││"anan" "bobo" "erer" "lyly" "nana" "pcpc" "vivi"││
│└────────────────────────────────────────────────┘│
│ ┌→─────────────────────────────────┐│
│ │"adad" "bebe" "fifi" "lolo" "ruru"││
│ └──────────────────────────────────┘│
│ ┌→─────────────────────────────────┐│
│ │"afcc" "chee" "diff" "joll" "purr"││
│ └──────────────────────────────────┘│
│ ┌→─────────────────────────────────┐│
│ │"cece" "gigi" "momo" "susu" "yaya"││
│ └──────────────────────────────────┘│
│ ┌→──────────────────────────┐│
│ │"abib" "dele" "novo" "stat"││
│ └───────────────────────────┘│
│ ┌→──────────────────────────┐│
│ │"actg" "cevi" "priv" "yare"││
│ └───────────────────────────┘│
│ ┌→──────────────────────────┐│
│ │"adds" "beet" "illa" "lood"││
│ └───────────────────────────┘│
│ ┌→──────────────────────────┐│
│ │"admi" "benj" "firn" "svea"││
│ └───────────────────────────┘│
└──────────────────────────────────────────────────┘
What are the takeaways from all of this?
Some readers may walk away from this wondering why one should bother with this complicated syntax that uses symbols instead of words.
But the key point here is how we could solve this problem without actually writing any code. In a way that is not too dissimilar to how a good Unix user can solve almost anything on the commandline without writing a program, the array languages (and not just Kap) are all very good at letting you get to the result without actually doing any “programming”.
Whether or not it's worth the effort to learn the syntax in order to be able to leverage the power of these tools is up to you. I personally believe it's worth it.