<< This is an ancillary document to my analysis of the game  Wordle  . >>
<< The full analysis is available at  math.utexas.edu/~rusin/wordle/  .>>

I found a set of 6 words that splits the whole Wordle solution list:

   catty, frond, rumba, spill, verge, whack

What follows is a description of how I found it. I am mostly
talking to myself, trying to write something down before I forget it.
But I'm trying to be, you know, expository.


The final solution was provided by a Maple procedure I wrote ("meander").
This routine considers a near-perfect splitting set S, considers each
of the pairs of words that are not split, and looks for words that (1)
split that pair and (2) are neighbors of a word in S . Replacing the
word in S by this neighbor of it, I get a new near-perfect splitting
set. (I defined "near-perfect" to mean there were at least 2310
separate equivalence classes.) Starting with a single set S having
2314 classes, I created some new sets, which I then processed in the
same way, etc. After about a dozen rounds of this I found the perfect
splitting set after creating about 600 near-perfect ones.

So how did I get the near-perfect splitting set to use as seed? This came
from a steady stream of randomly-generated candidates that had promise;
many hours of CPU time were spent endlessly generating candidates and 
retaining the ones that had a reasonable amount of success. Specifically
I generated quintuples of words that by themselves split the word list
into at least 2150 clusters; for each of these I then added every possible
word as the sixth, and retained sextets that split W into at least 2300
clusters. (For this particular sextet S, the six consituent quintuples
create between between 2104 and 2257 clusters; leaving out "whack" gives
the best of them.)

Well, where did those quintuples come from? I tried a few variations but
they were all based on the same idea: I had developed rankings of some
key subsets of the 2315 candidate words, so I chose at least a few of the
words to be ranked highly in the key subsets.

I also tried to eliminate
a lot of chaff (without, I hoped, losing much of the wheat) by ensuring
that the good words would work well together -- no two words in the list
should have much in common. (In the particular set I did find, the words
"whack" and "catty" do each have an "a" and a "c", but in different positions.
The other nine pairs either shared a single "a" or "r" (in different positions)
or had no letters in common at all.) After all, we expect the six words 
(30 letters) to include most of the alphabet, except (as in this example)
perhaps not the rare letters j,q,x,z, so there is not much room for
overlap, especially since it is fruitful to have single words with repeated
letters (e.g. to be able to distinguish "droll" from "drool"); in this
example the letters "t","e", and "l" are doubled. That leaves 27 other letters
in the six words to cover 22 non-rare letters of the alphabet, so there
is not much other duplication: there's an extra "c" and two extra each of "a"
and "r".

So what are these ranked subsets? I had three, actually. The first and
least useful was a ranking of all 2315 words that blended the words' 
rankings (in an informal way) by several metrics. When each word is
used alone to filter the word-list, we can consider how many separate
clusters it creates; how large is the largest; etc. By this ranking,
familiar words like  parse, trace, stale, later, and crate are at
the top of the list; at the bottom we find  mummy, queue, jazzy, fuzzy, sissy.
It seems reasonably clear that we would want to avoid the words low on 
the list. Perhaps surprisingly, though, the words high on the list are
not especially likely to be in a good splitting set. In part this is
because the desire to have small overlaps between words would then
prevent us from using other reasonably good words. More generally this
seems to be a pattern in difficult optimization challenges: we cannot
expect the best overall solution set to have exceptionally good subsets.

(FWIW: the six words in my splitting set ranked between 972 and 1654
on this list: that is, none of them were "poor" words, but none of them
individually is a "great" word either. They're all "middling" words.)

The second, somewhat more useful, ranking was used to enlarge the set of
possibilities in the randomizing, beyond the core set of options I'll
describe in a moment: I allowed my program to sometimes
pick a word from the top 100 on this second list, and more rarely to pick a
word from the top 500. This list was created using some optimization
software, as was the more important, third and shortest list.

To describe these two lists in detail, consider the problem of distinguishing
the words "booby" and "boozy". A perfect splitting set must either include
a word with a "z", or a word with two "b"s, or a word with a "b" or "z"
in the fourth spot. I used "rumba" to do this. There are, as it turns out,
only 64 words in the entire 2315-word dictionary that satisfy one of
these three conditions. Hence one of these 64 words must be included
in any perfect splitting set. This particular pair (booby/boozy) is
especially hard to split but there are many other pairs of words that
have a comparatively small set of words that split them (i.e. words that
return different patterns of green or yellow tiles in response to the
two words in the pair). I kept track of 916 of these "toughpairs" and
the sets of words that would split them. At a minimum, a perfect splitting
set must include at least one word from each of these splitting sets.
(And that's not all, of course; there are other pairs of words that
must be split as well --- these were just selected from a certain 
algorithm that I thought would find the most challenging tasks.)

So what must be done to split these 916 toughpairs amounts to what is
called a "covering set" problem: we want a subset of minimimal size
that "covers" each of the 916 splitting sets (in the sense of containing at
least one of its elements). This is in general a combinatorial difficult
process; it is difficult to determine the absolute minimal size of a
covering set. (Obviously we could cover all 916 splitting sets with
no more than 916 words by randomly choosing one word from each set!)

However, there is a technique that can establish a lower bound on the
size of a covering set -- for us, it is a lower bound on the number of
words needed in a splitting set for the Wordle wordlist. It amounts to
phrasing the covering set problem in terms of finding values for 2315
variables, each supposed to equal either 0 or 1, with certain linear
constraints on the variables, and asking to minimize their sum. If we
relax the condition that each variable equal 0 or 1, and instead insist
only that each variable be non-negative, then the problem has a classic
and fairly easy solution called the Simplex Algorithm. To summarize:
we want to solve a hard problem to determine the smallest possible
value of a number N; we can instead solve an easy problem to determine
the smallest possible value of a number X; and we know that N is larger
than  X.

In the Wordle situation, my 916 splitting sets put 916 constraints on the
2315 variables; common software solves the problem with the relaxed
constraint and tells me that X=5.4 (say). Then I know that N is at least
5.4, and it's also a whole number, which means N is at least 6. In 
Wordle terms, that means there is no quintuple of words that can
split all 916 of the toughpairs, which means any quintuple must leave
some of these (and possibly other sets of words) still in indistinguishable
clusters; you'll just have to guess which word in the cluster is the
hidden word of the day!

Now here's the second ranking of words: for any given word, I can 
ask what happens if that word is included in a proposed splitting set.
Each word helps split some of the toughpairs, so if that word is included,
I have fewer than 916 tasks left to perform, and still have 2314 other
words to use to complete them. But the same linear-programming software
can be used on each of these smaller problems, to determine (a lower bound for)
the minimum number of *additional* words are needed to solve this
subproblem. We expect the number to go down because there are fewer
tasks to complete, but if minimal  X  for the new problem only drops to
X= 5.1 (say), then the minimal value of N here is still 6 -- meaning we
still need 6 *additional* words to split the rest of the  916 toughpairs, 
in addition to the one word we started with. 
So at least as far as the utility for creating
a perfect splitting setuple is concerned, any word with an X value 
larger than 5.0 is useless. These are words that *cannot* be part of
a perfect splitting sextet. More generally, I used the  X  values
as a way to rank each word's likelihood of being in a splitting sextet:
if after we remove a word W we still need X=5.4 additional words, the
word is terrible. If we still need X=4.99 additional words, the word
is still a *candidate* to be in a splitting sextet, but presumably if I
looked at more toughpairs beyond the 916 I had focused on, then the 
covering-set problem would get harder, and perhaps that value X=4.99 
would tip over to X=5.01. So such a word is "marginal" -- not a good
candidate to be in a splitting sextet. On the other hand, words
with an X value of only 4.5 (say) would likely still be a candidate
even if I consider the complete set of all 2,678,455 pairs of Wordle
words, every one of those pairs having a splitting set that must be
"covered". 

In short, the X values for each word's subproblem gave me a ranking of
the Wordle words. Only 1828 of the words had an X<5.0, ranging from 
"jumbo", with X=4.554, to "imbue", with X=4.99898. In retrospect, this
ordering of the words was perhaps even more useful than I imagined:
five of the words in my sextet were in the top 50 by this ranking, and
the last ("whack") comes in as #167.

(Even had I known this, this information alone is not as  helpful as one
might hope: there are over 2 million quintuples to be formed from the
top 50 words on the list, and for each one I was patiently testing the
whole 2315-word list for a good sixth member.)


The final ranking of the words came from the dual problem to this
linear optimization question. I did at some point start to think there
might not be a perfect splitting sextet. How would one prove this? One
way would be to illustrate a small collection of toughpairs that already
by themselves cannot be covered with six words. Or at least, can we find a
small collection of toughpairs that cannot b e covered with 5 words -- thus
giving an easy combinatorial proof that a minimal splitting set has
6 words in it.

For example, no single word can simultaneously distinguish vocal from local,
magma from gamma, and inner from infer: the intersection of the splitting
sets of those three toughpairs is empty. That means the maximum number of
these N=three toughpairs that any single word can split, is K=2.
Hence the minimal size of a splitting set is at least 2.

Suppose we had a set of  N  toughpairs with the feature that any single
word can only split  K  of them. Then the minimal size of a splitting
set would be (the next integer up from) N/K. So we could prove that there
is no splitting set of size 3 if we can find collections of toughpairs
that meet these conditions with N>3K; there is no splitting set of size 4
if we can do it with N>4K, and so on. (Even a set with N=5K would be 
handy: it wouldn't quite rule out a splitting quintuple, but it would restrict
us to searching among words that really do split  K  pairs each, and
no two of the words could split the same toughpair. Surely that would be
either easy to rule out or interesting to find!)

By hand I found a collection of N=36 toughpairs that, collectively, are hard to
cover: no single word can split more than K=8 of them (and only 18 words can 
split that many!). So as above there cannot be a splitting set of size 4,
and I was hoping to push this to N=40 or 41 to dispense with quints. I never
could find quite that many, though. It's another covering-set kind of problem:
we have 916 binary variables now, with 2315 linear constraints, and we want
to maximize the number  N  of those variables that are nonzero. The same
switch of N to X can be applied in this setting, and I can quite quickly
find that the largest possible value of X is 41.7, which means N cannot
be larger than 41. Maybe  N=40  such toughpairs can be found (or even N=41!)
but the tools that I have could not find such a set.

The way we get the X value of 41.7 is by setting 67 of the variables to 
certain nonzero values, so there are 67 "important" toughpairs among the 916.
The exact determination of X then involves 67 "important" constraints; more
precisely there is a 67-dimensional subspace of the 2315 constraints that
is relevant. I found that space within a 74-dimensional subspace spanned
by the coordinate axes. That's a high-falutin' way of saying there are
74 words which by themselves are limiting the effectiveness of the 
set of all the toughpairs. I kept the ordering of those 74 words within
the previous ordering. So this list begins
  jumbo, catty, forgo, gumbo, verge, chill, refer, fungi, rumba,...

To see the importance of this last ranked set, let me observe that the word
"frond" is not among the top 74 words, but the other five are, and in
fact they are at positions 2,5,9,19, and 37 in that list. Again in
retrospect, it would not have taken all *that* long to have located
these five words by a methodical search, and of course the sixth word
could have been added quickly.


So this, finally, was the mechanism I used to get the randomly-generated
near-perfect splitting sextet. I chose five words at random from the 
set of 74 (I used a probability generator that skewed the preference slightly
to the "best" words in the previous paragraph.) This gives a very small
portion of the set of all possible quintets! To make sure I had a *chance*
to step out of that small space, I would sometimes randomly replace one
of the five words with a random word from the top 100 in the list of 1828;
and with an even lower probability I would replace one of the words with
a random selection from among the top 500 in that list. This gave me an
ongoing supply of quintets of words, with each quintet composed of words
that were likely to be helpful for this purpose. Augmented with the
screening for words that had substantial overlap, I got a steady stream
of quintets that did a good job splitting  the wordlist all by themselves.
As i stated earlier, if they did a good enough job, I looked methodically
for a sixth word to join them, and eventually found a nearly-perfect sextet.
The nearly-perfect was perfected after a sequence of small substitutions of
one word for a very similar one, without ever dropping the number of clusters
any lower than 2310.

 343,   815,  1628,  1870,  2194,  2246   <-- DAILY position
   2      0      9     19      5     37   <-- w74 position (frond is NOT in w74)
   5     22     18     48      9    167   <-- SL position
1519    972   1085   1535   1213   1654   <-- SW position