<< This is an ancillary document to my analysis of the game Wordle . >> << The full analysis is available at math.utexas.edu/~rusin/wordle/ .>> I found a set of 6 words that splits the whole Wordle solution list: catty, frond, rumba, spill, verge, whack What follows is a description of how I found it. I am mostly talking to myself, trying to write something down before I forget it. But I'm trying to be, you know, expository. The final solution was provided by a Maple procedure I wrote ("meander"). This routine considers a near-perfect splitting set S, considers each of the pairs of words that are not split, and looks for words that (1) split that pair and (2) are neighbors of a word in S . Replacing the word in S by this neighbor of it, I get a new near-perfect splitting set. (I defined "near-perfect" to mean there were at least 2310 separate equivalence classes.) Starting with a single set S having 2314 classes, I created some new sets, which I then processed in the same way, etc. After about a dozen rounds of this I found the perfect splitting set after creating about 600 near-perfect ones. So how did I get the near-perfect splitting set to use as seed? This came from a steady stream of randomly-generated candidates that had promise; many hours of CPU time were spent endlessly generating candidates and retaining the ones that had a reasonable amount of success. Specifically I generated quintuples of words that by themselves split the word list into at least 2150 clusters; for each of these I then added every possible word as the sixth, and retained sextets that split W into at least 2300 clusters. (For this particular sextet S, the six consituent quintuples create between between 2104 and 2257 clusters; leaving out "whack" gives the best of them.) Well, where did those quintuples come from? I tried a few variations but they were all based on the same idea: I had developed rankings of some key subsets of the 2315 candidate words, so I chose at least a few of the words to be ranked highly in the key subsets. I also tried to eliminate a lot of chaff (without, I hoped, losing much of the wheat) by ensuring that the good words would work well together -- no two words in the list should have much in common. (In the particular set I did find, the words "whack" and "catty" do each have an "a" and a "c", but in different positions. The other nine pairs either shared a single "a" or "r" (in different positions) or had no letters in common at all.) After all, we expect the six words (30 letters) to include most of the alphabet, except (as in this example) perhaps not the rare letters j,q,x,z, so there is not much room for overlap, especially since it is fruitful to have single words with repeated letters (e.g. to be able to distinguish "droll" from "drool"); in this example the letters "t","e", and "l" are doubled. That leaves 27 other letters in the six words to cover 22 non-rare letters of the alphabet, so there is not much other duplication: there's an extra "c" and two extra each of "a" and "r". So what are these ranked subsets? I had three, actually. The first and least useful was a ranking of all 2315 words that blended the words' rankings (in an informal way) by several metrics. When each word is used alone to filter the word-list, we can consider how many separate clusters it creates; how large is the largest; etc. By this ranking, familiar words like parse, trace, stale, later, and crate are at the top of the list; at the bottom we find mummy, queue, jazzy, fuzzy, sissy. It seems reasonably clear that we would want to avoid the words low on the list. Perhaps surprisingly, though, the words high on the list are not especially likely to be in a good splitting set. In part this is because the desire to have small overlaps between words would then prevent us from using other reasonably good words. More generally this seems to be a pattern in difficult optimization challenges: we cannot expect the best overall solution set to have exceptionally good subsets. (FWIW: the six words in my splitting set ranked between 972 and 1654 on this list: that is, none of them were "poor" words, but none of them individually is a "great" word either. They're all "middling" words.) The second, somewhat more useful, ranking was used to enlarge the set of possibilities in the randomizing, beyond the core set of options I'll describe in a moment: I allowed my program to sometimes pick a word from the top 100 on this second list, and more rarely to pick a word from the top 500. This list was created using some optimization software, as was the more important, third and shortest list. To describe these two lists in detail, consider the problem of distinguishing the words "booby" and "boozy". A perfect splitting set must either include a word with a "z", or a word with two "b"s, or a word with a "b" or "z" in the fourth spot. I used "rumba" to do this. There are, as it turns out, only 64 words in the entire 2315-word dictionary that satisfy one of these three conditions. Hence one of these 64 words must be included in any perfect splitting set. This particular pair (booby/boozy) is especially hard to split but there are many other pairs of words that have a comparatively small set of words that split them (i.e. words that return different patterns of green or yellow tiles in response to the two words in the pair). I kept track of 916 of these "toughpairs" and the sets of words that would split them. At a minimum, a perfect splitting set must include at least one word from each of these splitting sets. (And that's not all, of course; there are other pairs of words that must be split as well --- these were just selected from a certain algorithm that I thought would find the most challenging tasks.) So what must be done to split these 916 toughpairs amounts to what is called a "covering set" problem: we want a subset of minimimal size that "covers" each of the 916 splitting sets (in the sense of containing at least one of its elements). This is in general a combinatorial difficult process; it is difficult to determine the absolute minimal size of a covering set. (Obviously we could cover all 916 splitting sets with no more than 916 words by randomly choosing one word from each set!) However, there is a technique that can establish a lower bound on the size of a covering set -- for us, it is a lower bound on the number of words needed in a splitting set for the Wordle wordlist. It amounts to phrasing the covering set problem in terms of finding values for 2315 variables, each supposed to equal either 0 or 1, with certain linear constraints on the variables, and asking to minimize their sum. If we relax the condition that each variable equal 0 or 1, and instead insist only that each variable be non-negative, then the problem has a classic and fairly easy solution called the Simplex Algorithm. To summarize: we want to solve a hard problem to determine the smallest possible value of a number N; we can instead solve an easy problem to determine the smallest possible value of a number X; and we know that N is larger than X. In the Wordle situation, my 916 splitting sets put 916 constraints on the 2315 variables; common software solves the problem with the relaxed constraint and tells me that X=5.4 (say). Then I know that N is at least 5.4, and it's also a whole number, which means N is at least 6. In Wordle terms, that means there is no quintuple of words that can split all 916 of the toughpairs, which means any quintuple must leave some of these (and possibly other sets of words) still in indistinguishable clusters; you'll just have to guess which word in the cluster is the hidden word of the day! Now here's the second ranking of words: for any given word, I can ask what happens if that word is included in a proposed splitting set. Each word helps split some of the toughpairs, so if that word is included, I have fewer than 916 tasks left to perform, and still have 2314 other words to use to complete them. But the same linear-programming software can be used on each of these smaller problems, to determine (a lower bound for) the minimum number of *additional* words are needed to solve this subproblem. We expect the number to go down because there are fewer tasks to complete, but if minimal X for the new problem only drops to X= 5.1 (say), then the minimal value of N here is still 6 -- meaning we still need 6 *additional* words to split the rest of the 916 toughpairs, in addition to the one word we started with. So at least as far as the utility for creating a perfect splitting setuple is concerned, any word with an X value larger than 5.0 is useless. These are words that *cannot* be part of a perfect splitting sextet. More generally, I used the X values as a way to rank each word's likelihood of being in a splitting sextet: if after we remove a word W we still need X=5.4 additional words, the word is terrible. If we still need X=4.99 additional words, the word is still a *candidate* to be in a splitting sextet, but presumably if I looked at more toughpairs beyond the 916 I had focused on, then the covering-set problem would get harder, and perhaps that value X=4.99 would tip over to X=5.01. So such a word is "marginal" -- not a good candidate to be in a splitting sextet. On the other hand, words with an X value of only 4.5 (say) would likely still be a candidate even if I consider the complete set of all 2,678,455 pairs of Wordle words, every one of those pairs having a splitting set that must be "covered". In short, the X values for each word's subproblem gave me a ranking of the Wordle words. Only 1828 of the words had an X<5.0, ranging from "jumbo", with X=4.554, to "imbue", with X=4.99898. In retrospect, this ordering of the words was perhaps even more useful than I imagined: five of the words in my sextet were in the top 50 by this ranking, and the last ("whack") comes in as #167. (Even had I known this, this information alone is not as helpful as one might hope: there are over 2 million quintuples to be formed from the top 50 words on the list, and for each one I was patiently testing the whole 2315-word list for a good sixth member.) The final ranking of the words came from the dual problem to this linear optimization question. I did at some point start to think there might not be a perfect splitting sextet. How would one prove this? One way would be to illustrate a small collection of toughpairs that already by themselves cannot be covered with six words. Or at least, can we find a small collection of toughpairs that cannot b e covered with 5 words -- thus giving an easy combinatorial proof that a minimal splitting set has 6 words in it. For example, no single word can simultaneously distinguish vocal from local, magma from gamma, and inner from infer: the intersection of the splitting sets of those three toughpairs is empty. That means the maximum number of these N=three toughpairs that any single word can split, is K=2. Hence the minimal size of a splitting set is at least 2. Suppose we had a set of N toughpairs with the feature that any single word can only split K of them. Then the minimal size of a splitting set would be (the next integer up from) N/K. So we could prove that there is no splitting set of size 3 if we can find collections of toughpairs that meet these conditions with N>3K; there is no splitting set of size 4 if we can do it with N>4K, and so on. (Even a set with N=5K would be handy: it wouldn't quite rule out a splitting quintuple, but it would restrict us to searching among words that really do split K pairs each, and no two of the words could split the same toughpair. Surely that would be either easy to rule out or interesting to find!) By hand I found a collection of N=36 toughpairs that, collectively, are hard to cover: no single word can split more than K=8 of them (and only 18 words can split that many!). So as above there cannot be a splitting set of size 4, and I was hoping to push this to N=40 or 41 to dispense with quints. I never could find quite that many, though. It's another covering-set kind of problem: we have 916 binary variables now, with 2315 linear constraints, and we want to maximize the number N of those variables that are nonzero. The same switch of N to X can be applied in this setting, and I can quite quickly find that the largest possible value of X is 41.7, which means N cannot be larger than 41. Maybe N=40 such toughpairs can be found (or even N=41!) but the tools that I have could not find such a set. The way we get the X value of 41.7 is by setting 67 of the variables to certain nonzero values, so there are 67 "important" toughpairs among the 916. The exact determination of X then involves 67 "important" constraints; more precisely there is a 67-dimensional subspace of the 2315 constraints that is relevant. I found that space within a 74-dimensional subspace spanned by the coordinate axes. That's a high-falutin' way of saying there are 74 words which by themselves are limiting the effectiveness of the set of all the toughpairs. I kept the ordering of those 74 words within the previous ordering. So this list begins jumbo, catty, forgo, gumbo, verge, chill, refer, fungi, rumba,... To see the importance of this last ranked set, let me observe that the word "frond" is not among the top 74 words, but the other five are, and in fact they are at positions 2,5,9,19, and 37 in that list. Again in retrospect, it would not have taken all *that* long to have located these five words by a methodical search, and of course the sixth word could have been added quickly. So this, finally, was the mechanism I used to get the randomly-generated near-perfect splitting sextet. I chose five words at random from the set of 74 (I used a probability generator that skewed the preference slightly to the "best" words in the previous paragraph.) This gives a very small portion of the set of all possible quintets! To make sure I had a *chance* to step out of that small space, I would sometimes randomly replace one of the five words with a random word from the top 100 in the list of 1828; and with an even lower probability I would replace one of the words with a random selection from among the top 500 in that list. This gave me an ongoing supply of quintets of words, with each quintet composed of words that were likely to be helpful for this purpose. Augmented with the screening for words that had substantial overlap, I got a steady stream of quintets that did a good job splitting the wordlist all by themselves. As i stated earlier, if they did a good enough job, I looked methodically for a sixth word to join them, and eventually found a nearly-perfect sextet. The nearly-perfect was perfected after a sequence of small substitutions of one word for a very similar one, without ever dropping the number of clusters any lower than 2310. 343, 815, 1628, 1870, 2194, 2246 <-- DAILY position 2 0 9 19 5 37 <-- w74 position (frond is NOT in w74) 5 22 18 48 9 167 <-- SL position 1519 972 1085 1535 1213 1654 <-- SW position