The Best Starting Words For

                    W  O  R  D  L  E  !

I have some things to say about the word game "Wordle" that became
popular in 2022, along with related games that are based on (roughly) 
the same dictionary (Dordle, Quordle, Octordle, etc.) I do like to
play them, but as a mathematician I wanted to do a thorough analysis
of some questions that arose as I played.

What I present here is (mostly) a discussion of the "good" sets of words
drawn from the  original 2315-word Wordle vocabulary list. A set of
words is "good" if, when used as a starting set of entries in the games,
it enables a human to guess all the words in a small number of turns.
The starting set must help win all the "subgames" in a compound game; 
the player then usually in effect switches to "hard mode" in each of
the sub-games one at a time, guessing only words that are consistent
with the clues (the colored tiles that result); the player might
memorize extra rules to handle a few tricky situations.

I wish to find and to compare those initial lists of starting words.
I will use both exhaustive searches and clever optimization techniques
to find "discriminating" sets of words: sets for which each resulting
array of green/yellow/gray tiles matches relatively few vocabulary
words.  Rather than a summary of my own personal experience, this
document is intended to be a comprehensive review of sets of words
that are demonstrably better than the alternatives, according to a
variety of clearly-defined standards.

If I accomplish nothing else with these sets of words, at least I will
have generated some great passwords!

I am hardly the first person to apply a thorough mathematical analysis
to Wordle. Some open forums with information include Stack Exchange,
Reddit, and Discord. Laurent Poirrier has collected information about
"optimal" algorithms for playing the games, including results of Alex Selby. 
When applied to Wordle itself, those alternatives are more "efficient"
than anything I propose here, at least in the sense that the average
number of guesses will be higher playing the way(s) that I propose here.

But beginning with a fixed, good starting list is both simpler
for a human player (many fewer branching rules are required), and
better suited for the compound games, and those are the criteria
of interest to me. Using the solutions I present here, I will never
lose at Wordle, and can play hundreds of Quordle before losing
once. (I take an average of about 7.4 turns for Quordle; that
includes a lot of "user error".) Xan Gregg has
an analysis applicable to these compound games, although the focus
is on perfect play (rather than something a human can aspire to).


Please let me know of corrections or additions to this document.
-- dave
(rusin@math.utexas.edu)

Index of sections:
Some caveats
Introduction: what are we doing in this document?
The best starting sets of six (and more!), and why these are interesting
Best starting quintuples and waltzing nymphs
Interlude: What does it mean for a starting set to be "best"?
Best starting quadruples: everyone can win at Wordle
Best starting triples (by various measures):

Completing by move 5
Closest to free guessing
Best situation before turn 4
Guessing without strategy
Guessing with strategy
A few other good triples
Summary table of best triples

Best (and possibly best) starting pairs
The best single word to start with
Concluding remarks


==============================================================================
Some initial caveats first:

1. Except in section 9, all comments here are about playing Wordle
in "easy mode". Any starting set containing more than one word
will fail to satisfy the hard mode rules on some days. (And I
don't even know what "hard mode" would mean for a compound game.) 

2. All my analyses are built upon the word lists in the version of
Wordle that was a simple web page in February 2022 (before purchase
by the New York Times).  In particular, (almost) all uses of the
word "word" here mean "one of the original 2315 possible answers to
a Wordle puzzle". (I do make a few comments below that refer to the
larger, 12972-word, list of acceptable inputs to Wordle, but I have
made little effort to update them in response to NYT's enlargment
of that set in Summer 2022.)  I have gathered together a long list
of comments about the word list(s) that I recommend to a person who
actually wants to play the games well. It is important to know
what words are, or are not, potential Wordle answers, and in 
particular the results quoted in this document assume that the
player has perfect recall of the list of Wordle solutions!

I believe this "practice site" uses the same wordlist as Wordle
itself, and speedle offers it as an option; I recommend 
them for testing out the good word sets discussed in this
document. Some of the compound games use slightly different word
lists; these are discussed only briefly. (The game of Woodle
also uses the same wordlist as Wordle's current list, but the mode of
play is very different and will not be discussed in this document.)

3. In original Wordle, the daily hidden words were presented in
a particular (random-looking) order; since November 2022 they are
chosen by a "curator" at the Times. Our model of the games assumes
instead that the words in the word list are chosen at random, with
uniform probability, to be hidden each time we play. (This appears 
to be the mode of play in the compound games, at least in "practice
mode", except that as far as I can determine the multiple subgames
are guaranteed to have different hidden words.) One may therefore
interpret probabilistic statements in a frequentist sense: in what
fraction of the games in an entire 2315-day Wordle cycle does
such and such an event occur?

4. In some cases I am stating claims of optimality or completeness.
The proofs I give are mostly just sketches that can be fleshed out
by the reader if interested.  The only parts that do not amount to
a simple case-by-case computer check have to do with the computation
of covering sets (which I did with linear-programming/ optimization
software Gurobi). I have written up a brief introduction to that
technique available here. The key ideas are (a) to cement ideas of
"nearness" or "similarity" in the word list, (b) to identify sets
of "most-similar" words that will be problematic late in the game,
(c) to compute for each such set the collection of words to play
that will help avoid these problematic sets, then (d) to find a
"cover" for these collections --- a set of words that intersects
all or most of these collections.

==============================================================================

                INTRODUCTION: HOW DO PEOPLE PLAY WORDLE?

Let's get the terminology straight before we discuss the good word sets.
(You should, at the very least, read the end of this introduction!)

If you ask a researcher for the best way to play Wordle, they will
present a decision tree --- basically a list of if-then
statements that specifies what to do at each stage in the game, 
based on the clues being given by the day's hidden word. There must
necessarily be thousands of rules since for each of the two thousand
possible hidden words, the tree must have a separate terminal rule
(a "leaf" on the tree) that says it's time to play that word, not to
mention intermediate rules to be played mid-game.

Now ask a frequent Wordler their strategy and you'll get a variety of
answers.  "I just pick a random word to start with and run with it";
"I start with ADIEU to get a lot of vowels"; "I read somewhere that
it's best to start with CRANE + SPILT". To me, these sound like only
Phase 1 of a strategy: all these answers specify a certain number "a"
of fixed starting words (a=0, a=1, and a=2 respectively). (I should
note parenthetically that the second answer came from someone who,
unlike me, is willing to guess a word that's not a Wordle answer-word,
and the third answer came from someone who, like most of us, is
playing Wordle's "easy mode".)

This Phase 1 is about gathering information about what the hidden word(s)
might be.  And it's important for people who (like me) play the compound
games, because the hope is that these first "a" turns will simultaneously
reveal a lot of information about the multiple Wordle subgames.

But then comes Phase 2: how do we use the information gained? No human
is going to memorize thousands of individual instructions! Maybe a few,
to cover special cases ("If I still get all gray tiles then...") But at
some point, people start to enter guesses of what the hidden word might be;
unlike using SPILT after CRANE, most people at some point in the game
start to enter only words that are consistent with the clues from the
previous turns (i.e., they unconsciously switch to something like
Wordle's "hard mode").  They'll spend some number "b" of guesses in
this mode trying to guess the right word. Unlike a decision tree, this
number b is not fixed: given the same exact puzzle (same hidden word)
on a different day, the player might make different guesses, maybe
finding the hidden word sooner or later.  (So actually b is a "random
variable", in the parlance of Statistics.)  The goal is to have a+b no
larger than 6, to win the game, but we might interpret that in terms
of the expected value of  a+b,  or of its maximum value.

So humans mostly don't play as the researchers envision, and hence 
the results of most prior research are primarily of academic interest 
and not necessarily helpful to a human who is willing to learn only a
few steps and rules. That's where this document comes in.


A person playing Wordle does not need to think of the distinction
between Phase 1 and Phase 2. But it does become more important when
playing the compound games like Quordle (with N=4 subgames) in which
we are in essence playing Phase 1 just once for all N of the subgames
and then carrying out Phase 2 separately for each of them.  Thus the
total number of turns would be not a+b but rather a + N b . As N
increases, it becomes more and more important to keep b small, even
if it means a has to be a bit larger. In other words, we want to find
sets of words for Phase 1 that are really good at providing clues
about the hidden word(s), so that we spend very little time after
that guessing word consistent with the clues left after Phase 1.
Considering the optimal solutions we find in this document, the 
expected number of steps to solve an N-fold compound game need not
be higher than the lowest of these:
    6 + N
    5 + 1.0058 N
    4 + 1.0298 N
    3 + 1.1682 N
    2 + 1.6590 N
    1 + 2.8218 N
These are merely upper bounds, but the pattern is clear: smaller
values of  b  come with larger values of  a. So for sufficiently
large values of N it can be more efficient to use a larger starting set.

                ----------------

Let's see how we can analyze just how good a starting set is. As we
shall see, in order to be able to compare different starting sets,
it will be important to know not just what words a player starts with
but also what exactly they will do after those starting words are
entered --- and what it is they value as the game progresses.

Let's follow one player whose Phase 1 has a=3: they use the three
starting words LOATH+MURKY+SPINE. It's a good start! But now what?
Consider what this person might do when the colored tiles show up
as in each of these examples. 
   lower case = yellow tile = right letter, wrong place;
   UPPER CASE = green  tile = right letter, right place.
Play along here: what would YOU do in each case?

     LOATH + MURKY + SPINE
 1)  .O..h   .u...   s...E
 2)  .....   .....   ....E
 3)  ..A..   ..rK.   ....E
 4)  ....H   .ur..   s....
 5)  l.A..   ...k.   ...N.
 6)  ...t.   ..r..   ..i.e
 7)  .o...   ..r..   .p..E
 8)  .....   ..r..   .pI.E
 9)  .o...   ..r..   ...n.
10)  LOA..   m...Y   .....
11)  ..a..   ..r..   ....e
12)  ..at.   .u...   ...N.

I hope for Example #1 you decided the word was "HOUSE". You're right!
That's the only word consistent with those clues, and you might as well
enter it on your next turn and win.

Example #2 is much harder but it turns out there is only one Wordle
word that matches this pattern: "WEDGE". Most people, I think, would
need a hint to figure this out; they might deliberately enter something
that's not a Wordle solution word, or not consistent with the colored 
tiles, just to get some information about some more letter. That's fine,
but of course it costs one turn. In this document, we will assume the
player is perspicacious enough to spot the right word without hints,
when there is only one (and more generally we will assume the player
can list all the possible words consistent with the hints). This is
NOT realistic; from time to time we will discuss ways to make things
easier for the player.  But it illustrates why a person needs to
really know the word list!

It turns out these first two examples are pretty representative of
what this player will face: Of all the words in the Wordle dictionary,
1364 (59%) are uniquely identified by the colored tiles that result
from playing LOATH+MURKY+SPINE. But for the rest, the colored tiles
have only indicated that the hidden word is one of a "cluster" of
similar-looking words.

In Example #3, there's a good chance you see the "_rake" and so you'd
enter "brake" right away. Again, not a bad plan but it turns out "drake"
is also a Wordle word. This is common: we think we know the word, so why
not enter it? But then we discover the word we entered is only one of
several possibilities. In this example we have a 50-50 chance of getting
the word right, even if we *do* know the two possibilities. The same
would happen in Example #4: this time there's a better chance you recognize
both "brush" and "crush" as possibilities, and they are the only ones, so
what else is there to do but enter one or the other and hope for the best;
half the time you'll win on the first guess, half the time on the second.

With a bit of effort you might figure out that when Example #5 shows
up, the word is either BLANK, CLANK, or FLANK. So what do we do now?
Most people would probably enter these words one at a time, especially
if at first only one or two of the possibilities comes to mind.
(Really? "clank"?!) If that's what you choose to do, you'll get the
right word in either 1, 2, or 3 more turns, each with probability 1/3.
Playing this way --- simply entering the first matching word that
comes to mind --- we might call "guess-at-will" mode, or (since
we have now slipped into playing a kind of "hard mode"!) we might
call it "free-form hard mode".

The situation in Example #6 is a little different. The possible words
now are REFIT, RIVET, and TIGER. But this is a better situation for the
player than example 5! No matter which of the three we choose to enter
as our fourth word, if it's wrong we will get enough new information
to tell which of the other two is the hidden word. So in reality we're
playing the same way, but have better odds: still a 1/3 chance of
winning on turn 4 but then a 2/3 chance of winning on turn 5.

This now leads us to look at Example #7: there are again three
possible answers: GROPE, PROBE, and PROVE. But this time the situation
is a mix of the previous two. If we guess GROPE on turn 4, then we
have no additional information to distinguish whether PROBE or PROVE
is the right word. If instead we guess one of the other two, then we
*do* get that information and can surely win on turn 5.

So in this case, the player has two choices: it's simpler just to
continue to guess whatever seems to fit, continuing in freeform hard
mode. But it's more efficient to use a "guided hard mode", in
which (in addition to memorizing the three starting words) the player
memorizes that playing PROBE when it's possible to do so is the preferred
thing to do.

In that last example, taking the effort to remember an additional rule
has only a small payoff, but the same principle applies in more important
cases. Example #8 shows an array that could signal any of GRIPE, PRICE,
PRIDE, or PRIZE. It's quite possible that a player using freeform
guessing would guess GRIPE first, which unfortunately would give no
information about which other word is the hidden one (if it's not GRIPE
itself) and then no matter which other of the four words we try next,
we never get more information about the remaining candidates when we
guess wrong. In this case the player could definitely run out of turns
and lose.  By contrast, if the player takes pains to remember to keep
an eye out for PRIZE and play it when it's possible to do so, then he
or she will definitely win by turn 6 whichever of the other three is
the hidden word.

So our analysis of each starting wordsets will consider two situations
separately: what will happen if the player simply guesses candidate
answers words at random, versus, what would happen if the player
maps out a strategy that resolves a cluster of candidate solutions
by (memorizing and) playing the most "efficient" of those candidates.

There's another strategy that a clever player can use, and it's again
illustrated by example 5 (the "_LANK" one). For a player not committed
to hard mode at all, there's no reason the player could not enter, say,
BRACE as the fourth word. Depending on whether the B, the C, or neither
gets a colored tile, the player knows right away which is the right word
and can enter it as the fifth word, and win. So this manner of play ---
this "out of the box thinking" --- can reduce the maximum number
of turns that it will take to resolve a situation like example 5.
That can mean the difference between winning and losing!

Out-of-the-box mode can also reduce the average number of turns
needed (i.e. the expected value of this random variable).  Example #9
demonstrates this: there are five words that could possibly be that
day's hidden word: BROWN, CROWN, DROWN, FROWN, and GROWN.  It's pretty
clear that both freeform and guided hardmode can take up to five turns
to get the right answer, with the player losing the game.  But if the
player enters BADGE for the fourth word, for example, then there will
be a clear signal whether or not BROWN, DROWN, or GROWN is the hidden
word and can be entered to win on turn 5; otherwise the word is either
CROWN or FROWN, and we can enter one of them on turn 5 and if necessary
enter the other on turn 6 to win. So the maximum number of guesses needed
drops from 5 to 3, and the expected number drops from 3.00 to 2.40.
That's a significant improvement, but it does come at the cost of
the player having to memorize more steps to their algorithm (i.e. to
remember that if the word could be BROWN, then it's best to play BADGE).

With more options to choose from, it's not surprising that we can often
find out-of-cluster words that trim the set of candidates in a cluster
more often than using in-cluster ("preferred") words. In an attempt to
use fewer turns it's tempting to look outside the cluster more often.
I choose not to do so when an in-cluster word is available for two
reasons. First, the special rules to resolve a problematic cluster take
half as much memory this way! Secondly, when used in the compound games,
the preferred-word rules continue to be just as useful even when 
previously-solved subgames remove some candidates from a cluster (e.g.
the rule "play PRIZE whenever it's a candidate" continues to be at
least as good a move as making a random selection among the candidates,
irrespective of how many candidates are left in the same cluster as
PRIZE). By contrast, using up a turn to enter an out-of-cluster word
could be *less* efficient than choosing a random candidate, after some
of the candidates are removed (e.g. if DROWN and GROWN have already
been eliminated, it is a waste of a turn to enter BADGE).

So we will generally assume the player will NOT reach for an
out-of-cluster word if at least one of the words in the cluster will
lead to a guaranteed win.  In practice -- particularly in compound
games -- humans might find it can be handy to employ those tactics if
they can spot them on the fly (potentially even using words that are
not on the shorter Wordle answer-list) if the player is close to running
out of turns. But we will avoid discussion of such ad-hoc strategies.

                ----------------

For a player who has just entered  LOATH, MURKY, and SPINE there are
1,715 possible ways the colored tiles can then appear. (We've just worked
through 9 of them.) Fully 80% of them indicate precisely one word and
the player has an easy win on turn 4. But those other 20% can be tricky,
as we have seen. It's easy enough to write a computer program to alert
us to all of them and outline potential responses, but if we wish to
answer the question of how good this starting set is, we have to know
just *how* the player intends to proceed in those other 20% of the cases!

In my analyses, I will assume players play out "Phase 2" in one of two ways:
(1) Simply use a guess-at-will strategy. With a six-turn limit,
    that may mean accepting the possibility of a loss; we might want
    to compute that probability, and then value most highly the starting
    sets of words that keep this probability as low as possible. Or, we
    can imagine the freedom to continue playing as many turns as needed
    until victory, and then we can ask for the probability distribution:
    what is the probability that the player will win after 1, 2, 3, ...
    turns. From that we could compute the expected number of turns until
    a win; we would then seek starting sets ot minimize that expected value.
(2) Use the same guess-at-will strategy in general, but by pre-computing
    strategies for the possible tricky situations, the player will

    use a "preferred" right answer (like PRIZE for Example 8) when
    necessary to do so to ensure a win by turn 6. And (only) if no such
    preferred word exists, the player will use an "out-of-the-box"
    solution (like playing BADGE when BROWN is indicated by the clues,
    in Example 9). In either case, I would expect the player to revert
    to free-form guessing in the very next turn. Once the set of rules
    for each exceptional cluster is in place, we can again compute a
    probability distribution. We usually look for the starting sets
    that use the fewest number of turns on average.

        To repeat: yes, one can guess the hidden word faster by using
    additional rules to play more in- and out-of-cluster words; but the
    point of this document is to discuss *simple* algorithms to play the
    games! So we will only analyze sets of such rules that refer ONLY to
    those clusters that cannot be resolved by randomly guessing their members.

There can also be situations in which neither a "preferred" in-cluster word
nor an out-of-the-cluster solution exists. That turns out not to happen with
LOATH + MURKY + SPINE but it occurs for example with LEARN + STICK + DOUGH:
when the hidden word could be "batch", it could also be any of
     batch, catch, hatch, match, patch, watch
It turns out that no matter what word you enter for your forth turn,
from the entire Wordle answer list, you might STILL have to choose
from among a set of at least 3 words on your fifth and sixth turns;
then you might guess wrong and lose. (In fact, just this once, I even 
checked all 14,853 currently-allowed Wordle input words, as a candidate
for the fourth turn, and every one of them leaves a set of three or
more from which a player might have to guess on the last two turns.
I have to say I was surprised by this!)  That doesn't mean the player
who starts with LEARN + STICK + DOUGH cannot win by the 6th turn, but 
now he or she must use BOTH turns 4 and 5 to gain enough information
to be able confidently to enter the correct word, finally, on turn 6.
For example the player would have to know in advance to enter (say) 
BAWDY on turn 4 and CHAMP on turn 5 . So that's another rule this player
has to remember: BATCH -> BAWDY + CHAMP ; using it, the player will get
enough information to know what to enter for turn 6. Even a reasonably
good starting set might need a "two-word strategy" like this for a couple
of its most problematic clusters. With additional effort we might identify
a pair of words to be entered that fixes the problem; or alternatively
we can identify some "preferred" words in the sub-clusters that result after
turn 4, that should be played on turn 5. The very best starting sets
avoid all this messiness, but we will describe a few examples in which
these techniques are necessary or useful.


In practice, a player using a prescribed recipe like this might sometimes
choose to go rogue. Consider the player using LOATH + MURKY + SPINE who
gets the colored-tile pattern in Example #10. It is already clear after
the first two words have been entered that the word must be LOAMY,
so there is no point to entering SPINE. This example is especially
obvious but more generally there is no point to entering SPINE if the
first two words have already revealed five different letters in the
hidden word; since SPINE does not repeat any of the letters in LOATH
and MURKY, we know in advance that the response to SPINE would just
be five gray tiles, giving us no new information. Even if only some
colored tiles had shown from LOATH and MURKY, we could probably skip
entering SPINE, if we already have "lots" of information about the
hidden word. The point is that in such cases it is likely that there
are only very few words --- maybe just one --- that fit these unusually
helpful sets of clues. A dedicated player might even make a list in
advance of any problematic situations that could arise from skipping
SPINE when there are (say) four colored tiles from LOATH and MURKY,
and then devise separate strategies for those cases.

We will not pursue this line of inquiry very far because it 
deviates from both principles we declared at the outset. On the
one hand they give algorithms for play that increasing involve
branching (so they're not simple). And on the other hand they're
less useful in a compound game. To illustrate, suppose a person is
playing Dordle (N=2) and after just LOATH and MURKY are entered
the player sees  LOA..+m...Y  in one subgame but  .....+..... in
the other. Surely they can enter LOAMY next, to win the first subgame,
but this will give no new information in the other, so inevitably the
player will use SPINE again anyway. More generally, it only makes
sense to abandon the intended list of starting words if *all* of the
subgames have already given abundant clues in response to the first
couple of words.  This certainly can happen, especially in Wordle
itself (N=1) but it becomes increasingly rare as N increases.

                ----------------

To complete this introduction, we can now summarize the prospects
for the player who begins with LOATH + MURKY + SPINE .

First, we can describe the situation immediately after those three
words are entered. We start by observing the player is given a good
set of hints by the colored tiles that result from this starting triple.
It turns out that every hidden word will yield at least one colored 
tile; the word WEDGE mentioned earlier is the only one that gives
only a single green tile, and only CIVIC and VIVID do worse by 
giving only one yellow tile. At the other extreme, about 18% of
the words give five colored tiles; the average over the 2315-day
cycle is that we will get 1.35 green tiles and 2.41 yellow ones.
So this triple starts us off with pretty generous help with
constructing the hidden words.

Now, if the player has learned the list of solution words, then
each possible arrangement of colored tiles flags for them a cluster
of potential answer-words. We can count how many clusters there are
that consist of 1, 2, 3, ... words; that gives us the cluster vector
   [1364, 227, 67, 30, 13, 4, 2, 3, 3, 1, 0, 1]
So there are 1364 words that can immediately be guessed with 
confidence, 227 cases that require a coin flip, etc. These numbers
sum to 1715, the total number of clusters. The fact that there are
12 numbers here tells us the largest cluster has 12 words; in fact
that cluster is the set of possible solutions to Example #11:
   bread, cedar, debar, dread, eager, gazer,
   racer, rebar, wafer, wager, waver, zebra

Generally speaking, it's better to have short cluster vectors,
with entries dropping in size as quickly as possible left-to-right.
In another section we will discuss some ways to make that idea
precise; each such encapsulation can lead to a different way to
describe ome starting set as "better" or "worse" than another, 
and so each can lead to a different conclusion about which is "best".

To analyze further, we have to know which of the two modes of play
the person will pursue after this starting triple is entered.

(1) The guess-at-will mode cannot guarantee this person success.
A player who enters these three starter words and then follows this
strategy will sometimes take many turns to win. We can compute the
distribution of the percentages of the time that the hidden word is
found after 1,2,3,... addititonal turns; this is the probability
vector and for this starting triple I calculate it to be
   [0.740821, 0.211184, 0.038993, 0.007789, 0.001194, 0.000019]
The fact that it's six numbers long reflects the fact that an unlucky
guesser could take as many as six more guesses (nine turns total) to
find the hidden word, even if properly following all the additional clues
that come from earlier incorrect guesses, and even if only guessing
legitimate Wordle answer-words. From this vector we can also easily
compute that there is a 0.9002% chance of losing the game, and that
the expected number of turns needed to find the hidden word is 4.317.
(That's a=3 starting words plus an expected value of b=1.317
additional turns.)
   Again, a good starting set would give a probability vector that's
short and front-loaded. We will discuss a few different ways to turn
that idea into a precise metric.

(2) On the other hand, the player who does not want to face defeat
(but still wants to begin with LOATH + MURKY + SPINE ) has the option of
using special rules for tricky situations. As it turns out, of the 1715
clusters, only 25 of them can lead to defeat if we just randomly guess
any words consistent with the clues. Thirteen of these clusters will still
give a victory if we play a "preferred" word in the cluster of consistent
words (like PRIZE) but the other 12 require an out-of-cluster word
(like BADGE) to ensure success. (See e.g. Example #12, which could
indicate any of daunt, gaunt, jaunt, taunt, or vaunt as an answer. But 
the player could enter JUDGE on turn 4 and then be sure of a victory 
after just 1 or 2 more turns.) We can summarize these additional
rules by simply listing the 13 (in-cluster) preferred words
   {algae, allot, aware, baggy, bevel, biddy,
    bitty, bobby, boxer, bread, brief, budge, gripe}
and the 12 ordered pairs
    [baste, twice], [batch, bawdy], [batty, aback], [billy, bawdy],
    [bully, badge], [baker, aback], [brown, badge], [daunt, judge],
    [coyly, fjord], [cider, ridge], [after, creed], [breed, befit]
This means: play any preferred word that might be consistent with the
clues provided by the starting triple; and for each ordered pair
[WORD1, WORD2], play the (out-of-cluster) word WORD2 if the word WORD1
is consistent with the clues. (Checking WORD1 amounts to identifying
the cluster in which the hidden word lies.) In any other case, and
after using these special rules for turn 4, just keep guessing any
word that's consistent with the clues up to that point.

For players following this strategy, I computed the probability vector 
    [0.735637, 0.228790, 0.035572]
from which we easily deduce an average of 4.300 turns to win 
(and a 100% chance of winning by the sixth turn). This represents
an improvement over the guess-at-will strategy, but comes at the 
expense of having a more complicated set of rules of play. 

(A final variant strategy would be to take nothing to chance and to
map out in advance the best moves for each cluster, considering any
words for moves 4,5,6 that allow the player to win for certain by 
move 6. This would allow the player to lower the expected number of
moves just a bit, to 4.2894. But to do so would mean to memorize an
even longer list of pre-computed branching decisions, and that is
antithetical to the goals for this paper that we described at the outset!)


So in the end, is LOATH+MURKY+SPINE a good Phase-1 strategy for Wordle?
That's a matter of opinion. Memorizing the 13 "preferred" words, plus
recognizing the other 12 difficult clusters and remembering the
out-of-cluster word that resolves them, might be a bit much. Figuring
them out on the fly is not impossible, but taxing. Sticking instead with
mode (1) is simple, and if you're a gambler who is a "lucky guesser",
that may be sufficient.  And, well, maybe it's more fun too, even if
it means the occasional loss. It's a personal decision, of course.
But we can provide comparable data for other word sets, and let each
player make a decision separately.

                ----------------

And how good is LOATH + MURKY + SPINE as an opener for the compound games?

Those probability vectors apply only to Wordle itself but they point to
some statistics for the compound games too. If the player is playing a
compound game built from N Wordle subgames, then (if the game permits
sufficiently many turns), the player will first enter the a=3 starting
words, and then in any of the N subgames can expect to require b=1.317
additional turns to win by using a guess-at-will strategy (or 1.300
additional turns to win by memorizing what to do with the 25 anomalous
clusters). That would mean the expected total number of turns is
3 + 1.317 N.  Well... it *would* mean this expected number of turns
if (after the three starter words) the subsequent guesses are applied
only to one subgame at a time. In the games I know, this is not how the
games are played -- instead, the subsequent words entered for the first
subgame might give additional clues in the second and later subgames.
So the expected number of turns is surely smaller than 3 + 1.317 N .

Nonetheless, this gives an upper bound on the expected number of
turns for e.g. Quordle, and more broadly shows the relative importance
of the two phases of Wordle-solving. For Wordle itself (N=1), it is
only the combined number of turns taken for both the fixed initial
guesses and then the guess-at-will phase. But for increasingly
large values of  N, the size  "a"  of the starter set (here, a=3)
becomes less important than the set's effectiveness in approaching
a solution (measured here by the coefficient b=1.317)..

For example, we will see later that there is a four-word starting set
that has a probability vector of [0.9702, 0.0298], so the expected
number of turns is 4 + 1.0298 for Wordle and no more than 4 + 1.0298 N
for a compound game. Already for Quordle this is a smaller number than
3 + 1.317 N. So for small-N games, we might expect LOATH+MURKY+SPINE
to be better than the four-word starting set, but for large-N games
the opposite is likely true.

(More generally: for Wordle itself (and the smaller compound games)
it may be inefficient to fix more than one or two starting words to
be used every day, but for the larger compound games, it may indeed
be more efficient to begin with a starting set of three or four words.)

Also note that, while we described a way to turn LOATH + MURKY + SPINE
into a 100%-winning strategy for Wordle, it doesn't guarantee success for
the compound games. Our refined strategy (2) will surely win Wordle with
no more than 3 turns after the starting set, but for an N-fold compound
game that means the maximum number of turns needed could be 3 + 3N, in
the (rare) case that all the subgames force the player to use three
additional guesses to discover the word. (Not only would this happen
at most .0356^N of the time, according an earlier paragraph, but it
would require that none of the words entered to complete any subgame
offer any succor in any of the other subgames -- a very rare situation!)

How else could one offer more information about how the strategies
will fare in the compound cases? Surely failure is possible for N>1
even if it is impossible for N=1. Presumably one could (at least for
very small values of N) itemize a catalogue of the *combinations* of tile
patterns in the subgames that could lead to a loss and perhaps find ways
to circumvent them, as we did above with "preferred" and "out-of-the-box"
moves, but I have not tried to do so. I have tried to run computer
simulations of thousands of randomly-selected Quordle games to see
how the different starting sets compare, but it is not clear how
representative these are, since there are over one trillion different
Quordle games, and many more for Octordle, etc.

                ----------------

To summarize all this notation: in this document we will analyze some
sets of "a" Wordle words to be entered at the start of an "N"-subgame
compound Wordle-like game.  Based on the colored tiles they would yield,
the 2,315 Wordle answer words will be split into "clusters", the sizes of
which are stored in the "cluster vector". On any one day of play, the
player would enter the starting word-set; seeing the resulting colored
tiles tells us the cluster in which the hidden word is contained. The
player will try to discover which word in the cluster is the hidden word,
either by entering candidates at random, or by using a memorized "preferred"
member of the cluster, or by using a pre-computed "out-of-the-box"
word that splits the cluster into smaller clusters (or, in extreme
cases, by using the idea of "preferred" words recursively, or by
entering some pre-determined words over the next *two* turns to resolve
ambiguity). Knowing the starter set and the intended mode of play
(and the particular preferred or out-of-cluster words to be used) we
can calculate the "probabilty vector" of possible lengths of the game.
In turn, that allows us to compute the expected number of turns
and the expected rate of failure, for this starting set of words.

This gives us many potential metrics by which to say one starting
set is better or worse than another: we can combine some statistics
from the cluster vector and the probability vector, and if we are using
extra rules to ensure a win, we can count how many there are and
how complicated they are. We can combine these measurements by any
formula that appeals to us -- maybe tossing in other measures too.
(How likely it is that the there will be very few colored tiles, in
which case the player might need to waste a turn getting a hint? 
How likely is it that there will be so many colored tiles after the
starting set is paritally played, that we can jump early into a guess?)
In this document we will choose to balance the many measurements
in a few ways; the reader is invited to do so differently.

Very well, then, let's look at some very good starting sets of Wordle words.
We start with sets of  a=6  words, then progress down to smaller values of  a.

==============================================================================

                               SIX (AND UP)
The six-word set
     [catty, frond, rumba, spill, verge, whack]
is nearly perfect for playing these games. It completely distinguishes
all 2315 Wordle words (despite not including j,q,x, nor z !).
That is, any two different hidden Wordle words will generate different
patterns of green/yellow/gray tiles when these six words are entered.
So if we enter these words as Phase 1, then Phase 2 is simply: enter 
the hidden word and win; there is no need to wonder about the player's
behaviour. There is no guessing nor branching in this routine. After 
the  a=6  starting words are entered, the probability of finishing
on the very next turn is 100%.

So this six-word set is nearly ideal for the N-fold compound Wordle
games: all of them will be completed successfully in N+6 turns, 100%
of the time. Sadly, most of these games (Wordle itself is the case N=1,
then Dordle, Quordle, etc.) allow only N+5 turns to finish the game,
so this starting set has a 0% chance of actually winning the game.

One exception is Octordle's variant that requires the player to 
solve 8 simultaneous Wordle subgames *in order*; because of the
extra level of difficulty, that game allows the player 15 guesses to
try to win. This starting set of 6 words permits the player to win
every such game in only 14 guesses! (Here I use the fact that all 
the answers-words for Octordle are among the 2315 Wordle answers.)

*Almost* serving as another application is Sexaginta-quattuordle, 
which gives the player N+6=70 turns to guess N=64 words --- just enough
to use this starting sextet.  In fact, starting with this sextet gives an
excellent way to play this compound game, since then the player need
only scan the first six rows of the crowded display for each subgame.
Unfortunately, the word list for 64ordle is significantly larger
than that of Wordle. So in 64ordle there are pairs that are not
distinguished this sextet:
    edged, egged    sided, sized    dozed, oozed
    boded, boxed    dazed, jaded    waded, waxed
including some pairs involving Wordle words* :
  unzip*,unpin    bonus*,bosun    mummy*,yummy   organ*,argon
so the player would not be assured a win with this sextet.  In fact
no sextet will suffice for the full set of 64ordle words. I suspect
there are many septets that will suffice to distinguish every one of
the game's answer words, but with only 70 turns allowed, a starting
septet will not allow the player a victory.

The sextet also definitely fails (slightly) for Dordle.  whose
solution-list includes the words UNPIN, YUMMY, and ARGON; the sextet
does not distinguish these from UNZIP, MUMMY, and ORGAN, respectively.
Since the Dordle word list also *deletes* some of Wordle's words,
there may be a different discriminating sextet for Dordle. I have
not looked for one.

Of course, this discriminating sextet also works for those compound
games whose solution set is a subset of Wordle's: Quordle, Octordle,
and Duotrigordle. But it's of no practical value since those games
do not allow N+6 words to be entered.


It is comparatively easy to get more 6-word sets that *almost* split
the whole wordset, and thus it is easy to get many sets of 7 that do.
But, astoundingly, it turns out that this six-word set is the UNIQUE sextet
of Wordle answer-words that splits the Wordle dictionary into singletons
in the way I have described! (I have to say I was very surprised by this.)


A novice player who wants to practice recognizing Wordle words might
want to play with this set, since it allows only one Wordle-correct word
to be built from any set of clues. You could, for example, enter these
words into the sequential version of Sedecordle (because it allows you
to enter so many words) and then practice recognizing Wordle words.

Of course, even when playing this sextet, the player still has to
do some thinking to *recognize* the hidden word each day; knowing
that it is unique, and knowing a few letters in it, are not quite the
end of the story. Of the 30 letter tiles shown after the six words
are entered, the player may see no greens at all and as few as four
yellow tiles (both {b,i,n,o} and {i,n,p,u} can occur) and it can
take some effort to realize the hidden words are "inbox" and "unzip".

If you want to have a set of words that has the same property as the
Magic Six, but includes all 26 letters, you'll need at least eight
starting words. One such solution is
   [cross, equip, expel, flack, jumbo, razor, vodka, wight]

How much help do you need to construct the hidden word? We can even
test for each letter that's ever doubled, though to do so you'll
need at least 17 starting words, e.g.
   [affix, booby, ditto, jazzy, kappa, kayak, mimic, occur, penny,
    piggy, queue, radar, shush, slyly, undid, vivid, widow]

Oh, there are some tripled letters too; if you want those flagged as well
you'll need at least 20 words, e.g.
   [bobby, cocoa, daddy, error, fluff, heath, jazzy, knack, leggy, mamma,
    melee, ninny, pixie, puppy, queue, sassy, slyly, tatty, vivid, widow]
At that point, you know not only which letters appear but how many of each
there are! With 20 words entered, Sedecordle is leaving you space for just
one guess, but you have literally nothing to do but to permute any letters
in yellow! (On average, 3.74 of the tiles are already green; only
      abort, acorn, adorn, avian, axial, offal
have no green tiles and thus force you to consider all 120 permutations of
the five yellow tiles.)

I even offer a starting word-set that relieves the player of all thought!
We do so by finding an (optimal) solution to the game Kilordle, which
requires solving N=1000 Wordle games at once. (In Kilordle, it is not
necessary to *enter* each correct word, merely to get a green tile in
each of its five columns. As an additional assist, the 1000 subgames
are sorted to present first the ones that are closest to completion by
some metric, with the completed subgames removed from view.) But in
fact we can ignore the given subgames completely! Just treat this as
requiring a list of words that contains each letter in each position
-- 130 tasks. Actually 5 of the tasks are never presented in a Wordle
game (e.g. there is no word with an x as the first letter) so we can
always win by entering no more than 125 words. When solving Kilordle
manually, I typically need to enter about 100 words. On the other hand,
at least 26 words would be necessary because we would, among all the
subgames, eventually need to enter every letter in column 2. So the
minimum number of words needed, to be sure to solve every round of
Kilordle, is between 26 and 125.

Using optimization software I discovered that the minimum number of
words is actually 35. A sample solution is
    [above, affix, askew, banjo, bayou, civic, debug, eject, epoxy,
     ethos, evoke, extra, fritz, globe, howdy, igloo, imply, jazzy,
     known, leggy, maxim, nymph, ozone, pique, quasi, rajah, scrub,
     skimp, squad, tweak, udder, vinyl, whiff, yacht, zesty]

So not only does this set of words solve every game of Kilordle, it
gives a "simple" way to solve Wordle, too: just enter all 36 of these
words, and then locate the green tiles in each column to form a
5-letter word! Of course, we're now waaay past Wordle's 6-turn limit...

Other 35-word kilordle solutions exist but all of them must contain
"pique", which is the unique word having q in position 3. They must
likewise all contain "bayou" (u5), and either "banjo" or "ninja" (j4),
"azure" or "ozone" (z2), "eject" or "fjord" (j2), etc. I have already
fiddled with the list to remove words I didn't care for. ( "squib"?
"waxen"? "twixt"?) I'm not sure what I would consider to be the most 
"normal-sounding" list of 35 words. If you don't mind using words like
"embog" and "jambu" then the minimum drops to 30, using the 12000-word
list of possible Wordle inputs. A solution was posted to reddit by
user "k3and". As the Times increases the pool of acceptable input
words, the size of a minimal winning Kilordle set can decrease.)


Mathematicians might want to click here for a short description of
how I first found the 6-word set; You'll probably want to read about
measures of word similarity first. The point is that we can talk in a
meaningful way about what it means for two words to be "close" to each
other. (Mathematically, we can impose a metric on the set of these
words, and all our searches for optimal word sets focus on finding
words that are within a small distance of each other, then ensuring
that Phase 1 leaves us with ways to distinguish those words.)

The claims of minimality for the 8-, 17-, 20-, and 35-word sets are
proved by covering-set arguments and computations using Gurobi.

==============================================================================

                              FIVE

With five 5-letter words we can hope to include all but one letter of the
English alphabet, and sure enough this is possible. One example that
comes immediately to mind contains all the letters but  j :
(5-1)     [waqfs, vozhd, blunk, cimex, grypt]

Hah hah, just kidding, that's a bit ridiculous. Not one of those five
words is in the Wordle wordlist, although all five of them are accepted
as input in a Wordle game.  We can get as many as three wordlist
words into the set and still have 25 letters:
(5-2)     [waltz, fjord, chunk;   vibex, gymps]
(This one misses only  q .) But having two words outside the basic
Wordle wordlist is the provable(*) minimum, if you hope to include
25 of the 26 letters, and the only two such sets are that one and
(5-3)     [waltz, fjord, nymph;   vibex, gucks]
and neither of these is particularly great as an opening play in Wordle.
The cluster vectors are [2221,41,4] for the second one and [2242,30,3,1]
for the first, which cannot distinguish {puree, purer, rupee, upper}.
So neither will guarantee you a win of the game in 6 turns.)

[ (*) UPDATE: I re-verified this after the NYT increased the set of Wordle
inputs. Simply compare the list of 20-letter quadruples of Wordle words
to the new list of acceptable Wordle inputs; in every case there is
non-empty intersection. I'm waiting for them to decide that "gveck"
is a word; then
    [ fjord, nymph, squib, waltz ; gveck ]
will be a 25-letter quintuple that uses only one word from the list of
acceptable inputs. Alas, until "gveck" becomes a real word, we will have
to use sets of SIX Wordle words if we want to cover 25 -- or all 26 --
letters of the alphabet; "gecko" and "vixen" cover "gveck"+"x".]


A five-word set with 25 distinct letters is impossible if it includes only
zero or one non-wordlist words; the best we can do is 24 distinct letters.
Before I turn to the 24-letter sets formed only from answer-list words,
let me mention one example that does include just ONE non-Wordle word,
which I will do because it's actually a reasonable word. It's (almost)
a sentence, or at least a headline:
(5-4)     [quick, waltz, vexes, fjord, nymph]
(It's got two e's while missing b and g . "Vexes" is not in the Wordle
wordlist, being a third-person singular form of a verb.)  Entertaining
though it may be, it's not as perfect for Wordle as the sextet in the
previous section: the colored tiles returned from these five words are not
sufficient to distinguish "error" from "gorge", "blast" from "stall", etc.
Its cluster vector is a disappointing  [1637, 197, 49, 12, 5, 6, 4] .

                ----------------

That's the last word set I analyzed that uses non-Wordle words; for the rest of
this section, I only consider sets of words from the Wordle solution-word list.

Of all the 5-word sets made of Wordle answer words, none have 25
distinct letters. There are 58 5-word sets with 24 different letters.
Four of them include one word with a repeated letter, e.g.
(5-5)     [blitz, chump, fjord, gawky, seven]
and the rest have a pair of words sharing a letter, e.g.
(5-6)     [coven, fjord, gawky, plumb, sixth]

As starting sets for playing Wordle and the other games, I would
argue that these two are each the best in their class. But despite
revealing nearly all the letters in the hidden word (they miss q,x
and q,z respectively) they still don't quite pin down the word
unambiguously: the cluster vectors are [2260, 26, 1] for the first
(it cannot distinguish {odder, order, rodeo}) and [2253, 31] for
the second. Thus we cannot be sure to win by turn 6 with either
of these quints.

Obviously by flipping a coin for the 31 tricky clusters of the second
quint, we would have a probability vector of [0.986610, 0.013390] :
a 1.34% chance of losing and an average of 6.0134 turns to win.

The first quint (5-5) is a little different, though. That one triple
{odder, order, rodeo} does benefit from preferentially choosing
"odder" or "order" instead of "rodeo". So we can run two analyses:
(1) using a guess-at-will strategy, the probability distribution is
    [0.987904, 0.011663, 0.000432]
(2) Or we can use instead a "guided hard mode" strategy: 
    play "order" as a preferred word. Now the distribution is
    [0.987904, 0.012095]
Either way there's still a 1.21% chance of losing. But taking the
extra effort for one tough cluster lowers the *maximum* game length 
from 8 to 7, and the *average* game length from 6.0125 to 6.0121 .

                ----------------

Next we set aside the desire to include 24 different letters, and just
look for ANY good set of five Wordle words.

Is there any five-word set that's as good as the six-word set of the
previous section --- one that always narrows down the set of possibilities
to just one word (and thus guarantees a win by turn 6)? The answer is
provably "no". I have an elementary argument that explains why no such
perfect quint exists.  But it's faster to simply use the Linear Programming
techniques already described in this document, especially since
(a) That technique also proves we cannot even find a perfect quintuple
    among the much larger set of Wordle's list of recognized inputs, and
(b) LP techniques helped me find the "tough pairs" that I used to create
    the elementary argument, anyway.
(The LP techniques are also used to prove the uniqueness of the six-word
set in the previous section, and that uniqueness in turn trivially proves
that no five-word set can detect every hidden word unambiguously.)

We will return to observation (a) in a moment.

                ----------------

Nonetheless, some really good sets of five Wordle solution words do exist.

A provably most-efficient-possible 5-word set is:
(5-7)     [blank, chump, goody, river, swift]
(It lacks jqxz; has double r, double o, and two i.)
Laurent Poirrier found this one and we have proved that no other
quintuple is better in the sense that this quint can distinguish
all the Wordle answer words except for these 11 pairs:
    [ample, maple]  [booby, boozy]  [bugle, bulge]  [chili, chill]
    [eagle, legal]  [gauge, gauze]  [jaunt, taunt]  [lemon, melon]
    [pasty, patsy]  [skate, stake]  [testy, zesty]
That is, its cluster vector is [2293, 11]. So there is no need for
any strategy but free guessing --- just flip a coin for those 11 clusters.
And thus we compute a distribution vector of [0.995248, 0.004752], 
and so the loss rate is 0.48% and the average number of turns is 6.0048 .

The only other quintuple that is equally efficient is
(5-8)     [bawdy, clove, furor, might, spank]
(No jqxz; has double r, two o, and two a); it has the same cluster 
vector and distribution vector, and hence the same average and loss rate.

If we sequentially ran  N  independent random processes, 99.52%
of which finished after 1 step and the remainder after 2 steps,
then the number of steps needed to complete all of them could
be anywhere between  N  and  2N ; the probability that exactly
k  of these  N  processes took that second step to complete
would be  binomial(N,k) (.9952)^(N-k) (.0058)^k, and the
expected number of steps taken would be  1.0058 N . That almost
models what happens here with an N-fold compound Wordle game
like Quordle (N=4) : the expected total number of turns needed,
if we begin with either of the starting quintuples above,
would be 5 + 1.0058 N  IF the  N  subgames were independent.

But they're not! Suppose for example we are using the first of these
two quintuples. If after entering those words we have concluded that
in one of the N subgames the hidden word could either be "ample" or
"maple", and we also know that the hidden word in another subgame is
either "bugle" or "bulge", then we should indeed flip a coin to enter
either "ample" or "maple"; but then (by looking to see whether the L
is yellow or green) we would know whether the other hidden word is
"bugle" or "bulge". As it turns out, for EVERY one of the 11 pairs at
least one of the words in the pair can resolve at least one of the
other 10 ambiguous pairs, so we should preferentially play those
words and reduce the ambiguity in another subgame. In fact, it is 
impossible for a game to require more than 10 + N turns to complete
(far fewer than 5 + 2N except for the smallest N) because so many of
the ambiguous pairs include words to guess preferentially so as to
discover the hidden words in other pairs! A maximally bad example
includes N=5 subgames with the five hidden words being
    chili  gauge  jaunt  lemon  skate
Such a game would require five coin tosses, which if they are all
unlucky would cost us 10 turns to win, after the initial quintuple
is entered.

Since the subgames are very likely NOT independent, then, we can
only conclude that the expected number of turns to complete an
N-fold  compound game, when starting with one of these two quintuples,
is *at most* 5 + 1.0058  N .

When playing the typical  N-fold  compound games, that allow only
N+5 turns to complete the game, then after the starting quintuple
is entered, we must finish every one of the subgames with just
one turn (each). That would happen with probability (0.9952)^N
IF the games were independent, but as above we notice that the
earlier subgames can provide additional information to help resolve
the later ones. So in fact the probability of success is, for all  N,
at least  0.9952^5 = 0.97623 (i.e. a 97.623% chance of winning),
assuming the player resolves any of the 11 ambiguous cases in
an advantageous order.

                ----------------

When I claim that the previous two starting quintiuples are optimal,
what I mean is that they minimize the number of pairs of words that are
not distinguished from each other, and consequently they minimize the
expected number of turns needed until a win.  The proof of their optimality
comes from searching for sets of five words that maximize the pairs split
among a select list of pairs of similar words. Searching for maximizing
quints in that way allows us to discover other quints that are nearly as good.

The next few close contenders for "best" starting quintuple (all of which
happen to contain all letters except the rare  j,q,x, and z)  are these:
(5-9)     [bawdy, furor, month, speck, vigil]
has cluster vector [2292, 10, 1] and distribution vector [0.9948, 0.0052].
(Nothing is better than guessing at random for the triple {bobby,booby,boozy}.)
(5-10)    [flock, haven, rugby, swept, timid]
has cluster vector [2291, 12] and the same distribution vector [0.9948, 0.0052].
(5-11)    [bawdy, chump, front, skill, verge]
has cluster vector [2285, 15] and distribution vector [0.9935, 0.0065].
(5-12)    [batty, champ, furor, slink, wedge]
has cluster vector [2285, 15, 1] and distribution vector [0.9927, 0.0073].
(Same ambiguous triple as (5-9), so guess-at-will is as good as anything.)


For the curious: these are the six quints that cover the largest
numbers among the 919 hardest splitting sets, that is, I checked the
919 hardest pairs of words to differentiate, and these quints covered
the most --- at least 907 of them --- and moreover these quints *did*
distinguish any pair of words that wasn't on this list of 919 tough pairs.

For completeness' sake, I ran a similar test with the much more
accommodating set of 14,853 currently-allowed input words for Wordle.
There exist (multiple) sets of five words which successfully distinguish
all the Wordle answer-words except for 8 pairs, and  8  is the minimal
number of failures. For example, for
(5-13)    [spill, verge, dumbo, fawny, chott] 
the cluster vector is [2299, 8]: the non-singleton clusters are
     {algae,glaze}    {crock,crook}    {dried,drier}    {husky,hussy}
     {liken,linen}    {odder,order}    {piper,riper}    {rebar,zebra}
So the distribution vector must be [0.9965, 0.0035], and thus the
average number of turns is 5.0035 and the failure rate is 0.35%.

I cannot predict what "words" will someday be allowed as input for
Wordle, but I can guarantee that it will never be possible to enter
five words and unambiguously know what the hidden word is, to be
entered on turn 6. I considered 945 of the hardest pairs of words
to separate, and used Gurobi to determine the largest number that
could be distinguished by ANY combination of five strings of five
letters each. It reported that the maximum is 943, that is, at least
two of those pairs would go unsplit. (There were several sets of
five "future words" that would accomplish this, but only after some
experimentation did I find one that *also* split all pairs not on
my chosen list of 945: the starting quintuple of these "words"
     [serer, calvl, hyott, gmudn, fpibk]
has a cluster vector of [2311, 2], the only unsplit pairs being 
crook/crock and gauge/gauze. SERER is actually on the current list
of admissible inputs; none of the others is. The letters can be
permuted within columns to get other equivalent starting sets, as
long as the doubled e, l, r, t stay doubled within a word.

If you really want to stretch the notion of the Wordle game, suppose
we start the game with these ... 5.2 words (?!):
  [spaul, flyin, doogh, crrew, mktbt, v****]
The tile colors that we get in response will almost always identify the hidden
word; Its one failure is eager/gazer. But this set consists of five
complete "words", and the extra letter is right at the front of the sixth
word (so can we call it 5.2 words?). As a bonus the first two words are
actually admissible Wordle inputs, and frankly if you told me the next
two were Wordle words, too, I'd believe you. (They're not.) As you can
see there are words with two Os, two Rs, and two Ts (and those letters
need to be together in a word); there are also two Ls which you could
put into the same word but there's no need to.

(The tool for this optimization is to allow 130 variables for which
letter/position pairs are used, together with 26 variables for which
letters will be doubled within a word, and 10 more variables to
indicate which letters would be tripled within a word. Together,
these account for all the mechanisms by which a set of starting 
words can distinguish any particular pair of Wordle solution words.)


It is easy to find many starting quintuples which give success rates
over 99%, but never 100%, so perhaps we are just splitting hairs here.
But one quint of note is
(5-14)    [carve, sight, downy, plumb, fetal]
with cluster vector [2269, 20, 2] and distribution [0.9896, 0.0104].
(Both triples can be solved using freeform guessing.) The significance
of this quintuple is that it extends the best quadruple we will find
in the next section (the first four words here). So playing the 
first four words first already gives a high probability of solving a
Wordle puzzle without entering all five words; and those first four
words already use 20 different letters, making it easier to guess the
hidden word even when it is known to be unique.  Alternatively, this
quint is good as a backup plan for a player intending to use that
"best" quadruple but who has trouble discerning what word to enter
next; FETAL offers the most additional help (but does come at a cost
of a 1% failure rate that was not present for a player who uses that
quadruple and CAN discern the unique solution word!)


So to summarize, we have found several quintuples of starting words that
*almost* always enable us to know the hidden word and enter it as turn 6
and win --- but not one of them allows us to win 100% of the time.
This will change when we get to quadruples!


But first, we need a little digression...

==============================================================================

               INTERLUDE: THE DIFFERENT WAYS TO RANK PROPOSED STARTING SETS

As we discuss smaller starting sets, there will not be a single "best" starting
set because there are different ways to rank or score the candidates.
In order to rank them, you have to know what it is that you value most!

The main question to ask before ranking is: do we want to rank
the candidates based only on how the game looks immediately after
the starting set is entered? Or should we "play the long game"
and incorporate into our ranking the knowledge of how we will
proceed for the rest of the game?

We'll consider the first possibility first; we'll see there are
multiple ways to measure just how well the starting set has worked.
Later, we'll investigate rankings based on two possible strategies
the player might use to finish the game: a guess-at-will procedure,
or a procedure based on pre-computing a few ideal moves to make
in just enough cases to ensure a victory by move 6.

(One can go further and assume the player has worked out a strategy
for more than the minimal number of cases, maybe even computing an
ideal move to make at every turn. Since this article is about
*human* players, we will not pursue such advanced options.)

We can apply these rankings to any sets of candidates. Primarily
we will use them to make a systematic examination of all the
starting sets that repeat no letters --- this restriction
will give us a manageable set of candidates to examine methodically,
which generally speaking contains the best starting sets of any size.
Here are the counts of such sets (and some links to the lists):
    Sets        How many exist
 quintuples+            0
 quadruples        45,147
 triples        1,243,026
 pairs            196,175
 singletons         1,566


Note: many times our rankings will necessarily put two candidates at
a tie. One reason is that our lists of candidate starting sets include
many "perfect anagrams": pairs of starting sets that have the same
letters in the same columns and therefore will return precisely the 
same colored tiles no matter what the hidden word. For example the 
triples [crone, guilt, shady] and [crony, guide, shalt] are perfect
anagrams of each other. Some extreme examples are
     [place, trunk], [plane, truck], [plank, truce], [plunk, trace]
and
  [twang,slump,cried], [tried,swung,clamp], [tried,swamp,clung], [tramp,swing,clued]
Starting sets that are perfect anagrams of each other result in the
same progress of the game: they partition the dictionary into the same
clusters, they give the same colored tiles, etc. So when we are rank
the candidate starting sets, we will mention only one of the pair, and
relegate its perfect anagram(s) to a footnote.  These sets of anagrams
will appear when we review starting quadruples, triples, and pairs.
(But no two single words can be perfect anagrams of each other!)


(A) COLOR METRIC(S)

Very well then; how can we assess how good the player's situation is,
right after entering the starting triple (i.e., making no assumptions
about the player's actions thereafter)? All the information we could
use at that moment is presented in the tiles that have been turned
green/yellow/gray by the starting set. 

Since we are assuming throughout this document that the player
knows the Wordle dictionary, that implies the player can run mentally
run through the list each day to find the words that match the 
colored tiles before him: he knows each day what cluster of words
contains the day's hidden word. We will use that information in
part (B), below. But let's concede that a typical human player will
instead try to construct the hidden word candidates from just
the green and yellow letters; so he wants to have a lot of those!
For example, in the Introduction, we gave an example of a display
of colored tiles that corresponded to only one possible word (WEDGE)
yet the situation would have been difficult for a human because 
there was only one colored tile to go on.

So one of the ways we will rank different starting sets will be
in terms of how much information we get from the starting
set. Our proxy for that will be simply to count how many yellow
and green tiles they produce; more is better than fewer.

We will primarily do this for starting sets without repeated letters.
If a starting set includes two words sharing the same letter in the
same position, the counts of the green, yellow, and grey tiles will
over-estimate the amount of information obtained from the starting
set. (That's also true if the starting set includes a word with a 
repeated letter: if, say, the left-most  E  goes grey, then we gain
no new information from observing that the other  Es  are grey too.)

Of course the counts will vary depending on what the daily hidden word
is, so these should be interpreted as averages --- expected values of
a random variable. Equivalently, we can simply count the numbers
G  and  Y  of green and yellow tiles that will show up day after day,
across an entire 2315-day cycle. (Divide by 2315 if you wish to
compute averages.)

Then each starting set can be plotted as a point in the (G,Y) plane;
the points that are farther out correspond to the starting sets that
that give the player the most information. Mathematically the 
points of interest are the points that are on the boundary of the
convex hull of this set of points --- the ones that would 
snag a lasso tightening around these points.

In order to actually rank the candidates, we have to decide how
much information we get from a yellow tile as opposed to a green one.
If each yellow is worth a fraction  f  of a green, then the metric
we would use to rank the candidates is simply  G + f Y . Natural
choices might be f=0 (if only the green tiles are of interest),
f=1/2 (if you'd be willing to swap two yellows for a green),
or f=1 (if every colored tile is equally valuable to you). 
But we can determine the ranking of the candidates for every  f.
Mathematically we can even ask about f>1 although this would
make no sense in the context of Wordle!

It's not hard to show that when f=1, the metric depends only on the
letters involved, not their positions; anagrams might replace yellow
tiles with green ones or vice versa, but no matter what the day's
hidden word, the total number of colored tiles is the same for both
permutations of the letters. This will mean that when  f=1,  there
is likely to be a large multi-way tie for "best" starting set,
each candidate set being an anagram of the others.

So this becomes our first (family of) metric(s): we can show
the highest-ranking candidates for  f=0  --- that is, the candidates
that produce the most green tiles, on average; then a list of 
ranges of f < 1 on which the ranking doesn't change, and in
each range we can show the rankings of the candidates using
the values of  G + f Y .

Jeff Dooley has proposed an interesting variation: just as we might
weight the yellow tiles differently from the green ones, we might
also weight the colored tiles differently depending on the letter
revealed. (Not only does a green letter help the player more than a
yellow one, but since J is so much rarer than E among Wordle words,
the presence of a yellow J is much more valuable than a yellow E.)
I have not pursued this very far yet. He remarks that simply
weighting consonants differently from vowels can lead to different
rankings; for example the best starting pair under such an
assumption came out to be CRONE + SHALT.)

Another alternative using the color tiles is to try to maximize
the minimum number(s) of colored tiles day after day:
the better starting set is the one that never leaves the player
high and dry. We won't pursue this systematically but can make
some observations about examples.

Since these rankings depend only the the numbers G and Y, it will
happen that two candidate starting sets rank equally for every value
of  f, if they should happen to have the same values of G and of Y.
That happens for perfect anagrams of course; it can also happen "by accident".

(B) CLUSTER METRIC(S)

Next we assess the situation after the starting play a little
differently. A player who really is familiar with the set of
solution words might not need so many colored tiles to identify
the candidate words; but he might appreciate a starting set
that generally leaves little ambiguity about what the word is, 
that is, the player would prefer that the starting set partition
the dictionary into a lot of small clusters rather than fewer
larger ones.

On the assumption that all the dictionary words are equally
likely to be the hidden word, it makes no difference which
words are in the cluster: the quality of the situation
depends only on how many words are in the cluster. (In parts
(C) and (D) below, we will consider ways in which one cluster
may be viewed as better or worse than another cluster of the
same size.)

So from this perspective, all the information we need to judge
and compare candidate starting sets is the cluster vector 
v = < v1, v2, ..., v_N >  that shows the numbers v1, v2, ... 
of clusters of sizes 1, 2, ... We'd like to rank more highly
the starting sets for which the numbers of small clusters are
high and the other numbers drop off rapidly. So we will form
various metrics that can be calculated from  v  and which grow
larger or smaller for the better or worse starting sets.

In the Wordle "literature" there are multiple metrics of this type;
we will outline a few of them below. But actually, all of their
rankings can be determined in a uniform way by computing a version
of mathematicians' "L^p metric": For any number  p  we can compute
a number for each candidate starting set, based on the cluster vector:

   Lp ( v ) = sum(  v_i * i^p )
            = sum of the p-th powers of the sizes of all the clusters

We will even extend this to the values of  p = - infinity ( meaning that
i^p = 0 unless i = 1 ), and p = + infinity ( meaning we find the ranking
of the candidates that applies for all large  p ; equivalently, we
rank the candidates by the value of  lim( p -> infinity )  (Lp)^(1/p) . )

Which indicates a better candidate, a larger value of  Lp  or a smaller one?
Suppose two candidates have nearly identical cluster vectors, but one
has a single cluster of  2C  elements, and the other has two clusters
of  C  elements each. Clearly the second candidate is the better one
for the player. How do their  Lp  values compare? They have all the
same summands except for a term  (2C)^p  for the first candidate, and
2 C^p  for the second. Thus if the  Lp  metric is to indicate which
candidates is better, then we must declare that  2  is better than  2^p.
This means that for  p > 1, the better candidates are the ones with
smaller values of the  Lp  metric; for  p < 1  the sense is reversed.

I don't have a ready explanation of what exactly each metric Lp
measures. But think of it this way: using different values of  p > 1 
allows us to decide just how badly we want to avoid having large
clusters; different values of  p < 1 (especially negative  p)  allow
us to decide how strongly we want to favor having small clusters.

There are key values of  p  for which the Lp ranking matches the
rankings that other people have investigated.

p = 1 : This sum  L1 = sum( v_i * i ) simply counts all the words in
all the clusters, and so is the same value for all candidates (L1 = 2315).
(Even though all candidates are tied for best when p=1, there will
of course be winners using the metrics with p near 1 . The starting
sets that already give the smallest  Lp  values when p is just larger
than 1, or the largest values when  p is just smaller than 1, are
those making the smallest changes from L1 = 2315; mathematically
this rate of change is the derivative of  Lp,  which works out to
sum( v_i * i * log(i) ). Thus it makes sense to use this expression
to give us a ranking that applies when p=1; smaller is better. )

p = 0 : Since i^p = 1  for every cluster size  i,  this sum  L0 = sum(v_i)
is simply counting the clusters, i.e. the number of possible ways the
tiles can be colored over the years. (Note that the arithmetic average
of the sizes of all the clusters is  L1 / L0 = 2315/L0, so maximizing
the number of clusters, L0,  also minimizes the average of their sizes.) 
Also, it is an elementary probability exercise to see that
the probability of successfully guessing the hidden word on the very
next turn, when using a guess-at-will strategy, is exactly  L0 / 2315:
so the starting triple with the highest value of  L0  is the one that
makes it most likely that we'll enter the hidden word by our fourth turn.

p = - infinity : Treating  i^p  as 0 for every value of  i > 1 means
that for this value of  p,  Lp  is counting the number of singleton
clusters: the number of words that can are determined unambiguously
by the tile colors. Equivalently, this  Lp  counts the number of days
per cycle that we can be sure of the hidden word; subtract from
the total number of days (2315) to get the number of days we must either
selected a word at random from the cluster as a guess, or determine a
different strategy for that cluster that can guarantee a win by turn 6.
We will return to those options in parts (C) and (D), below.
(OK, OK, "-infinity" is not a real number, so we are instead
using the ranking provided by all sufficiently negative  p .) 

p = 2 : Over the whole 2315-day cycle, every word will eventually be
the hidden word once; thus a cluster with  i  elements will be
recognized as the cluster containing the day's hidden word just that
many times. Hence when we tally the size of the day's cluster, day
after day, we are computing  L2 . (We could then divide  L2  by the
number of days, 2315, to obtain the "average daily ambiguity".)

p = + infinity : the last term  v_N * N^p  dominates the others for large  p,
so the candidate with the smallest value of  Lp  will be the one(s) for
which the largest cluster is as small as possible; ties are broken by
counting the number of clusters of that size. (Any remaining ties are
broken by the next-largest term, which similarly considers the second-largest
size of a cluster, and so on.) 

We recognize all these special cases as metrics of interest that provide
worthwhile rankings of candidate starting sets; but by considering
the rankings that result from all values of  p,  we can see
how the different rankings morph into each other as  p  varies.
Here is an illustration, showing the rankings of the top
single-word starting sets, for all values of  p.

Sometimes a tweak in the metric(s) may be appropriate.
For example, after a starting triple is entered, the player has
three more turns to try to enter the day's hidden word. It's great
if the clues obtained from the starting set uniquely identify
the hidden word, but even if there are two or three possibilities
we are comforable: we know can keep guessing until we land on the
correct one and still we will have won by turn 6. So perhaps we
should try to maximize not the number of singletons but rather the
number of days when we can confidently just try all the words in
a cluster: don't maximize  v1  (which is what happens with  p=-infinity)
but rather maximize  v1 + 2 v2 + 3 v3 . (Along the same lines, I have
also looked to maximize the percentage of clusters that are
this small; that is computed as (v1+v2+v3)/L0.  ) Of course we
would rank starting quads in this way by maximizing only v1+2v2,
and similarly adjust for starting sets of other sizes.

Parallel to the comment in (A) about candidates with the same (G,Y) 
measurements, note that all these Lp metrics are computed only from
the cluster vector. Two candidates with the same cluster vectors will
end up ranked equally for every value of  p. This happens with 
perfect anagrams, of course, but can also occur in other cases, 
particularly when the cluster vector is short (as happens with the
best starting quadruples, for example).

As you can imagine from the foregoing discussion, there is no end to
the set of ways that one may assign a "score" to every starting
set. Besides inventing new quantities to measure, we also have the
option to combine several existing measurements into one; how we mush
them all together is a personal choice.  (For example, in my files of
the pairs and triples that have no repeated letters, I sort them
according to the value of 5 L0 + 2 G + Y .) For this reason I will try
to determine not only the "best" starting set by each metric, but also
a couple of runners-up --- one of them might be "pretty good" by lots
of separate metrics, and thus become a player's go-to starting set.

(C) GUESS-AT-WILL RANKINGS

So far we have considered only the ways to compare and rank different
starting sets without regard to how the player will proceed afterward.
We can make more useful rankings if we know what the players would 
actually do in later turns; but what strategies might they use?
Of course a player may adopt a complicated playbook of their own
design, but from here forward we will assume the player with either
(a) switch entirely to a guess-at-will strategy, running a risk of losing,
or (b) pre-compute a simplest-possible winning strategy, finding a word
(or sequence of words) to play in those cases when the colored tiles
indicate a cluster that might lead to a loss if we use strategy (a).

For a player pursuing strategy (a), I can think of only two natural
measures by which to rank the starting sets. We want both these numbers
to be as low as possible:

  M6 = probability of a loss (i.e. not guessing the word by turn 6)

  M7 = average number of turns needed to win (including occasions when
        7 or more turns are used)

Both these numbers are computed readily from the probability distribution
that shows the probability that the player will guess the hidden word
after 1, 2, 3, ... additional turns. We will compute those distributions
assuming that the hidden words occur with equal probability, and assuming
that the player selects words from the cluster with equal probability.

Note that at this point, not all clusters of a given size are equally
difficult. Among clusters of three words, for example, we have seen in
previous sections clusters like {skate, stake, state}, in which any
wrong guess gives extra information to reveal what the hidden word
must be; and clusters like {haunt, jaunt, vaunt}, in which each
wrong guess provides no extra information. (So the first cluster
adds a vector (1/3, 2/3, 0) to the probability distribution, while
the second cluster adds (1/3, 1/3, 1/3) .)

(D) RANKINGS THAT ASSUME A (MINIMAL) STRATEGY

The last way to rank and compare starting sets is to indeed take into
account the players actions after entering the starting set, but
to assume the player will do something other than enter a cluster word
at random.

In order to compute such a ranking we have to decide in advance what
playbook we think the player will follow. Here there are multiple
options. In this document, we will assume that such a player will
above all want to find the hidden word by turn 6, but even beyond that,
there are multiple options. In our ranking of candidate starting sets,
we choose to review just one strategy per candidate. Then, we shall
simply rank the candidates by the average number of turns their
strategy requires until finding the hidden word.

Other analyses I have seen have opted to assume the player would
use the "optimal" playbook, that is, for each starting set, to follow
up with a full decision tree showing what actions to take for each
cluster that could contain the hidden word, the actions chosen to 
minimize the expected number of turns. We will (usually) not consider
these strategies, since they are usually too complicated for human
execution, which is our interest.

Instead, in order to pick a "simple" strategy which we will assume
the player will follow, we will assume their actions after the
starting set is entered are governed by these principles:

(1) Whenever a guess-at-will strategy is sure to bring a victory
by turn 6, use it.

(2) If playing a "preferred" cluster member will guarantee victory,
use it; more precisely, use the cluster member that will ensure
victory in the fewest turns.

(3) If not, play a single, out-of-cluster word that will ensure
victory (again choosing one that minimizes the number of turns).

(4) I have made some ad-hoc choices in the rare cases that no 
single word will allow a guess-at-will strategy afterwards.

But note in particular that the strategies I am assuming will include
special handling rules only for the clusters that might lead
to a loss if free guessing is used.

More finely-tuned decision trees can surely lower the expected numbers
of turns but in the interest of finding ways for ordinary humans
to play the game, their consideration is beyond the scope of this article.

With the strategy in place, it is a straightforward matter to compute
the probability distribution showing the likelihood of ending the game
after this or that many moves. We can then rank the candidates based
on the average number of moves.

Unfortunately, it is computationally intensive to work out the optimal
strategy. Therefore, I usually only work one out for the starting sets
which have proved to rank highly by the metrics in the previous subsections.
As a general rule, most of the time a player is using these strategies,
they are simply using the guess-at-will rule (1); hence the rankings
with a strategy are typically similar to the rankings without a strategy.
That gives some confidence that we have not missed a "best" candidate.


The other issue of importance in this subsection is a bit informal: we
would like to find starting sets whose win-by-turn-6 strategy is "simpler"
than those of any competing candidate. Primarily that means we rank
more highly the starting sets that have fewer anomalous clusters that
require a special rule to be memorized. Generally we prefer starting
sets that allow more use of in-cluster preferred words than out-of-cluster
words, and we prefer starting sets that do not require rules that are
used any later than immediately after the starting set. 



So there you have it! Many different metrics by which to compare and rank
candidate starting sets, some with side variations or extra parameters
to tinker with. For starting sets of four or fewer words, these different
metrics can lead to distinctly different choices of which candidate is "best".

In the next three sections we will evaluate our starting sets of different sizes
according to these different metrics.

==============================================================================

                          FOUR

Using the right starting set, we can guarantee a win of Wordle.

With a four-word starting set, it is conceivable one could win
*before* using up all the turns allowed by Wordle's rules. Indeed
we'll see in the next section that a perfect player can always win
Wordle in at most 5 turns. But in order to do so with a four-word
starting set, that quadruple would have to unambiguously identify
the hidden word every time. As discussed in the quintuple section,
that's not even possible with FIVE starting words, let alone with four.

So instead we look for sets of four starting words that can
guarantee a win by turn 6, i.e. with TWO rounds of guessing after
the four starting words are entered. That's flexibility we did not
have in the previous section, and it turns out to be just what the
doctor ordered. Let's get right to my favorite starting quadruple:

The four-word starting set
(4-1)     [carve, downy, plumb, sight]
guarantees a win at Wordle. The cluster vector is simply [2182, 59, 5];
obviously we can win on the fifth turn whenever the hidden word comes
from the many singleton clusters, and if it's in any of the 2-word
clusters, we can try one of the two words on turn 5 and (if necessary)
the other on turn 6. But as it turns out the five triples are also
easy to resolve: no matter which of the three cluster words we enter
on turn 5, it turns out to give enough information to determine which
of the other two words is the hidden word.  So there's no need to
develop a strategy: freeform guessing will end the game by turn 5
in 2246/2315 of the cases, and on turn 6 in the other 69/2315. So the
distribution vector is just [0.970194, 0.029806], and so there is
a 100% chance of victory, taking an average of 5.0298 turns.

As with quintuples, we can at least estimate the performance
for compound games: an independence assumption would make the expected
number of turns be  4 + 1.0298 N , and the fact that the subgames need
not be independent only serves to lower the expected number of turns.
Similarly we can compute the probability of a win by the (N+5)th turn
under an independence assumption to be .9709194^N + N*.029806*.970194^(N-1)
and again be confident that the true probability of a win is higher.
(Since the cases in which we do NOT have an instant solution are already
rare, the independence assumption is not all that far from reality.
Interestingly, we can similarly under-estimate the probability of a
winning a compound games using our best starting quintuple in
the previous section, to be 0.995248^N . The two (under)estimates
agree around N=16, which suggests that for Sedecordle and the larger
games it may be better to use the 5-word starting set, but for say
Sedecordle and smaller games, it's the 4-word starting set that
may be the better choice.)


Note that this set has 20 different letters (all but k, f, and the
rare j,q,x,z) which gives the human player a lot of information
about the hidden word right away. That's handy! Intuitively, one
would expect that using 20 different letters would tend to keep
cluster sizes small. So for most of this section, we will focus
only on such starting quads. We can compare all these quadruples,
using the different rankings established in the previous section.


(A) THE BEST STARTING QUADRUPLE BY THE VARIOUS POST-START METRICS

We start with the rankings established by the color metrics G + f Y.
The best quads for various values of  f  are:
   [brave, flint, pudgy, shock]  (G,Y = [4117, 6030] ) for f < 0.31838
   [budge, flack, print, showy]  (G,Y = [4096, 6096] ) for 0.31818 < f < 0.32026
   [balmy, fudge, print, shock]  (G,Y = [4047, 6249] ) for 0.32026 < f <= 1
At f=1 there is a 1247-way tie of quads that all yield 10296 colored tiles,
namely, the quads made of the 20 most common letters in Wordle (all the
letters except v,w, and jqxz); the one yielding the most greens is again
[balmy, fudge, print, shock].

(For all f > 1.05405, the best score is held by [aglow, fetid, nymph, scrub]
because it is the quad with the most yellows: [G,Y]=[2151, 8137] .)
                                                               
Note: none of these "best" quads (by the color metrics) has any perfect
anagrams, but for example in second place for small  f  we have a tie
(of course) between these three perfect anagrams:
    [brick, fudge, plant, showy]
    [black, fudge, print, showy] 
    [budge, flack, print, showy]
Altogether there are 28 quads that are in the top 5 when ranked by
the values of  G + f Y  , for some value of  f > 0; they form just 22
distinct (G,Y) pairs on the convex hull of all the 45174 (20-letter) quads
because 6 of these quads are perfect anagrams of others.


We next can rank the quads by the Lp metrics, for all real numbers p.
Over all real values of  p,  there are only four quads that are ever "best":
    [batch, drove, slung, wimpy]  for p > 4.826180826
    [blown, carve, dumpy, sight]  for smaller p > 3.833921198
    [chump, dying, fable, worst]  for smaller p > 2.000000000
    [bugle, champ, downy, first]  for smaller p (i.e. p<2.0)
Each of these has 0, 1, or 2 clusters with four elements, and no larger clusters.

In particular, the first of these is ranked best by all large  p  because its
largest cluster contains only 3 words, and 3 is the minimum for all these 45K
20-letter quads. (There are quads like [flown, jerky, match, squib] in this 
set that have as many as 15 words in a cluster!) Some other quads also have
just 3 elements in their largest clusters, but this one is the only one to
have only a single cluster of 3. (Its cluster vector is [2148, 82, 1].)

The last of these four clusters wins on several counts: its cluster
vector is [2199, 50, 4, 1], so it has 2199 singleton clusters, which
is the maximum, and it has 2254 clusters altogether, which is also the
maximum. Its average "ambiguity" is 1.058747300 words per day, which
is the minimum, and which is accomplished only by this and the third
quad (whose cluster vector is [2913, 56, 2, 1].)

Literally using the Lp ranking with p=+infinity puts into first place
all the quads whose largest cluster has the minimal size, which among
these 45K quads is three. That gives a massive tie for first place to
the 3223 quads which have no clusters larger than 3 elements.

(We can break the tie among those 3223 quads by considering these same
metrics: By using the large-p Lp metrics (which we have done above),
this is tantamount to ranking them by the number of clusters of size 3
that they have.  The one which unambiguously identifies the most words
is [chant, dowel, rugby, skimp] with cluster vector [2186, 57, 5]. 
It's also the one with the lowest average daily ambiguity: on average,
the player is choosing the hidden word from a set with 1.062203024
candidate words in it. In fact it's the "best" one of these 3223
candidates by evey Lp metric with p<3.685172 .  It's also the one
with the most clusters altogether (2248), which means it will win most
often on move 5 (2248/2315 of the time) if we pursue a strategy of
simple guessing words within clusters.)

Returning now to the full set of 45K 20-letter quands, we have already listed the ones
that are "best" by any Lp metric. The other quads that show up in the top-5 ranking
for some value of  p  are
[batch, dimly, prove, swung], [cable, fight, rowdy, spunk], [clump, grove, handy, swift]*,
[crown, dumpy, fight, salve], [comfy, diver, plush, twang], [clasp, downy, giver, thumb], 
[carve, dumpy, flown, sight], [carve, downy, fight, slump], [chant, dowel, rugby, skimp], 
[chump, dingy, fable, worst], [crump, downy, fable, sight], [dumpy, globe, ranch, swift], 
[bland, comfy, purse, wight], [bugle, cramp, downy, shift]*, [barge, clump, downy, shift], 
                              [bugle, candy, morph, swift], 

(*= plus the perfect anagrams  [crump, glove, handy, swift] and  [bugle, crimp, downy, shaft])

The quads [crown, dumpy, fight, salve] and [clump, grove, handy, swift]
have the same cluster vector [2151,79,2], so they will be tied in the
rankings from every Lp metric. However, they are not perfect anagrams of
each other, and so will be different by some of the other (non-Lp) rankings.
(For example, the numbers of green and yellow tiles they produce over a
complete cycle are [3574, 6596] for the first and [3693, 6477] for the
second, so the second quad is "better" in the sense of giving more
information to the player in the form of colored tiles.) This collision
is not rare: there are only 24587 distinct cluster vectors for the 45147 quads.

There are also many collisions for specific values of p . Because the
cluster vectors for these quads are so simple, the equations that
define the values of p for which the Lp metrics of two quads are
equal, are themselves also very simple, and likely to be repeated for
other pairs of quads. Indeed, see the chart  showing how the rankings
of these two dozen quads vary as p varies; there are multiple values
of p where more than one pair of quads exchange places in the
rankings. This is quite unusual --- it does not happen (much) for
smaller starting sets than quads, because such starting sets have
longer, more complicated cluster vectors.


(B) BEST STARTING SETS IF YOU INTEND TO JUST KEEP GUESSING

I computed the probability distribution of each of the 45,174
20-letter quadruples: if a player consistently pursues a guess-at-will
strategy with the same starting quad, what fractions of the games
will end after 1, 2, 3, ... more turns. From that probability
distribution we can compute the average number of turns needed
for victory, and the probability of a loss (i.e. failure to guess
the hidden word by turn 6).

The starting quads with the lowest average number of turns are
   [bugle, champ, downy, first]  5.02754 turns on average
   [barge, clump, downy, shift]  5.02797
   [bugle, candy, morph, swift]  5.02797
   [bland, comfy, purse, wight]  5.02840
   [chump, dying, fable, worst]  5.02840
Notice that the differences are very small, and sometimes zero!
They amount to needing a single extra turn across the entire
2315-day cycle! The list continues with very small increments
for a considerable length. Indeed, 45 thousand entries later we
reach the worst quad, [fjord, glyph, quack, vixen], which still
takes only 5.23466 turns on average; so the increments must
be small. (Perfect anagrams would have identical turn averages;
the pairs in the table which appear to have equal averages really
do, but they are not perfect anagrams of each other.)


The other metric that we use when incorporating the guess-at-will
strategy is the failure rate: which quads would lose least often
when following this strategy? The five quads listed above would all
occasionally require a seventh turn to win. But as noted at the outset
in this section, there are starting quads like (4-1) which will never
lead to a loss if the player simply guesses any Wordle word that is
consistent with the clues at each turn!

There are 230 such quads among the 45K, so they are all tied for
best by this metric. To break the tie we might invoke one of the
metrics from section (A). For example, 130 of them have a maximum
cluster size of 3. (All the others have a max cluster size of 5, except for
     [angst, birch, dumpy, vowel]
whose cluster vector is [2145, 78, 3, 0, 1]; yet even its largest
cluster, {skate, stake, state, steak, taste} can be resolved in 
two turns with free-form guessing!)

Alternatively, we could break the tie by looking at the average
number of turns needed. In that case the winners are
     [carve, downy, plumb, sight]  5.02980 turns on average
     [brawl, coven, dumpy, sight]  5.03024
     [burst, champ, dingy, vowel]  5.03024
     [carve, downy, fight, slump]  5.03067
     [covet, gland, shrub, wimpy]  5.03067
     [burst, champ, dying, vowel]  5.03110
(The last has a perfect anagram [burst, champ, dowel, vying] too.)
These all have similar, 3-term, cluster vectors.

These 230 quads are the most impressive, but really, starting with 
any 20-letter quad is sure to give satisfactory results. The very
worst of them still wins 97.58% of the time just by guessing Wordle
words after the initial quad is played. The smallest *nonzero* rate
of failure among the 45K quads is exactly 1 loss per 5 full cycles, i.e.
the daily Wordle player could expect a loss only once every 32 YEARS!


(C) BEST STARTING SETS IF YOU'LL USE A SIMPLE STRATEGY THAT FORCES A WIN

Since the guess-at-will strategy is already very successful, the
use of a strategy which dictates actions only for problematic clusters
is expected to result in only small changes in play. In particular,
we expect that the best quads now in part (C) are likely to be among
the best ones in part (B). 

So I reviewed each of the top 1000 quads, ranked by the average number of
moves until victory when playing guess-at-will. For each of them I
determined which clusters could cause a loss by turn 6, and selected
a preferred word, or if necessary an out-of-cluster word, to play on
turn 5 that would guarantee success by turn 6. (Since there are only
those two turns left after the initial quad, it's easy to see that
any preferred word that works will give the same average number of 
turns; likewise any successful out-of-cluster word will give the
same number of moves.)

Note that our decision to seek minimal rules for a win-by-turn-6
strategy limit us to using pre-determined moves only for clusters
that could otherwise lead to a loss; in particular, for any of the
230 quads that have no problematic clusters, that pattern dictates
that we will introduce no new rules into our strategy for those 
quads; the average number of turns for them will be the same here
in part (C) as it was in part (B).

When a win-by-turn-6 strategy is found for a quad, the distribution
vector will be simply of the form  [1-x, x]  where  x  is the fraction
of the time the game goes to turn 6; then the average number of moves
is  5+x . Our standard metric for starting sets having a win-by-6
strategy is to minimize this average number, which is equivalent
to minimizing  x .  Here are the best quadruples, along with the
value of  x . Also shown are the numbers of in- and out-of-cluster
words we must remember to use in order to guarantee this win by turn 6:

     x=0.027213, 1, 2  [bugle, champ, downy, first]
     x=0.027645, 1, 2  [bugle, candy, morph, swift]
     x=0.027645, 1, 2  [barge, clump, downy, shift]
     x=0.028077, 1, 2  [chump, dying, fable, worst]
     x=0.028077, 1, 1  [bland, comfy, purse, wight]
     x=0.028509, 1, 2  [bugle, crimp, downy, shaft]
     x=0.028509, 1, 2  [bugle, cramp, downy, shift] (anagram)
     x=0.028509, 1, 1  [dumpy, globe, ranch, swift]
     x=0.028509, 1, 1  [downy, farce, plumb, sight]
     x=0.028509, 1, 1  [chump, globe, randy, swift]

Again the numbers are close and we have multiple quads that
rank equally highly. 

About 22 of these good (top-1000 !) quads do not have any
winning strategy! In each of the cases of failure in that cohort,
the starting quad created a cluster of four _AUNT words such as
{haunt, jaunt, taunt, vaunt}. In such a case, there is no way to
guarantee a win by turn 6: no matter what Wordle word is entered on
turn 5, there will be at least one pair of these four words that are
scored the same, and all we can do on turn 6 is pick one of them to
enter, and face a loss if we guessed wrong.

Other quads could guarantee a win by turn 6 but only by fixing a
strategy for quite a few clusters, including out-of-cluster plays,
because the clusters included multiple words of tricky forms like
_AUNT, _ATCH, _IPER, CO_ER, etc. So the quads in the table above
are remarkable not only because of the low numbers of turns needed
but because of the low numbers of rules to be learned and followed.

(Of course, by any notion of "simplicity", the quads with the
most simple strategy to achieve a win by turn 6 are the 230 quads
that can accomplish it by guessing any word in the cluster!)

We will close out part (C) by discussing a few examples.

The unambiguous best quad by the standards of part (C) heads the preceding table:
    [bugle, champ, downy, first]
Its largest cluster is  {skate, stake, state, stave} . Guessing, say, STATE
on turn 5 would be a problem if the hidden word were STAKE or STAVE --- both
would get the same response from STATE. So instead, guess SKATE as the
preferred cluster member. Two of its other clusters are {piper, riper, viper}
and {jaunt, taunt, vaunt} and it is clear that guessing any member of these
clusters can lead to a loss. A suitable recipe is to guess PARER in the first
case and JETTY in the second; then the game will have to go to turn 6, but
now there is enough information to enter the correct word on turn 6 and win.
As noted in the table above, this strategy will lead to a use of turn 6
2.72% of the time, so the average number of turns needed is 5.0272. That's
a minimum over all these 20-letter quads.

In the sequel, we will summarize this strategy in just a few lines:
   startset:   [bugle, champ, downy, first]
   preferred:  { skate }
   out-of-box: [jaunt, jetty], [piper, parer]

We have already highlighted the quad
    [chant, dowel, rugby, skimp]
It uses few moves, on average, to win.  Unfortunately two of its clusters
are the sets {jaunt, taunt, vaunt} and {focal, local, vocal}, and it is
clear that any use of in-cluster words has a one-in-three chance of not
finding the hidden word until turn 7. We can guarantee a win, but that
requires using turn 5 to play "jetty" or "trove", if the first tricky cluster
shows up, and something like "fever" if the second cluster does. In that case
the distribution vector will be [0.9701, 0.0299] : 5.0299 turns on average.


The set
(4-6)     [bawdy, flung, porch, smite]
was suggested to me by a friend when I was first introduced to Wordle.
It also uses 20 different letters and has a fairly good cluster vector
[2165, 69, 4]. But it's actually not quite as good as the previous
quads. One of the four largest clusters consists of the three
words {jaunt,taunt,vaunt}. 

For this starting quad there is only the one problematic cluster so we
need only one extra rule:
    Start with [bawdy, flung, porch, smite]. Then
    If the hidden word *could* be "jaunt", play "judge".
    Otherwise, continue to guess anything consistent with the clues.
(In the notation of the introduction, this is the one rule 
    [jaunt, judge]
As it happens, JUDGE and TROVE are the only words we could use here!)

It we use no strategy with this starting quad, but just guess 
words consistent with the clues, the probability distribution
(frequency that the game lasts 1, 2, or 3 more turns) is
   [0.966739, 0.032829, 0.000432]
Using instead the one rule "[jaunt: judge]" changes the distribution to
   [0.966307, 0.033693].
The success rate on turn 5 has gone down, the average number of
turns is unchanged (at 5.0337), but importantly the maximum length
of a game has gone down from 7 to 6 by switching to this strategy.


There may also be more efficient 4-word sets that involve fewer 
than 20 letters; I haven't found any yet. (And I observe that these
may be harder to use for people who don't know the wordlist well.)
Here is one example that at least comes close:
(4-8)     [champ, flown, rugby, steed]
Its cluster vector is  [2132, 87, 3] and I compute the probability
distribution vector to be
    [0.959827, 0.039165, 0.001008]
meaning a 0.10% failure rate and an average of 5.0412 turns if
we play by guessing at random. Or we can guarantee a win by turn 6
if we add rules  {stark; [jaunt, jetty], [stake, evoke] } .
With this algorithm the probability distribution vector is
    [0.958963, 0.041037]
meaning an average of 5.0410 turns per game (and a maximum of 6 !)
Overall not as good as the 20-letter quadruples we've met, but close!


I don't claim to have examined all possible starting quadruples;
there may be more that should be listed, especially if they excel
according to some other metric than we have used so far.


Just as at the end of the last section, we can offer a Wordle
starting set that bridges two sections of this document. The quad
(4-9)     [blast, midge, porch, funky]
adds one word to one of the best starting *triples* from the next
section, for all the same reasons -- to get a hint, to backpedal
in a goal to start only with a fixed triple, etc. As before,
this "augmented triple" won't measure up as well as the actual
(excellent) triple, but it may be easier to use. FUNKY is arguably
the best word to add to the other three. This quadruple now has
the simple cluster vector  [2156, 66, 9]. With a guess-at-will
strategy the distribution vector is
    [0.963715, 0.035133, 0.001152]
which works out to an average of 5.0374 turns. But there's still
a loss that way, so we look for additional rules for the tough
clusters. A simple choice turns out to be 
    {eager, fever, [catch, clown], [jaunt, jetty]}
For this algorithm, the distribution vector turns out to be
    [0.962851, 0.037150]
for 5.0372 turns on average, and of course 100% win rate.

==============================================================================

                          THREE

This is a long section because there are many starting triples that are
"good" for different reasons, so no single one can be called "best".
I have created a separate file that contains all the statistical data
for the triples mentioned in this section; feel free to weight the
different criteria as you wish to select your favorite starting triple!


With a set of three starting words, we can surely win by turn 6, but in
practice this can be tricky. After just three initial words there are
at least 11 letters that will not have been tested, so the player must
do more sleuthing; e.g. it is quite possible that after three initial
guesses the player has seen nothing but grey tiles! And many starting
triples cannot guarantee success by turn 6 simply because they cannot
quickly enough distinguish, say, JOKER, BOXER, FOYER, LOVER, WOOER and
the other twenty(!) _O_ER words.

Still, by choosing an appropriate starting set of three words, one
can hope to have a 100% win rate at Wordle. After all, we have already
seen in the last section that we can win 100% of the time starting
with CARVE + DOWNY + PLUMB; surely with the freedom to choose something
other than SIGHT next, we should be able to ask for something more
than just a 100% success rate in 6 turns. At the very least we should
be able to arrange a lower average number of turns until a win. What
else might we ask for? What are we willing to give up?  How do we
decide that one or another three-word starting set is "better" than
another, or even "the best"?

The question of what is a good three-word starting set arises periodically
on the Reddit forum. I fashioned a detailed response analyzing many of the
starting triples that had been proposed. In this document, we can put
that analysis into context.

What we will see is that trying to reduce the expected number of turns
needed to win will introduce more complexity in our algorithms.  To be
precise, what we had in the previous section was a starting set
[carve, downy, plumb, sight] that had two features:
    (A) It wins 100% of the time within two more turns.
    (B) It requires no added rules besides "guess any candidate".
In this section we could hope for a THREE-word starting set with both
those properties. After all, it *is* known that there are algorithms
to win Wordle in just 5 turns (although the best algorithms to do that
to my knowledge all require long lists of rules).

Sadly, I can prove that no set of three words can have BOTH
properties (A) and (B). In fact, I am pretty sure that (B) alone is
impossible (more on this below). But we can find starting triples
that have property (A)!

                ----------------

(A) CAN WE FORCE A WIN IN FIVE TURNS? YES!

I have found that there are exactly 261 starting triples with which
every game can be won by turn 5. For each of these triples, there will
be clusters of words (signalled by the pattern of the 15 colored tiles)
that could lead to a loss if we simply play with a guess-at-will mode,
that is, we will have to map out some preferred words or out-of-cluster
words to use in those cases. (See the examples of PROBE and BRACE in the
Introduction.) Unfortunately, for each of these starting triples, in
order to achieve goal (A), we need at least 54 rules of these two types,
which is perhaps too many for a human to execute while playing a game.

Which is "best" among these 261 triples is as always a matter of taste but
(R3-01)   [blast, midge, porch]
is certainly a good choice. Let's discuss this triple in detail;
comparable analyses for other triples are given as a table in 
another file.

Of all 261 starting triples, this one's cluster vector
       [1597, 207, 53, 18, 9, 3, 0, 0, 0, 1]
has the highest total number of clusters, which translates into
having the highest probability of getting the word on the very
next turn after the starting triple (81.56%) just by guessing.
It also has the highest number of singleton clusters, meaning
more words are known with certainty after this opening triple
than with any of the other 260 special triples. 

The distribution vector using the guess-at-will strategy is
    [0.815551, 0.168524, 0.014825, 0.001100]
so this strategy will take an average of 4.2015 turns; more
importantly it won't complete by turn 6 about 0.11% of the time, and
definitely will not always complete by turn 5 ! The precise reason is
that there are 55 clusters where a hidden word can still be hidden
after two clue-consistent guesses from within the cluster. Of these,
39 clusters can be resolved by turn 5 by playing a "preferred" cluster
member on turn 4 (e.g. {arena, freak, raven, wafer, waver, wreak} is
such a cluster; "wafer" is the one choice that will work). The other
16 clusters can be resolved on turn 5 by using an out-of-cluster word.
(One such cluster is {jaunt, taunt, vaunt}; in order to guarantee a win
by turn 5 we must enter either "jetty" and "trove" on turn 4.)

So in toto we have 55 such rules that must be memorized, one for each
tricky cluster. A sample algorithm using this information might be this:

  After blast+midge+porch, enter any word consistent with the clues, except
  * If the word COULD be any of the following 39 words, then play it:
     allow, antic, awake, award, bevel, crown, dizzy, dowdy, dried, drone,
     eater, enter, equal, fauna, fatty, fewer, filly, finer, folly, funky,
     jelly, kitty, liner, mafia, otter, relax, safer, seize, sever, shown,
     skate, skulk, swash, taste, testy, udder, unfed, value, wafer
  * If the word COULD be any of the following 16 first-halves, play the second half:
      [anger, gawky], [catch, crown], [cinch, crown], [crane, ozone],
      [fatal, fella], [field, gawky], [fight, frown], [fizzy, ozone],
      [focal, fella], [forth, crown], [fudge, funky], [jaunt, jetty],
      [major, jetty], [rower, gawky], [snoop, frown], [stoke, funky]
  Then (if the hidden word has not already been played) there is only one
  Wordle word consistent with the clues; play it on turn 5 and win.

The probability vector for this set of rules is [0.808639, 0.191361],
meaning the game runs 4.1914 turns on average (and has a 100% rate
of completion by turn 5). This is the lowest average turn count among
the algorithms that I checked for these 261 triples.

Other strategies for the 55 problematic clusters exist; for example,
for each of them --- indeed for all but four of the 291 non-singleton
clusters! --- one or more of the following ten words will split the cluster
completely. Make a table of which of these words you wish to use to resolve
each of the 55 clusters to create your own win-by-turn-5 algorithm:
   [crown, fewer, filly, funky, gawky, jetty, navel, skate, spunk, tawny]
(The cardinalities of the sets of ten words here and the seven used
in the previous algorithm are minimal, as determined by Gurobi.)
With this given starting triple ("Phase 1") these different algorithms
to complete the daily puzzle ("Phase 2") can have slightly different
probability vectors and thus different expected numbers of turns.

This starting triple can also be used, more easily, to win in Wordle
by turn 6. That is, a player who initially intends to follow this
algorithm so as to win by turn 5, may decide during the play that
it would be sufficient to win by turn 6, and then can forget most of
the 55 special rules listed above; the only ones still needed are
  {crown, fauna, fewer} and
  [fight, frown], [fudge, funky], [rower, gawky], [snoop, frown]
Alternatively if we're willing to wait until turn 6 to win, we can
use more preferred words and fewer out-of-cluster ones: the strategy
  {crown, fauna, fewer, fudge, rower, snoop,   [fight, frown] }
works, and gives a distribution vector of
  [0.815119, 0.172282, 0.0125989]
and thus a slightly higher number of turns (4.197) than when using
the 55 rules to finish by turn 5. (All those extra rules were 
needed just to avoid these 1.26% of the days when the game took
a sixth turn to win!)

We have already mentioned a third alternative in the previous section:
we can consistently play FUNKY on turn 4 and then follow rules for
just four special clusters in (4-9); but this gives a significantly
higher expected number of turns: 5.037.


Note that by using this 3-word starter set on an N-fold compound game,
we can solve all N of the subgames in at worst 3+2N turns. In that worst
case, this is more than the N+5 turns typically allowed in compound games.
But when N=2, the two are equal, meaning we have a guaranteed winning
strategy for 2-fold Wordle. Dordle is not exactly a 2-fold Wordle --
it uses a different wordset --, so one does not have an a priori guarantee
that this algorithm will work for Dordle. But as it turns out, it does
still work, with minor modifications. Change the set of preferred words
to this set of 35:
  { allow, assay, awake, awash, awful, crown, dowdy,
    drone, eater, enjoy, enter, fatty, fever, finer,
    folly, funky, goner, jawed, kneed, lefty, newly,
    otter, relax, sally, seize, sever, skate, skier,
    skulk, snipe, testy, tower, value, viper, wafer }
and change the set of out-of-cluster moves to this set of 20 pairs:
    [anger, wagon], [catch, clown], [cinch, awful], [crane, anvil],
    [dizzy, dozen], [fatal, awful], [field, awful], [fifty, flank],
    [fight, flown], [focal, fever], [forth, awful], [foyer, gawky],
    [fudge, fauna], [jaunt, jetty], [liner, anvil], [lower, anvil],
    [major, agony], [snoop, flown], [staff, bonus], [stoke, ankle]
Then each of the two subgames of the Dordle game will definitely end
within 3+2 turns, i.e. the whole game will end within 3+2+2=7 turns.


Finally, for basic Wordle we may return to a point made in the
Introduction. If we enter only BLAST + PORCH, on about one-fourth of
the days we will see 4 or 5 colored tiles, or 3 greens, or 2 greens
and a yellow. In most of those cases we can still win by turn 5 by
simple guessing without entering MIDGE! We need only watch for the
following words, to be used as preferred members of their cluster:
   [blade, blond, brain, ditch, graft, gulch, mouth, plain, swash]
and if the word could be STORE or CATCH, play HYMEN.
Doing so will lower the expected number of turns to 3.9 .
(We can similarly avoid MIDGE in the compound games, but this analysis
applies only if all the subgames show such a favorable return from
just BLAST + PORCH, which becomes increasingly rare as the number of
subgames increases.)


This long analysis of BLAST + MIDGE + PORCH can be repeated for each
of the other 260 starting triples that have property (A). I have not
done so, but have collected some data about those triples and invite a
discussion of which others are, by some measure, better than this one.

                ----------------
(B) IS THERE A WINNING STARTING TRIPLE THAT REQUIRES NO EXTRA RULES? NO!

Now, what about property (B)? Surely it would be convenient to have
a starting triple that worked as easily as the starting quad of
CARVE + DOWNY + PLUMB + SIGHT : just enter the starting set and
keep guessing words that are consistent with the clues. It would be
great to know for certain that we'd find the hidden word by turn 6!

I believe I can prove that no such triple exists when using only
words from the Wordle answer-list. Just for this search, though,
I also looked at the longer 14853-word list of valid input words.
In order to speed things up I made the reasonable, but not ironclad,
assumption that such a triple would involve 15 distinct letters.
(This permitted me to doing a preliminary compression to the 5,649
sets of five distinct letters that can form at least one of those
words, and then to non-intersecting triples of such letter-sets.) 
If I have done the search properly, I can report that no such perfect
triple exists: for every (15-letter) triple of allowed input words,
there is at least one cluster for which the guess-at-will strategy
can lead to a loss in standard 6-turn Wordle.

I did also look for near-misses, though, and found a couple of
triples for which there is only one bad cluster. The best is
(R3-02)   [bonds, glamp, fecht]
Each of the hidden words SKATE, STAKE, STARE, STATE, and STAVE will
turn yellow the S, A, E, and T tiles, and obviously a guess-at-will
strategy would for example allow the player to guess them in
e.g. reverse alphabetical order, which would be a loss if the hidden
word were STAKE. So in this case the player must remember (only) one 
additional rule: "prefer SKATE".

Also having just one bad cluster is
(R3-03)   [techs, glamp, rownd]
which has a cluster  {berry, eerie, ferry, fever, jerky, verve} .
This cluster can again lead to a loss from random guessing (e.g. 
the sequence "verve, jerky, ferry, berry") but again the loss can 
be prevented by playing the preferred word BERRY on turn 4 if we
get a green E and a yellow R from the starting triple.

(While triple (R3-02) has one word in the Wordle solution-list,
triple (R3-03) has none at all! I would also say that the first triple
is "better" than the second in the sense that we only have to invoke
our special rule ("guess SKATE") on five days out of a 6.5-year cycle,
as opposed to needing the other special rule ("guess BERRY") on six
days per cycle. There's also a minor technical reason. Finding these
good triples amounts to making sure they never (or rarely) permit
quadruples like {berry, ferry, jerky, eerie} to be together in a
cluster after the initial triple of words is entered. I assembled a
list of tens of thousands of these problematic quadruples and then
developed mechanisms to detect starting triples that broke most of
these quads apart. The first triple I listed only missed its one
quadruple "STA_E". The other one actually missed two: both 
  {berry, ferry, jerky, eerie} and {berry, ferry, jerky, verve}
are problematic quadruples. This is a minor distinction of course.)

I did not find any other starting triples that involved only a single
non-singleton cluster. Both GLAND+ROMPS+FECHT and GLAND+ROMPS+WECHT 
involve just two (and each of them actually leaves three problematic
quadruples unseparated).

I make no claim about whether other equally-good triples exist. (My
method of sorting was only designed to make sure I didn't miss any
starting triples that guaranteed success *just by guessing* (with zero
special rules like "use SKATE"), and so I am fairly confident that
such a triple does not exist; but along the way I had to branch though
decision trees to trim the candidate pool, and a starting triple that
was a "near miss" might not have been good enough to survive an
early-stage pruning.)

                ----------------
.
Now let's return to the consideration only of triples of words drawn
from the more limited (and more reasonable!) Wordle answer-list.

I can present a number of triples that are good (by several measures); only
rarely can I truly claim that the ones I found are absolutely the best.
Using only the words in the Wordle answer list, there are over 2 billion
sets of three words that we might potentially consider as starting sets.
Just as we observed with five- and four-word starting sets, it is
certainly true that some starting sets with repeated letters can be good.
In fact, 46 of the 261 starting triples that can lead to a guaranteed win
by turn 5 have repeated letters! (Example: [crump, doubt, salve].) And
some posters on Reddit say they use with success some starting sets with
repeat letters, e.g. [blind, stare, wimpy] and [colon, right, speed].

But generally speaking it seems prudent to focus on sets of three Wordle
words that include 15 different letters. That reduces the number of candidate
triples to 1,243,026. (I have listed them all in a 26Mb zipped file.) This is 
just small enough a set that it is possible to run some quick preliminary
computations on all of them and then run longer analyses on the most
promising among them. (Some statistics take a while to compute!) I computed
assorted other statistics for some "promising" triples in this Reddit post. 

We will review some of the "best" triples that can be found this way.

                ----------------
(D) THE BEST STARTING TRIPLE BY THE VARIOUS POST-START METRICS

As discussed in a previous section, there are two types of metrics to
assess how the game "looks" immediately after the starting set is played:
those using the cluster vector, and those counting the colored tiles.
Each method splits into a whole continuum of metrics (distinguished 
by a parameter), and allows a couple of variations.

As for the cluster-vector metrics: I have worked out the rankings
of the triples based on the Lp metrics, and found the top-4 triples for
every possible value of  p. The only triples in that list are 
    [birth, model, spawn], [bland, comet, sprig], [bland, copse, mirth],
    [bland, copse, right], [blimp, dance, short], [blimp, dance, worst],
    [blind, comet, sharp], [blind, match, spore]*, [blond, march, spite],
    [blond, rivet, scamp], [chirp, golem, stand], [clamp, diner, ghost],
    [climb, donut, parse], [climb, sandy, trope], [midge, porch, slant]
(*):There is also the anagram triple  BLOND+MATCH+SPIRE  which results in
identical game play to BLIND+MATCH+SPORE, so it will not be separately mentioned.


Specifically, the very best triple is, for each  p,  one of these three:
(R3-04)   [blimp, dance, short]  (for all p < -0.7479659304)
(R3-05)   [bland, copse, right]  (for the next p < 2.166571184)
(R3-06)   [bland, comet, sprig]  (for all larger  p )

Each of these passes to second-best for some other ranges of  p ; 
the only other triples that are second best for any  p  are
(R3-07)   [bland, copse, mirth]  (for all p < -0.747965930)
(R3-08)   [clamp, diner, ghost]  (only for 2.232224367 < p < 2.282649886)
(R3-09)   [blimp, dance, worst]  (for all p >  2.282649886)


Particular values of  p  give us the metrics of independent interest:
p=+infinity, p=2, (p=1), p=0, and p=-infinity.


When p = +infinity, the ranking by the Lp metric coincides with ranking
by metric  M1, the size of the largest cluster. It turns out that every
triple yields cluster(s) with at least six words in them. There are 
283 triples whose largest clusters have only 6 members, so all tie by metric M1.
We can break the ties to pick out the "best" among that set by using the
Lp metric for large (finite) p ; that puts (R3-06) on top, closely followed
by (R3-09); then come 
(R3-10)   [blond, rivet, scamp]
(R3-11)   [birth, model, spawn]
As it turns out, these four are the only ones of the 283 to have just a single
cluster of size 6. They also score best by the metrics we will use in parts
(E) and (F). 

I did a partial search for triples that use fewer than 15 different
letters, yet still had no "large" clusters. I still did not find any
triples whose largest clusters are smaller than 6 words, but I did
find some reasonable candidates with just a few 6-element clusters
and nothing larger. An example is BLIMP + COAST + RENEW , but it's
not quite as good as the previous four -- it uses slightly more turns,
has more problematic clusters, etc. (It has four clusters of size 6;
I have not yet found a triple with repeated letters that has at most
one cluster of size 6 and none larger.)


Recall that when  p=2, the Lp metric is equivalent to the average cluster
size (as measured day by day). According to this ranking, the best triples
are (R3-05), (R3-04), (R3-08), and (R3-06). A player using the first of these
triples day after day would on average have only 1.4501 words to choose
from in the cluster containing the word of the day!

(Those first two triples maintain top ranking for all lower p , including p=0
and p=-infinity).


When p=1, all triples are tied; the L_1 metric simply counts the words
in the Wordle dictionary (2315) no matter what the triple. Even though
the optimization process is treated a bit differently for p>1 versus
p<1, for  p  very near  1  on either side, the ranking is the same:
the best for this  p  is (R3-05), followed by (R3-04), then
(R3-12)   [blind, match, spore], then
(R3-13)   [blind, comet, sharp]
As noted in our section on Rankings, these triples are best for those
values of  p  near p=1 because they are the ones with the smallest
derivative there, sum( v_i * i * log(i) ).


Taking p=0 gives metric M2, the total number of clusters; it is
maximized by (R3-05) which spreads the 2315 words into 1954 different
clusters; (R3-04) is second best with 1952. Then follow (R3-07) and
(R3-14)   [midge, porch, slant]
with 1949; these two switch their positions in the rankings precisely at p=0.

Finally, for very negative  p  the ordering of starting triples matches
that of "p = - infinity", i.e. metric  M3 (the number of singletons).
Now (R3-04) is the best triple, putting each of 1696 words --- nearly
three-fourths of the Wordle solution list --- into its own cluster.
This is followed by (R3-07), (R3-14), and then (R3-15):
(R3-15)   [blond, march, spite]
(R3-16)   [copse, gland, mirth]


In a previous section we proposed variants of the Lp metrics that can also be
computed from the cluster vectors. We have metric M4, which counts words that
are in clusters with at most 3 words in them. The winner on this scale is
(R3-17)   [bench, midst, polar] 
It has just 78 words in clusters of sizes 4 or more (10 quads, 3 sextets,
and one cluster each of sizes 5,7,8); the other 2237 words are in clusters
small enough that we can just test all the candidates in a cluster by turn 6 !
Second and third place go to 
(R3-18)   [chirp, donut, false]
(R3-19)   [blind, comet, spray]
with 2234 and 2228 respectively; then this is followed very closely by many good ones.

Not much changes if instead we count the number of small clusters, as a fraction
of the total number of clusters. This time it's (R3-18) that ekes out a win,
with 99.21% of all clusters small; (R3-17) is just behind it with 99.16%,
then (R3-19) is third with 99.12%, and in fourth, with 99.11%, is
(R3-20)   [bench, solid, tramp]
They have only 15, 16, 17, and 17 clusters, respectively, with more than
3 words in them.


We turn next to the metrics relating to the numbers of green and yellow tiles
produced by the starting triple: M5 = G + f Y, with colored tiles counted
across a whole 2315-day cycle. 

When  f=0, we are comparing our starting triples by looking only at the
green tiles. In this case, and indeed for all  f < 0.0769, the best is 
(R3-21)   [brace, moult, shiny]
which produces 3611 green tiles per cycle (an average of 1.56 per day)
along with 5260 yellow tiles. 

For larger values of  f  up to 0.5587, including the natural choice of f=1/2,
the best is this one, with G=3605, Y=5338 :
(R3-22)   [built, crone, shady]  
(Note: [built, crony, shade]  will yield identical game play, and so need
not be mentioned separately.)

Then, for higher fractional weights (up to f = 1.0) the torch passes to
this one, which offers 3505 green tiles and 5517 yellows (total: 9022)
(R3-23)   [curly, point, shade]

Finally at f=1.0, yellow and green are counted equally. At that point,
any two triples that use the same letters will show the same number of
colored tiles, so there is a massive tie of 455 (I think) triples that
all use the letters
       {a, c, d, e, h, i, l, n, o, p, r, s, t, u, y} ,
probably the best of which is (R3-23) itself. (It also has the highest
proportion of green tiles among the colored ones.)


So each of those triples qualifies as "best" for a range of values of  f.
For each positive value of  f <= 1, the second- and third-best are either
one of those three triples, or one of these four :
(R3-24)   [briny, coupe, shalt] (3600 G, 5319 Y)
(R3-25)   [cruel, point, shady] (3476 G, 5546 Y)
(R3-26)   [count, plier, shady] (3466 G, 5556 Y)
(R3-27)   [crony, guide, shalt] (3547 G, 5429 Y)
(The triples [crony, guilt, shade] and [crone, guilt, shady] have
exactly the same letters in the same positions as (R3-27), so the
game proceeds identically in all three cases.)

Rather than a high average count of colored tiles, you might want
a high minimum count of colored tiles. Arguably the best triple
to use, to avoid days with few clues from the starting triple, is:
(R3-28)   [handy, slice, tumor]
Use this starting triple and you will never get an all-gray day!
There is only one day in the whole 7-year cycle when you will get just a
single colored tile (it will be a green A because the hidden word will be
"kappa"). In addition you get just two yellow tiles on 15 days, two greens
on 21 days, and one of each on 47 days. On all the other 2231 days
you will be rewarded with three or more colored tiles. (Having
these same attributes, but generally worse measures of play, are its
permutations [handy, lemur, stoic] and  [duchy, merit, salon]. All triples
made from these 15 letters will give three or more colored tiles to
all but 84 of the 2315 hidden words; no letter set can reduce this
number below 84. The main reason many of those 84 words give only 1 or 2
colored tiles is because the hidden word itself is made of few letters,
e.g. MAMMA can never get more than two colored tiles from our 15-letter
starting triples! If we restrict our attention to the hidden words made
of five distinct letters, every one of them except GAWKY will return
three or more colored tiles to triple (R3-28).)


A variant in the opposite direction (many colored tiles) is to ask for
the greatest number of colored tiles from just the first two words in a
triple. We mentioned in the Introduction the possibility of starting
with an intended triple but abandoming the third word if the first two
have already yielded "many" clues. (This is much less useful in the 
compound games.) In the next section we will look at starting pairs,
and given any starting pair that gives many colored tiles (by whatever
weighting we want for yellow versus green ones), we can simply adjoin a
third word to get a starting triple that would be optimal by the metric
of this paragraph. Carrying this out for some high-ranking starting
pairs we find that the best third word to add is usually dumpy, jumpy,
or pudgy.  Judging the triples that are created in this way by the
metrics for triples, the best is probably
(R3-29)   [close, train; dumpy]
Also scoring well are [scone, trial; dumpy] and [crony, slate; humid].

                ----------------

(E) BEST STARTING SETS IF YOU INTEND TO JUST KEEP GUESSING

We have already observed that no triple has property (B), that is,
for every starting triple the guess-at-will strategy can fail to find
the word by turn 6. So we begin with metric M6 : the probability
of a loss.

I believe the very best by this metric is
(R3-30)   [blond, girth, swamp]
It's actually kind of hard to make enough bad guesses that you lose
after 6 turns! The expected number of losses in an entire 2315-word
cycle is 0.446, i.e. you could reasonably expect to go *14 years*
between losses! (But yes, it can happen: For example it would be
consistent with the rules to guess the sequence "treat", "extra",
and "cater", and lose if the hidden word is "after".  Try it and see.)

Runners-up appear to be
(R3-31)   [blond, right, swamp]
(R3-32)   [blend, right, scamp]

That small chance of failing to finish by turn 6 can be reduced to 0
if instead of guessing cluster words at random, we remember to play a
few preferre words when a troublesome cluster appears. But actually, 
these three triples were found waiting in the list of 261 triples that
allow a win by turn 5, that is, if instead of just the few preferred
words we were to memorize dozens of rules, then we could force a win
one turn earlier even in those 2%-3% of the cases when the game would
otherwise go to the sixth round. (Triple (R3-32) can accomplish a 5-turn win
by using 41 preferred words and 13 out-of-cluster moves; this total of
54 rules is minimal within that group.) These triples respectively manage
to break up all but 7, 8, and 10 of the "problematic" quadruples  mentioned
in part (B).


Metric M7 measures the expected length of a game (playing until the hidden
word is found, potentially past the sixth turn). This is minimized by
the stellar starting triple
(R3-05)   [bland, copse, right]
which, on average, solves Wordle with this strategy in 4.1707 turns.

This triple is also excellent by other measures. We have already seen
in part (D) that it has the largest number of clusters (1954); and has
the lowest daily average of number of possible solutions (an average of
only 1.4501 words matches the tile colors). When using the guess-at-will
strategy, it is the triple giving the highest probability of winning on
the very next turn (84.4% of games will end by turn 4). And we'll see
in part(F) that it also finishes fastest, on average, when using a 
strategy designed to win by turn 6.

Nearly as good in terms of metric M7 are these four (which are are 
actually as good or better by some of the other metrics). So arguably
any of these could also be called the best overall starting triple:
(R3-04)   [blimp, dance, short]
(R3-07)   [bland, copse, mirth]
(R3-15)   [blond, march, spite]
(R3-14)   [midge, porch, slant]

Note also that [blond, spite] alone is a good starting pair,
as we shall see in the next section. 

(Triple (R3-18) has the distinction of being the worst of the triples
mentioned in this document by the following criterion: a very unlucky guesser
pursuing a guess-at-will strategy with any of the dozens of highlighted
triples will surely win by move 9 at the latest -- except it is possible with
this one to still require a tenth move to enter the hidden word and win!
This happens for example if the player guesses, in succession, the
completely appropriate sequence 
  rarer, baker, waver, eager, gamer, gazer, gayer
with "gayer" being the hidden word.)

                ----------------
(F) BEST STARTING SETS IF YOU'LL USE A SIMPLE STRATEGY THAT FORCES A WIN

I examined every starting triple that I believe has a reasonable chance
to excel in this section for some good resolution of its problematic
clusters. To be clear here, we are interested only in simple strategies
that will guarantee a win by turn 6. That means we use a guess-at-will
strategy whenever that will surely find the hidden word that fast; we
only need a plan for the problematic clusters for which the
guess-at-will strategy might not discover the hidden word until turn 7
or later. Rather than include the data in this narrative, I have
collected the specifics for each triple mentioned in this document into
a separate file. With this particular chosen
strategy to guarantee a win by turn 6, we can then compute the
probability distribution showing how long the game might then last.

So let us turn right away to the triples with a strategy that wins 
with the fewest turns. Using the sets of preferred in-cluster and
out-of-cluster words listed in this file, the starting triples 
which will take, on average, the fewest turns to find the hidden word are:

(R3-05)   [bland, copse, right], which needs 4.1682 turns on average
(R3-14)   [midge, porch, slant]
(R3-04)   [blimp, dance, short]
(R3-07)   [bland, copse, mirth]

It is a consistent pattern that the ranking of good starting sets
by metric M8 is very similar to their ranking by metric M7. This is
not surprising: in this subsection we are assuming the player
takes specific actions on turn 4 if the colored tiles signal
the need to do so; but the affected clusters only account for
a few percent of all the words, so with or without the extra
rules, on most days we reach the answer just with freeform guessing.
Hence the probability distributions are not greatly different.

Similarly, the next few entries in the ranking are also score
very highly by metric M7, too. Most of the next few entries have
already appeared on other lists:
(R3-16)   [copse, gland, mirth] 4.1716 turns on average
(R3-13)   [blind, comet, sharp]
(R3-33)   [blond, right, space]
(R3-15)   [blond, march, spite]


The top four are also at the top of the list of how often the 
hidden word is discovered right away on turn 4. For example, a player
who enters BLAND, COPSE, and RIGHT on turns 1-3 and then follows this
strategy will guess the hidden word on turn 4 fully 84.4% of the time.
(The precise order for this criterion is (R3-05), (R3-04), (R3-07), and
then (R3-14), which is actually also tied with (R3-12).)

Dual to the most-wins-by-turn-4 record held by (R3-05) is the
fewest-wins-on-turn-6 triple. In English: which of these strategies,
designed to finish by turn 6, most often actually finishes by turn 5?
The best I've found is for
(R3-34)   [brown, midst, place]
So not only does this strategy guarantee a win by turn 6, it is *almost*
a strategy to win by turn 5, while using many fewer extra rules than the
261 true win-within-5 strategies discussed earlier. Second place goes to
(R3-35)   [blown, caper, midst]
which does not have a strategy that can guarantee a win by
move 5, but still can accomplish it 99.13% of the time!
(Triples (R3-09) and (R3-15) also scores very well.)


Since triple (R3-05) has now appeared at the top of several lists,
let me discuss this one in greater detail.

Obviously after any starting triple, the player (assuming they know the
Wordle dictionary!) can find the hidden word by move 6 it is contained in
a cluster with only 1, 2, or 3 elements. Triple (R3-05) doesn't have many
clusters that are larger than that! The four largest clusters have sizes
10, 7, 6, and 5;  it turns out that to prevent a loss by turn 6 when
looking within one of these clusters, we just to play the corresponding
preferred words
    eater, femur, major, mover 
on turn 4, when they fit. There is a cluster of four _IGHT words that could
lead to a loss too, but it can be avoided by using this rule:
   if FIGHT fits, play AWFUL
That leaves only one cluster of size 5 and thirteen of size 4 and as it
turns out, all of them can be resolved by freeform guessing. That's the
playing strategy that will, over the long haul, need 4.1682 turns on average
to finish the game, and will always do so by turn 6. In fact, it will
finish on turns 4, 5, and with probabilities [0.843629, 0.144557, 0.011814].

Experimentally this triple also minimizes the expected length of a Quordle
game, taking about 7.4 turns to solve all four subgames. This has become
the starting triple I personally use. (If you try it too, and get stuck 
on a puzzle, the fourth word MURKY is the word most likely to be helpful.)
Were it not for "user error", I would regularly be able to go hundreds of
Quordle games between losses. (The probability distribution above would
suggest that a four-round Wordle game would win in 9 moves only 97.1318%
of the time -- one loss every 35 games or so -- assuming that the four
subgames are independent; the fact that a much higher success rate
is apparently possible shows the weakeness of this independence assumption.
Apparently the information gained from the easy subgames greatly
reduces the ambiguity in the harder subgames.) This is relevant as we
compare this optimal (?) starting triple to the optimal starting
quadruple, quad (4-1). An independence assumption would suggest
that already for Dordle, and surely for the larger games, there is a 
lower probability of loss if we use a larger starting set instead of
a starting triple; we made precisely this kind of claim when discussing
quad (4-1) in the previous section. But this starting triple is
seen experimentally to give fewer losses than (4-1), at least for
Quordle, because there is substantial conveyance of information from
one subgame to another now; the independence assumption is this
time far from valid.



It's beyond the scope of this article, but nonetheless it is possible to
write out a complete decision tree showing the optimal cluster words
to choose (rather than simply choosing cluster candidates at random);
following this tree over the entire 2315-day cycle would involve
entering a total of 9629 words, an average of 4.1594 per day.
But we demur since the goal of this paper is to study simple ways
to play Wordle, and a human cannot be expected to memorize the whole tree.

Indeed, the other way to compare starting sets in this part (F) is
precisely in terms of the simplicity of their win-by-turn-6 algorithms.
Although using non-Wordle words allowed us in part (B) to find a
starting triple that required only one extra rule ("guess SKATE"),
we can't do that well when using only Wordle words in our starting triple.

It appears that three is the minimum here, that is, every starting
triple requires the player to remember at least three additional rules
if he or she wishes to ensure a win by turn 6. The two triples (R3-30)
and (R3-31) are the only ones that manage to do this using only "preferred"
words. Some other starting triples also fall just 3 clusters short of
meeting goal (B), but require one or two out-of-cluster resolutions:
(R3-36)   [copse, drawn, light]
(R3-37)   [force, glint, swamp]
(R3-38)   [blimp, cedar, ghost]
(R3-39)   [choir, gland, swept]
(R3-40)   [glint, peach, sword]

I believe there are no other starting triples that lead to just three
problematic clusters. Of those with four bad clusters, there are just
these three that can resolve them all with in-cluster "preferred" words:
(R3-20)   [bench, solid, tramp]
(R3-41)   [blend, match, sprig]
(R3-42)   [blimp, cedar, thong]

                ----------------
(G) SOME OTHER FAVORITE TRIPLES

We close with a list of a few additional starting triples that have a
well-balanced set of attributes.

(R3-43)   [blond, parse, wight]
(R3-44)   [glide, spawn, throb]
(R3-45)   [crawl, fight, spend]
These three are among the 261 starting triples that can be won on turn 5, 
have low failure rates when the player is just guessing, and have
win-by-turn-6 strategies that use no out-of-cluster words.  (The top
 and bottom ones use only 5 preferred words). The top two use all 9 of
the most common letters. I believe (R3-45) is the fourth-best triple
in terms of its failure rate when using guess-at-will.

(R3-46)   [clang, spied, throw]
This is one of the 283 triples whose largest clusters have only 6 elements;
it has several of them, unfortunately, but despite that, finds the hidden
word in fewer moves than most.

(R3-47)   [crimp, doubt, salve]
The triple appears on two short lists: all its clusters are of size 6
or less, and it is possible to create a strategy that allows the play
to ensure a win by turn 5 . It even manages to accomplish these things 
without also being on the list of 15-letter triples! (But it's not
particularly recommended for play: it takes 8 preferred words plus 4
out-of-cluster plays just to ensure a win by move 6.)

                ----------------
(H) SUMMARY OF "BEST" STARTING TRIPLES

I said at the start of this section that no one triple could be called "best".
Here I will do so anyway, for various ways to phrase the question:

Rankings base on how the game board looks right after the starting triple:
(R3-09)   [blimp, dance, worst]  lowest size of largest cluster (M1=6)
(R3-05)   [bland, copse, right]  highest number of clusters (M2=1954)/lowest average cluster size (1.1847)
(R3-04)   [blimp, dance, short]  highest number of singleton clusters (M3=1696)
(R3-17)   [bench, midst, polar]  most words in small clusters (M4=2237)
(R3-21)   [brace, moult, shiny]  most green tiles (G=3611 per 2315-day cycle)
(R3-22)   [built, crone, shady]  most colored tiles (by fractional count) (G+Y/2 = 3605 + 5338)
(R3-23)   [curly, point, shade]  most colored tiles (by total count) (G+Y= 3505 + 5517)
(R3-28)   [handy, slice, tumor]  fewest bad-tiles days (one [1,0], fifteen [0,2])
(R3-29)   [close, train; dumpy]  "many" colored tiles after first two words

Rankings based on game's end, after randomly guessing candidates:
(R3-30)   [blond, girth, swamp]  lowest chance of failure (M6 = 0.02% -- once per 5190 games)
(R3-05)   [bland, copse, right]  lowest average turn-count (M7 = 4.1707)

Rankings based on game's end, after using simplest algorithm designed to ensure a win by move 6:
(R3-05)   [bland, copse, right]  lowest average turn-count (M8 = 4.1682)
(R3-30)   [blond, girth, swamp]  smallest number of extra rules to follow (3)
(R3-34)   [brown, midst, place]  lowest use of turn 6 (0.8210%)

Rankings based on game's end, after using an algorithm designed to ensure a win by move 5:
(R3-01)   [blast, midge, porch]  lowest average turn-count (4.1914)
(R3-32)   [blend, right, scamp]  lowest number of rules needed (54)

==============================================================================
                         TWO

We next ask what might be the "best" two-word starting sets.

We saw in a previous section that there are many ways to define
"best". We'll use those definitions here too, and will keep track
of the best few as measured by each individual metric,
in case we'd like to find a starting set that isn't "best" by 
any single criterion, but is at least "good" by many.

When we looked at five-word starting sets, we discovered that
having repeated letters was the only choice; among four-word
starting sets it was a competitive choice, and among three-word
starting sets, having repeated letters was an uncommon choice.
Now, among two-word starting sets, we expect that having repeated
letters will be a poor choice. So in what follows we will
restrict our attention to pairs with no repeated letters.

It turns out that there are exactly 196,175 such pairs of Wordle words.
I have made a list of all of them, along with some basic data about each. 
(It is sorted informally to reflect a notion of expected "quality". 
At the top is [salon, trice]; at the bottom is [inbox, jumpy].)  
I have examined all these pairs against all the metrics in sections
(A) and (B); in section (C) I ignored some of the least-likely candidates
to be "best", but I suppose I one of those could end up being a good one.

OK, then, let's follow the pattern of parts (D) (E) (F) of the 
previous section. As with the triples, we'll conduct our analysis
here and store all the data about the individual pairs mentioned in
a separate file. We'll summarize our list of "best" pairs at the
end of this section.

                ----------------
(A) THE BEST STARTING PAIRS BY THE VARIOUS POST-START METRICS

As with triples, we may rank all the (10-letter) pairs by their
Lp metrics, for every real number  p  including +- infinity.
The rankings of the top three only change at a few select
values of  p:

p=+infinity
                [scald, tenor], [clone, stair], [nosey, trail]
p=+6.55013012                  X
                [clone, stair], [scald, tenor], [nosey, trail]
p=+4.71796285                                                X
                [clone, stair], [scald, tenor], [coast, liner]
p=+3.82766546                                                X
                [clone, stair], [scald, tenor], [cairn, stole]
p=+3.67351974                                  X
                [clone, stair], [cairn, stole], [scald, tenor]
p=+3.54643114                                                X
                [clone, stair], [cairn, stole], [coast, liner]
p=+3.45486425                                                X
                [clone, stair], [cairn, stole], [salon, trice]
p=+2.98298632                  X
                [cairn, stole], [clone, stair], [salon, trice]
p=+2.88072579                                  X
                [cairn, stole], [salon, trice], [clone, stair]
p=+2.62287098                  X
                [salon, trice], [cairn, stole], [clone, stair]
p=+2.13659860                                                X
                [salon, trice], [cairn, stole], [close, train]
p=+0.69814188                                  X
                [salon, trice], [close, train], [cairn, stole]
p=+0.54007080                                                X
                [salon, trice], [close, train], [crane, spilt]
p=+0.50624357                                  X
                [salon, trice], [crane, spilt], [close, train]
p=+0.48959569                                                X
                [salon, trice], [crane, spilt], [price, slant]
p=+0.21339525                  X
                [crane, spilt], [salon, trice], [price, slant]
p=+0.20671891                                  X
                [crane, spilt], [price, slant], [salon, trice]
p= 0.00000000                  X
                [price, slant], [crane, spilt], [salon, trice]
p=-0.76193452                                                X
                [price, slant], [crane, spilt], [crane, split]
p=-1.30720631                                                X
                [price, slant], [crane, spilt], [crest, plain]
p=-2.97845882                                                X
                [price, slant], [crane, spilt], [lance, sport]
p=-infinity


The rankings for the largest values of p>0 agree with the rankings
by metric M1, the size of the largest cluster.  Every starting
pair leaves some clusters of size 16 or more --- no matter what
starting pair is played, there will be days on which there are 16 or
more words that are consistent with the clues they provide. The
top three listed pairs are the only ones whose clusters all contain
at most 16 words (they have two, three, and four of them respectively).
(S2-1)    [scald, tenor]
(S2-2)    [clone, stair]
(S2-3)    [nosey, trail]
(For example, if pair (S2-1) turns the E and R tiles yellow and the
other eight gray, then the hidden word that day could be any of:
    bribe, brief, every, fibre, fiery, grief, grime, gripe,
    prime, prize, puree, purge, query, rhyme, rupee, where.
And an all-gray tile display indicates the hidden word is one of
the sixteen Wordle words that lack s,c,a,l,d,t,e,n,o, and r.)


When p=2 we are measuring the daily average cluster size;
the starting pairs that minimize this are
(S2-6)    [salon, trice]
(S2-2)    [clone, stair]
(S2-10)   [cairn, stole]
For example the first leaves us with an average of 4.3633 possible
solutions each day (but a maximum of 23). These three maintain their
leading positions for a range of values of  p  which includes the
case p=1 where all starting sets a momentarily tied.

When p=0 we are measuring the total number of clusters, metric M2.
There is a tie here between two very similar anagrams, each with 
M2=1071 clusters:
(S2-4)    [price, slant]
(S2-5)    [crane, spilt]
These two maintain their lead for all negative  p,  including
p=-infinity which is the metric M3: the number of singletons.
(They have respectively M3=634 and M3=631: after just these two
starting words are entered, there is already a unique possible
solution word on over a quarter of all days, and simply guessing
among possible solutions will score us a victory on turn 3
46% of the time.)

The other starting pairs which finish in the top three for some  p  are
(S2-7)    [crane, split]
(S2-8)    [crest, plain]
(S2-9)    [lance, sport]
          [close, train]
          [coast, liner]



We turn next to metric M4, which counts the number of words in "small"
clusters. Since we are beginning with just a starting pair, we
can confidently play through all words in any cluster of size four or
less. For the starting pair (S2-6) SALON + TRICE, those clusters include
1543 of the 2315 words --- which is exactly 2/3 of them, and more than
for any other pair.  Close behind is (S2-4), and then a tie for third
place between (S2-5) and (S2-10). 

(Counting by clusters instead of by words, the best pair is (S2-4)
for which over 91.5% of the clusters contain fewer than five words. Close
behind is (S2-2), and CAIRN + SLEPT is third. 

                ----------------

The other metrics that we described in a previous section
use the numbers of green and yellow tiles that result from
playing the starting pair day after day. We can determine the
pairs that produce the highest values of  G + f Y  (for various
values of  f >= 0 ) over the course of a 2315-day cycle.

Once again the conclusion of which pair is best depends on the
value of the parameter  f : how much of a green tile is a yellow
tile worth? Then the best pairs are these
(S2-11)   [crony, slate]  if f < .4518, including f=0 (only green matters)
(S2-12)   [irony, slate]  if .4519 < f < .8368
(S2-2)    [clone, stair]  if .8369 < f < .9135
(S2-13)   [route, slain]  if .9135 < f < 1

At the extreme of counting yellows and greens equally, (i.e. when f=1)
there is a 13-way tie since for f=1 the score depends only on the 
letters used, not their positions. Thus all of these pairs will
score equally (they each will produce 7062 colored tiles):
   alien,torus  arose,unlit  arose,until  arson,utile  louse,train
   noise,ultra  outer,slain  outer,snail  route,snail  sonar,utile
   solar,unite  solar,untie       and (S2-13)=route,slain  
because they are made of the same letters: a,e,i,o,u and l,n,r,s,t .

Of these thirteen pairs, the one with the best distribution vector
in the next subsection is (S2-13); for example it takes an average
of 3.8400 turns to win using the guess-at-will strategy with this pair.

These happen to be the 10 most-used letters in Wordle if we count
by *words containing the letter*. If instead we count by *appearances
of letters within words*, then C would replace U in this list;
and as it happens the words containing  AEIOCLNRST  are (tied for) 
second in this ranking, with 7053 colored tiles. (Last in this ranking
is JUMPY + WHISK, which yields only 3585 colored tiles.)

It makes no sense in Wordle to value a yellow tile more than a green, 
but mathematicians might ask about values of f > 1 as well. For all
f>1.02, the optimal pair is  OCTAL + RESIN, which turns out to yield the
most yellow tiles of any starting pair: 5827 of them, along with 1226 green.

Besides these four best starting pairs, the other pairs that rank
among the top three, for various values of  f < 1, are:
   [brine, soapy], [briny, slate], [cairn, stole], [corny, slate],
   [crony, saute], [irony, stale], [rainy, stole], [route, snail]



In the list of all 10-letter pairs mentioned above, I sorted all
the pairs by a composite metric (basically combining M1 and M5);
unsurprisingly, it begins with some of the pairs already mentioned
(SALON+TRICE is first) but already by the seventh entry (SCONE+TRAIL)
we meet a pair that isn't really close to "best" by any single
measure, yet is fairly good by several of them. It's quite reasonable
to balance competing metrics in this way; your results may be
different and still good!

Since I did the computations with the list of all possible
10-letter pairs, I can, just for laughs, report the worst
possible 10-letter starting pairs by the various metrics. These
include such gems as [jumpy, shock], [gawky, jumbo],
[gawky, squib], and [ethos, umbra].

                ----------------

(B) BEST STARTING SETS IF YOU INTEND TO JUST KEEP GUESSING

I believe it is true that there is no starting pair with which the player
can be assured of winning by turn 6, if the player just randomly selects
a word consistent with the clues at each step.

Metric  M6  ranks the starting pairs by the probability of a loss. For 
this search I resorted to some heuristics to trim the search space a bit,
so it is *possible* that there can be a better pair, but these appear to
be the best. (Not only is there no starting pair for which this strategy
is sure to win by turn 6, but I did not even encounter any pairs for which the
strategy is sure to win by turns 7 or 8! A win by turn 9 is guaranteed for
many, e.g. for CABLE + SNORT, but with many good starting pairs the player
might continue making logical guesses and still not discover the hidden 
word for a long time --- e.g. starting with the (good!) pair SLANG + TRICE,
it is nonetheless possible for the game to continue to the 12th move this way!)

Here are the best pairs I have found, when judged by metric M6.
Shown here is the computed (not experimental!) rate of failure
when following a guess-at-will strategy after starting with each pair.
(S2-14)   [spend, trawl] 0.002421 (1 fail per 413 games)
(S2-15)   [blond, tramp] 0.002424
(S2-16)   [blend, tramp] 0.002470
(S2-17)   [bland, swept] 0.002515
(S2-18)   [scold, tramp] 0.002596 (1 fail per 385 games)
The (low!) failure rates are quite close, and there are plenty more
good daily choices that would fail less than once per year. There is
a consistent pattern to nearly all the pairs high on this list: 
the two words each have just one vowel, right in the middle.


Metric M7 counts the average number of turns until victory (including games
that use a 7th, 8th, ... turn) when the player continually tries randomly
chosen words consistent with the clues. The best pairs by this metric are
(S2-5)    [crane, spilt], 3.700452 turns on average
(S2-4)    [price, slant], 3.702214
(S2-7)    [crane, split], 3.712983
(S2-19)   [cried, slant], 3.713729
(S2-20)   [print, scale], 3.719044
(These all use the same 10 letters, except (S2-19) uses D not P.)
Again, even further down this list, there is a pattern, and it's
very different from the previous list: now there are always
three vowels, usually AEI.

There is a definite tradeoff between metrics M7 and M8. The
best words in the second table will guess the hidden word
on turn 3 nearly half the time, while the best words in the
first table will do so less than one-third of the time. In
the other direction, the pairs at the top of the second table
usually have failure rates twice as high as the words in the
first table --- even much higher as we read just a little
beyond the listed portion of the table. Occasionally there
are pairs that are reasonably good by both metrics, e.g.
   [crisp, table] [brace, spilt] [scalp, tribe] [clasp, trend]
(We would like both the average number of turns, T, and the
percentage P of failures to be low, i.e we want the pair (T,P)
to be close to the origin; but for all the pairs studied,
3T+2P tends to be larger than about 12, so that's a description
of the tradeoff.)

For point of reference, I took a pair at random from near the center of
the list of all 10-letter pairs ("center" according to the informal 
combination metric I mentioned in part (A) ) : RECAP + WOULD. This
unremarkable starting pair could also be used with a guess-at-will
strategy, and the results are still fairly good: one could expect
only one failure per 123.9 games, and an average of 3.9954 turns
per game. So our "best" starting pairs are distinctly better,
although for the average player playing only once per day,
it may be difficult to see the difference!

                ----------------
(C) BEST STARTING SETS IF YOU'LL USE A SIMPLE STRATEGY THAT FORCES A WIN

As with starting triples, I examined each starting pair that 
could reasonably be expected to be highly ranked in this section, and
determined a minimal strategy to ensure a win by turn 6. That
means checking each of the clusters created by the starting pair,
and using the simplest resolution of it:
(1) If that cluster will surely find the hidden word by turn 6
    using a  guess-at-will strategy, the player will pursue that.
(2) If not, but if one "preferred" word in the cluster will give enough
    extra information to finish by turn 6, then play it on turn 3.
(3) if not, but a non-cluster word will separate the cluster into 
    smaller clusters that can each be resolved by turn 6 using
    guess-at-will, then play that word on turn 3.
If in (2) or (3) there are multiple candidate words to play, then 
play the one that gives the best probability distribution.

I've worked out resolutions for all the clusters, for several
hundred of the most promising starting pairs. Unfortunately
it seems that for most pairs there are clusters that do not 
resolve in any of these three ways. In such cases we can use
two-word solutions in those cases, or a recursive use of 
"preferred" words. 

With a strategy in hand for each of the starting pairs, we can
compare them using metric M8. The five best starting pairs by this
metric are the same as the ones that are best by metric M7, and
in the same order, though the numbers of turns are reduced:
(S2-5)    [crane, spilt] 3.6591
(S2-4)    [price, slant] 3.6618
(S2-7)    [crane, split] 3.6739
(S2-19)   [cried, slant] 3.6757
(S2-20)   [print, scale] 3.6771

These are really good results --- 3.6591 is one of the lowest
expected numbers of turns of any strategy described in this document!
And again we may compare to the run-of-the-mill starting pairs: the
example RECAP + WOULD only improves to an average of 3.9072 turns
per game when using its best minimal strategy. But these results
come at a cost: let's see what is needed to achieve them.


The top example, (S2-5) CRANE + SPILT, guaranteess a win by turn 6,
and most of the 1071 clusters can be dispatched by free-form guessing.
But there are 30 that cannot.

This includes 25 clusters which can be handled by using a preferred member:
    above, allow, awake, batch, bevel, blade, corer, dingy, ditch,
    ditty, dogma, dumpy, earth, foist, goody, gouge, grade, haven,
    marry, merge, otter, sewer, vomit, wager, women
An additional four clusters require ordinary out-of-cluster solutions:
   [billy, bawdy], [bound, bawdy], [bully, fjord], [daunt, judge]

Then in addition, there is the largest cluster: 33 words that yield
a yellow E and yellow R. We cannot pin down the hidden word among them
unambiguously by playing any single word on turn 3 (neither from within
nor outside the cluster). We can still guarantee a win by move 6 if 
we remember rules for turn 3 and turn 4: use the preferred word
"berry" on turn 3; then on turn 4 (only), if either "rover" or "mover"
fits, play it. Otherwise, guess at will.

Potentially this algorithm is just within a player's ability
to memorize; for such a player I have prepared a cheat sheet.

In the previous subsection we noted this starting pair also used fewer
turns on average (3.700452) than any other starting pair, when using a
guess-at-will strategy. The distribution vector with that strategy is
  [.462635, .404369, .109006, .019043,
   .003868, .001008, .000068, .000003, .00000005]
(with a 0.50% failure rate).  But with the strategy above it changes to
  [.460907, .430587, .097044, .011462]


The best strategy for PRICE+SLANT is nearly the same because these
two starting pairs are not only anagrams of each other, but have seven
of the 10 letters in the same positions! So about 80% of the clusters
for one starting pair are the exact same sets of words as clusters for
the other starting pair. Please refer to the strategy cheat sheet
for more details. (For the clusters that are different for the
two starting pairs, it is often possible to make some minor adjustments
in the proposed algorithms so that we not only guarantee a win by turn 6
but reduce the average number of turns just a little.)


The top two pairs keep their ranking, and the list of also-rans changes little,
if we ask for the frequency by which the game is won by turn 3, or by turn 4.
It's a little different if we look for wins on turn 5, that is, how 
often do we really need a sixth turn? In that case the winners are
(S2-19)     [cried, slant] 0.918% of the time, a 6th turn is needed.
(S2-16)     [blend, tramp] 0.968%
            [crowd, slept] 0.978%
            [scald, trope] 0.985%
            [chide, slant] 0.986%
(For comparison, CRANE + SPILT will go to the sixth turn 1.146% of the
time if we handle the problematic clusters as above.) What is happening
here is that the ranking starts to resemble the first list in part (B),
ranking pairs by their ability to end in six turns, using the guess-at-will
strategy.

                ----------------

The other criterion we care about, when comparing candidate starting
pairs that each use a strategy designed to guarantee a win by turn 6,
is to try to optimize the simplicity of the algorithm that
guarantees a win. Clearly with 30 additional rules to learn,
the two examples above are only barely simple enough to be used
by a casual Wordle player!


The starting pair with the "simplest" algorithm that I have found is
(S2-21)    [blond, spite]
It has "only" 23 problematic clusters, and all can be resolved
without any rules that extend beyond the first turn after the pair.
It will lead to a guaranteed win by turn 6, with an average of
3.8237 turns, if we simply follow rules (2) and (3) for these
exceptional clusters. A simplest strategy involves only two
out-of-cluster words:
  { aider, cater, chain, charm, chart, crave, crest,
    fifth, folly, girly, grill, legal, mayor, money,
    scary, scree, shrew, stark, trait, twang, wager,
    [found, wharf], [mover, rocky] }

Alternatively, using the word MARCH on turn 3 resolves many
of the 23 clusters by turn 6, that is, if forget some of the 23 rules
while playing, we could simply revert to the six rules already discussed
for this starting triple (R3-15) in the previous section.

Among the many starting pairs I have so far studied, none has fewer 
than 23 rules for exceptional clusters like this, and I have found
only one other starting pair that creates only 23 tricky clusters: 
(S2-22)   [blond, trace]
A strategy that works uses four out-of-cluster words:
  { favor, fetal, folly, gaunt, gipsy, grape, harpy, mouth, palsy, pinch,
    serif, serve, setup, shake, shift, sigma, smell, swill, vague
    [catch, champ], [found, swamp], [gamer, gawky], [mover, whisk] }
Using these 23 rules will guarantee success by turn 6, taking on
average 3.7461 turns. 

It is by this measure that these "best" starting pairs really are
better than the run-of-the-mill starting pairs. For example, the
pair RECAP + WOULD, mentioned earlier, requires using 45 preferred
words and 5 out-of-cluster pairs to resolve its many problematic
clusters! This seems to be typical for pairs taken from the middle
of the list of all 10-letter pairs, but the "best" pairs only
need about half that many!


I also have found one starting pair that never requires an out-of-cluster
word to be played: 
(S2-23)   [scold, tramp]
Simply play these 27 preferred words if they fit the clues:
  { bagel, bugle, catty, deign, diner, feign, flake, fleck, folly,
    habit, haste, liken, nerve, novel, prone, range, rotor, rough,
    serif, shunt, skate, stone, swine, taint, vegan, white, wince }
This wins every time but takes an average of 3.8436 turns.
(SWAMP + TREND also avoids all out-of-cluster words, but needs 36
preferred words to ensure a 100% win rate and averages 3.8504 turns.)

Other examples that never require out-of-cluster words are trickier.
For example, CLASH + TRIPE can be won every time by playing
preferred elenents of the 28 trickiest clusters, including
  { badge, bowel, brand, bread, broad, budge, crown, diner, dingy,
    ditto, dolly, drove, exult, filmy, frond, grade, gumbo, jaunt,
    mangy, marry, meter, modal, sewer, stung, taunt, wager, woven }
However, the largest cluster ("yellow e & r") requires recursive
use of preferred words.  Guess "derby" on turn 3 if it fits; then
(and only then) if "mower" fits, play that *on turn 4*, and then 
(only then!) if "roger" fits, play it *on turn 5*. If that's STILL
not the right hidden word, it nonetheless gives enough information
to choose between "goner", "joker", and "rover" on turn 6!
This procedure averages 3.7249 turns for a win.


An interesting candidate for "simplest" is
(S2-24)    [gland, swept]
This starting pair leaves 35 clusters that require special treatment; 
34 of them can be resolved using a preferred member of the cluster,
and the last (the one including BERRY) can be resolved using
the out-of cluster word ROCKY. But alternatively we can use
CHOIR for 33 of the first 34 --- all except the one containing
      [arbor, armor, favor, major, mayor, razor]
since we noted in the previous section that (R3-39) CHOIR + GLAND + SWEPT
is a good starting triple, having itself just three difficult clusters,
this set being one of them, (It can be resolved using MAYOR as
a preferred word.) In other words we have a "simple" algorithm for
winning Wordle:
   Start with GLAND + SWEPT and see which cluster contains the day's word.
      play ROCKY if the cluster contains BERRY, 
      play MAYOR if the cluster contains MAYOR, 
      play CHOIR if the cluster is any of the other *problematic* ones,
      Otherwise, guess at will.
But this is a cheat! To use this algorithm we must recognize
those 33 clusters as they arise, which is no easier than remembering
the preferred words that signal them. But this does suggest a hybrid
algorithm for Wordle -- something between a 2- and a 3-word starting set:
   Start with GLAND + SWEPT.
   Next,
      play ROCKY if the word could be BERRY, 
      play MAYOR if the word could be MAYOR, 
      play CHOIR otherwise.
   Then, guess at will.
In practice this is very much like the procedure for the starting triple
CHOIR+GLAND+SWEPT; it simply skips a step in about 1% of the cases.

                ----------------

(D) SINCE WE'RE ALREADY CONSIDERING COMPLEX STRATEGIES...

Throughout this document, we have emphasized just two strategies that
a player might use after the starting set is played: if not using
guess-at-will all the time, we at least assume the player would 
follow that strategy for the clusters which will surely lead to a
victory by move 6. Just for this starting set CRANE + SPILT, however,
I considered a couple of other options.

First of all, the player might commit to memory a preferred word
to use in every cluster, including the sub-clusters that appear 
after turns 3, 4, and 5. This would produce an entire decision tree
which, if used daily, would finish the game on the Nth turn this
many times during an entire 2315-day cycle, always choosing a word
within the clusters:
    1, 1, 1069, 1071, 153, 16, 3, 1
for a total of 8384 words entered (an average of 3.6216 per day ---
of course that's better than guessing randomly within each cluster!)

We can also combine the two strategies: use the preferred word from
every cluster, as in the previous paragraph, except for the five clusters
that required an out-of-cluster word or two-word strategy used earlier.
Then the probability distribution vector that replaces the ones in the
previous paragraph is
  [.461339, .465659, .066955, .006048]
to get an average of 3.617714 moves per game, which is lower than
every other algorithm we discuss in this document --- although we have
now strayed very far from our goal of "simple" ways to play Wordle!

In fact, we have at that point run almost the complete analysis (for
this starting pair) which other researchers have done; the only additional
"improvements" one could make to the algorithm at this point would be
to consider the use of out-of-cluster words even for those cluster
that can be resolved by move 6 without them; and then, finally, to
allow the use of Wordle's "other" words --- the words that are not in
the solution list but are allowed as input. 

In fact I believe these two additional refinements would add very little.
The two starting pairs that I have found to be "best" by metrics M7 and M8
are also the highest two (among the pairs that use only Wordle
answer-list words) on Alex Selby's list. He computes the average 
numbers of turns used assuming the player plays optimally after the
starting pair. That means not only memorizing a word to be played from
each cluster (and sub-cluster) but also pre-computing that word from
among all the Wordle-permitted entry words (not just the Wordle answer-list
words, as I have done, and certainly not just the words within the cluster!)
Of course with more options to select (and assuming a very compliant player!)
one can then expect the average number of moves to decrease. But it doesn't
go down by much. We can compare the average numbers of turns for the four
strategies: (a) random guessing (b) minimal win-by-6 (c) previous paragraph (d) optimal (Selby):
   CRANE+SPILT:    3.7005            3.6591                     3.6177             3.6003
   PRICE+SLANT:    3.7022            3.6618                                        3.6037
Since this document intends to be an analysis of a *human* player's options,
the second column is probably the limit of our investigations.


Clearly an average game length of even 3.6591 is better than anything we
obtained using fixed starting triples. But when applied to an
N-fold compound game, this would imply (an upper bound for) the length of
the game being 2 + 1.6591 N. For N=1 this is clearly better than say the
bound  3 + 1.1682 N  which we obtained in the previous section. But already
for  N=2  the advantage is nearly lost; so even for Dordle it is not clear
that we are better off with the best starting pair than we would be with
one of our good starting triples. It is unlikely to be better for Quordle
and beyond. I have found subjectively that I do less well on Quordle even
with this "best" starting pair than I do with good starting triples.

                ----------------

(E) SUMMARY OF BEST STARTING PAIRS

I have not finished an exhaustive search of word-pairs but I have looked
at all pairs drawn from what I consider the "better half" of all Wordle words.
(This informal ranking provably works well in the next section, hence my optimism.)
I am running a background process at home that sifts through promising pairs;
for each one it is necessary to identify the problematic clusters and to find
in- or out-of-cluster words that can resolve them, if possible; I can also
search for procedures to resolve the clusters which cannot be won with these
tools, and then compute the probability distribution showing the frequencies
with which this algorithm will end on turns 3, 4, 5, or 6. Over time I
may use these results to update this section. But it seems clear that these
procedures to guarantee a win with a particular starting pair are inevitably
very complicated for a human to use, and unlikely to be useful for the compound games.


As at the end of the last section we can summarize the reasons to declare a starting pair "best":

(A) Rankings base on how the game board looks right after the starting triple:
(S2-2)   [clone, stair]  lowest size of largest cluster (M1 = 16)
(S2-5)   [crane, split]  highest number of clusters (M2=1071)/lowest average cluster size (tie)
(S2-4)   [price, slant]  highest number of clusters (M2=1071)/lowest average cluster size (tie)
(S2-4)   [price, slant]  highest number of singleton clusters (M3=634)
(S2-6)   [salon, trice]  most words in small (< 5) clusters (M4=1543)
(S2-11)  [crony, slate]  most green tiles (G = 2692 per 2315-day cycle)
(S2-12)  [irony, slate]  most green & discounted yellow tiles (G = 2528, Y = 4494)
(S2-13)  [route, slain]  most colored tiles (G+Y = 7062) 

(B) Rankings based on game's end, after randomly guessing candidates:
(S2-14)  [spend, trawl]  lowest chance of failure (M6 = 0.24% -- once per 413 games)
(S2-5)   [crane, spilt]  lowest average turn-count (M7 = 3.700452)

(C) Rankings based on game's end, after using an algorithm designed to ensure a win by move 6:
(S2-5)   [crane, spilt]  lowest average turn-count (M8 = 3.6591)
(S2-19)  [cried, slant]  lowest use of turn 6 (0.918%)
(S2-21)  [blond, spite]  smallest number of extra rules to follow (23)

==============================================================================

                            ONE

We can address the one-word starting sets in all the multiple ways
we have looked at larger sets in the last three sections. Since there
are only 2315 candidates this time, we can apply most of our testst
comprehensively to every option. (Unsurprisingly, the starting sets
without repeat letters are again better!)

(A) THE BEST STARTING WORDS BY THE VARIOUS POST-START METRICS

Again we can find the top-ranking choices for every Lp metric;
the rankings stay constant across intervals as  p  varies across
the whole real number line. I have computed the successive
lists of top-5 words (for each  p); here
there is space just to list the single best word for each  p . It is:

      raise   for p > +0.9112781617
      slate   for +0.4922321810 < p < +0.9112781617
      trace   for -0.3589050698 < p < +0.4922321810
      parse   for -0.9281602628 < p < -0.3589050698
      filet   for -1.8523351140 < p < -0.9281602628
      brute   for p < -1.852335114

For large  p, ARISE is in second place, but it and RAISE are
tied at p=+infinity, where the Lp metric becomes metric M1, measuring
the size of the largest cluster. But even they have clusters of 168
words (in both cases, it's the set of words that contain none of
the letters A,E,I,R, and S). All other words leave a cluster that's
even bigger; ALONE has one of size 182, then AROSE, ATONE, RATIO, ...

At p=+2 we are measuring the daily average of the size of the cluster
containing the hidden word; RAISE is the winner here too, with an
average of 61.0009.  (Even though there are only 10 clusters larger
than this, the player will encounter those ten very often!) Next
best are ARISE, IRATE, AROSE, and ALTER.

At p=+1, all words give the same value to the Lp metric (namely 2315),
but at this point  RAISE is the word for which the Lp metrics are
growing most slowly, followed by SLATE, CRATE, IRATE, and TRACE.

At p=0 the LP metric counts the number of clusters. By this point,
TRACE has claimed the lead, with 150 clusters. That gives TRACE the
smallest average size of its clusters (15.4333 words per cluster)
and the greatest likelihood for a person to guess the hidden word
on turn 2 (6.48% of the time). Runners up are, in order, CRATE,
SLATE, PARSE, and (tie) CRANE and STALE.

At p=-infinity the Lp metric simply counts the singleton clusters,
that is, the most words which are known unambiguously after the
starting word is entered. BRUTE and CHANT are tied for the most;
but "most" isn't many -- they only pin down 40 words in the Wordle
dictionary! (Next come METRO and SPILT with 39, DINER and HORDE with 38.)


Our other post-starting-set metrics are based on counting the colored
tiles they produce. The words that produce the most green tiles over
the six-year cycle are SLATE (with 1437 of them), SAUCE (1411), and
SLICE (1409). Treating a yellow as half a green the highest scoring
words are STARE (1326 + 2761/2), then AROSE (2670.0) and RAISE (2668.5);
SLATE drops to fifth. Treating a yellow as equal to a green
gives a tie score to words with the same letters; on top are
ALERT/ALTER/LATER (4117), then IRATE (4116), AROSE (4093),
STARE(4067), and RAISE/ARISE (also 4067).

Of course we can rank the words by the value of  G + f Y  where a
yellow is valued at a fraction  f  of a green; we just worked out
f=0, f=0.5, and f=1. Then SLATE is indeed best if  f=0  or any  f 
less than f=0.37 . At f=0.37 it's a tie between SLATE and STARE.
Then STARE is best for larger  f  until  f=0.842, when IRATE takes
the lead. At f=0.993 the best becomes LATER, which keeps its title
until f=1, as above. For  f > 1 the best is ALERT unless you for some
reason value yellow tiles more than 3.25 greens; at that point
OPERA is the best word simply because it gives the largest number
of expected yellow tiles.

                ----------------
(B) BEST STARTING WORDS IF YOU INTEND TO JUST KEEP GUESSING

The guess-at-will strategy has a distinctive feature when
applied to starting sets of size 1. Since we are starting with
just a single fixed word before starting to guess, and with this
strategy we'll guess only within clusters formed from previous guesses,
this means the player is also following a strict form of Wordle's
"hard mode"! (The game's "hard mode" is actually more permissive:
a letter that has previously come up gray can be used again even
though from our perspective the new word is now clearly not in the
same cluster as the hidden word; likewise the game allows a repeat
of a letter in the same spot after the tile is yellow, while for
us that also indicates the new word is out-of-cluster.)

The two natural metrics to compute for a strategy of free-form
guessing are (1) the average number of turns needed to discover
the hidden word, and (2) the probability of doing so by turn 6
(i.e., "winning").


(1) A preliminary sort of the words in the Wordle dictionary works
very well to quickly locate the words which find the hidden
word fastest. From the probability distributions of these words
I can compute the average numbers of turns until the hidden word
is found. (As always, this includes counting as 7, 8, etc turns
those rarer occasions when the player will continue past the
6-turn limit --- which is now a fairly common event when starting
with just a one-word starting set!) Ranked by average numbers
of turns, the list of 1-word starting sets begins:

    slate, 3.8218 turns on average to complete Wordle
    least, 3.8327
    trace, 3.8420
    stale, 3,8463
    crate, 3.8490
    slant, 3.8509
    leant, 3.8541
    plate, 3.8618
    dealt, 3.8659
    react, 3.8662
    ...
Note that all of these are significantly *more* than the the counts
of the better two-word starting sets. "Guess-at-will" is frankly
not an efficient strategy for one-word starting sets. The problem
is that after playing just one starting word, we are left with some
very large clusters, and truly guessing candidates from them at
random is a very inefficient way to uncover the hidden words in them.

(Just so we're clear here, these are calculated probabilities,
not empirical data. The actual expected number of turns for
SLATE, for example, is exactly
    2143345083855809867374480229283246322413669919670571681198349
    -------------------------------------------------------------
     560821042520148446351329436583213214101064845220300800000000
So we do have the precision to rank these properly!)



(2) Though it may not be efficient (few turns) this strategy is
certainly simple, and *may* also be effective, as judged by
the probability of a win by turn 6. My preliminary sorting has
not been as helpful at suggesting candidates for "best" by this
metric, and so I am still computing data for more candidates
but it appears the winners are:

    clasp, .992797, about one loss in every 138.83 runs
    scalp, .992524,                         133.76
    splat, .992047,                         125.75
    spelt, .991969,                         124.52
    slept, .991902,                         123.50
    spilt, .991801,                         121.98
    split, .991771,                         121.53

Again it's worth observing that even though we have five
more turns allowed after a single start word (compared to just 
four more after a starting pair), we will still suffer a loss
at least three times as often as we do with the best starting pair.


There's also a sequence of rankings more or less intermediate
between these two lists. The ordering in table (1) is similar to
a ranking already mentioned above listing the words that
have the greatest chance of winning by turn 2 (TRACE leads the
pack with 6.48% of games ending on turn 2, then crate, slate, and
parse). We can also sort by the fraction of games finished by
turn 3; it's  SLATE at 39.42% (then trace, then crate, then least).
Sorted by the fraction completed by turn 4, it's SLATE at 78.56%
(then least, then slant, then stale). Finally sorting by the
fraction completed by turn 5, the highest is CLASP at 95.27%,
now followed by slept, spilt, and slant. The last ranking in 
this sequence would be the same as the ranking by the rate of 
victories, which is table (2) above.

More of a curiosity than anything else, I suppose: table (2) clarifies
that no matter what word a player starts with, an unlucky guesser
should expect occasionally to need more than 6 words to guess the
hidden word. But in fact, For most starting words, on the worst
days it will take 12, 13, 14, or 15 turns to find the hidden word.
The closest exceptions I have found are the starting words SPORT, 
STONY, STORM, STORK, and SCOLD, each of which always finished by the
11th move. With SPORT as the first word, and following a guess-at-will
strategy with Wordle answer words, a player will always finish
on turns 2 through 11 with these frequencies:
   [.05097192, .29439833, .41291251, .19020249, .04183704,
    .00790584, .00158518, .00018201, .00000464, .00000004]
for an average of 3.9085 turns and a 0.97% chance of loss.
The situation with the other four is similar, although their
distributions are skewed more heavily to the right.


As we have moved through this document, looking at shorter and
shorter starting sets in each section, we have seen that the
numbers of moves needed has generally decreased. The information
gained from the shorter starting sets is obviously weaker, so
more turns will be needed, on average, to finish the puzzle after
the starting words are entered. But for good staring sets, those
additional turns have been few, and their number increased
only slowly as we moved from section to section. In particular,
the average number of turns taken for the best starting sets
was lower for starting triples than for starting quads, then
lower again for starting triples. But the pattern has now 
ended: as we drop from studying starting pairs to considering
a single fixed starting word, the average game length has
increased. (Using in both cases the guess-at-will strategy,
we saw that the starting set CRANE + SPILT would take an
average of 3.7005 moves. Our best single starting word SLATE
requires 3.8218 moves on average.) A similar conclusion has
been drawn in a slightly different context.
In a nutshell: if you will simply use a guess-at-will strategy,
asking "What is the best starting word?" is pointless; it's
always better to ask for the best starting pair unless
it's especially important to you to have a fighting chance
to finish the game on the second turn. (The probability of that
is at best 6.4%, which happens if your opening move is TRACE).

One-word starting sets can finish more quickly than the
best two-word starting sets, but only if we use more complex
follow-up strategies than "guess at will", as we shall see in
the next subsection.

                ----------------
(C) BEST STARTING WORDS IF YOU'LL USE A SIMPLE STRATEGY THAT FORCES A WIN

Well, the guess-at-random strategy of part (B) isn't looking as good for
any starting singleton as it did for some starting pairs. What about the 
win-by-turn-6 strategy? It turns out that, whatever the efficiency
of such algorithms, their complexity has risen to levels that
surely make them unsuitable for regular human use. So I don't expect
I will bother trying to compute them for each of the 2315 possible
1-word starting sets to pick a "best".  (Moreover, these strategies
of preferred cluster members and so on take considerable time to find
in the first place.)  So we will content ourselves with a single
promising example: What would be our strategy if we started with just
the one word that's most efficient for a guess-at-will strategy: SLATE.

Right away from the cluster vector we can tell that such a one-word
starting set is going to be trouble: it begins as
 [29, 20, 10, 9, 6, 10, 4, 4, 0, 4, 2, 4, 4, 3, 2, 1, 4, 1, 2, 1, 1, 0,...]
(which counts the numbers of clusters of sizes up to 22) and then is a
sparse list, as the remaining clusters have sizes
   [23, 24, 25, 27, 28, 28, 31, 31, 32, 37, 39, 39, 42,
     48, 51, 56, 58, 61, 61, 72, 86, 87, 107, 136, 165, 221 (!) ]
So the Wordle dictionary is split by SLATE into relatively few clusters
(only 147 of them), some very large -- a situation very different from
what happens with 4, 3, and even 2-word starting sets!

To amplify what we've already said about the guess-at-will approach:
if after we enter SLATE we simply begin guessing words consistent with
the growing list of clues, we will guess the word on turns 2, 3, 4, ...
with these probabilities:
  [0.0634989201, 0.3306985566, 0.3913739602, 0.1629811789, 0.0401178793,
   0.0092995402, 0.0017820878, 0.0002364096, 0.0000112207, 0.0000002431,
   0.0000000036, 0.0000000000]
That last number isn't really zero, actually; the probability of the game
ending on the THIRTEENTH turn is small but nonzero: about 2 x 10^(-11).
From this distribution we can compute that the average game will last
3.821799  turns (as in Table (1), above), and will fail to complete by
turn 6 about 1.1330% of the time. As already noted: guessing isn't a good
strategy with one-word starting sets.


What about pursuing a simplest-possible strategy that can guarantee a
win by turn 6? We can do so but will have to devise recipes for the 
FORTY-FOUR(!) clusters that could fail to complete by turn 6 using freeform
guessing. We attempt this as has been done in previous sections.

The natural strategy would be to use preferred cluster members when possible
and out-of-cluster words when not; we can do this but in six cases we need
out-of-cluster pairs (as we did for CRANE + SPILT). (I have not proved that
these 6 multiple-out-of-cluster choices are optimal but at least they do work.)
Here is the strategy in the now-familiar format {preferenes, [signals,actions]}:
  { aback, abide, abled, adapt, adept, agile, alarm, album,
    alien, amity, avail, befit, belch, belie, beset, bigot,
    bison, blade, bleed, blimp, bloke, boost, bused, chose,
    cutie, sandy, scare, scene, scion, scout, screw, slick,
     [baste, batch],    [scant, frisk],    [adage, crimp],
     [birth, brown],    [afoul, manic],    [binge, doing],
      [abbot, [cough, thumb]],   [abbey, [cigar, bawdy]], 
      [billy, [drill, wharf]],   [abhor, [grind, macaw]], 
      [beech, [rowdy, cabin]],   [biddy, [frond, chump]], 

The distribution of game lengths for this procedure works out to be
   [0.058315, 0.308362, 0.477195, 0.141094, 0.015033]
giving an average of 3.746167 turns to complete and win.  That's a
notable improvement over a guess-at-will approach, but comes at the
expense of having to learn *62* words and the context in which each
one applies. (The improvement is because over three-quarters of the
Wordle words are in clusters where at least one preferred word has
been computed; unlike the situation with larger starting sets, these
many additional rules are doing much more than eliminating fringe cases!)
To add insult to injury, this strategy still does not finish faster,
on average, than the comparatively-simpler strategy for CRANE+SPILT
discussed in the last section.

Since already in this example we are confronting strategies that
could hardly be called simple, we might as well consider other 
strategies too, even if they are yet more complex, if they offer
some other benefit. The next few paragraphs will lead us in that
direction, but I don't think I've found the definitive narrative 
to follow here.

                ----------------

(D) AREN'T WE NOW DOING HARD MODE?...

Let's speak a bit more about original-Wordle's "hard mode". All the time
we have discussed starting sets with two or more words, the sequences
of words that we have been proposing will be inconsistent with the
hard mode rules on most days (unless, say, the colored tiles all come
back gray after the first word!) But now that we are discussing a
starting set of just one word, it is conceivable that we could carry
out our procedures in a hard-mode game. We have already noted that
playing a guess-at-will strategy is consistent with (a strict version
of) the hard-mode rules; the same will be true whenever we play a
"preferred" member of a cluster. Indeed, we have up to this point
barely mentioned the possibility of identifying preferred words on a
recursive basis (picking out preferred members of sub-clusters to be
used on later turns) but this also would be consistent with (a
stricter form of) the "hard mode" rules.  So it is natural to ask,
can we do that? Can we get a hard-mode strategy for playing Wordle,
starting with SLATE ?

Suppose some day you begin with SLATE, and in reponse you get green T,E
and yellow S,A tiles. If you are playing in Wordle's "hard mode", 
you must then play a word ending in TE that also has S and A (but
does not[*] begin with  S). The only such Wordle solution-words are
   baste, caste, haste, paste, taste, waste
and one of those is the day's hidden word. But each time you play a word
from this cluster, if it's not itself the hidden word, then you gain no
additional information about the hidden word. So it could take you six
more turns to stumble on the correct word, at which time the game could
have ended. In short: any strategy that can start with SLATE and
guarantee a win by turn 6 must use an out-of-cluster word for this
cluster, and so is not permitted by the rules of hard mode.

[*UPDATE -- I am not a regular user of "hard mode" and only recently
learned that this is not true: apparently Wordle would also allow the
use of SAUTE, for example, which is not in the same cluster as SLATE.
So the remarks in this paragraph and the next actually apply only to
a hypothetical "strict hard mode" that DISallows the use of letters
that have already been given grey tiles, or yellow tiles in the given
positions. But as it turns out, the conclusion for SLATE is unchanged:
using hard mode (and using only Wordle's solution-list words), it is
not possible to guarantee a win by move 6 if we start with SLATE.]


This situation is not unusual: there are many other "good" starting
words for which the cluster that contains "baste" could potentially go
unresolved by move 6 if we insist on playing in "hard mode"; this
includes not only SLATE but also STALE, STEAL, SLANT, STARE, TRAIL...
Other patterns that cause the same problem for other starting words
include "(h)atch" (for TRACE, CRATE, REACT, LEANT, DEALT, ...);
"(j)aunt" (for CLEAT, ALERT, LEAPT, CRANE, ...) and "(w)ound" for TRADE...

Indeed, the starting words TRACE and REACT are generally good choices,
but each of them creates one cluster of *seven* "_ATCH" words; using
hard mode obligates us to then spend up to seven turns looking for the
hidden word, so the game may not conclude until turn 8 !


Nonetheless, there ARE starting words which permit the construction of
a win-by-turn-6 strategy that involves only using preferred words within
clusters, and this gives a strategy that is permitted in "hard mode".
However, in all the cases I have checked, the strategy must be used
recursively: we have to map out some preferred words not only for turn 2
but also for later turns (that is, there are words that are preferred
for a cluster but only after a previous preferred word has been tried.)

One example I have worked out this way is PARSE. I first worked out
the consequences of a guess-at-will strategy starting just with
PARSE. The probability distribution for this strategy is
  [.06306695, .30292396, .38894339, .18507397, .04831372,
   .00951574, .00179633, .00032386, .00003878, .00000313,
   .00000016, .00000001, .00000000 ]:
so (a) a very unlucky guesser could have to go to the 14th turn
to discover the hidden word; (b) the average number of turns needed
is 3.890251; (c) there is a 1.17% chance of failing (by turn 6).

I next found a standard (non-recursive) strategy for each of the 
clusters that can lead to a failure that way, just as I did (above)
for SLATE. For PARSE there are 40 such clusters; the simplest
procedure for dealing with them is summarized as:
     { alike, carat, caste, debar, drift, gavel, grace, hoist,
       merry, metal, noose, pouty, pride, prong, rayon, reply,
       resin, salon, short, slash, slate, smart, spilt, stalk,
       steak, stink, stone, tonal, torch, tribe, truss, verge,
   [boule, build], [cable, gulch], [craft, droit], [lager, light],
 [badly,[child,tawny]], [blind,[blond,fight]], [betel,[ditch,nobly]],
                     [refit,[blown,ditch,fever] ]   }

So 32 of the clusters are resolved by using a preferred word inside
the cluster. Then, four clusters each require ONE out-of-cluster
word; three need a PAIR of out-of-cluster words, and as far as I
can tell the last one requires using THREE fixed words, to be played
on turns 2, 3, and 4! This guarantees a win by move 6; the fractions
of the time that the word is found on turn 2,3,4,5,6 are
  [0.05961123, 0.32074700, 0.49688725, 0.11121149, 0.01154303]
for an average of 3.695328 turns.

But (unlike for SLATE) for the starting word PARSE, we can write
other rules for the last 8 clusters, ones that involve only words
within the clusters. The following algorithm will work after PARSE
is played on turn 1, and will determine the hidden word by move 6:

**On turn 2, just start guessing, unless one of above 40 fits, in which case, play it.
    (the 32 words "alike, carat, ..." and 8 first-halves "boule, cable, ...")
**On turn 3, just start guessing, unless one of these 20 fits, in which case, play it.
    [from BADLY] magma, tawny
    [from BETEL] cello, enjoy, woven
    [from BLIND] cumin, dowdy, filmy, gulch, joint, witty
    [from BOULE] undue
    [from CABLE] gauze
    [from CRAFT] board, drawn
    [from LAGER] water
    [from REFIT] brief, diner, rebel, wreck
**On turn 4, just start guessing, unless one of these 2 fits, in which case, play it.
    [from MAGMA] havoc
    [from WRECK] bluer
**On turn 5, just start guessing, unless one of these 2 fits, in which case, play it.
    [from HAVOC] taint
    [from BLUER] homer

(It will only happen that we need the special rule on one turn if we had
followed a particular special rule on the prior turn.) With this set of
rules, the probability distribution is
   [0.06306695, 0.37474268, 0.41821998, 0.12766482, 0.01630556]
from which we compute the average number of moves to be 3.659399 .
Again this guarantees a win by turn 6, but now this strategy can be
used when playing "hard mode"! (If you wish to play along this way,
you might appreciate a brief cheat sheet I wrote up.)

As has been typical for all our larger starting sets, most of the clusters
will surely be resolved by turn 6 just by guessing candidates at random.
What's different with 1-word starting sets is that there are so few 
clusters (146 with PARSE) and while, yes, 106 out of 146 can be safely
handled by freeform guessing, those clusters account for only 488 words --
only one-fifth of the Wordle dictionary. The next largest group of
clusters is the set of 32 of them for which we have picked out a 
prefered word, to be used in both the second and third algorithms.
Just giving some guidance for these clusters accounts for the
improvement from the first to the second strategies, because nearly
half the dictionary (1014 words) are in these clusters, so anything
better than unguided guessing will affect the number of moves almost
every other day! The rest of the words are contained in just 8 clusters,
but they are large: together they hold 883 words; dispensing with 
an out-of-cluster guess --- in some sense a wasted turn --- on every
one of those 883 days accounts for why the third algorithm can be 
better than the second.

(Sadly, both the second and third procedures again require memorizing
some five dozen words, properly ordered. So it is unlikely to be of
practical use for most people.)


While it would not be suitable for a human player, I did pursue this
idea to the extreme for some of the likely best starting words: how
well could we do if we were willing to memorize preferred words to
be used on every move, until the hidden word is found? That is,
how well can we do if we are willing to construct the entire decision
tree, using only (strict) hard mode guesses (i.e. always playing words
that are in the current cluster)? I have already mentioned that 
starting with TRACE might take us until the eighth turn to finish, 
yet it's otherwise not a bad start from this perspective! I can
produce a decision tree that shows what to do on each of the 2315 days
of a full cycle, and a total of only 8204 words would have been played --
an average of 3.543845 turns per game. Actually SLATE is better,
and in fact gives the lowest total of all the words I tested in this
way: 8186 total (3.536069/day) if we follow this decision tree.
Most of these trees sometimes extend past the sixth move though;
among those that do not, the fewest moves I have found is for PLATE.
It wins every day, taking 3.562419 moves per day, 8247 total.
(SCALE and PLACE are nearly as good; PARSE takes 8282 total.)
In other words, PLATE seems to be the best starting word in the sense
of using the fewest turns on average, if we play in hard mode and
insist on winning by turn 6 (using only solution-list words).

My algorithm tried to minimize something a bit odd, in my construction 
of these decision trees: I have learned that Tom Sirgedas has 
posted decision trees to play the game with the same constraints that
I do: strict hard mode, using only Wordle solution words. He gives
one starting with SLATE which, like mine, will occasionally take more
than 6 turns; another starting with PLATE which always finishes by the
6th turn; and a third starting with SCAMP which always finishes by the
5th turn. His first two finish a bit faster than mine (3.51836 turns
and 3.55378 turns respectively) and his third finishes in 3.71404 turns
on average. Each of these is optimal in its class by this metric.
(My decision trees opt to lower the maximum number of turns
needed to handle each word in a cluster, before trying to lower the
average number of turns. The difference is evident only in
a few clusters.) 


                ----------------

(E) A "WORD" ABOUT REPEATED LETTERS

In previous sections it was necessary to restrict attention to starting
sets without repeated letters, to cut down on the size of the search
spaces. For single-word starting sets that is not necessary, and indeed
this gives us an oppotunity to consider the consequences of not
allowing repeated letters in the previous sections. 

There are 749 Wordle solution words that repeat one or more letters.
We can rank them according to any of the metrics that we have applied
to other sets of words:

For the various Lp metrics, the best of these words is one of these:
   INANE (for all p>10.736), whose largest cluster has "only" 316 elements
   ELATE (for other p>2.525)
   ERASE (for other p>1.133) with a daily ambiguity of "only" 106.97
   TEASE (for other p>0.645)
   TERSE (for other p>-0.895) which has a "large" total of 122 clusters
   TRAIT (for other p>-1.604)
   TOTAL (for all p<1.604) with 33 singleton clusters, which is the max.
For the color-tile metrics, the best is either
   SOOTY (for all f<0.063) with 1392 green tiles, which is maximal
   CEASE (for other f<0.183)
   TEASE (for other f<0.234)
   ERASE (for other f<0.821)
   EATER (for other f<1.840) with 3641 total colored tiles, which is maximal
   TERRA (for all f>1.840) with 2759 yellow tiles, which is the max.
For the guess-at-will strategy the best repeat-letter starting words are
   SLEET, which finishes after an average of 3.96898 turns (the min).
   SLEEP, which fails 1.00232% of the time (the min)
Also notable are ROOST, TORSO, and SHOOT, which alone are guaranteed to
finish by the 12th move this way.

(I did not compute a minimal finish-by-turn-6 strategy for any of these
starting words. Each of the words TRAIT, TOTAL, SLEEP have 28 clusters
that require an additional rule of play (i.e. clusters for which
guess-at-will can fail), and the other words listed above (probably)
have even more, so none of these words would be optimal by the
'simplest strategy' criterion, and I doubt any of the rest of the 749
repeated-letter words would be any simpler.) 

Based on this information, one might conclude that ERASE, TEASE, and TERSE
are the best words to include in a discussion that's not limited to
words with distinct letters. Besides the other words listed above, 
one might also consider words that showed up fairly highly on one or
more of the above rankings: STEEL, RESET, BERET, EASEL, LEASE, LATTE,
NERVE, NEVER, SNEER, TREAT, FLEET, STATE, SPREE, SCREE, ...

*However*, as good as the best of this lot are, they're not great.
Ranking all 2315 words by any of various critera, even the best of these
words (measured by the same criterion) are typically out-ranked by some
10% of the words that do not repeat any letters.


One might hope that one could pair off the best of these "four-letter words"
with the best pairs to get good triples with only 14 distinct letters;
or similarly to create good 19-letter quads or 9-letter pairs.
This does not seem to be the case. For example, what 14-letter
triples can we form which include ERASE? Of all the 196175 10-letter pairs,
every one of the ones that I would consider to be in "the better half"
already includes at least one of the letters A, E, R, or S. A good example
of one that doesn't is COUNT+DIMLY, but now the triple ERASE+COUNT+DIMLY
is not competitive with the best triples we itemized in a previous section.
Likewise I scanned what is arguably the "top 20%" of all the 15-letter
triples, and every one of them shared a letter with every one of the
18 or so best words in this section.

The conclusion is that in order to build starting sets that repeat a
letter but remain competitive with the others we have already found,
we would have to either (a) combine moderately-good starting
sets without repeat letters, with moderately-good individual
words with a repeated letter, in the hopes that they have complementary
strengths, or (b) combine very good parts that individually have
no repeated letters, but whose letter-sets overlap. (Option (b) is
counter-intuitive, since there is more information that can be returned
from a letter repeated in a single word, then from a letter in common
between two words. Nonetheless, we have a better selection of the two
parts in this way.) 

                ----------------

(F) SUMMARY OF BEST SINGLE STARTING WORDS

So what's the single best starting word? By various criteria in this
document, we have seen it might be TRACE, or BRUTE or CHANT, or RAISE
or ARISE, or FILET, or SLATE, or CLASP, or SLEPT, or STARE or ALERT/ALTER/LATER,
or PLATE. I once informally combined some of the metrics to get an overall winner,
and it came out to be PARSE. WordleBot says CRANE, while if you set the
toggles suitably at WordleTools and it will rank TRAIL, TRAIN, or ATONE highest.
One "expert" advocates CLAMP, while someone else "proves" it's SLICE or ADEPT.

Most of these claims have some justification and computation behind them,
so it may seem odd that they can have different conclusions. Of course
it depends on what you wish to optimize and how you will continue
play after the starting word, to try to optimize it. (Most claims about
the best starting word seems to assume the player "will play optimally"
but that seems unrealistic unless a clear and simple algorithm is
presented, and then followed by the player!)


I know the very best one-word starting sets can achieve an average number
of turns as low as 3.4212 while guaranteeing a 0% failure rate. But all
the ones I know of allow use of all the non-answer-list words as well;
I don't know how much higher the minimum is if we restrict to words
from the Wordle answer list. But I do know that these "optimal" algorithms
have a significantly higher amount of branching than our simple algorithms.
(The best score in this document was 3.6590 turns per game, for CRANE + SPILT.)
Also, it bears repeating that the superiority of those algorithms applies only
to original Wordle, not the compound games.

==============================================================================

                CONCLUSION(?)

So ... what does all this tell us about how to play Wordle and the compound games?

One conclusion, surely, is that what we choose to do will depend on what we want
consider important. We have seen in the examples that the most important goals 
may be at odds. We want to win as often as possible; we want to keep our numbers
of turns used as low as possible; we want to follow a procedure that is as
simple as possible; and along the way we appreciate not having to come up with
good moves in the absence of concrete hints.

In order to measure just how good or bad a particular starting set is,
we have insisted throughout that it is important to clarify just what the
player will do after the starting set is played. It is interesting to note
however, that (assuming the goal is to minimize the average number of turns)
the relative rankings are similar whether we intend to finish the game
with a guess-at-will strategy, or following a strategy that guarantees a win
within six turns.

So, since people like a firm answer, let's just guess what the reader is
really interested in, and make a recommendation: depending on what
size of set of words you want, the "best" starting set is one of these:

   catty, frond, rumba, spill, verge, whack
   blank, chump, goody, river, swift
   carve, downy, plumb, sight
   bland, copse, right
   crane, spilt

Should I give a personal, non-necessarily scientific, recommendation that takes
into account what actual human players are like? First, familiarize yourself
with the Wordle wordlist! Then
(1) If you're playing Wordle on hard mode, start with PLATE.
(2) If you're playing Wordle on easy mode, and just want to keep your "streak"
    alive, start with CARVE, DOWNY, PLUMB, SIGHT.
(3) If you're playing Wordle on easy mode, and want to minimize your number of moves,
    start with CRANE + SPILT
(4) If you're playing a compound game, start with COPSE, BLAND, RIGHT.
In all cases, memorize as many of the corresponding side rules as you can.
Then, during play, use those rules but otherwise just keep guessing Wordle
words consistent with the clues.

                ----------------

I will continue to process the datasets that I have constructed, and intend to
update this document when something new pops up. In the mean time, I welcome
corrections and suggestions for further investigation. 

Now, how about moving on to a nice game of Nerdle, hmm? :-)

--dave
rusin@math.utexas.edu