[[ This is a subsidiary document to my findings on Wordle and related games. ]]
[[   See  math.utexas.edu/~rusin/wordle/   for the full analysis .   ]]

Let's talk about what Wordle thinks is meant by "5-letter words".

There will come a day when the correct answer to a Wordle puzzle will be the
word "GOLEM". I know that word from a chance encounter with some Jewish 
folk literature --- a golem is a kind of Frankenstein's monster, a living being
brought to life out of a lump of clay. But most people don't know that word,
so how will they know to enter those letters into the Wordle game?

In order to win Wordle, one must be familiar with the words that Wordle will
eventually use as answers. This is a fixed, finite set of words, which I will
call "the wordlist". In this document I will discuss several aspects of this
wordlist. In particular, a Wordle player should glance through this document
just to get introduced to words that are in one way or another unusual or
unlikely to come to mind while playing.

It's also important to know what's NOT in the wordlist. The Wordle game allows
the player to enter some other "words" as input even though they are never
used as the answer of the day (there may be strategic reasons to do so); but
if the player sees the letters a,c,o,s,t and thinks "COATS" or "TACOS" is
the answer, they will be sad to learn that these words are not even in the
wordlist. (The answer that day is either "COAST" or "ASCOT".)

So we should review the actual wordlist, and see how it compares to
what a typical player might think a good list of 5-letter words is.
==============================================================================

First up: we need to establish exactly what wordlist we really are
talking about.

Josh Wardle, the inventor of the game, selected a list of words that
would be the correct answers day after day. It was a personal choice of
2315 words. You can get an alphabetical listing here. Actually
the words were included in the code for the original website version of
the game, listed in order of play, day after day, so you can download
that original web page and cheat if you like!

In addition to this wordlist, the original program included a longer
list of some ten thousand additional words, many of them obscure, that
would be accepted as input to Wordle. I suppose the inputs have to be limited
to prevent a player from using inputs like "DKLMV" when they have already
figured out that the solution has the form "S-H-A-?-E". Since the
words on this longer list are acceptable inputs to Wordle, I really
ought to run analyses that incorporate them when I discuss solution
techniques. However, it is my own preference never to use these words
(and more generally I dislike to use "words" that I don't recognize as
words!) so *almost* without exception I will ignore this longer list:
the 12972-word "input list" which is the union of the 2315-word
solution wordlist and this longer set of second-rate words.
[UPDATE: In summer 2022, the NYT enlarged the set of input words;
there are now 14,853 accepted inputs to Wordle. But again, we will
mostly ignore this longest list.]

Next, it is important to recognize that the "official" word list has
changed and is subject to future change. This is the result of the
purchase of Wordle by the New York Times. While their intent was not
to change the gameplay, they did decide to tinker with the wordlist.
The only changes (so far) of which I am aware were to remove a few
words that it considered unfamiliar or offensive:
    agora, fibre, lynch, pupal, slave, wench
(In addition, some of the words were taken out of their original position
in the sequence of answers. This does not affect our analyses since
we invariably treat the game as if the day's hidden word is selected
randomly. I believe the list of "moderated" words is this set of 22:
    augur bobby butch eclat fanny fella fetus flack gaily harry hasty
    hydro liege octal ombre payer sooth stalk unlit unset vomit widow
[UPDATE: Starting in November 2022, the Times allows a "curator" to
select the hidden word each day. The chosen word is now often linked
in some way to the date. It is possible that the curator will some day
choose answer words that are not on the original list; I also expect
the curator to choose the most obscure words on the original list
only reluctantly.]

I have also enjoyed playing the compound Wordle variants that challenge the
player to guess words on several Wordle games at once.  Again, the authors of
these games are free to choose their solution word lists (and input word
lists) as they see fit. Most of the games that I play have similar word
lists, but there are some differences I know of.  Here is what I know for the
N-fold compound games.

Because of the differences, I cannot guarantee that statements that I
make about the compound games will apply to all of them equally
well. Apart from Dordle and 64ordle, the word sets I propose should
work as well or better for a compound game; for those two in
particular, I can only really say that the predictions I make about
the behaviours of these games do seem to be at least approximately
correct. Let me reiterate: I am making claims about one particular
wordlist; those claims may or may not be applicable to similar sets of
5-letter words in games.


Other word lists are of course possible. For instance, I have a set of
some 8000 words that is (or was) the official set of 5-letter words
that could be played in Scrabble. One may repeat the analyses that I
have done on the Wordle wordlist, for any of these variant lists, but
I have not done so. Just be aware that any claims about a set of words
being "best" are relative to the wordlist being considered.

==============================================================================

So in the rest of this document we will have a look at the 2,315 words
in the original Wordle wordlist. There are two points of view from which
to study this set of words: as sets of actual words (that is, ways of
talking about concepts), and as strings of letters. We'll do both.


So what words did Wardle include, or exclude, when making his list?
It's an idiosyncratic choice. For example it does not include "squid",
even though it contains "squad" --- and "squib"! I have some comments
about words that did get included, and some about the words excluded
from the wordlist.


There are some strange choices that I did not expect to see in the wordlist.
Be prepared to enter these some day as a Wordle answer even if they look wrong.
"Gazer" without star- ? "Willy" without -nilly? "Outgo" (not outgoing)?
I understand "ombre" as shade, "terra" for land, and "caput" for head, but are
they English words? (Did they mean "kaput", which isn't in the list? "Fritz" is.)
"Bleep" and "clack" and "clank" and "humph" and "splat" are sounds, but are they
really words? What part of speech are they? ("Boing" did not make the cut I guess!)
Kids use "scram", "stunk", and "slunk", as well as "snuck" for a past tense,
and say someone was a "goner", but I'm not sure they are adult words. 
I think of "hydro", "hyper", "micro", "quasi", "super", and "ultra" as
being prefixes; Wordle accepts them as words (though not "multi".)
The wordlist includes some words of regional or specialized use, such
as "matey" and "caulk". Both "fiber" and "fibre" can be the word of the day,
but Brits might wonder what a "homer" is, and Yanks don't use "bobby". 
"Fella" is dialect -- spoken, not written, right?
To me, "tepee", "gipsy", "gayly", and maybe "flier" are mis-spelled. 
(But "forgo" *is* ok; it's not the same word as "forego"!)

And of course, each of us will see words in the list that we just don't know. 
For me personally I thought most of the words that made it to Wardle's list 
were words that I recognized, but even so, I --- a veteran of word games ---
drew a blank trying to define, or use in a sentence, some of these:
    boule covey crump debar droit dross eclat gawky harpy iliac junto ovate
    ovine ralph savoy squib swash tatty thrum tonga tulle utile waxen whelp
Your own personal list might include others; for example I have heard complaints
about "rupee" "swill" "agora" "cavil" and "ennui".Here's an article about
the Wordles that people found most difficult.


Conversely, it's important for game play is to know what's NOT in the wordlist,
so you don't waste a guess. I have already linked to the lists of hundreds of
words that Dordle and 64ordle thought were perfectly reasonable (and I mostly
agree!) Even beyond all of those, there are words that I have thought to play
in Wordle because I thought they look completely ordinary to me; alas, Wardle
does not include them among the answer wordlist:

      addle amino balky busty cocky ducky eider eland folic grift liane liter
      liven miter muggy pacer pinup rondo roper sedum snafu taser thine thrip

None of the elements argon, boron, radon, nor xenon is in the wordlist: so that's
a consistent pattern.  But usually, in almost every category, the word list
includes some words from that category and omits others --- foods, plants, ethnic
groups; contemporary words and outdated words; verbs and nouns; etc.  Examples:
"Sloop" and "skiff" are Wordle words; "ketch" is not. "Nylon"-yes, "rayon"-yes, "latex"-no.
"Carat" is in the wordlist, "karat" and "caret" are not (nor is "carrot"!).
"Penny" and "pound" of course; but why "rupee" but not "ruble" nor "franc"? 
"Rhino" and "hippo" are OK but "chimp" is too short?
"Torah" and "koran" are absent even though "bible" is in the list (and so is "mecca").
Both "patsy" and "pasty" are included, as is "gusty" -- but not "gutsy".
Also included are "dumpy", "jumpy", and "lumpy", but not "bumpy";
"woozy", "loopy", and "goofy" yes but "doozy" no.
I guess the "skort" has gone out of fashion?
Wordle will expect you to guess "liken" but not "liven", "wight" but not "bight",
"axion" and "quark" but not "boson", "kneed" but not "egged".


Generally the wordlist does not shy away from "politically incorrect" words
("hussy", "lynch") and it generally includes words for squeamish ideas 
("fanny", "fecal", "semen"), though it chooses not to include a few others
("feces", "naked", "penis"). On the other hand, particularly rude
words and slang are definitely not in the wordlist (bitch, boner,
fagot, whore). I ran a comparison to an online "bad word" list to see
which were or were not in the Wordle wordlist, and found plenty of each.

English happily absorbs words from other languages, especially to describe
things like the foods of other cultures, so inevitably the Wordle list 
includes many words that might be considered foreign-language words. Of
course most English words trace their words to Greek and Latin roots, but
in some cases that English word is unchanged from the original, hence
Wordle words AGAPE and GAMMA, UMBRA and TERRA. Many words are used in
the original French form: BUTTE, ECLAT, ENNUI, PUREE. We have words from
German (BLITZ), Dutch (HOIST), Norse (FJORD), Scots (TWEED), Irish (PHONY),
Scots Gaelic (CAIRN), Spanish (JUNTA), Portuguese (CASTE), Italian (PESTO),
Czech (ROBOT), Polish/Russian (VODKA), Yiddish (BAGEL), Hebrew (RABBI),
Persian (LILAC), Arabic (NADIR), Turkish (KEBAB), Sanskrit (KARMA), Hindi (KHAKI),
Tamil (CURRY), Chinese (RAMEN), Japanese (SUSHI), Indonesian/Malay (GECKO),
Australian aboriginal (KOALA), Tongan (TABOO), Bantu (BANJO), Swahili (JUMBO),
W. African (BONGO), Inuit (IGLOO), Ojibwe (TOTEM), Lakota (TEPEE), Illinois (PECAN), 
Choctaw (BAYOU), Arawak (GUAVA), Nahuatl (CACAO), Tupi (TAPIR)

Of course, a word's history can be long and tortuous. For example, CHESS
actually came to English from Russian, but the Russians got it from the 
Persians. Other Wordle words have longer histories, e.g. look up the
etymology of CANDY or HORDE.


Then there is the question of the multiple forms of a root word.

For verbs, you will never see third-person singular forms (e.g. "talks"); not even "goeth", "lieth",
  nor "doeth". There is one second-person singular form: "shalt". (But not "goest", "doest", "liest".)
Fifteen gerunds of 2- and 3-letter verbs are included ("being", "aging", "dying", etc.)
 although "acing" and "axing" are not. (FWIW, there are 25 words that contain  I,N,G
 but do not end in -ING : 
    again align begin binge bingo deign dingo dingy feign fungi genie giant given
    glint grain grind groin hinge ingot lingo login neigh night reign singe
A 3-letter word could take "re-" or "un-" as a prefix. Only "fit" does both. I guess
 typically "re-" is used with a present-tense verb and "un-" with a past tense, as in
 the Wordle words  "reuse" and "rerun", but "uncut", "undid", "unfed", "unlit", "unmet"
 "unset", and "unwed". But "untie" and "unzip" are also in the list while "retie" and "rezip"
 are not. (Neither "re-" nor "un-" combination appears with "buy", "dry", "fry", etc.)
 There's also the de- prefix seen in Wordle words "debar", "debug", "detox"; calling out
 other words like "debut" as being verbs with a de- prefix is a bit of an etymological stretch!
You mostly won't see the regular past tense of shorter verbs (e.g. "paced", "acted").
But there are exceptions: the words "bused", "clued", "freed", and "kneed" *are* in the
wordlist, and it looks like just about every *irregular* past form is, too. That includes
*the -ied past tense of eight 3-letter words ending in -y are all in the wordlist;
*the -t variant constructions are in the list (except "blest" is not in the wordlist):
    built cleft crept dealt dwelt knelt leant leapt meant slept spelt spent spilt swept
*the -n past participle forms are accepted, including :
    begun blown borne drawn eaten flown given grown known laden risen shown sworn taken woken woven
  (The wordlist includes many *present* tense -en verbs like "widen" and "liken"; but "waken" is excluded.)
*the past tense forms that change the root vowel: at least, these examples are in the list:
    arose awoke began begat bound broke chose clung drank drove* drunk froze
    flung found heard shone shook slung smote spoke* stank stole stood stuck
    stung stunk swore swung threw undid wound wrote wrung    (*but not "spake" nor "drave").

For nouns, almost without fail, regular plurals are omitted: no "ashes" nor
"boats" nor "lives" nor "skies".  Indeed, there is not a single word ending
in -es ! (And not many -s words at all). I do see "cacti", "fungi", and "radii",
though not "genii" nor "celli". Both "goose" and "geese" are in the list, 
as well as "moose" but not ... sorry. "Media" and "opera" are in the list,
while "quora" is not, but perhaps those are not usually seen as plural nouns.
Other irregular plurals I know to be in the list include "algae", "pence",
"teeth", "women", and I suppose "those". (Strictly speaking, words like 
"sheep" can also be plural.) 
On the other hand some irregular plurals are not in the list; arguably
these are really foreign words and perhaps uncommon even in the singular:
    beaux cilia folia labia phyla sacra styli uteri vacua vitae
It's not surprising that "oases" is not in the list, since "oasis" isn't either! 

Agents formed from a verb are hit or miss: "voter", "giver" and "taker",
(and "tamer" -- as in lions?), as well as the odd-sounding "wooer" and
"aider" are there; "pacer" and "oiler" are not. "Skier" yes, "caver" no.
"Dryer" and "flyer" (and "plier"!) yes, but not "cryer", nor "fryer".
The partial agent word *ater could be "eater" or "hater" but not "rater".
{e,f,i,l,r} could be "flier" or "filer" (or "rifle"!) but not "lifer".

Proper nouns and adjectives are excluded from the Wordle solution list:
there's no Adams nor India. But trump and china are there --- they're common
nouns used in the context of card-playing and table-setting respectively.
So too do many other words in the list sound like proper nouns but they have a
common-noun usage too: frank, stein, peter, billy, patty, smith. (A few of these
are kind of circular reasoning: the teddy bear was named after Teddy Roosevelt,
the ascot tie is named after a place, and I imagine "welch" is actually an
ethnic slur.) Look it up: even "tonga" is a common noun. (I didn't know that.)
("Gauss", "hertz", and "curie" are common nouns in science, but they're not Wordle words.)
"Druid" is a Wordle word, but in practice maybe isn't treated as a proper noun.
"Mecca" is also on the list, and used lower-case to mean "a destination like Mecca".
The only exceptions I have seen to this whole pattern are "Welsh" and "Dutch":
whether before "door", "treat", or "courage", "Dutch" is regularly capitalized,
and I guess there is some kind of reference to the Netherlands and its people.
Ditto with "Welsh".

Adjectives that amount to past-tense verbs are also rare: "abled" is
included, while "tired", "faded", "timed", "bared" are not. A large set
of comparatives ("bluer", "riper", "surer", "truer", "saner", "freer",
"wider") is included (perhaps "tamer" belongs in *this* list?); some
sound strange to my ears.  Yet some others ("cuter", "icier", "lamer",
"abler") are not in the wordlist, even though I personally am more
likely to use the word "cuter" than "saner"! Wordle allows you to be 
"gayer" (and "drier") but not "shyer" (nor "shier"). I don't think
English even has any 5-letter superlatives to think about!

Every adverb that I expected seems to be included in the wordlist
(formed -- in different ways -- from adjectives like apt, coy, icy, dull,
and noble) although conversely I think most of the -ly words in the list
are not adverbs; many are adjectives formed from nouns (like manly, girly,
and woolly) but many are not (filly, imply, jolly, lowly, reply, surly).

==============================================================================

A completely separate way to analyze the wordlist is to think of the words as
just strings of letters. I've run many tabulations about the words this way; 
they can be helpful when trying to reconstruct a word from the hints that accrue
during game play.

For example, it's useful to know that the most common letters are e,a,r,o,t, while
j,q,x,z are rare. Here are the counts of the letters' usages; we can choose to 
count the words containing the letters, or the total number of occurence of
the letters (i.e. counting repeated letters in a word multiple times, or not). 
We can also see which letters are most likely to be repeated.

Counting words  Counting repeats  Repeats
e, 1056           e, 1233          e, 177
a,  909           a,  979          o,  81
r,  837           r,  899          l,  71
o,  673           o,  754          a,  70
t,  667           t,  729          r,  62
l,  648           l,  719          t,  62
i,  647           i,  671          s,  51
s,  618           s,  669          c,  29
n,  550           n,  575          n,  25
u,  457           c,  477          i,  24
c,  448           u,  467          d,  23
y,  417           y,  425          f,  23
h,  379           d,  393          p,  21
d,  370           h,  389          m,  18
p,  346           p,  367          b,  14
g,  300           m,  316          g,  11
m,  298           g,  311          h,  10
b,  267           b,  281          u,  10
f,  207           f,  230          k,   8
k,  202           k,  210          y,   8
w,  194           w,  195          z,   5   (dizzy fizzy fuzzy jazzy pizza)
v,  149           v,  153          v,   4   (savvy valve verve vivid)
x,   37           z,   40          w,   1   (widow)
z,   35           x,   37          
q,   29           q,   29             ( q,x,j  never repeat )
j,   27           j,   27

So the frequency rankings of the letters in the first two columns are
identical except for the swapping of four pairs (u/c, h/d, g/m, x/z)
in which the second of the pair is more likely to be repeated in a word.
(Note that the numbers in the last two columns add to the number in the
first column. This includes counting e.g. "mamma" as one word containing
three "m"s, counting this as TWO repeats of "m".)

The way the letters are distributed among the words also varies. There
are 5x26=130 potential combinations of "letter xxx in position yyy".
Five of these never occur: there are no words that start with x, none
that end with j,q, or v, and none with a  q  in position 4.
The only word with a  q  in position 3 is "pique", and the only word
that ends with   u  is "bayou"(*). Other rare combinations are
    j4: banjo ninja
    z2: azure ozone
    j2: eject fjord
    y4: polyp satyr vinyl
    x4: epoxy proxy twixt
    j3: enjoy major rajah
    z1: zebra zesty zonal
    z5: blitz fritz topaz waltz
The list continues with q2 (in 5 words), y1 (in 6), x5(8), f2(8), h3(9), k2(10).
The most common combinations are  e5 (424 words), s1 (366), y5 (364), e4 (318),
a3 (307), and a2 (304). All other combinations occur in between 11 and 279
words. (Median: 61; average: about 93).
[(*) On 2023-04-09, the hidden word was SNAFU, the second word not on the original
     Wardle list to become a hidden word.]

These numbers can help find pairs of words that are hard to distinguish.
For example, there are only 17 w5s and only 11 b5s; that means there are
only 28 words that can distinguish "throw" from "throb" on the basis of
their response in the last column. Here are the next "closest" pairs:
    28 = 17 w5 + 11 b5 thro*
    28 = 16 b2 + 12 g2 a*ate
    32 = 16 b2 + 16 s2 a*ide
    32 = 29 y3 +  3 j3 ma*or
    33 = 23 y2 + 10 k2 e*ing
    35 = 20 d2 + 15 v2 e*ict
    38 = 26 w3 + 12 k3 po*er
Some of these small sets of testwords can be used on multiple word pairs.
For example there are the 49 v3s and 26 w3s that together are the only 75 words
that in this way can distinguish *seven* pairs
    co*er fe*er lo*er mo*er ne*er ro*er se*er


It is possible to have two sets of words that have identical collections
of {letter, position} pairs: an example is  POINT+CLUED  and  COUNT+PLIED .
We will call such pairs of wordsets "perfect anagrams" of each other.
Since each pair has no repeated letters, that means that the collections
of hints (colored tiles) provided by each pair are identical, and so the
clusters of words the pairs create (that is, the sets of hidden words that
would yield the same colored tiles from the pair) are identical too.
Consequently, for example, starting Wordle with SWARM+POINT+CLUED will
give exactly the same statistics (e.g. percentage of times failed) as if
starting Wordle with SWARM + COUNT + PLIED !  (Similar, more obvious,
examples include SPIRE+BLOND vs SPORE+BLIND .) There are also interesting
instances with triples, e.g. SPORT+MANGE+CHILD and SPILT+MANGE+CHORD, or
even multi-sets like
  {twang, slump, cried}, {tried, swung, clamp}, {tried, swamp, clung}, {tramp, swing, clued}
any two of which have the exact same results when used as a Wordle opening!


Altogether there are 1566 words that consist of five distinct letters; the other
749 words have repeated letters. Thinking of these latter as "poker hands", most 
of them (691) have only a single pair. But there are 38 words that show "two pairs":
    allay amass array assay belle booby cacao civic cocoa femme
    freer gamma kappa kayak level llama madam magma mimic minim
    motto onion papal penne queue radar refer rotor salsa sense
    shush slyly teeth tenet tooth tweet verve vivid
Also 19 are "three of a kind":
    bobby daddy eerie emcee error fluff geese mammy melee mummy
    nanny ninny poppy puppy rarer sassy sissy tatty tepee
And there is one "full house":
    mamma

Some of these repeated-letter sets come from multiple words, and some are
contained in others. That lone 2-letter combination {a,m} is contained in one
three-of-a-kind word "mammy" and five two-pair words (amass, llama, madam,
plus gamma and magma (which use the same third letter, G)).
The 38+19=57 words that use three distinct letters involve only 52 different
3-letter sets because of these overlaps:
* [cacao, cocoa] are both 2-pair hands but they double different letters
* [freer, refer] and [gamma, magma] are 2-pair hands with the same doubled letters
* [assay, sassy] and [bobby, booby] match a 2-pair hand with a three-of-a-kind.
Each of the 52 three-letter sets is part of a four-letter set used in a Wordle
word except for the letters in FEMME and KAYAK. (Each of those 3-letter
sets is contained in the letters of multiple words, e.g. FRAME and GAWKY.)
The 691 words that simply have a repeated letter give rise to only 591
4-letter sets because some single-pair words use the same sets of letters,
including four sets of four words each:
* [erode, odder, order, rodeo] each doubles a different letter
* [ester, reset, steer, terse] each doubles the same letter E
* [asset, state, taste, tease] and [puree, purer, rupee, upper]
There are also nine sets of three words built from the same four letters:
 [attic, cacti, tacit] [algae, eagle, legal] [eater, terra, treat]
 [allot, atoll, total] [belie, bible, libel] [dried, drier, rider]
 [ether, there, three] [serve, sever, verse] [strut, truss, trust]
And there are 70 more pairs of words that use the same four letters
[leper, repel], [sleep, spell], etc.) Of the 591 four-letter sets
used in Wordle words, most of them are also part of a five-letter
set (e.g. TRYST -> r,s,t,y -> RUSTY), but 137 are maximal, including
seven maximal four-letter sets that come from more than one word:
 [batty, tabby] [fatty, taffy] [elegy, leggy] [foggy, goofy]
        [loopy, polyp] [lowly, wooly] [snoop, spoon]
Finally, the 1566 words that consist of five distinct letters actually
give rise to only 1393 sets of 5 distinct letters, because as we will
see there can be as many as 4 Wordle words built from the same set
of 5 distinct letters (e.g. cater crate react trace).


Some combinations of letters are rare. For example, not only are q,x,z, and j
rare individually, they are extremely rare together: the only words containing
two of these are the five double-z words shown above (including "jazzy", which
has three!)

Next rarest is "v". Apart from the four double-v words shown above, the only
word adding  v  to q,x,z,j  is "vixen".

Next rarest is "w". There is just one double-w word "widow". The only words
having  w  and some of q,x,z,j  or  v  are
    jewel twixt waltz waxen woozy     vowel waive waver weave woven

Next rarest is "k" . The double-k words are
    kayak khaki kinky kiosk knack knock   skulk skunk
The words with  k  and one of  q,x,z,j,v  are
    evoke jerky joker knave quack quake quark quick quirk vodka
and the ones with both  k  and  w  are
    askew awake awoke gawky known tweak wacky whack whisk woken wrack wreak wreck

Next rarest is "f". The double-f words are
    fifth fifty   affix offal offer  gaffe jiffy puffy taffy  and 13 ...ff 's 
The words with  f  and one of q,x,z,j,v  are
    fjord jiffy   affix fixer   fizzy fritz froze fuzzy    favor fever
Both  f  and  w :
    awful dwarf fewer flown frown swift wafer wharf whiff
Both  f  and  k : quite a few (38) --- see separate file.


The 29  q  words only use {a, b, c, d, e, h, i, k, l, m, n, o, p, q, r, s, t, u, y};
never  z,x,j,v,w, nor f nor g. Without exception, q is followed by u.
The only q words with double letters are
    queen queer quell queue quill
Note that  q  is always in positions 1 or (as eq- or sq-) position 2, except for
    pique.

Of the 35  z  words, the only ones with a double letter are
    amaze bezel booze boozy ozone plaza razor seize woozy
and the five --zz- words shown above.

The only occurences of "h" that are not preceded by  {c,s,t,p,w,g}  either 
have the h in front or in these 8 words:
    abhor ahead khaki myrrh rajah rehab rhino rhyme
(This is useful information when  H turns yellow after I enter a set of
words that includes e.g. {c,s,t,p,g} but not  W ; I either have to start
with the H or else I know a W is missing. Moreover, the only WH pairs in
the Wordle solution list occur at the start of the word!)

The only occurences of "k" that are not preceded by  {c,s,n,r,l,o,a,e,i}  either
have the k in front or in these words:
    fluke gawky vodka

Of the 417 words with a "y", 364 have a "y" in the final position; the other
positions are more rare. The only words with  "y" in position 4 are
    polyp satyr vinyl
(Note that  y  is used as a vowel here.) The only words with "y" in front are
    yacht yearn yeast yield young youth
(where  y  is used as a consonant). There are 23 words with "y" in position 2;
in all cases this  y  is used as a vowel.
    bylaw cyber cycle cynic dying eying gypsy hydro hyena hymen
    hyper lying lymph lynch lyric myrrh nylon nymph pygmy synod
    syrup tying vying
(note that "gypsy" and "pygmy" have another y at the end.)
Finally there are 29 words with "y" in the middle (including 6 with two "y"s);
I would argue this is a mix of vowel-y and consonant-y cases.
    abyss bayou buyer coyly crypt dryer dryly flyer foyer gayer
    gayly glyph idyll kayak layer loyal maybe mayor payee payer
    rayon rhyme royal shyly slyly style thyme tryst wryly


I looked for the patterns of vowels, consonants, and "y" (treated separately).
There are 53 distinct patterns, including 20 different patterns involving "y"s.
Four patterns account for just about half of all words:
    cvcvc, e.g. "coral" (391 times = 17% of all words)
    ccvcc, e.g. "bring" (349 times = 15%)
    cvccy, e.g. "manly" (219 times =  9%)
    ccvvc, e.g. "speak" (192 times =  8%)
The next few are ccvcv, cvccv, cvvcc, vccvc, ...

At the other extreme are 16 patterns used only once or twice:
    queue           (the only word with 4 consecutive vowels)
    angst           (the only word with 4 consecutive consonants)
    yacht           (the other 5 words starting with "Y" are all  yvvcc )
    hyena
    eying
    maybe
    gooey
    audio, eerie   (audio is the only word with four distinct vowels...)
    aunty, early
    abyss, idyll
    bayou, payee   (... unless you include "y", in which case bayou wins too.)
    coyly, gayly
    gypsy, pygmy
    cycle, hydro
    dryer, flyer
    spray, stray
Apart from "queue", the other words with 3 vowels in a row are:
    gooey quail queen queer quiet wooer
There are quite a few strings of three consonants, e.g. "sixth", "aptly". 
But at the front of a word only   scr- shr- spl- spr- str- thr-  appear, and
the only vcccv words are
  alpha amble ample angle ankle apple extra intro ombre ultra umbra uncle

For the most part, a word that starts with a consonant must have a vowel next,
but there are many initial consonant blends of these forms:
  *R: the first consonant must be one of  bcdfgkptw
  *L: after  bcfgps
  *H: after  cpstw
  S*: before  ckmnpqtw
Partially following those patterns are the atypical "llama", "ghost", "ghoul",
"khaki", "krill", "rhino", and "rhyme". There are also 9 words starting with TW
and 10 starting with KN, plus these oddballs:
    dwarf, dwell, dwelt, fjord, gnash, gnome, psalm
The initial strings only continue to a third consonant with the triples
SCR-, SHR-, SPR-, STR-, THR-, and SPL- .


English has those old-fashioned double-vowel symbols "ae" and "oe"
(e.g. in "foetus" or "Aesop") but they are essentially absent here. In
fact, several vowel puermutations are rare or missing altogether. 
There are no words containing the permutations AA UU IY YU or YY.
The other permutaations that occur in fewer than 10 words are
    AE: algae
    II: radii
    IU: opium
    UY: buyer
    AO: aorta, cacao, chaos
    EO: cameo, rodeo, video
    EU: deuce, queue, reuse
    UO: quota, quote, quoth
    OE: canoe, gooey, poesy, wooer
    YA: kayak, loyal, royal, yacht
    YO: bayou, mayor, rayon, young, youth
    YI: dying, eying, lying, tying, vying, yield


Someone on Reddit posted a table of the numbers of words with all 2^6 64 vowel combinations:
        --- E   O   U   EO  EU  OU  EOU
    ---  0  167 124  87 154  88  69  14
    A   166 277 106  45  16  19   4
    I   125 174  64  36  11  17   3
    Y    13  45  62  58  14   2   4
    AY   88  28  14   3
    AI   97  22   9   3
    IY   61   8   3   3
    AIY  10

Up to 4 words may be made from the same set of 5 letters:
    cater crate react trace
    glare lager large regal 
    leapt petal plate pleat 
    least slate stale steal 
    resin rinse risen siren 
    ester reset steer terse (which uses only 4 *different* letters)
Same set of letters (counts irrelevant) can also be as many as four words,
as occurs in (only) these cases:
    erode odder order rodeo
    asset state taste tease
    puree purer rupee upper

(In our discussion of "word similarity" in a separate file, we will look
at the 54 pairs of words that have the same five letters with none in the
same location. This includes some words with two "neighbors":
    angel-glean-angle
    ester-terse-steer
    trace-cater-react
    tuber-brute-rebut         plus the longer chain
    lager-glare-regal-large   and the closed loop
    trove-overt-voter(-trove)
so the 54 pairs actually involve only 99 different words.)

The 2315 words are made from 2037 different sets of letters. There is one set
{m,a} of cardinality 2, from which we can (only) make "mamma". There are 52 sets of 
cardinality 3, including some from which we can make two words:
    {a, s, y}, assay, sassy
    {b, o, y}, bobby, booby
    {a, c, o}, cacao, cocoa
    {e, f, r}, freer, refer
    {a, g, m}, gamma, magma
There are 591 sets of cardinality 4, from which we can make as many as 4 words,
as noted in the four cases above. (There are also 9 cases when the same four letters
can make 3 different words, namely the letter-sets in these words:
    algae, allot, attic, belie, dried, eater, ether, serve, strut   )
We have already noted that the 1393 sets of 5 distinct letters can form
as many as four words each.

I did not do a search on all sets of 5 letters, but I checked all the 5-letter
sets that make a Wordle word, and all the quintuples of the 16 most common letters,
and found sets that could make a dozen or more words in the wordlist:

16: asset, easel, elate, latte, lease, LEAST, salsa, slate, sleet, stale, stall, 
    state, steal, steel, taste, tease
15: asset, eater, erase, ester, rarer, reset, STARE, start, state, steer, taste, 
    tease, terra, terse, treat
13, EARTH, eater, ether, hater, heart, heath, rarer, teeth, terra, there, 
    theta, three, treat
12, carat, CATER, crate, eater, erect, racer, rarer, react, terra, trace, tract, treat



There are cases in which a Wordle game can look almost complete but be
far from finished! I mentioned above the example of "S-H-A-?-E" that
occured on a recent day. We can look for large sets of words that all
have the same four letters in the same places. The largest such sets are
      9 .ight    eflmnrstw  note repeat t (i.e. "tight" is one of the 9 words)
      8 .ound    bfhmprsw
      7 .ower    clmprst    repeat r
      7 .atch    bchlmpw    repeat c,h
      7 sha.e    dklmprv
      6 .illy    bdfhsw
      6 .aunt    dghjtv     repeat t
      6 .atty    bcfprt     repeat t
      6 .aste    bchptw     repeat t
      6 s.ore    chnptw
      6 sta.e    gklrtv     repeat t
      6 gra.e    cdptvz
Then there are 18 sets of 5:
     .ying .usty .unch .ully .rown .ried .over .ough .olly
     .each .ater .andy s.out s.oop s.are s.ack sto.e spi.e
There are 66 sets of size 4, and many sets of size 3 and smaller. Obviously a
player should keep this in mind and make sure to choose early some test words that 
use up a fair portion of these interchangeable missing letters!

Sets of words sharing 3 letters can of course be larger; the largest is the set of
29 words following the pattern  .a.er :
    baker baler caper cater eager eater gamer gayer gazer hater lager
    later layer maker paler paper parer payer racer rarer safer saner
    taker tamer taper wafer wager water waver
Then there are 25  .o.er
    boxer corer cover cower foyer goner homer hover joker loser lover
    lower mover mower poker poser power roger rover rower sober sower
    tower voter wooer
And 24 each of .i.er  and  s.a.e :
    aider cider diner diver fiber filer finer fixer giver liner liver
    miner miser nicer piper rider riper riser river tiger timer viper
    wider wiser
    scale scare shade shake shale shame shape share shave skate slate
    slave snake snare space spade spare stage stake stale stare state
    stave suave
The next most common patterns are ..ing (23), ..lly (22), .ra.e (20), then
    sta..  ..tch  s.o.e  sha..  .ri.e  s.i.e  ..ter  sto..  s.or.  s.ar.  .ro.e  .la.e
each match 15-19 words.

Of course with only (1 or) 2 letters fixed, the number of possibilities grows!
The largest set is ..a.e  which matches 84 letters, then s.a.. and .a..y with 78 each,
followed by s...e (75), ..i.e (72), s.o.. (62), ..o.e (62), etc.


Apart from looking simply at words that share some of the letters (in particular slots),
we can look more broadly at words that seem "somehow similar" to it. I have a separate file
similarity in which I consider this idea more thoroughly and use it in my analyses. 
I make (arbitrary but clear) definitions of what it means for words to be "neighbors",
based on how many yellow and green tiles each would show if guessed against the other
word as the hidden word-of-the-day.

For starters, we could consider the other words against which this word 
measures either 4 greens, or 3 green and 2 yellows. It turns out that "share" 
is the unique word with as many as 15 of these (immediate) neighbors
    scare, shade, shake, shale, shame, shape, shard, shark,
    sharp, shave, shire, shore, snare, spare, stare
(In this case, all the neighbors are obtained by changing just one letter
of "share", as opposed to swapping two of them.)
No word has 14 neighbors; "store" (alone) has 13 (again, all are single-letter
changes). The numbers of words with  n  neighbors then continues as
    12, 3 (shore; stare; and stale, with a "permutation" neighbor "slate")
    11, 1 (shale)
    10, 3 (shave,stack,patty)
    9, 12 (including spine, with neighbor snipe; and stake, with neighbor skate)
    8, 43
    7, 53
    6, 92
    5, 147
    4, 203
    3, 316
    2, 401
    1, 487
    0, 552, that is, there are 552 words with no neighbors at all. Among
them I might highlight words like affix, femme, jazzy, kayak, khaki, pizza
which rarely even give 3 colored tiles against any other word.

We can then include neighbors of neighbors (etc.); the most neighborly is
again "share" (with 56 neighbors), then "stare" (54) and "shore" (53) and other
words of the form  "s{chpt}{ao}{lr}e" and related types; nothing really
different shows up until "clack" (34) and "lover" (33). But note that this
might not really be what we want to measure: isolates will stay isolates
even if we iterate (3-step neighbors, etc). After all, the notion of
"neighbors" int he previous paragraph considers only pairs of words 
obtained from each other only by either permutation or single-letter
substitutions; the idea of "neighbors of
neighbors" does enlarge the size of a neighborhood when iterated for words
that are neighbors because of single permutations of their letters, but
instead maybe we want to enlarge the collection of "neighbors" by allowing
words that change out *two* or more letters at once. We will address this
in the separate file regarding similarity between pairs of words.


Oh, and one more tally for the fat-fingered typists out there (like me!).
There are 373 pairs of words that differ only in one letter, and the letters
that they have different are adjacent on the Wordle keyboard, e.g. WRING/WRONG.
Watch carefully as you type, or check your typing before you hit "enter"!
At worst, there are just a few Wordle words that can have as many as 4
of these typos in the Wordle list:
    dilly --> dolly, dully, filly, silly
    silly --> dilly, silky, sully, willy
    wager --> eager, wafer, water, waver

(There are 562 more such pairs that can be transformed into each other
using two such typos, so if you're especially fat-fingered, be extra careful!
This time around the worst offenders are words that can be mistyped into SIX
other words if we have exactly two errors:
    fiber --> diner, diver, fibre, giver, river, tuber
    sling --> along, aping, doing, eking, slimy, spiny
    sting --> aging, dying, eying, stony, stunt, wring
and words that can be mistyped into EIGHT otherwords by using at most two errors:
    dilly --> dolly, dully, filly, silly; folly, fully, silky, sully
    sting --> stint, stung; aging, dying, eying, stony, stunt, wring )