[[ This is a subsidiary document to my findings on Wordle and related games. ]]
[[ See math.utexas.edu/~rusin/wordle/ for the full analysis . ]]
Let's talk about what Wordle thinks is meant by "5-letter words".
There will come a day when the correct answer to a Wordle puzzle will be the
word "GOLEM". I know that word from a chance encounter with some Jewish
folk literature --- a golem is a kind of Frankenstein's monster, a living being
brought to life out of a lump of clay. But most people don't know that word,
so how will they know to enter those letters into the Wordle game?
In order to win Wordle, one must be familiar with the words that Wordle will
eventually use as answers. This is a fixed, finite set of words, which I will
call "the wordlist". In this document I will discuss several aspects of this
wordlist. In particular, a Wordle player should glance through this document
just to get introduced to words that are in one way or another unusual or
unlikely to come to mind while playing.
It's also important to know what's NOT in the wordlist. The Wordle game allows
the player to enter some other "words" as input even though they are never
used as the answer of the day (there may be strategic reasons to do so); but
if the player sees the letters a,c,o,s,t and thinks "COATS" or "TACOS" is
the answer, they will be sad to learn that these words are not even in the
wordlist. (The answer that day is either "COAST" or "ASCOT".)
So we should review the actual wordlist, and see how it compares to
what a typical player might think a good list of 5-letter words is.
==============================================================================
First up: we need to establish exactly what wordlist we really are
talking about.
Josh Wardle, the inventor of the game, selected a list of words that
would be the correct answers day after day. It was a personal choice of
2315 words. You can get an alphabetical listing here. Actually
the words were included in the code for the original website version of
the game, listed in order of play, day after day, so you can download
that original web page and cheat if you like!
In addition to this wordlist, the original program included a longer
list of some ten thousand additional words, many of them obscure, that
would be accepted as input to Wordle. I suppose the inputs have to be limited
to prevent a player from using inputs like "DKLMV" when they have already
figured out that the solution has the form "S-H-A-?-E". Since the
words on this longer list are acceptable inputs to Wordle, I really
ought to run analyses that incorporate them when I discuss solution
techniques. However, it is my own preference never to use these words
(and more generally I dislike to use "words" that I don't recognize as
words!) so *almost* without exception I will ignore this longer list:
the 12972-word "input list" which is the union of the 2315-word
solution wordlist and this longer set of second-rate words.
[UPDATE: In summer 2022, the NYT enlarged the set of input words;
there are now 14,853 accepted inputs to Wordle. But again, we will
mostly ignore this longest list.]
Next, it is important to recognize that the "official" word list has
changed and is subject to future change. This is the result of the
purchase of Wordle by the New York Times. While their intent was not
to change the gameplay, they did decide to tinker with the wordlist.
The only changes (so far) of which I am aware were to remove a few
words that it considered unfamiliar or offensive:
agora, fibre, lynch, pupal, slave, wench
(In addition, some of the words were taken out of their original position
in the sequence of answers. This does not affect our analyses since
we invariably treat the game as if the day's hidden word is selected
randomly. I believe the list of "moderated" words is this set of 22:
augur bobby butch eclat fanny fella fetus flack gaily harry hasty
hydro liege octal ombre payer sooth stalk unlit unset vomit widow
[UPDATE: Starting in November 2022, the Times allows a "curator" to
select the hidden word each day. The chosen word is now often linked
in some way to the date. It is possible that the curator will some day
choose answer words that are not on the original list; I also expect
the curator to choose the most obscure words on the original list
only reluctantly.]
I have also enjoyed playing the compound Wordle variants that challenge the
player to guess words on several Wordle games at once. Again, the authors of
these games are free to choose their solution word lists (and input word
lists) as they see fit. Most of the games that I play have similar word
lists, but there are some differences I know of. Here is what I know for the
N-fold compound games.
- (N=2) Dordle's solution list is about 90% similar to that of Wordle.
I have enumerated the differences, but I guess the Dordle list
is fluid: during 2022 I observed "HAPAX" as a Dordle solution, and it
is not (any longer) on the Dordle answer list.
- (N=4) The word list for Quordle is nearly the same as Wordle's.
It seems "caput" has been replaced by "kaput" in Quordle (as it may
perhaps have been replaced in Wordle itself by now); in addition
thirty-eight words from Wordle's list are "blacklisted" by Quordle.
This difference will generally not affect our conclusions.
- (N=8) The Octordle list is a subset of Wordle's list: since February 2022
Octordle removed words from that original list and is now down to 2294 words;
moreover the author advised me that the list does change periodically
- (N=16) The list for Sedecordle is the same as Wordle's except
for the deletion of both "gypsy" and "gipsy"
- (N=32) The word list for Duotrigordle is identical to Wordle's.
- (N=64) Sexaginta-quattuordle adds 373 words to Wordle's
list and removes 4, making it the longest of the word lists I have looked at.
(In particular, it adds quite a few past-tense words "ached", "acted", etc.)
Because of the differences, I cannot guarantee that statements that I
make about the compound games will apply to all of them equally
well. Apart from Dordle and 64ordle, the word sets I propose should
work as well or better for a compound game; for those two in
particular, I can only really say that the predictions I make about
the behaviours of these games do seem to be at least approximately
correct. Let me reiterate: I am making claims about one particular
wordlist; those claims may or may not be applicable to similar sets of
5-letter words in games.
Other word lists are of course possible. For instance, I have a set of
some 8000 words that is (or was) the official set of 5-letter words
that could be played in Scrabble. One may repeat the analyses that I
have done on the Wordle wordlist, for any of these variant lists, but
I have not done so. Just be aware that any claims about a set of words
being "best" are relative to the wordlist being considered.
==============================================================================
So in the rest of this document we will have a look at the 2,315 words
in the original Wordle wordlist. There are two points of view from which
to study this set of words: as sets of actual words (that is, ways of
talking about concepts), and as strings of letters. We'll do both.
So what words did Wardle include, or exclude, when making his list?
It's an idiosyncratic choice. For example it does not include "squid",
even though it contains "squad" --- and "squib"! I have some comments
about words that did get included, and some about the words excluded
from the wordlist.
There are some strange choices that I did not expect to see in the wordlist.
Be prepared to enter these some day as a Wordle answer even if they look wrong.
"Gazer" without star- ? "Willy" without -nilly? "Outgo" (not outgoing)?
I understand "ombre" as shade, "terra" for land, and "caput" for head, but are
they English words? (Did they mean "kaput", which isn't in the list? "Fritz" is.)
"Bleep" and "clack" and "clank" and "humph" and "splat" are sounds, but are they
really words? What part of speech are they? ("Boing" did not make the cut I guess!)
Kids use "scram", "stunk", and "slunk", as well as "snuck" for a past tense,
and say someone was a "goner", but I'm not sure they are adult words.
I think of "hydro", "hyper", "micro", "quasi", "super", and "ultra" as
being prefixes; Wordle accepts them as words (though not "multi".)
The wordlist includes some words of regional or specialized use, such
as "matey" and "caulk". Both "fiber" and "fibre" can be the word of the day,
but Brits might wonder what a "homer" is, and Yanks don't use "bobby".
"Fella" is dialect -- spoken, not written, right?
To me, "tepee", "gipsy", "gayly", and maybe "flier" are mis-spelled.
(But "forgo" *is* ok; it's not the same word as "forego"!)
And of course, each of us will see words in the list that we just don't know.
For me personally I thought most of the words that made it to Wardle's list
were words that I recognized, but even so, I --- a veteran of word games ---
drew a blank trying to define, or use in a sentence, some of these:
boule covey crump debar droit dross eclat gawky harpy iliac junto ovate
ovine ralph savoy squib swash tatty thrum tonga tulle utile waxen whelp
Your own personal list might include others; for example I have heard complaints
about "rupee" "swill" "agora" "cavil" and "ennui".Here's an article about
the Wordles that people found most difficult.
Conversely, it's important for game play is to know what's NOT in the wordlist,
so you don't waste a guess. I have already linked to the lists of hundreds of
words that Dordle and 64ordle thought were perfectly reasonable (and I mostly
agree!) Even beyond all of those, there are words that I have thought to play
in Wordle because I thought they look completely ordinary to me; alas, Wardle
does not include them among the answer wordlist:
addle amino balky busty cocky ducky eider eland folic grift liane liter
liven miter muggy pacer pinup rondo roper sedum snafu taser thine thrip
None of the elements argon, boron, radon, nor xenon is in the wordlist: so that's
a consistent pattern. But usually, in almost every category, the word list
includes some words from that category and omits others --- foods, plants, ethnic
groups; contemporary words and outdated words; verbs and nouns; etc. Examples:
"Sloop" and "skiff" are Wordle words; "ketch" is not. "Nylon"-yes, "rayon"-yes, "latex"-no.
"Carat" is in the wordlist, "karat" and "caret" are not (nor is "carrot"!).
"Penny" and "pound" of course; but why "rupee" but not "ruble" nor "franc"?
"Rhino" and "hippo" are OK but "chimp" is too short?
"Torah" and "koran" are absent even though "bible" is in the list (and so is "mecca").
Both "patsy" and "pasty" are included, as is "gusty" -- but not "gutsy".
Also included are "dumpy", "jumpy", and "lumpy", but not "bumpy";
"woozy", "loopy", and "goofy" yes but "doozy" no.
I guess the "skort" has gone out of fashion?
Wordle will expect you to guess "liken" but not "liven", "wight" but not "bight",
"axion" and "quark" but not "boson", "kneed" but not "egged".
Generally the wordlist does not shy away from "politically incorrect" words
("hussy", "lynch") and it generally includes words for squeamish ideas
("fanny", "fecal", "semen"), though it chooses not to include a few others
("feces", "naked", "penis"). On the other hand, particularly rude
words and slang are definitely not in the wordlist (bitch, boner,
fagot, whore). I ran a comparison to an online "bad word" list to see
which were or were not in the Wordle wordlist, and found plenty of each.
English happily absorbs words from other languages, especially to describe
things like the foods of other cultures, so inevitably the Wordle list
includes many words that might be considered foreign-language words. Of
course most English words trace their words to Greek and Latin roots, but
in some cases that English word is unchanged from the original, hence
Wordle words AGAPE and GAMMA, UMBRA and TERRA. Many words are used in
the original French form: BUTTE, ECLAT, ENNUI, PUREE. We have words from
German (BLITZ), Dutch (HOIST), Norse (FJORD), Scots (TWEED), Irish (PHONY),
Scots Gaelic (CAIRN), Spanish (JUNTA), Portuguese (CASTE), Italian (PESTO),
Czech (ROBOT), Polish/Russian (VODKA), Yiddish (BAGEL), Hebrew (RABBI),
Persian (LILAC), Arabic (NADIR), Turkish (KEBAB), Sanskrit (KARMA), Hindi (KHAKI),
Tamil (CURRY), Chinese (RAMEN), Japanese (SUSHI), Indonesian/Malay (GECKO),
Australian aboriginal (KOALA), Tongan (TABOO), Bantu (BANJO), Swahili (JUMBO),
W. African (BONGO), Inuit (IGLOO), Ojibwe (TOTEM), Lakota (TEPEE), Illinois (PECAN),
Choctaw (BAYOU), Arawak (GUAVA), Nahuatl (CACAO), Tupi (TAPIR)
Of course, a word's history can be long and tortuous. For example, CHESS
actually came to English from Russian, but the Russians got it from the
Persians. Other Wordle words have longer histories, e.g. look up the
etymology of CANDY or HORDE.
Then there is the question of the multiple forms of a root word.
For verbs, you will never see third-person singular forms (e.g. "talks"); not even "goeth", "lieth",
nor "doeth". There is one second-person singular form: "shalt". (But not "goest", "doest", "liest".)
Fifteen gerunds of 2- and 3-letter verbs are included ("being", "aging", "dying", etc.)
although "acing" and "axing" are not. (FWIW, there are 25 words that contain I,N,G
but do not end in -ING :
again align begin binge bingo deign dingo dingy feign fungi genie giant given
glint grain grind groin hinge ingot lingo login neigh night reign singe
A 3-letter word could take "re-" or "un-" as a prefix. Only "fit" does both. I guess
typically "re-" is used with a present-tense verb and "un-" with a past tense, as in
the Wordle words "reuse" and "rerun", but "uncut", "undid", "unfed", "unlit", "unmet"
"unset", and "unwed". But "untie" and "unzip" are also in the list while "retie" and "rezip"
are not. (Neither "re-" nor "un-" combination appears with "buy", "dry", "fry", etc.)
There's also the de- prefix seen in Wordle words "debar", "debug", "detox"; calling out
other words like "debut" as being verbs with a de- prefix is a bit of an etymological stretch!
You mostly won't see the regular past tense of shorter verbs (e.g. "paced", "acted").
But there are exceptions: the words "bused", "clued", "freed", and "kneed" *are* in the
wordlist, and it looks like just about every *irregular* past form is, too. That includes
*the -ied past tense of eight 3-letter words ending in -y are all in the wordlist;
*the -t variant constructions are in the list (except "blest" is not in the wordlist):
built cleft crept dealt dwelt knelt leant leapt meant slept spelt spent spilt swept
*the -n past participle forms are accepted, including :
begun blown borne drawn eaten flown given grown known laden risen shown sworn taken woken woven
(The wordlist includes many *present* tense -en verbs like "widen" and "liken"; but "waken" is excluded.)
*the past tense forms that change the root vowel: at least, these examples are in the list:
arose awoke began begat bound broke chose clung drank drove* drunk froze
flung found heard shone shook slung smote spoke* stank stole stood stuck
stung stunk swore swung threw undid wound wrote wrung (*but not "spake" nor "drave").
For nouns, almost without fail, regular plurals are omitted: no "ashes" nor
"boats" nor "lives" nor "skies". Indeed, there is not a single word ending
in -es ! (And not many -s words at all). I do see "cacti", "fungi", and "radii",
though not "genii" nor "celli". Both "goose" and "geese" are in the list,
as well as "moose" but not ... sorry. "Media" and "opera" are in the list,
while "quora" is not, but perhaps those are not usually seen as plural nouns.
Other irregular plurals I know to be in the list include "algae", "pence",
"teeth", "women", and I suppose "those". (Strictly speaking, words like
"sheep" can also be plural.)
On the other hand some irregular plurals are not in the list; arguably
these are really foreign words and perhaps uncommon even in the singular:
beaux cilia folia labia phyla sacra styli uteri vacua vitae
It's not surprising that "oases" is not in the list, since "oasis" isn't either!
Agents formed from a verb are hit or miss: "voter", "giver" and "taker",
(and "tamer" -- as in lions?), as well as the odd-sounding "wooer" and
"aider" are there; "pacer" and "oiler" are not. "Skier" yes, "caver" no.
"Dryer" and "flyer" (and "plier"!) yes, but not "cryer", nor "fryer".
The partial agent word *ater could be "eater" or "hater" but not "rater".
{e,f,i,l,r} could be "flier" or "filer" (or "rifle"!) but not "lifer".
Proper nouns and adjectives are excluded from the Wordle solution list:
there's no Adams nor India. But trump and china are there --- they're common
nouns used in the context of card-playing and table-setting respectively.
So too do many other words in the list sound like proper nouns but they have a
common-noun usage too: frank, stein, peter, billy, patty, smith. (A few of these
are kind of circular reasoning: the teddy bear was named after Teddy Roosevelt,
the ascot tie is named after a place, and I imagine "welch" is actually an
ethnic slur.) Look it up: even "tonga" is a common noun. (I didn't know that.)
("Gauss", "hertz", and "curie" are common nouns in science, but they're not Wordle words.)
"Druid" is a Wordle word, but in practice maybe isn't treated as a proper noun.
"Mecca" is also on the list, and used lower-case to mean "a destination like Mecca".
The only exceptions I have seen to this whole pattern are "Welsh" and "Dutch":
whether before "door", "treat", or "courage", "Dutch" is regularly capitalized,
and I guess there is some kind of reference to the Netherlands and its people.
Ditto with "Welsh".
Adjectives that amount to past-tense verbs are also rare: "abled" is
included, while "tired", "faded", "timed", "bared" are not. A large set
of comparatives ("bluer", "riper", "surer", "truer", "saner", "freer",
"wider") is included (perhaps "tamer" belongs in *this* list?); some
sound strange to my ears. Yet some others ("cuter", "icier", "lamer",
"abler") are not in the wordlist, even though I personally am more
likely to use the word "cuter" than "saner"! Wordle allows you to be
"gayer" (and "drier") but not "shyer" (nor "shier"). I don't think
English even has any 5-letter superlatives to think about!
Every adverb that I expected seems to be included in the wordlist
(formed -- in different ways -- from adjectives like apt, coy, icy, dull,
and noble) although conversely I think most of the -ly words in the list
are not adverbs; many are adjectives formed from nouns (like manly, girly,
and woolly) but many are not (filly, imply, jolly, lowly, reply, surly).
==============================================================================
A completely separate way to analyze the wordlist is to think of the words as
just strings of letters. I've run many tabulations about the words this way;
they can be helpful when trying to reconstruct a word from the hints that accrue
during game play.
For example, it's useful to know that the most common letters are e,a,r,o,t, while
j,q,x,z are rare. Here are the counts of the letters' usages; we can choose to
count the words containing the letters, or the total number of occurence of
the letters (i.e. counting repeated letters in a word multiple times, or not).
We can also see which letters are most likely to be repeated.
Counting words Counting repeats Repeats
e, 1056 e, 1233 e, 177
a, 909 a, 979 o, 81
r, 837 r, 899 l, 71
o, 673 o, 754 a, 70
t, 667 t, 729 r, 62
l, 648 l, 719 t, 62
i, 647 i, 671 s, 51
s, 618 s, 669 c, 29
n, 550 n, 575 n, 25
u, 457 c, 477 i, 24
c, 448 u, 467 d, 23
y, 417 y, 425 f, 23
h, 379 d, 393 p, 21
d, 370 h, 389 m, 18
p, 346 p, 367 b, 14
g, 300 m, 316 g, 11
m, 298 g, 311 h, 10
b, 267 b, 281 u, 10
f, 207 f, 230 k, 8
k, 202 k, 210 y, 8
w, 194 w, 195 z, 5 (dizzy fizzy fuzzy jazzy pizza)
v, 149 v, 153 v, 4 (savvy valve verve vivid)
x, 37 z, 40 w, 1 (widow)
z, 35 x, 37
q, 29 q, 29 ( q,x,j never repeat )
j, 27 j, 27
So the frequency rankings of the letters in the first two columns are
identical except for the swapping of four pairs (u/c, h/d, g/m, x/z)
in which the second of the pair is more likely to be repeated in a word.
(Note that the numbers in the last two columns add to the number in the
first column. This includes counting e.g. "mamma" as one word containing
three "m"s, counting this as TWO repeats of "m".)
The way the letters are distributed among the words also varies. There
are 5x26=130 potential combinations of "letter xxx in position yyy".
Five of these never occur: there are no words that start with x, none
that end with j,q, or v, and none with a q in position 4.
The only word with a q in position 3 is "pique", and the only word
that ends with u is "bayou"(*). Other rare combinations are
j4: banjo ninja
z2: azure ozone
j2: eject fjord
y4: polyp satyr vinyl
x4: epoxy proxy twixt
j3: enjoy major rajah
z1: zebra zesty zonal
z5: blitz fritz topaz waltz
The list continues with q2 (in 5 words), y1 (in 6), x5(8), f2(8), h3(9), k2(10).
The most common combinations are e5 (424 words), s1 (366), y5 (364), e4 (318),
a3 (307), and a2 (304). All other combinations occur in between 11 and 279
words. (Median: 61; average: about 93).
[(*) On 2023-04-09, the hidden word was SNAFU, the second word not on the original
Wardle list to become a hidden word.]
These numbers can help find pairs of words that are hard to distinguish.
For example, there are only 17 w5s and only 11 b5s; that means there are
only 28 words that can distinguish "throw" from "throb" on the basis of
their response in the last column. Here are the next "closest" pairs:
28 = 17 w5 + 11 b5 thro*
28 = 16 b2 + 12 g2 a*ate
32 = 16 b2 + 16 s2 a*ide
32 = 29 y3 + 3 j3 ma*or
33 = 23 y2 + 10 k2 e*ing
35 = 20 d2 + 15 v2 e*ict
38 = 26 w3 + 12 k3 po*er
Some of these small sets of testwords can be used on multiple word pairs.
For example there are the 49 v3s and 26 w3s that together are the only 75 words
that in this way can distinguish *seven* pairs
co*er fe*er lo*er mo*er ne*er ro*er se*er
It is possible to have two sets of words that have identical collections
of {letter, position} pairs: an example is POINT+CLUED and COUNT+PLIED .
We will call such pairs of wordsets "perfect anagrams" of each other.
Since each pair has no repeated letters, that means that the collections
of hints (colored tiles) provided by each pair are identical, and so the
clusters of words the pairs create (that is, the sets of hidden words that
would yield the same colored tiles from the pair) are identical too.
Consequently, for example, starting Wordle with SWARM+POINT+CLUED will
give exactly the same statistics (e.g. percentage of times failed) as if
starting Wordle with SWARM + COUNT + PLIED ! (Similar, more obvious,
examples include SPIRE+BLOND vs SPORE+BLIND .) There are also interesting
instances with triples, e.g. SPORT+MANGE+CHILD and SPILT+MANGE+CHORD, or
even multi-sets like
{twang, slump, cried}, {tried, swung, clamp}, {tried, swamp, clung}, {tramp, swing, clued}
any two of which have the exact same results when used as a Wordle opening!
Altogether there are 1566 words that consist of five distinct letters; the other
749 words have repeated letters. Thinking of these latter as "poker hands", most
of them (691) have only a single pair. But there are 38 words that show "two pairs":
allay amass array assay belle booby cacao civic cocoa femme
freer gamma kappa kayak level llama madam magma mimic minim
motto onion papal penne queue radar refer rotor salsa sense
shush slyly teeth tenet tooth tweet verve vivid
Also 19 are "three of a kind":
bobby daddy eerie emcee error fluff geese mammy melee mummy
nanny ninny poppy puppy rarer sassy sissy tatty tepee
And there is one "full house":
mamma
Some of these repeated-letter sets come from multiple words, and some are
contained in others. That lone 2-letter combination {a,m} is contained in one
three-of-a-kind word "mammy" and five two-pair words (amass, llama, madam,
plus gamma and magma (which use the same third letter, G)).
The 38+19=57 words that use three distinct letters involve only 52 different
3-letter sets because of these overlaps:
* [cacao, cocoa] are both 2-pair hands but they double different letters
* [freer, refer] and [gamma, magma] are 2-pair hands with the same doubled letters
* [assay, sassy] and [bobby, booby] match a 2-pair hand with a three-of-a-kind.
Each of the 52 three-letter sets is part of a four-letter set used in a Wordle
word except for the letters in FEMME and KAYAK. (Each of those 3-letter
sets is contained in the letters of multiple words, e.g. FRAME and GAWKY.)
The 691 words that simply have a repeated letter give rise to only 591
4-letter sets because some single-pair words use the same sets of letters,
including four sets of four words each:
* [erode, odder, order, rodeo] each doubles a different letter
* [ester, reset, steer, terse] each doubles the same letter E
* [asset, state, taste, tease] and [puree, purer, rupee, upper]
There are also nine sets of three words built from the same four letters:
[attic, cacti, tacit] [algae, eagle, legal] [eater, terra, treat]
[allot, atoll, total] [belie, bible, libel] [dried, drier, rider]
[ether, there, three] [serve, sever, verse] [strut, truss, trust]
And there are 70 more pairs of words that use the same four letters
[leper, repel], [sleep, spell], etc.) Of the 591 four-letter sets
used in Wordle words, most of them are also part of a five-letter
set (e.g. TRYST -> r,s,t,y -> RUSTY), but 137 are maximal, including
seven maximal four-letter sets that come from more than one word:
[batty, tabby] [fatty, taffy] [elegy, leggy] [foggy, goofy]
[loopy, polyp] [lowly, wooly] [snoop, spoon]
Finally, the 1566 words that consist of five distinct letters actually
give rise to only 1393 sets of 5 distinct letters, because as we will
see there can be as many as 4 Wordle words built from the same set
of 5 distinct letters (e.g. cater crate react trace).
Some combinations of letters are rare. For example, not only are q,x,z, and j
rare individually, they are extremely rare together: the only words containing
two of these are the five double-z words shown above (including "jazzy", which
has three!)
Next rarest is "v". Apart from the four double-v words shown above, the only
word adding v to q,x,z,j is "vixen".
Next rarest is "w". There is just one double-w word "widow". The only words
having w and some of q,x,z,j or v are
jewel twixt waltz waxen woozy vowel waive waver weave woven
Next rarest is "k" . The double-k words are
kayak khaki kinky kiosk knack knock skulk skunk
The words with k and one of q,x,z,j,v are
evoke jerky joker knave quack quake quark quick quirk vodka
and the ones with both k and w are
askew awake awoke gawky known tweak wacky whack whisk woken wrack wreak wreck
Next rarest is "f". The double-f words are
fifth fifty affix offal offer gaffe jiffy puffy taffy and 13 ...ff 's
The words with f and one of q,x,z,j,v are
fjord jiffy affix fixer fizzy fritz froze fuzzy favor fever
Both f and w :
awful dwarf fewer flown frown swift wafer wharf whiff
Both f and k : quite a few (38) --- see separate file.
The 29 q words only use {a, b, c, d, e, h, i, k, l, m, n, o, p, q, r, s, t, u, y};
never z,x,j,v,w, nor f nor g. Without exception, q is followed by u.
The only q words with double letters are
queen queer quell queue quill
Note that q is always in positions 1 or (as eq- or sq-) position 2, except for
pique.
Of the 35 z words, the only ones with a double letter are
amaze bezel booze boozy ozone plaza razor seize woozy
and the five --zz- words shown above.
The only occurences of "h" that are not preceded by {c,s,t,p,w,g} either
have the h in front or in these 8 words:
abhor ahead khaki myrrh rajah rehab rhino rhyme
(This is useful information when H turns yellow after I enter a set of
words that includes e.g. {c,s,t,p,g} but not W ; I either have to start
with the H or else I know a W is missing. Moreover, the only WH pairs in
the Wordle solution list occur at the start of the word!)
The only occurences of "k" that are not preceded by {c,s,n,r,l,o,a,e,i} either
have the k in front or in these words:
fluke gawky vodka
Of the 417 words with a "y", 364 have a "y" in the final position; the other
positions are more rare. The only words with "y" in position 4 are
polyp satyr vinyl
(Note that y is used as a vowel here.) The only words with "y" in front are
yacht yearn yeast yield young youth
(where y is used as a consonant). There are 23 words with "y" in position 2;
in all cases this y is used as a vowel.
bylaw cyber cycle cynic dying eying gypsy hydro hyena hymen
hyper lying lymph lynch lyric myrrh nylon nymph pygmy synod
syrup tying vying
(note that "gypsy" and "pygmy" have another y at the end.)
Finally there are 29 words with "y" in the middle (including 6 with two "y"s);
I would argue this is a mix of vowel-y and consonant-y cases.
abyss bayou buyer coyly crypt dryer dryly flyer foyer gayer
gayly glyph idyll kayak layer loyal maybe mayor payee payer
rayon rhyme royal shyly slyly style thyme tryst wryly
I looked for the patterns of vowels, consonants, and "y" (treated separately).
There are 53 distinct patterns, including 20 different patterns involving "y"s.
Four patterns account for just about half of all words:
cvcvc, e.g. "coral" (391 times = 17% of all words)
ccvcc, e.g. "bring" (349 times = 15%)
cvccy, e.g. "manly" (219 times = 9%)
ccvvc, e.g. "speak" (192 times = 8%)
The next few are ccvcv, cvccv, cvvcc, vccvc, ...
At the other extreme are 16 patterns used only once or twice:
queue (the only word with 4 consecutive vowels)
angst (the only word with 4 consecutive consonants)
yacht (the other 5 words starting with "Y" are all yvvcc )
hyena
eying
maybe
gooey
audio, eerie (audio is the only word with four distinct vowels...)
aunty, early
abyss, idyll
bayou, payee (... unless you include "y", in which case bayou wins too.)
coyly, gayly
gypsy, pygmy
cycle, hydro
dryer, flyer
spray, stray
Apart from "queue", the other words with 3 vowels in a row are:
gooey quail queen queer quiet wooer
There are quite a few strings of three consonants, e.g. "sixth", "aptly".
But at the front of a word only scr- shr- spl- spr- str- thr- appear, and
the only vcccv words are
alpha amble ample angle ankle apple extra intro ombre ultra umbra uncle
For the most part, a word that starts with a consonant must have a vowel next,
but there are many initial consonant blends of these forms:
*R: the first consonant must be one of bcdfgkptw
*L: after bcfgps
*H: after cpstw
S*: before ckmnpqtw
Partially following those patterns are the atypical "llama", "ghost", "ghoul",
"khaki", "krill", "rhino", and "rhyme". There are also 9 words starting with TW
and 10 starting with KN, plus these oddballs:
dwarf, dwell, dwelt, fjord, gnash, gnome, psalm
The initial strings only continue to a third consonant with the triples
SCR-, SHR-, SPR-, STR-, THR-, and SPL- .
English has those old-fashioned double-vowel symbols "ae" and "oe"
(e.g. in "foetus" or "Aesop") but they are essentially absent here. In
fact, several vowel puermutations are rare or missing altogether.
There are no words containing the permutations AA UU IY YU or YY.
The other permutaations that occur in fewer than 10 words are
AE: algae
II: radii
IU: opium
UY: buyer
AO: aorta, cacao, chaos
EO: cameo, rodeo, video
EU: deuce, queue, reuse
UO: quota, quote, quoth
OE: canoe, gooey, poesy, wooer
YA: kayak, loyal, royal, yacht
YO: bayou, mayor, rayon, young, youth
YI: dying, eying, lying, tying, vying, yield
Someone on Reddit posted a table of the numbers of words with all 2^6 64 vowel combinations:
--- E O U EO EU OU EOU
--- 0 167 124 87 154 88 69 14
A 166 277 106 45 16 19 4
I 125 174 64 36 11 17 3
Y 13 45 62 58 14 2 4
AY 88 28 14 3
AI 97 22 9 3
IY 61 8 3 3
AIY 10
Up to 4 words may be made from the same set of 5 letters:
cater crate react trace
glare lager large regal
leapt petal plate pleat
least slate stale steal
resin rinse risen siren
ester reset steer terse (which uses only 4 *different* letters)
Same set of letters (counts irrelevant) can also be as many as four words,
as occurs in (only) these cases:
erode odder order rodeo
asset state taste tease
puree purer rupee upper
(In our discussion of "word similarity" in a separate file, we will look
at the 54 pairs of words that have the same five letters with none in the
same location. This includes some words with two "neighbors":
angel-glean-angle
ester-terse-steer
trace-cater-react
tuber-brute-rebut plus the longer chain
lager-glare-regal-large and the closed loop
trove-overt-voter(-trove)
so the 54 pairs actually involve only 99 different words.)
The 2315 words are made from 2037 different sets of letters. There is one set
{m,a} of cardinality 2, from which we can (only) make "mamma". There are 52 sets of
cardinality 3, including some from which we can make two words:
{a, s, y}, assay, sassy
{b, o, y}, bobby, booby
{a, c, o}, cacao, cocoa
{e, f, r}, freer, refer
{a, g, m}, gamma, magma
There are 591 sets of cardinality 4, from which we can make as many as 4 words,
as noted in the four cases above. (There are also 9 cases when the same four letters
can make 3 different words, namely the letter-sets in these words:
algae, allot, attic, belie, dried, eater, ether, serve, strut )
We have already noted that the 1393 sets of 5 distinct letters can form
as many as four words each.
I did not do a search on all sets of 5 letters, but I checked all the 5-letter
sets that make a Wordle word, and all the quintuples of the 16 most common letters,
and found sets that could make a dozen or more words in the wordlist:
16: asset, easel, elate, latte, lease, LEAST, salsa, slate, sleet, stale, stall,
state, steal, steel, taste, tease
15: asset, eater, erase, ester, rarer, reset, STARE, start, state, steer, taste,
tease, terra, terse, treat
13, EARTH, eater, ether, hater, heart, heath, rarer, teeth, terra, there,
theta, three, treat
12, carat, CATER, crate, eater, erect, racer, rarer, react, terra, trace, tract, treat
There are cases in which a Wordle game can look almost complete but be
far from finished! I mentioned above the example of "S-H-A-?-E" that
occured on a recent day. We can look for large sets of words that all
have the same four letters in the same places. The largest such sets are
9 .ight eflmnrstw note repeat t (i.e. "tight" is one of the 9 words)
8 .ound bfhmprsw
7 .ower clmprst repeat r
7 .atch bchlmpw repeat c,h
7 sha.e dklmprv
6 .illy bdfhsw
6 .aunt dghjtv repeat t
6 .atty bcfprt repeat t
6 .aste bchptw repeat t
6 s.ore chnptw
6 sta.e gklrtv repeat t
6 gra.e cdptvz
Then there are 18 sets of 5:
.ying .usty .unch .ully .rown .ried .over .ough .olly
.each .ater .andy s.out s.oop s.are s.ack sto.e spi.e
There are 66 sets of size 4, and many sets of size 3 and smaller. Obviously a
player should keep this in mind and make sure to choose early some test words that
use up a fair portion of these interchangeable missing letters!
Sets of words sharing 3 letters can of course be larger; the largest is the set of
29 words following the pattern .a.er :
baker baler caper cater eager eater gamer gayer gazer hater lager
later layer maker paler paper parer payer racer rarer safer saner
taker tamer taper wafer wager water waver
Then there are 25 .o.er
boxer corer cover cower foyer goner homer hover joker loser lover
lower mover mower poker poser power roger rover rower sober sower
tower voter wooer
And 24 each of .i.er and s.a.e :
aider cider diner diver fiber filer finer fixer giver liner liver
miner miser nicer piper rider riper riser river tiger timer viper
wider wiser
scale scare shade shake shale shame shape share shave skate slate
slave snake snare space spade spare stage stake stale stare state
stave suave
The next most common patterns are ..ing (23), ..lly (22), .ra.e (20), then
sta.. ..tch s.o.e sha.. .ri.e s.i.e ..ter sto.. s.or. s.ar. .ro.e .la.e
each match 15-19 words.
Of course with only (1 or) 2 letters fixed, the number of possibilities grows!
The largest set is ..a.e which matches 84 letters, then s.a.. and .a..y with 78 each,
followed by s...e (75), ..i.e (72), s.o.. (62), ..o.e (62), etc.
Apart from looking simply at words that share some of the letters (in particular slots),
we can look more broadly at words that seem "somehow similar" to it. I have a separate file
similarity in which I consider this idea more thoroughly and use it in my analyses.
I make (arbitrary but clear) definitions of what it means for words to be "neighbors",
based on how many yellow and green tiles each would show if guessed against the other
word as the hidden word-of-the-day.
For starters, we could consider the other words against which this word
measures either 4 greens, or 3 green and 2 yellows. It turns out that "share"
is the unique word with as many as 15 of these (immediate) neighbors
scare, shade, shake, shale, shame, shape, shard, shark,
sharp, shave, shire, shore, snare, spare, stare
(In this case, all the neighbors are obtained by changing just one letter
of "share", as opposed to swapping two of them.)
No word has 14 neighbors; "store" (alone) has 13 (again, all are single-letter
changes). The numbers of words with n neighbors then continues as
12, 3 (shore; stare; and stale, with a "permutation" neighbor "slate")
11, 1 (shale)
10, 3 (shave,stack,patty)
9, 12 (including spine, with neighbor snipe; and stake, with neighbor skate)
8, 43
7, 53
6, 92
5, 147
4, 203
3, 316
2, 401
1, 487
0, 552, that is, there are 552 words with no neighbors at all. Among
them I might highlight words like affix, femme, jazzy, kayak, khaki, pizza
which rarely even give 3 colored tiles against any other word.
We can then include neighbors of neighbors (etc.); the most neighborly is
again "share" (with 56 neighbors), then "stare" (54) and "shore" (53) and other
words of the form "s{chpt}{ao}{lr}e" and related types; nothing really
different shows up until "clack" (34) and "lover" (33). But note that this
might not really be what we want to measure: isolates will stay isolates
even if we iterate (3-step neighbors, etc). After all, the notion of
"neighbors" int he previous paragraph considers only pairs of words
obtained from each other only by either permutation or single-letter
substitutions; the idea of "neighbors of
neighbors" does enlarge the size of a neighborhood when iterated for words
that are neighbors because of single permutations of their letters, but
instead maybe we want to enlarge the collection of "neighbors" by allowing
words that change out *two* or more letters at once. We will address this
in the separate file regarding similarity between pairs of words.
Oh, and one more tally for the fat-fingered typists out there (like me!).
There are 373 pairs of words that differ only in one letter, and the letters
that they have different are adjacent on the Wordle keyboard, e.g. WRING/WRONG.
Watch carefully as you type, or check your typing before you hit "enter"!
At worst, there are just a few Wordle words that can have as many as 4
of these typos in the Wordle list:
dilly --> dolly, dully, filly, silly
silly --> dilly, silky, sully, willy
wager --> eager, wafer, water, waver
(There are 562 more such pairs that can be transformed into each other
using two such typos, so if you're especially fat-fingered, be extra careful!
This time around the worst offenders are words that can be mistyped into SIX
other words if we have exactly two errors:
fiber --> diner, diver, fibre, giver, river, tuber
sling --> along, aping, doing, eking, slimy, spiny
sting --> aging, dying, eying, stony, stunt, wring
and words that can be mistyped into EIGHT otherwords by using at most two errors:
dilly --> dolly, dully, filly, silly; folly, fully, silky, sully
sting --> stint, stung; aging, dying, eying, stony, stunt, wring )