8.4 billion nonwords generated in landmark study of uselessness

October, 2024

A research team affiliated with The Pataphysical Society of New York has successfully generated the largest ever list of English language nonwords (i.e. imaginary words) using a specialized computer application developed by the Society as part of an inquiry into the nature of uselessness.

The team was able to identify and catalog an astounding number of nonwords — 8,352,712,501 — and analyze them using various techniques.

Finally, the team developed a novel method for calculating and expressing the information potential, entropy, and uselessness of a given system, and benchmarked the results against the nonword dataset, as well as several large, classic texts, as a first slate of experiments.

Today we are publishing their initial findings and releasing large, new datasets of English nonwords, along with the source code of the nonword generator, for fellow researchers, pataphysicians, and academics to use freely. (GitHub link)

This historic project advances our understanding of uselessness as a property of large datasets, adding new analytic tools, vocabulary and lexicographic specimens to the expanding field of pataphysical research.

Methods

We developed a C++ application to systematically find nonwords based on a given dictionary. We began with a high quality open source English language word list of 370,081 words and fed it as an input to the application, searching for incrementally longer combinations of letters to generate exhaustive lists of nonwords from 1 up to 7 letters in length. (The program can be set to output nonword lists of various lengths, as standalone files or as a single combined file.)

The C++ application takes advantage of multi-threaded processing and advanced memory allocation techniques, which to our knowledge have never been employed to systematically explore the vast landscape of potential letter combinations. This allowed us to far exceed the previous attempts at the generation of nonword lists. We continued until the resulting file size became too large and unstable to use. The maximum length of the nonwords generated was 7 letters, though the resulting 64.25 GB file is prone to crashing. (The 6-letter nonword list, at 2.16 GB, is far more manageable, though the information loss is profound.) The resulting word lists were inspected for quality and analyzed using statistical methods.

A random sample of 3-letter nonwords from our dataset:

Analysis of nonword lists

In all, we extrapolated 8.352 billion (8,352,712,501) nonwords from the base list of 370,081 “real” words.
This compares with 8.353 billion (8,353,082,582) total possible combinations of letters from 1 to 7 letters, calculated as the sum of the geometric series: 26¹ + 26² + 26³ + ... + 26⁷.

We found that the percentage of real words out of the total possible combinations of 1- to 7-letter sequences is approximately 0.00443%. This means the overwhelming majority of letter combinations are nonwords.

We observed these densities of real words by length:

1-letter: 3/26 ≈ 11.54% real words
2-letter: 427/676 ≈ 63.17% real words
3-letter: 2130/17576 ≈ 12.12% real words
4-letter: 7186/456976 ≈ 1.57% real words
5-letter: 15920/11881376 ≈ 0.13% real words
6-letter: 29874/308915776 ≈ 0.0097% real words
7-letter: 41998/8031810176 ≈ 0.00052% real words

These measurements reveal not only the distribution of words within all lexical constraints, but also suggest the utility of the application itself. Because the program requires a list of real words to begin generating nonwords, its value (i.e. its utility) appears to increase as the density of real words decreases.

Density of Real Words

Application Utility

Utility vs Density of Real Words

60% 40% 20% 0%

This would appear to contradict the dominant paradigm, which holds that a) the utility of a given system is equal to its usefulness; and b) that its un-utility (and therefore its uselessness) is simply the inverse of its utility. Still, no conclusive proof of these relationships has ever been offered, and very little hard science has explored the correlation – if there truly is one.

To further understand the relationship, we further examined certain properties of the the nonwords' metadata in order to answer questions such as:

How useless is this data?
How useless is the script that generates the data?
How do we measure the intrinsic uselessness of a given program, system, or machine?

These questions lie at the heart of this research venture. Our nonword dataset serves as a prime example of high-volume, low-signal data, allowing for practical applications of information theory principles in seemingly useless contexts.

Utility and Uselessness

Standard dictionaries necessarily omit nonwords because their (usually human) users prioritize filtered lists of real words for everyday use. Nonword dictionaries are highly impractical, taking up large amounts of storage space and posing difficulties searching, moving, and referencing them in other works. Simply put, they are not needed.

But are they useless? And if so, to what degree?

Recent advances in processing power and considerable drops in storage costs have made feasible the notion of a comprehensive nonword list suitable for specialized (sometimes human, often-pataphysical) uses, despite the impracticalities. (Notable users of English nonwords include J. Joyce, L. Carroll, and W. Shakespeare.) The availability of such a nonword list eliminates the inconvenience and expense of their not yet existing, reducing the friction involved in employing nonwords for actual use.

By generating extremely large lists of English nonwords, we are able to explore the hypothesized relationship between information potential and list size as factors regulating uselessness. Though nonwords are not needed today, they have the potential to become real words in the future as their uselessness decays into meaning. From this basic observation, we may infer that — notwithstanding some finite number of unpronounceable phonemes (for a given language, at a given time) — the total potential meaning for a dataset ought to be quantifiable at any given instant.

(As a next phase of research, comparative patalinguistic teams at PataNYC are preparing to evaluate dictionaries in other languages for the phonotactic differences, orthographic patterns, morphological insights, and lexical gaps that define the contours of meaning across multilingual nonword lists. Contributors are invited to participate in the project via the contact information provided. The observations will eventually be benchmarked against theoretical figures originally derived from inverted real-word dictionaries.)

Despite the obviousness of this approach, today's theoretical models are only partially capable of describing the latent potential of meaning within nonword texts. Current techniques do not account for a nonword list's inherently dynamic levels of potential real-word density. And, the rarity of longer real words predicted by entropy encoding techniques presently in use does not align with the actual, observed interplay between combinatorial possibilities and linguistic constraints present in our dataset. As a result, the full effect of entropy on uselessness remains unknown, though speculation about the phenomenon has led some researchers to wonder whether the so-called “Rube Goldberg threshold” may indeed exist.

This makes the need for a new method of analysis abundantly clear.

A novel method for the calculation of uselessness

As the final aspect of our study of nonwords, we propose a calculation that parametrizes degrees of entropy and potential meaning based on the size of the dataset and accounts for compression efficiency, retrieval speed, storage cost and processing cost, resulting in a single measure of uselessness, which may be expressed:

U = [(S + P) × D] / [(I × C × R)]

and applied, for illustrative purposes, to our program and resulting dataset thusly:

(I) = Information Potential

I = -Σ(p(x) × log₂(p(x))) × N

1-letter: -23 × (1/23 × log₂(1/23)) = 4.52

2-letter: -249 × (1/249 × log₂(1/249)) = 7.96

3-letter: -15446 × (1/15446 × log₂(1/15446)) = 13.91

4-letter: -449790 × (1/449790 × log₂(1/449790)) = 18.78

5-letter: -11865456 × (1/11865456 × log₂(1/11865456)) = 23.50

6-letter: -308885902 × (1/308885902 × log₂(1/308885902)) = 28.20

7-letter: -8031768178 × (1/8031768178 × log₂(8031768178)) = 32.90

Total I ≈ 129.77 × 8352384044 ≈ 1.08 × 10¹² bits

(C) = Compression Efficiency

Actual application size: 6,154 bytes

C = 8352384044 × 8 bits / (6154 × 8 bits) = 1,357,227.50201

(R) = Retrieval Speed Ratio

Total runtime for 7-letter dataset: 37521.34 seconds

Number of 7-letter nonwords: 8031768178

Average time per nonword generation: 37521.34 / 8031768178 = 0.00000467173506 seconds

Estimating lookup time for a real word as 0.00000100000000 seconds (1 microsecond)

R = 0.00000467173506 / 0.00000100000000 = 4.67173506

(S) = Storage Cost

Storage cost: $0.023 per GB

Function to calculate total storage cost:

def calculate_storage_cost(file_sizes, cost_per_gb):
    total_size_gb = sum(size / (1024**3) for size in file_sizes)
    return total_size_gb * cost_per_gb

script_size = 6154  # bytes
output_sizes = [
    46,  # 1 letter
    747,  # 2 letters
    62 * 1024,  # 3 letters
    2.24 * 1024 * 1024,  # 4 letters
    71.2 * 1024 * 1024,  # 5 letters
    2.16 * 1024 * 1024 * 1024,  # 6 letters
    64.25 * 1024 * 1024 * 1024  # 7 letters
]

total_storage_cost = calculate_storage_cost([script_size] + output_sizes, 0.023)

S = $1.52797001625

(P) = Processing Cost

Total runtime: 37521.34 seconds

Estimated MacBook Air power consumption: 10W

Energy used: 37521.34 × 10 / 3600 = 104.23 Wh = 0.10423 kWh

NYC electricity rate (2024 estimate): $0.25/kWh

P = 0.10423 × 0.25 = $0.026057500000

(D) = Redundancy

D = 1 + (370,081 / 8,353,082,582)

= 1 + (.00004430472)

= 1.00004430472

Final Calculation

U = [(S + P) × D] / [I × C × R]

U = [(1.52797001625 + 0.026057500000) × 1.00004430472] / [1.08 × 10¹² × 1,357,227.50201 × 4.67173506]

U = 1.55409288854 / (6.84 × 10²¹)

Making the total uselessness of our nonword dataset:

U ≈ 2.27206562 × 10^-22

Relative uselessness of classic texts

To place this result into context, we then applied the same analysis to the several large classic texts for comparison, and, as an experimental control, to the original list of real English words from which the application derived its output of nonwords. For the sample texts, we used a simple textual analysis script written in Python to find the number of words, unique words, and file size, to be used as inputs to the calculation. (This script is also included in the PataNYC nonwords repository on GitHub.) The results are shown in the table below.

	Bible	War and Peace	Moby Dick	English word list	Nonword list (7 letters)
Word Count	821,496	563,286	212,032	370,081	8,031,768,178
Unique Words	33,421	41,548	33,265	370,081	8,031,768,178
File Size (bytes)	4,436,171	3,339,767	1,243,001	4,234,834	68,987,183,104
Runtime (seconds)	0.17	0.12	0.04	0.26	37,521.34
I (bits)	470,962.442	546,720.363	469,171.615	3,502,750.189	232,921,277,162.000
C	1.482	1.348	1.363	0.699	0.932
R	2,069.412	2,132.530	1,886.792	7,023.077	4,671.735
S ($)	0.000102032	0.0000768146	0.0000285890	0.0000974012	1.52797001625
P ($)	0.0000118056	0.00000833333	0.00000277778	0.0000180556	0.026057500000
D	24.580	13.557	6.374	1.000	1.000
U	2.54895 × 10^-8	4.93751 × 10^-9	2.43642 × 10^-10	4.08747 × 10^-12	2.27207 × 10^-22

Our analysis found the Bible to have the highest uselessness due to its high redundancy, while the list of 7-letter nonwords has by far the lowest uselessness, despite its impracticality. This is due to its extremely high information potential and the fact that every word is unique.

Visualization of comparative uselessness (log scale)

0.000000254895

0.0000000493751

0.00000000243642

0000000000408747

0.000000000000000000000227207

1.00E-9

1.00E-11

1.00E-13

1.00E-15

1.00E-22

The Bible

War and Peace

Moby Dick

English word list

Nonword list (7 letters)

Conclusion

Though it is a patient and humble step forward in the saga of progress, we have, by creating the largest dataset of nonwords to-date, advanced the study of uselessness to the frontiers of plausibility.

We hope pataphysicians and others will now have the means to more accurately account for the latent information potential and un-utility that lies in the very form of a given system, as we have attempted to describe. As words are the building blocks of ideas, so the nonword generator contains the generative power of all potential words, and therefore, all possible ideas.

We believe our model provides more realistic (if not more intrinsically useful per se) estimate of the uselessness of these possible ideas, or any given system, compared to traditional methods. Having proffered our estimations, we must now await the slow evolution of language to prove or disprove their accuracy. With great humility, we invite other researchers to review our methods and attempt to replicate these findings so we may collectively advance this important effort to demystify uselessness.

&&&

Back to PataNYC.org