October, 2024
A research team affiliated with The Pataphysical Society of New York has successfully generated the largest ever list of English language nonwords (i.e. imaginary words) using a specialized computer application developed by the Society as part of an inquiry into the nature of uselessness.
The team was able to identify and catalog an astounding number of nonwords — 8,352,712,501 — and analyze them using various techniques.
Finally, the team developed a novel method for calculating and expressing the information potential, entropy, and uselessness of a given system, and benchmarked the results against the nonword dataset, as well as several large, classic texts, as a first slate of experiments.
Today we are publishing their initial findings and releasing large, new datasets of English nonwords, along with the source code of the nonword generator, for fellow researchers, pataphysicians, and academics to use freely. (GitHub link)
This historic project advances our understanding of uselessness as a property of large datasets, adding new analytic tools, vocabulary and lexicographic specimens to the expanding field of pataphysical research.
We developed a C++ application to systematically find nonwords based on a given dictionary. We began with a high quality open source English language word list of 370,081 words and fed it as an input to the application, searching for incrementally longer combinations of letters to generate exhaustive lists of nonwords from 1 up to 7 letters in length. (The program can be set to output nonword lists of various lengths, as standalone files or as a single combined file.)
The C++ application takes advantage of multi-threaded processing and advanced memory allocation techniques, which to our knowledge have never been employed to systematically explore the vast landscape of potential letter combinations. This allowed us to far exceed the previous attempts at the generation of nonword lists. We continued until the resulting file size became too large and unstable to use. The maximum length of the nonwords generated was 7 letters, though the resulting 64.25 GB file is prone to crashing. (The 6-letter nonword list, at 2.16 GB, is far more manageable, though the information loss is profound.) The resulting word lists were inspected for quality and analyzed using statistical methods.
A random sample of 3-letter nonwords from our dataset:
These measurements reveal not only the distribution of words within all lexical constraints, but also suggest the utility of the application itself. Because the program requires a list of real words to begin generating nonwords, its value (i.e. its utility) appears to increase as the density of real words decreases.
To further understand the relationship, we further examined certain properties of the the nonwords' metadata in order to answer questions such as:
These questions lie at the heart of this research venture. Our nonword dataset serves as a prime example of high-volume, low-signal data, allowing for practical applications of information theory principles in seemingly useless contexts.
Standard dictionaries necessarily omit nonwords because their (usually human) users prioritize filtered lists of real words for everyday use. Nonword dictionaries are highly impractical, taking up large amounts of storage space and posing difficulties searching, moving, and referencing them in other works. Simply put, they are not needed.
But are they useless? And if so, to what degree?
Recent advances in processing power and considerable drops in storage costs have made feasible the notion of a comprehensive nonword list suitable for specialized (sometimes human, often-pataphysical) uses, despite the impracticalities. (Notable users of English nonwords include J. Joyce, L. Carroll, and W. Shakespeare.) The availability of such a nonword list eliminates the inconvenience and expense of their not yet existing, reducing the friction involved in employing nonwords for actual use.
By generating extremely large lists of English nonwords, we are able to explore the hypothesized relationship between information potential and list size as factors regulating uselessness. Though nonwords are not needed today, they have the potential to become real words in the future as their uselessness decays into meaning. From this basic observation, we may infer that — notwithstanding some finite number of unpronounceable phonemes (for a given language, at a given time) — the total potential meaning for a dataset ought to be quantifiable at any given instant.
(As a next phase of research, comparative patalinguistic teams at PataNYC are preparing to evaluate dictionaries in other languages for the phonotactic differences, orthographic patterns, morphological insights, and lexical gaps that define the contours of meaning across multilingual nonword lists. Contributors are invited to participate in the project via the contact information provided. The observations will eventually be benchmarked against theoretical figures originally derived from inverted real-word dictionaries.)
Despite the obviousness of this approach, today's theoretical models are only partially capable of describing the latent potential of meaning within nonword texts. Current techniques do not account for a nonword list's inherently dynamic levels of potential real-word density. And, the rarity of longer real words predicted by entropy encoding techniques presently in use does not align with the actual, observed interplay between combinatorial possibilities and linguistic constraints present in our dataset. As a result, the full effect of entropy on uselessness remains unknown, though speculation about the phenomenon has led some researchers to wonder whether the so-called “Rube Goldberg threshold” may indeed exist.
This makes the need for a new method of analysis abundantly clear.
As the final aspect of our study of nonwords, we propose a calculation that parametrizes degrees of entropy and potential meaning based on the size of the dataset and accounts for compression efficiency, retrieval speed, storage cost and processing cost, resulting in a single measure of uselessness, which may be expressed:
U = [(S + P) × D] / [(I × C × R)]
and applied, for illustrative purposes, to our program and resulting dataset thusly:
I = -Σ(p(x) × log2(p(x))) × N
1-letter: -23 × (1/23 × log2(1/23)) = 4.52
2-letter: -249 × (1/249 × log2(1/249)) = 7.96
3-letter: -15446 × (1/15446 × log2(1/15446)) = 13.91
4-letter: -449790 × (1/449790 × log2(1/449790)) = 18.78
5-letter: -11865456 × (1/11865456 × log2(1/11865456)) = 23.50
6-letter: -308885902 × (1/308885902 × log2(1/308885902)) = 28.20
7-letter: -8031768178 × (1/8031768178 × log2(8031768178)) = 32.90
Total I ≈ 129.77 × 8352384044 ≈ 1.08 × 1012 bits
Actual application size: 6,154 bytes
C = 8352384044 × 8 bits / (6154 × 8 bits) = 1,357,227.50201
Total runtime for 7-letter dataset: 37521.34 seconds
Number of 7-letter nonwords: 8031768178
Average time per nonword generation: 37521.34 / 8031768178 = 0.00000467173506 seconds
Estimating lookup time for a real word as 0.00000100000000 seconds (1 microsecond)
R = 0.00000467173506 / 0.00000100000000 = 4.67173506
Storage cost: $0.023 per GB
Function to calculate total storage cost:
def calculate_storage_cost(file_sizes, cost_per_gb):
total_size_gb = sum(size / (1024**3) for size in file_sizes)
return total_size_gb * cost_per_gb
script_size = 6154 # bytes
output_sizes = [
46, # 1 letter
747, # 2 letters
62 * 1024, # 3 letters
2.24 * 1024 * 1024, # 4 letters
71.2 * 1024 * 1024, # 5 letters
2.16 * 1024 * 1024 * 1024, # 6 letters
64.25 * 1024 * 1024 * 1024 # 7 letters
]
total_storage_cost = calculate_storage_cost([script_size] + output_sizes, 0.023)
S = $1.52797001625
Total runtime: 37521.34 seconds
Estimated MacBook Air power consumption: 10W
Energy used: 37521.34 × 10 / 3600 = 104.23 Wh = 0.10423 kWh
NYC electricity rate (2024 estimate): $0.25/kWh
P = 0.10423 × 0.25 = $0.026057500000
(D) = Redundancy
D = 1 + (370,081 / 8,353,082,582)
= 1 + (.00004430472)
= 1.00004430472
U = [(S + P) × D] / [I × C × R]
U = [(1.52797001625 + 0.026057500000) × 1.00004430472] / [1.08 × 1012 × 1,357,227.50201 × 4.67173506]
U = 1.55409288854 / (6.84 × 1021)
Making the total uselessness of our nonword dataset:
U ≈ 2.27206562 × 10-22
To place this result into context, we then applied the same analysis to the several large classic texts for comparison, and, as an experimental control, to the original list of real English words from which the application derived its output of nonwords. For the sample texts, we used a simple textual analysis script written in Python to find the number of words, unique words, and file size, to be used as inputs to the calculation. (This script is also included in the PataNYC nonwords repository on GitHub.) The results are shown in the table below.
Bible | War and Peace | Moby Dick | English word list | Nonword list (7 letters) | |
---|---|---|---|---|---|
Word Count | 821,496 | 563,286 | 212,032 | 370,081 | 8,031,768,178 |
Unique Words | 33,421 | 41,548 | 33,265 | 370,081 | 8,031,768,178 |
File Size (bytes) | 4,436,171 | 3,339,767 | 1,243,001 | 4,234,834 | 68,987,183,104 |
Runtime (seconds) | 0.17 | 0.12 | 0.04 | 0.26 | 37,521.34 |
I (bits) | 470,962.442 | 546,720.363 | 469,171.615 | 3,502,750.189 | 232,921,277,162.000 |
C | 1.482 | 1.348 | 1.363 | 0.699 | 0.932 |
R | 2,069.412 | 2,132.530 | 1,886.792 | 7,023.077 | 4,671.735 |
S ($) | 0.000102032 | 0.0000768146 | 0.0000285890 | 0.0000974012 | 1.52797001625 |
P ($) | 0.0000118056 | 0.00000833333 | 0.00000277778 | 0.0000180556 | 0.026057500000 |
D | 24.580 | 13.557 | 6.374 | 1.000 | 1.000 |
U | 2.54895 × 10-8 | 4.93751 × 10-9 | 2.43642 × 10-10 | 4.08747 × 10-12 | 2.27207 × 10-22 |
Our analysis found the Bible to have the highest uselessness due to its high redundancy, while the list of 7-letter nonwords has by far the lowest uselessness, despite its impracticality. This is due to its extremely high information potential and the fact that every word is unique.
Though it is a patient and humble step forward in the saga of progress, we have, by creating the largest dataset of nonwords to-date, advanced the study of uselessness to the frontiers of plausibility.
We hope pataphysicians and others will now have the means to more accurately account for the latent information potential and un-utility that lies in the very form of a given system, as we have attempted to describe. As words are the building blocks of ideas, so the nonword generator contains the generative power of all potential words, and therefore, all possible ideas.
We believe our model provides more realistic (if not more intrinsically useful per se) estimate of the uselessness of these possible ideas, or any given system, compared to traditional methods. Having proffered our estimations, we must now await the slow evolution of language to prove or disprove their accuracy. With great humility, we invite other researchers to review our methods and attempt to replicate these findings so we may collectively advance this important effort to demystify uselessness.
&&&