Information networks, the Oxford English Dictionary and the idea of universal knowledge
March 13, 2008
I have to preface this post with a confession: I once signed up to receive monthly newsletter emails from the Oxford English Dictionary. I still receive these emails and – unlike many of the emails that come to me from friends, family, colleagues and the communities in which I actively participate – I read the OED emails as soon as they show up in my inbox.
In my defense, the emails often provide a fascinating window onto the fast-paced world of dictionary-writing. Well, okay, maybe it’s not fast-paced. And maybe my girlfriend can’t believe how boring it is, but at least it’s fascinating to me…
This month’s edition made what I would consider a shocking announcement: for years, the OED editorial staff has revised content by proceeding alphabetically through the previous edition, adding new words, and updating existing entries. The result, as it’s not hard to imagine, is that a lot of words that nobody knows get attention disproportionate to their usage. But, as of this most recent quarterly update however, the editorial staff will begin to complement the old approach with a new method based on lexical frequency, semantic search data, and alphabetical clusters. This means that the words updated this time around look much more familiar: for example, “heaven,” “hell,” “fuck,” “computers,” “gay” and “free.”
It took a moment for me to realize the implications of this change. Although the newsletter doesn’t go into detail about the new selection process or the data on which it was based, it’s clear that the new technique suggests a radically different concept of language and information. Let me explain what I mean.
The old, alphabetical way of prioritizing updates ascribes an implicit equality to each word. This is reasonable from a bird’s eye perspective of the lexical universe: each word is, after all, equal to its peers in the sense that they all have a place in the dictionary. They are all words. However, the assumption of lexical equality falls apart quickly upon closer inspection of the way people use words in the world. Relatively few words dominate our everyday speech and writing patterns. As with many social phenomena, the frequency of word use in any given natural language follows a “Zipfian Distribution” – a.k.a. a power law. The new OED revision method attempts to take this reality into account. They will no longer leave it up to chance whether the most relevant, heavily used, and contentious words that populate everyday language undergo regular “check-ups.”
Why am I making such a big deal out of this? Dictionaries – as well as their not-too distant cousins, encyclopedias – reveal a lot about the way people think about themselves and the world around them. The original encyclopédie was a distinctive product of the Enlightenment. Edited by Denis Diderot and Jean le Rond D’Alembert, it was intended to be a comprehensive catalogue of the entirety of human knowledge. Around the same time (i.e. the 18th century), the idea arose to make dictionaries follow an alphabetical order. Prior to that they had been organized around discrete topics. By embracing the alphabet as an abstract system of categorization, dictionarists effectively accepted the notion that language use was not as important as a totalizing system of categorization. The shift represented an attempt to make the dictionary more encyclopedic in its scope and structure. Not surprisingly, the method of dictionary production and revision would come to follow the structure.
In adopting a revision system that takes natural language use patterns into account, the OED editors have, in one sense, recognized the limits of the encyclopedic project. While this should not suggest the end of the enlightenment or any such nonsense, it does suggest an interesting turning point in the evolution of human self-knowledge.