Notes Information Apocalypse

Mixing Semantic Vocabularies and Basic English

Microformats make sense to me immediately, in a way that RDF never has. By focusing on visible data, the markup retains a high degree of interoperability with existing web tools, without requiring complex ontology specifications, and actually reflects the way that humans think about and use written information (paragraphs, lists, and abbreviations). RDF is clearly an important and useful technology, but I think that its potential for success may be in far more specialized applications than in generally enabling the Semantic Web.

In a closed world, it is easy to relate metadata about objects directly to an ontology (or several). But URI's live in an open world, where new ontologies will constantly be forming, so any universal process aimed at matching, comparing, or infering information about a resource will end up being far too complex and time consuming to be realistically achieved - leading many to believe that the Semantic Web is an impractical misplaced dream.

Look at any standard RDF document, and you'll likely see a mixture of namespaces; the set of unique vocabularies or ontologies that are used in the resource definition. For example, a Typepad XML feed looks like:


That's six namespaces in addition to the default RSS format. In this case, the Dublin Core namespace is used to denote a timestamp for each blog entry:


Ok, it's standardized and probably useful - but compare it to the approach used in the hCalendar format:

June 6th

This is also standardized - it's valid XHTML and the title attribute is an ISO8601 date format. But the abbreviation itself is a human readable date. Thus, it's presentation friendly, and with the addition of a class attribute, it becomes meta friendly too.

There is a long running philosophical argument about meaning, and its relationship to text. From where do we get our semantics? Words? Sentences? Languages? Social networks? Cultures? Current attempts to standardize the Semantic Web are really a bet that the appropriate unit of semantics is the vocabulary or ontology.

In A Thousand Years of Nonlinear History, Manuel De Landa references the critique of Chomskys automata model of language by Deleuze and Guattari:

Our criticism of these linguistic models is not that they are too abstract but, on the contrary, that they are not abstract enough, that they do not reach the abstract machine that connects language to the semantic and pragmatic contents of statements, to collective assemblages of enunciation, to a whole micropolitics of the social field... There is no language in itself, nor are there any linguistic universals, only a throng of dialects, patois, slangs, and specialized languages. There is no ideal speaker-listener, any more than there is a homogenous linguistic community. Language is, in Weinreich's words, "an assentially hetrogenous reality". There is no mother tongue, only a power takeover by a dominant language within a political multiplicity.

The Chomskyian model of a rule-applying automaton in the brain responsible for grammatical intuition arose from observations that children learned language by listening to adult conversations without needing to be explicitly told what the rules are, which challenges the idea that grammatical rules manifest and spread from social institutions or cultures at the high level. But we can consider sources other than institutional rules to understand the "combinatorial productivity" of language. De Landa points to the work of linguists George Zipf and Zellig Harris who introduced the idea of "transformation" to linguistics in terms of "combinatorial constraints", rejecting the idea of a homogenous core that is the essence of language, and instead proposing meaning in historical terms, the cumulative result of actual usage; giving rise to combinations and restrictions of the co-occurence of words.

A given operator, once included in a sentence, demands an argument of a certain class. This constraint, too, adds information to the sentence: the more unfamiliar the argument supplied for a given operator, the more informative it will be.

The operator-argument constraint, when used to link verbs and nouns, adds to a sentence the meaning of "aboutness," the ability to refer not only to individual objects and events but also to complex situations.

It's worth noting that these combinatorial constraints are occuring at the level of statements, not of ontologies. Thus, when considering the language of the Semantic Web, standardizing completeness and correctness at the ontology level may not always enhance meaning, but by forcing ideas and concepts to be more homogenous, could potentially destroy it. De-emphasising the essential components of transformation and flux in language, thus sacrifices organic potential in favor of machine readability.

Earlier, I noted the potential for a simple and meaningful vocabulary of content objects that may or may not be synthetic representations of real world objects. I define these not by a schema, but in terms of Basic English. And as such, I can't forget H.G Well's proximation:

One unlooked for development of the hundred years between 2000 and 2100 was the way in which Basic English became in that short time the common language for use between nations... By 2200 almost everyone was able to make use of Basic for talking and writing.

Basic English has already become so ubiquitous, we can destabilize or decentralize terms such as "author" or "title" or "description" only in terms of their referents. The actual terms would only change under a radical restructuring of the English language itself, and this seems unlikely, in the short term at least. So if we know this much about certain nouns, surely we can build software that parses this information and connects resources in a way that doesn't require a centralized ontological system to be present?

The real challenge in enabling the Semantic Web is not to prove syllogisms, or correctly make deductions or inferences about information. It's about enabling people to organize information using tools that treat understanding as more important than consistency.