Notes Information Apocalypse

Tag-a-long

Our formal attempts at understanding language tend to be hierachically structured, but when it comes to actually organizing language, it is well known that meaning is relational and constantly fluctuates. This is perfectly fine in a verbal or written setting where there is always a certain amount of leeway for expression, and well understood cues are present to provide context for such fluctuations. But tagging is based on isolating and detaching specific words from this flow, abstracting them as separate conceptual objects. Relationships between tags emerge through serendipitous intersection as well as planned editing and the non-heirachical nature of tag sets means there is usually no distinction between generality and specificity in tags used. For example, the MIT OpenCourseWare site is perpetually popular on del.icio.us. The most popular tag for OpenCourseWare is education which neatly captures the general subject that this link is about. But other popular tags used are totally specific, like mit; or too general to have much relevance at all, like free. There's no standard of "aboutness" or semantic association for this kind of labelling, it's a purely hetrogenous free-for-all.

This lack of classification and heirachy drives some people crazy, while others are overjoyed and empowered. Tagging emphasises choice over clarity, with an immediacy that has already had a profound impact on popular conceptions of metadata and online information. The most basic idea of a tag is a mapping between a collection of media objects and a human concept. Here, a relational tension arises between individual patterns of usage and the groupthink of folksonomy. It's interesting to think of the human value of tagging being lost when the shared meaning of the concept becomes contested, self-referential, or overly noisy.

We can learn a lot from exploring this further - both in terms of understanding the way that tag clusters provide first impressions of collective ideas and aspirations, and also through grasping the different ways in which we can extract useful information structures based on these clustered relationships.

What I am not understanding is the claim that power law relationships found in the distribution of tags are inherently a "good thing". Beyond obsequiousness, are we actually looking at anything surprising or interesting in the metrics of these distributions? It's a nice idea to think that such "long tail" effects can pose a threat to established business and information models, but let's not miss the point that Google treats individual words on a page as "de-facto tags" anyway - such scale free properties will always arise from the structured usage of language. I don't see Zipf's law radically changing anytime soon, and as far as I'm concerned, these distributions are a basic phenomenon of networks and relational objects per-se, they don't necessarily tell us anything about the actual content of the relationships embedded within.

What web designers need to understand better is how tag clusters can be used to filter and organize streams of information. If tagging does indeed have a "sweet spot" of early adoption, beyond which sees a rapid decline in information quality, then designers, developers and publishers need to think very carefully about both their data sources, and the actual keywords they choose. This doesn't necessarily mean that tagging has outgrown it's usefulness, or should be dismissed as a social fad, but may mean that it's time to think laterally about how the tagging approach can be extended.

Instead of defining a strict up-front heirachy for navigation, tagging encourages an organic approach to content, usually through tools that will autogenerate tag or category pages based on the set of keywords attached to each content item. Can this organically evolving base be taken a step further, from what we know about adaptive design?

Organically evolving design methodologies have seen their greatest success in software design. The runaway popularity of Ruby on Rails wasn't because it was the best architected, most powerful and complete web framework around, it was because it worked straight away and was simple to use. I wonder that a lot of this success stemmed from the process by which it came to be, based on the idea that frameworks are extractions. The core of Rails was built to solve useful problems, and it was only after it had been proven and tested in a specific application, that elements were extracted into a more general framework. This approach is related to the idea of refactoring, which focuses on architectures that emerge from the process of problem solving and growth in development, rather than being driven by pre-planned structures. This design methodology, (with it's perils and pitfalls alongside it's positives) is well accepted wisdom in the development community, but it's less accepted that the potential impact of adaptive design runs far further than just the code underlying the software.

The "middle distance" view of metadata might well be a synthesis of tagging as current practice augmented with the concept of a vocabulary of set values based on terms and types. Two immediately obvious types are subject and topic, as well as category which is already extremely widespread. We can come up with all sorts of bespoke combinatorial rules for building a vocabulary based on these elements from any collection of tags. The key principle is that our metadata extraction captures various heirachical constraints that we know from experience.

Using these vocabulary types, we can allow navigation and classification systems to evolve from experience, shaped by the actual usage of our content collections, rather than our initial assumptions about how they should be structured. Relationships can be discovered and invented through plain tagging, and eventually be extracted to form more valuable forms of organization.

My hope is that these concepts will lead to richer and more structured web content being published with less time and effort. The key idea is the formulation of metadata in terms related to entropy rather than ontology or description.