Sampling Twitter

From a random sample of 300,000 Twitter users, it would seem that men follow men, and nobody tweets. As you might suspect, this is a textbook example of participation inequality.

People have been saying for a while now that at a large scale, Twitter is a wasteland of unused and abandoned accounts. Perhaps this Harvard Business School research indicates that it is disingenuous for companies to talk about the total numbers of accounts on a service where the group of active users is far smaller than the total group of all users. It would be unfair to suggest that this is a pattern unique to Twitter – I wouldn’t be surprised if the number of abandoned MySpace accounts dwarfs the total number of active accounts on all the other social networks combined (Facebook excepted).

Taking all this into consideration, it might be a little misleading to say that there is such a thing as a typical Twitter user, since there are vast numbers of people counted as users who are not actually using the service, even though they have an account. It’s probably fair to say that what is significant about the Twitter population is not to be found in the medians, but in the outliers. It would be interesting to know how many of the massively asymmetric celebrity accounts were included in this sample. Should the researchers have dropped the inactive accounts from the sample? That of course, raises the messy and subjective argument over what constitutes an inactive account. As an example, the HP Twitter Research defined an active account as one that had logged in and completed activity within the last 30 days.

Since this earlier research by HP (which took place late in 2008) found that 68% of their sample were active accounts, this begs the question of how much the dynamics on Twitter have changed since the shark jumping of Jan-Feb 2009, or whether different sampling methods are producing drastically different pictures of Twitter usage.

This issue actually has important implications for Twitter’s design and engineering team. The recent backlash over @reply settings all started because Twitter made a very specific change to a feature, based on evidence derived from a statistical analysis of accounts using the feature. Including millions of inactive accounts in this analysis would obviously skew such percentages away from what the active users are actually doing.

What is definitely clear is that on the whole, Twitter is dominated by the activities of a small number of people. It remains to be seen whether Twitter activity fits harmoniously with regular power-law or Pareto distributions, or whether it lies completely off the curve in some uniquely unpredictable skew.

The most interesting aspect of this research is the analysis of the gender of followers. Men seem to follow men in far greater numbers than on other social networks, whereas women are more balanced across genders. Does this suggest a very visible separation of male behavior jockeying for status positioning, while females seek connection? Do men care more about increasing their follower count? What about blind following or people using autofollow? Or companies and organizations that are genderless?

I hope these questions show how difficult it is to draw accurate and valid social interpretations from gross aggregates. It is very trendy to use data to make sweeping inferences about online behavior, but we should always be cautious about what conclusions we can draw from damned lies and statistics.

Update: Stinky Blogging Stats illustrates exactly what happens when averages, medians, and outliers are misappropriated to make a journalistic point.