Notes Antic Disposition

Counting Accented Vowels

The Twitter API Wiki has one of the most practical explanations I’ve seen of some of the issues involving Unicode normalization. This is particularly important for Twitter, because of the 140 character limit on tweets – arsing up this character count would be a horrendous usability mistake.

Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word "café". It turns out there are two byte sequences that look exactly the same, but use a different number of bytes.

To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count "café" as four characters no matter which representation is sent.