There’s a tendency on the part of designers, researchers, and others to assume that English-language users’ behaviors in social networks generalize to that of other language users. But in a recent study where we examined 62 million tweets collected over a four-week period, we found significant differences in how people of different language backgrounds used features such as URLs, hashtags, mentions, replies, and retweets. Our results suggest that social network users from different languages used Twitter for different purposes… and this may have interesting implications for designing cross-cultural communication tools.
To find out more, join us at ICWSM ’11 — the 5th International AAAI Conference on Weblogs and Social Media (this July 17-21 in Barcelona, Spain) — where we (myself, Gregorio Convertino, and Ed Chi) will be reporting our findings on “Language matters in Twitter: A large scale study“.
In the meantime, read on for more about how we identified the top 10 languages on Twitter…because methods matter, especially for examining large-scale data sets.
The “Lingua Tweeta” — top 10 languages on Twitter
We collected 62,556,331 tweets over a four-week period in spring 2010 — about 2.2 million tweets per day representing 3-4% of all public messages – using the streaming API for the Spritzer sample feed. Altogether, we identified 104 languages from these 62 million+ tweets. Here’s the top 10, which accounted for more than 95% of the tweets. [Note: 0.03% of the tweets' languages could not be identified ("!!!"). And if a user posted tweets in multiple languages, we counted her multiple times. We also found that 0.61% of the tweets contained nothing but URLs, hashtags, or mentions.]
|Language||Tweets||% of dataset||Users||Tweets/user|
How did we identify the languages of over 62 million tweets?
First, we removed URLs, hashtags, and mentions from the text content of every tweet. Then, we identified the language of each tweet using a combination of Google’s Language API and LingPipe’s text classifier (we trained LingPipe with the Leipzig corpora collection to detect Danish, English, Estonian, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Serbian, Swedish, and Turkish). We used LingPipe as the primary method for detecting English and Japanese. If LingPipe identified the language to be English or Japanese with more than 95% confidence, we accepted the result. Otherwise, we invoked Google’s Language API (which has a rate limit so couldn’t be used as the primary method). Turns out, both tools exhibited a high degree of agreement for English and Japanese tweets.
By the way: it’s amazing how APIs (application programming interface) have not only changed the way programs interact, but the way we do research. So thank you to all of you who make APIs freely available!
Understanding the accuracy of automatic language detection
Like many automated detection techniques in natural language processing, mistakes happen. [In fact, resolving ambiguity is one of the harder aspects of natural language processing... we know this, because PARC has long invested in natural language research -- some of our technology was acquired by Microsoft through Powerset to be part of Bing.]
And on Twitter, which has a greater concentration of idiosyncratic or uniquely shortened messages – things get even more complicated. For example, our algorithm recognized the following tweet as non-English: “Got ur dirct msg. i’m lukng 4wrd 2 twt wit u too. so,wat doing ha…”. [Though I'm not even sure a human can parse this that easily, either!]
To validate the accuracy of our automatic language detection, we conducted a small human-coding study. We drew a random sample of 2,000 tweets from our data set. For each of the top 10 languages, we recruited two human judges who were either native speakers or proficient in the language to be recognized. All judges were given the same 2,000 tweets and asked to recognize the tweets written in their language of competence. During a second round of coding, each pair of judges re-coded the tweets on which there was any disagreement and resolved their differences. When we compared the results of the human judges and the algorithm, we found:
|Language||True Positives||True Negatives||False Negatives||False Positives||Cohen’s Kappa|
For example: our judges and the algorithm agreed that 974 of the 2,000 tweets were in English (true positives) and that 971 tweets were not (true negatives); but the algorithm misclassified 20 of the human-coded English tweets as non-English (false negatives) and 35 of the non-English tweets as English (false positives). Not bad at all. Using Cohen’s Kappa as a measure of inter-rater agreement [pdf], we found that there was substantial agreement between the judges and the algorithm (the rightmost column of the above table shows that the kappa values are 0.90 or higher for the top 7 languages).
In fact, the algorithm did a really good job discriminating even between similar languages: we were surprised that no tweets in the human-coded Portuguese sample were misclassified as Spanish and vice versa. This was not so true of Indonesian and Malay — 3 of the 4 Indonesian false positives were human-coded as Malay, and 9 of the Malay false positives were human-coded as Indonesian.
Those were the methods. Now for the madness — stay tuned for my next post, where I’ll share how users from these top 10 languages adapted specific Twitter features such as URLs, hashtags, mentions, replies, and retweets.