You are here

Predicting Tweet Sensitivity through Content Analysis

Gothenburg.
The next AoIR 2010 speaker is David Houghton, whose interest is also in Twitter. He starts by pointing to a range of tweets of varying degrees of mundaneness and secrecy, and is interested in examining linguistic differences in them. What threats to personal privacy result from the spread of gossip? How can levels of self-disclosure be measured – in breadth or depth, for example – in order to alert users to when they might be compromising themselves by oversharing?

How do we enable users to go about sharing while protecting their concerns and informing them about potential harms? David collected 250 random tweets from both Twitter and Secret Tweet (which collects tweets with sensitive information, it seems).

Linguistic rating showed that tweets from the latter site were indeed more sensitive in nature, and overall content analysis identified a range of linguistic markers through which the sensitivity of a tweet could be predicted: heightened word count, personal pronouns, past tense, work words, family words, human words, inhibitions, sexual words (in combination) all predicted a more sensitive information disclosure, for example. On the other hand, more articles, ‘you’, ‘she/he’, swear words, question marks, exclamation marks, punctuation and filler words tended to predict more ‘normal’ information disclosure.

This adds to the tool set for qualitative and quantitative Twitter research, and also aids system developers in helping users to identify what they’re sharing. Tweets could also be automatically set as more or less sensitive, and on that basis made available to a larger or smaller group of followers. This could enhance online privacy and preserve users’ autonomy while alerting them to potential harm.