At Queensland University of Technology in Brisbane, Australia, where we’re based, we’re currently advertising two new positions relating to our research: a Postdoctoral Research Fellow (two years full-time) and a PhD researcher (including a three-year stipend). Both of these are associated with Axel Bruns’s ARC Future Fellowship project Understanding Intermedia Information Flows in the Australian Online Public Sphere. Both projects are closely related, and will use innovative research methodologies to work with a range of big data resources (from social media platforms and other sources) on user activity in the Australian online public sphere.
If you’re interested in applying for one or the other of these positions, please see the full calls for applications at the following URLs:
Compared to the excitement of January and February, March 2015 has turned out to be a comparatively quiet month in Australian public life, even in spite of the New South Wales state election campaign which culminated in the re-election of Mike Baird’s Coalition government on 28 March. The immediate heat has dissipated from the leadership debate around PM Tony Abbott: contenders Malcolm Turnbull and Julie Bishop are resorting to playing the long game and as any potential new leadership challenge looks increasingly unlikely to happen before the May budget.
For the purposes of our Australian Twitter News Index (ATNIX), which tracks the sharing of links to Australian news and opinion sites on Twitter, this period of relative calm manifests in comparatively stable, regular link sharing patterns. ABC News and the Sydney Morning Herald continue to track neck-and-neck with some 315,000 to 320,000 links shared throughout the month, and are firmly established as Twitter news market leaders in Australia, with third-placed The Age reaching only some 135,000 tweets over the same period.
The major point of heightened activity during the month occurs in the week of 16 March, especially for ABC News, as the full aftermath of tropical cyclones Nathan (off far north Queensland), Olwyn (northwestern Australia), and – most devastatingly – Pam (which caused severe destruction in Vanuatu) became known. A report that several elderly indigenous residents in Carnarvon were denied access to a cyclone shelter ahead of Olwyn’s arrival was especially widely retweeted. Meanwhile, a particularly spectacular Aurora Australis event which was visible even from the mainland generated additional shares for the ABC.
Meanwhile, major political stories fail to emerge beyond general day-to-day sharing. The Australian records a brief spike in shares on 2 March with an article suggesting that a major ally of Indonesian President Joko Widodo had come out against the death penalty for Bali Nine drug smugglers Myuran Sukumaran and Andrew Chan, and PM Tony Abbott’s swiftly withdrawn comparison of Bill Shorten with Joseph Goebbels in parliament causes a brief flurry of outrage on 19 and 20 March, but there is little sustained engagement with either of these stories, beyond average levels.
And finally, the comparatively uneventful end to the NSW election campaign (at least by contrast to the surprising outcome of the Queensland poll, one month earlier) similarly fails to significantly affect the sharing of links to Australian news and opinion sites – indeed, 28 and 29 March are the days which see the fewest links to the Sydney Morning Herald shared during March, even compared to the already lower weekend averages for the paper.
Such a lack of sharing does not represent a lack of interest in the results of the New South Wales election, however. Turning to our Experian Hitwise data, which show the total number of user visits to the leading Australian news and opinion sites, we can see that the Sydney Morning Herald and – especially – ABC News record comparatively strong results on 28 and 29 March; with almost 770,000 visits to its site, ABC News in particular receives as many visitors on the Sunday as it usually does on weekdays. This points strongly to the ABC’s continuing role as the nation’s premier source of information on election results – similar to the patterns we observed in the previous Queensland election.
However, especially in the absence of any major election surprises, it is also evident that Twitter and (presumably) other social media users did not feel the need to specifically share the NSW election results with their followers and friends: the elevated levels of access to the ABC and other news sites on and after election day did not result in significant additional shares. Had there been any unforeseen developments, the picture would likely have been very different.
Standard background information: ATNIX is based on tracking all tweets which contain links pointing to the URLs of a large selection of leading Australian news and opinion sites (even if those links have been shortened at some point). Datasets for those sites which cover more than just news and opinion (abc.net.au, sbs.com.au, ninemsn.com.au) are filtered to exclude the non-news sections of those sites (e.g. abc.net.au/tv, catchup.ninemsn.com.au). Data on Australian Internet users’ news browsing patterns are provided courtesy of Experian Marketing Services Australia. This research is supported by the ARC Future Fellowship project “Understanding Intermedia Information Flows in the Australian Online Public Sphere”.
This post builds on the new approach to transforming Twitter datasets generated by the TCAT tracking tool for analysis in Tableau which I’ve introduced in my recent posts. Often, we will be interested in exploring the structure of Twitter communities as they form around given hashtags or keywords – for instance to examine whether they really act as communities in a narrow sense, or are rather merely groups or publics who are in some way connected to the hashtag, but barely aware of each other’s presence.
In the past, we’ve used one of our Gawk scripts, metrify.awk, to generate a range of metrics which provided detailed information on the dynamics of a dataset over time, across individual users, and across different groups of accounts as defined by their level of activity; I explained that process in a multi-part post in 2012 (1, 2, 3, 4, and follow-up). With the move from yourTwapperkeeper and Excel to TCAT and Tableau, most of this analysis can now be done directly within Tableau itself, directly from the source TCAT dataset and the additional helper datasets which our TCAT-Process scripts generate. What’s still missing from the mix is a method for exploring the contribution of the different groups of accounts, though – this post outlines the steps for generating these metrics from within Tableau itself.Introducing Percentile Groups
It’s well established that the distribution of activity levels across a given group of social media accounts will often follow a ‘long tail’ distribution: a very small number of accounts are very heavy contributors to a hashtag or a discussion, while a large number of others are contributing only very occasionally. The exact balance between these groups, and the exact nature of their respective contributions, can tell us a great deal about the dynamics of the overall Twitter public gathered around the shared hashtag or theme – the lead users may contribute in different ways from the least active users, for example by including more URLs in their tweets, or by taking a more discursive approach that features more @replies than retweets. We’ve used such observations very effectively in the past to distinguish between different types of hashtag events, and to pinpoint useful areas for further close reading of tweets.
What’s often used in this context is a 1/9/90 division between participants: ordered by their number of contributions to the conversation, the top 1% of accounts are identified as lead users; the next 9% as highly active users; and the remaining 90% as least active users. Other divisions are also possible, of course; what is most appropriate will depend on the specific dataset at hand, and on the research questions asked of it. For the purposes of this post, we’ll continue with the 1/9/90 division of accounts into three percentile groups.
Happily, it is fairly straightforward to create these percentile groups in Tableau. In the following discussion, I’m building on the processes outlined in my previous posts: so, we’ve already downloaded a full dataset export from TCAT (in my example, tweets about the attempted party leadership challenge to Australian Prime Minister Tony Abbott, using the #libspill hashtag and a number of related hashtags and keywords), and we’ve processed this dataset using the TCAT-Process scripts package I’ve made available here. We’ve also loaded and combined the resulting datasets in Tableau.
Now, the first new step is to create a new calculated field called ‘Percentile Ranking’ in Tableau, using the following formula:
As usual, we are using COUNTD(Id) as the most reliable count of unique tweets; in a given list of items in Tableau, the RANK_PERCENTILE() formula then uses this count of tweets to calculate which percentile in the list a specific item occupies. The result will be a value between 0 (lowest percentile, 0%) and 1 (highest percentile, 100%).
In Tableau, we can now graph CNTD(Id) against From User Name, order the list by CNTD(Id), and add Percentile Ranking as a label; this generates an ordered list of participant accounts and shows their percentile ranking:
By this ranking, accounts with a Percentile Ranking greater than 0.99 are in the top 1% of lead users; accounts with a ranking between 0.9 and 0.99 are in the next 9% of highly active users; and the remainder of accounts with a ranking below 0.9 are in the bottom 90% of least active users.
However, in our further analysis we cannot use the Percentile Ranking field directly, as it is always freshly calculated depending on what fields are graphed against each other; we therefore have to persistently allocate accounts to the three groups we’ve defined. This is where Tableau gets uncharacteristically cumbersome for a moment:
(The overall process will remain the same for different percentile cutoffs, of course, and is even easier if you make only a simple distinction between a lead user group and the remainder of the userbase – for a 20/80 split, for example, simply create one group for accounts ranked above .8. However, finer gradations between multiple subgroups usually generate more useful analysis.)Analysing the Groups’ Contributions
Having created these groupings, we can now begin to use them in our analysis. First, we should determine how many accounts there are in each of our groups, by showing the count of unique user names – i.e. CNTD(From User Name) – for each of the groups (note that I have also added a Grant Total row by selecting Analysis > Totals > Show Column Grand Totals):
By default, Tableau names the groups after the combination of selection criteria that the combined Sender Groups field was constructed from – but from the membership size of the groups listed above we know that the smallest group (the first row in the image above) must be the 1% of lead users, the second the 9% of highly active users, and the third the remainder of the 90% least active users. Right-clicking on each field and selecting “Edit Alias…” allows us to rename these fields to something more user-friendly.
It is also notable that while my dataset contains a total of 134,518 unique user names, the lead user group is made up of 1,347 accounts, and the two top groups together number 14,396 accounts in total – more than the 1% or (combined) 10% of 134,518 that they should contain. This is not an error, but simply a sign that Tableau does not play favourites: if there are multiple accounts at the boundary between two groups which equally fulfil the requirements for belonging to the higher-ranked group, Tableau will include them all, rather than arbitrarily sending some of them to the lower group in order not to expand the higher percentile group beyond the top 1% or 10%. In my sample dataset, for example, the cutoff for belonging to the lead user group was a total of 43 tweets sent, and multiple accounts had reached exactly that number.
Next, we might want to explore the number and types of tweets sent by each group – so here, I’ve graphed Sender Groups against CNTD(Id), and coloured by Type (again with an added Grand Total column). What becomes evident in my sample is that the 1,347 lead users contributed more tweets than each of the other two groups, and that they were especially active in sending @mentions and retweets:
Replacing Type with Hashtag as the field determining colour, we can also determine highly divergent hashtagging practices – the lead users almost always included a hashtag, while almost two thirds of the tweets posted by the least active users did not contain hashtags (note again that the tweet numbers are increased beyond the previous graph here, and the percentages add up to more than 100%, because tweets can contain two or more hashtags). Incidentally, I’ve displayed the percentages by adding CNTD(Id) to Label and calculating its value as a percentage of total, using Table (Down) as the calculating method:
Many further permutations of these analyses are also possible, of course – we might explore, for example, whether there are differences in the URLs each group are sharing (are they using distinctly different domains as their sources of information?), or whether they are tweeting from notably different devices (as indicated by the Source field).
Further, we can also examine the contributions by these groups over time:
In my example, this shows that the lead users are responsible for a volume of tweets that usually closely matches that contributed by the (much larger) group of highly active users, while the least active users are less engaged during peak times, but make up for this by maintaining greater levels of activity outside of peak periods. This could also indicate that the least active group contains a range of users whose tweets showing up in our dataset as false positives (e.g. because they use the term ‘spill’ in non-#libspill-related contexts), which could be a good argument for excluding this group from the analysis altogether.
Using Percentile Groups Elsewhere
While this post has focussed on defining groups of accounts based on their active contributions to a dataset (i.e. the number of tweets they posted), the same approach can also be used for other distributions where grouping may be useful. For example, we might instead list the accounts being @mentioned (based on the To User field which the TCAT-Process scripts generate – not the unreliable To User Name field which the Twitter API itself provides) against the number of tweets mentioning them (via CNTD(Id)), and again calculate their percentile ranking. In fact, we can use the Percentile Ranking field we defined at the start of this post – it performs a new calculation for any list of items it is being applied to:
We should exclude “Null” from this list (which collects all the tweets which did not @mention or retweet another user), and can then again define a number of percentile groups following the process outlined above. For my sample dataset, this results in a group of 454 “Most Visible” accounts (the 1% of accounts who received the most @mentions and retweets), 4,596 “Highly Visible” accounts (the next 9%), and 40,272 “Least Visible” accounts (the remaining 90%). Note here, though, that by default Least Visible will contain the “Null” recipient (as Least Visible is simply a collection of all recipients that are not included in the other two groups), so we will need to manually filter out this recipient from all further analysis.
I’ve combined these groups into a Receiver Groups field, which we can now also use for some interesting analysis. First, for example, as expected the most visible accounts command the majority of @mentions and retweets:
Second, they are also especially popular with the most active participants in the discussion. Note especially how the least visible accounts are mainly mentioned by the least active participants – it seems that there are several separate discussion circles here:
And finally, turning the interactions between the various sender and receiver groups into a matrix and adding some further Tableau functionality into the mix, here’s a nice graph to end on. This shows the volume of activity from each sender to each receiver group, and breaks it down into @mentions and retweets:
Again, similar rankings can of course also be created for many of the other fields in our dataset – for example for the most frequently shared URLs (at the domain level, or for each fully qualified URL), the most prominent hashtags, even the most widely used tweeting platforms. Given the approaches I’ve outlined here, I hope these will be relatively easy to calculate now.