You are here

Mapping Online Publics

Subscribe to Mapping Online Publics feed
Updated: 1 hour 15 min ago

Rebooting ATNIX, using MySQL and Tableau

Fri, 05/09/2014 - 15:33

During 2012 and 2013 I published a more-or-less-weekly overview of the news sharing patterns in the Australian Twittersphere, the Australian Twitter News Index (ATNIX), which I also crossposted to my column at The Conversation. Technical issues and the call of other commitments have forced me to put ATNIX on the backburner for some time, but it’s time now to update and restart the index. This post is about the methods used to generate ATNIX – and soon I’ll also post a first new analysis of the data.

First, a reminder: ATNIX builds on the fact that, given a domain name like abc.net.au, the Twitter search API will return all tweets which contain a link to that domain, even if the link has been shortened by Twitter’s mandatory URL shortener t.co and/or one of the other common URL shorteners (bit.ly, ow.ly, etc.). This makes it possible to use yourTwapperkeeper or other tracking tools to capture all of the tweets that link to one or more given domains, on an ongoing basis. As always with Twitter, there are limits to this approach, of course: first, strangely, the Twitter streaming API doesn’t provide identical functionality: tracking a term like abc.net.au there does precisely nothing. This means that (for the time being) we’ll continue to generate ATNIX using our trusty yourTwapperkeeper rather than more powerful tools like DMI-TCAT, since for tracking tweets the latter relies on the streaming API alone. Second, data gathering using the standard Twitter API is subject to Twitter’s 1% rule: we’ll miss tweets if the search results we should be getting add up to more than 1% of the total, global tweet volume at the time. This seems unlikely in our present context unless an article on an Australian news Website goes viral on a global basis, however.

Those limitations aside, we’re therefore able to track Australian news sharing patterns on Twitter by tracking a list of Australian domain names. Our current list includes some 35 domains of news and commentary sites ranging from abc.net.au to watoday.com.au, and from crikey.com.au to thesaturdaypaper.com.au. There are some problematic cases, though: since its move from theconversation.edu.au to a .com address, and its establishment of a UK and international contributor base, The Conversation has a much broader audience than just Australian users, so its numbers may be somewhat inflated by comparison with Australian-only opinion sites – and comparatively recent international entries into the Australian news environment such as The Guardian and Daily Mail continue to operate their Australian editions under .com and .co.uk domains, respectively, with no way to distinguish specifically Australian content on the basis of the URL alone.

Given that The Conversation started in Australia and retains a very strong readership here, we’ll continue to include it in ATNIX, then (but with strong caveats), but I’m afraid The Guardian and Daily Mail won’t be covered by ATNIX for the time being. In addition, we’ll also filter the data to exclude some sections of Websites which cover more than news alone – for example, abc.net.au’s TV guide, or sbs.com.au’s (very popular) Pop Asia section. We’re also using the full resolved URL to distinguish between Yahoo!7 News (au.news.yahoo.com) and The West Australian, which is also hosted by yahoo.com (at au.news.yahoo.com/thewest). In the past, ATNIX has also used such URL paths to distinguish between news and opinion content on some of the leading news sites, but recent site redesigns at the ABC and on Fairfax sites have meant that news and opinion content can now no longer be distinguished in this way. So instead of this distinction we’ll add a new table that focusses specifically on the opinion-only sites.

From CSV and Excel to MySQL and Tableau

So far, so good – this method of gathering and processing our data hasn’t really changed much since we started ATNIX. However, our previous method of storing and analysing these datasets (as comma- or tab-separated files, and in Excel) has become more and more unsustainable the longer we’ve continued ATNIX, since the longitudinal analysis of these datasets rapidly causes problems for Excel with its antediluvian limit of a maximum of ~1 million data rows per spreadsheet. ATNIX often pulls in more tweets than that in a single month. The solution to this is the popular ‘big data’ analytics tool Tableau, which has no such limits to the size of its datasets, and is also able to connect to a range of database solutions rather than being limited to working with static data files.

It’s tempting to move immediately to a NoSQL solution for storing our ATNIX data, then, but for the moment (with some tens of millions of tweets in our current ATNIX dataset) a standard MySQL setup will still be sufficient (and once the data are in MySQL, transferring them to a more powerful cloud-based solution will be comparatively easy, anyway). So, the first step is to import our raw tab-separated data files (originally in yourTwapperkeeper format, and now with a few more columns added following the URL resolution process) into MySQL. Once we’ve set up a table with the right field structure, this is comparatively easily done by using a LOAD DATA statement in MySQL – but there’s a catch: as it turns out, MySQL takes a somewhat unusual approach to parsing input files for escape characters.

We’re using tab-separated files in order to ensure that commas and quotation marks in the raw tweet data don’t cause any problems in identifying the start and end of fields and rows correctly. (My modifications to the original yourTwapperkeeper export functions, available here, have already stripped out any stray tab or newline characters from the original tweets.) But when it imports data from a file, MySQL also pays attention to the \ character, treating any backslash as an escape character rather than a literal backslash. This is especially problematic if the backslash is followed by a letter like ‘n’ or ‘t’ – in a tweet, \n/ would therefore be treated as a newline character followed by a slash, and break the data being imported. Similarly, a \ at the very end of a tweet would escape the field-separating tab character which follows, so that all columns in the current row move left.

This behaviour can be fixed with a simple Gawk helper script, though, which escapes all backslashes in our import file by doubling them:

# escapebackslashes.awk - replaces any single backslash \ with an escaped backslash \\ ahead of MySQL importing # # usage: gawk -F , -f escapebackslashes.awk original.csv >escaped.csv # # Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au { gsub(/\\/, "\\\\") print }

After importing the TSV file, all those double backslashes are single again.

Loading the escaped datafile into MySQL is easy if a table containing the correct columns has been set up. The SQL query to do so will depend on the table structure, but will look something like this:

LOAD DATA LOCAL INFILE '…filename…' IGNORE INTO TABLE `…table…` CHARACTER SET 'latin1' FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' IGNORE 1 LINES ( field1, field2, field3, … )

Moving our entire ATNIX dataset since mid-2012 into a single MySQL database has taken me some time, but we now finally have a full set, and the capability to continue to update it with new data as data gathering for ATNIX continues. In this database, each row contains a single tweet, including the tweet text, all the metadata gathered by yourTwapperkeeper, and the additional fields for the resolved URL which our URL resolution process has added. With this, we’re ready to explore the data using Tableau.

Introducing Tableau

In addition to reading CSV, TSV, and Excel files, Tableau (which is free to licence for students, incidentally) is able to plug directly into MySQL and other databases – all that’s required are the server and login details. Where it surpasses Excel (other than in not being limited to just over one million rows per dataset) is in being able to abstract from the raw data on the fly: advancing well beyond what Excel has to offer, Tableau is all pivot tables, all the time. We’re able to feed it the raw ATNIX data, tweet for tweet, and from this generate day-to-day, week-to-week, and month-to-month activity patterns, distributions of attention across the various news sites, and rankings of the most active news sharers directly, without the tedious steps of data processing, extraction, and aggregation which would have been required in Excel; we’re even able to define standard analytics dashboards which can be re-used time and again. And because we’re still working directly with the raw data in doing so, we nonetheless retain the ability to drill down immediately to the individual tweets which are responsible for specific phenomena showing up in the analysis.

The purpose of this post has been to document the processing steps we’re taking with ATNIX, from the raw yourTwapperkeeper data to our analysis using Tableau. As we’re completing the transition to this setup, I’ll also soon start to post some more or less regular ATNIX analysis updates again. Down the track, I’m also hoping to compare the Australian patterns with similar data we’ve been gathering for Germany, Norway, and Sweden for some time now – and to examine how newssharing on Twitter in Australia compares with news engagement patterns on Facebook, and more general patterns of access to Australian news sites. More to come!