I’ve previously introduced my TCAT-Process package of helper scripts (written in Gawk), which take exports of Twitter data from the Twitter Capture and Analysis Toolkit (TCAT), developed by the Digital Methods Initiative at the University of Amsterdam, and convert them to a format that is best suited to using the data in the analytics software Tableau. This post is an update that provides the latest version of TCAT-Process, and outlines some of the alternative options that are now available through further developments in TCAT itself.
But beyond these updates, the major new addition to this setup that I’ll describe here is a process for uploading the data to Google BigQuery. BigQuery is a commercial but very affordable platform for working quickly with large datasets, and uploading the processed Twitter data to BigQuery can considerably speed up the analytics process – especially also because Tableau has the ability to connect natively to BigQuery databases. For large Twitter datasets – and when you’re dealing with major global events, they can easily grow to the millions of tweets and generate gigabytes of data –, this is a crucial advance which outsources a lot of the computationally intensive processing to Google’s servers, rather than doing it locally on a desktop machine.
None of this is entirely easy – partly also because recent changes to TCAT, although welcome where they’ve added new functionality, have at times also changed the underlying data formats. But hopefully the following outline of steps will help you get your data from TCAT into Tableau without too many problems.
Please note: if your version of TCAT has options to export hashtag and mention tables, if your dataset is small enough for these exports to work reliably, and if you’re not interested in resolving the URLs contained in your data, you won’t need to use TCAT-Process at all. In that case, skip the section below that discusses TCAT-Process, and load your TCAT exports directly into BigQuery, following the steps described below.Downloading TCAT Data
The first step in the process is necessarily to export the data from TCAT. Recent versions of TCAT have added an export option in tab-separated format (as opposed to comma-separated), and this is by far the preferred format for our further work.
TCAT-Process is able to work with CSV files exported from TCAT, too, but if at all possible, choose TSV:
Then, select the desired export parameters (dataset, timeframe, etc.), and export all tweets from your selection:
If your version of TCAT is recent enough, you’ll also have options to export hashtag and mentions tables. Depending on your server setup, these may not work very well for very large datasets (of a million tweets or more), and TCAT will time out or produce empty files if you try them – so export hashtags and mentions if you can, but if this doesn’t work we’ll use TCAT-Process to create them in a later step.
(As noted, skip this section if you’ve managed to export the full, hashtag, and mentions tables from TCAT and are not interested in resolving the URLs in your data.)
First, download the latest version of TCAT-Process from here: tcat-process.zip. This contains a directory named _scripts that should be placed in your central Twitter data folder, alongside the folders that your various datasets are stored in.
If necessary, you also need to install Gawk and cURL, the tools which TCAT-Process relies on – see my previous post for more information on how to install these.
I’ve already outlined the inner workings of TCAT-Process in my previous post, and won’t repeat all of this here. However, there are a number of key additions to the processing options that are worth pointing out. Here is the full set of option switches that TCAT-Process now uses:
TCAT-Process will work best if you call the script from within the folder that contains your dataset. So, once you have downloaded your full dataset as filename.tsv, open a terminal window and make the dataset folder your current directory. To process the data file completely using TCAT-Process, use the following command:
gawk -f ..\_scripts\tcat-process.awk file=filename.tsv tcattype=tsv
If you’ve successfully downloaded the hashtag and mention tables that TCAT generates, you can keep TCAT-Process from generating those tables again from the main dataset:
gawk -f ..\_scripts\tcat-process.awk file=filename.tsv tcattype=tsv nohashtypes=1
(In each case, use ../_scripts/tcat-process.awk if you’re on a Mac. I’m assuming here that the _scripts folder sits within the directory that also contains the folder with your dataset.)
Press return when prompted, and wait until TCAT-Process completes. This may take some time depending on the size of your dataset – for datasets of more than 100MB, it could stretch to hours or days. URL processing takes up the vast majority of this time, so use nourls=1 if you’re happy to skip that part. Also, while the URLs are still processing you can already get started on uploading the full, hashtags, and mentions datasets to BigQuery.Uploading Your Data to BigQuery
First, follow the instructions provided by Google to set up a BigQuery account. Note that BigQuery isn’t free – for very extensive datasets, you will eventually generate costs that would be charged to your credit card. That said, for the volume of Twitter data most of us will be working with, those costs are largely negligible, and will remain at a level of cents rather than dollars. See the pricing information provided by Google for more details, and use the History tab on the Billing page of the Google Cloud Platform to keep track of your current costs.
Once you are set up for Google BigQuery, we’ll first need to create a new project to hold your data. Go to the Google Cloud Console, and from the pull-down menu in the header select the Create a project… option. In the dialogue that pops up, choose a meaningful name for your project:
Next, use the Dashboard option to create a Cloud Storage bucket. A ‘bucket’ is a container for your uploaded data files. Give this a meaningful name, e.g. tcat_exports:
You should now see an empty file list for your new bucket:
To upload your data files to the Cloud Storage bucket, simply drag them onto the file list in your browser, and an upload progress indicator will appear. For each dataset, you should have up to four files to upload: your full data export from TCAT (filename …fullExport….tsv), the mentions and hashtag tables from TCAT if you have them (filenames …mentionExport….tsv and …hashtagExport….tsv), and/or the mentions, hashtags, and URL files you’ve created using TCAT-Process (filenames …-tweettypes.tsv, …-hashtags.tsv, and …-urlpaths.tsv). There is no need to upload any of the other files that TCAT-Process has created (and in fact you’re welcome to delete these at this point). Depending on the size of your datasets, and on the quality of your network connection, the upload may take some time – keep your browser window open while it progresses:
The uploaded files will appear in the bucket browser – and once they’re all there, we’re ready to import them into BigQuery. Use the Google Cloud Platform menu sidebar to go to Google BigQuery (the menu has many options, so you may want to search for ‘bigquery’):
The Twitter data project you’ve created should already be preselected. In the left sidebar, click on the pull-down arrow next to your project title, and create a new dataset. Give this a meaningful name that describes the contents of your data (for my example below, I’ll be using ‘eurovision’, since that’s what my TCAT dataset is about):
When the dataset appears under your project, mouse over it and click the + icon that appears to create a new data table:
In the Create Table dialogue that now appears, choose the following settings (start with the full export file, and repeat this for each subsequent file):
Next, we need to define an import schema for the new table. This schema describes the table you’re importing, and is therefore different for each table type. I’ve created a number of presets that describe each type – download and unpack the following file: bq-import-schemas.zip. Use a text editor to open the file that describes the table you’re importing:
Select the entire schema as shown in your text editor, and copy it to your clipboard. Back in BigQuery, below the Schema settings, click on Edit as Text and paste the schema into the textbox that appears:
Finally, in the Options settings choose Comma or Tab depending on the format of your datafiles, set Header rows to skip to 1, and tick Allow quoted newlines and Allow jagged rows:
When you’re all set, click Create Table and wait until the process completes. BigQuery will automatically switch to the Job History list, which shows the progress of your current data import job; click on the job to see more details. Once the job is complete, the icon to the left of the job will turn green.
If it turns red to indicate an error, check for what went wrong and click Repeat Load Job. In case there are any problems with your data files, you might want to adjust the Number of errors allowed setting in Options (say to a value of 10), or tick Ignore unknown values. (Such errors may be caused by particularly unusual characters in tweet texts: I’ve noticed that in its current version, the character combination \” is incorrectly encoded by TCAT as it exports the data, and confuses BigQuery as it parses the source file. But these errors should be very rare.)
If there is a persistent problem with importing your full export file from TCAT, you may be running an older version of TCAT which exports a different number of data fields, in a different order. In that case, try the import schema in TCAT BQ import schema (TCAT full export, 35 fields).txt, rather than the schema in TCAT BQ import schema (TCAT full export).txt. (To check which version you are running, export a small dataset from TCAT and open the export file in Excel or a text editor. The older TCAT version will export 35 fields, the later version exports 36 fields. (There is also a much older version which exported only 22 fields, but hopefully most people will have moved on from this by now.)
Repeat these import steps – with the appropriate import schema in each case – for each of your data files, so that in the end you have up to four BigQuery tables (e.g. fullexport, tweettypes, hashtags, urlpaths – or, if you’re using the original TCAT exports of mentions and hashtags, fullexport, mentionexport, hashtagexport, urlpaths).
Note that if your source data are split across multiple data files – for instance because you’ve exported your dataset from TCAT every week or every month – you can also import multiple data files of the same type into the same table: the new data will simply be appended to the table, and the order in which we import the data into the table doesn’t matter because we’re accessing the content through Tableau anyway. So, for instance, if you have TCAT …fullExport… files for April, May, and June, use the approach described above for each of these files, but direct them all to a single fullexport table. Then, load the corresponding tweettypes files for the April, May, and June exports all into the same tweettypes table on BigQuery – and do the same for hashtags and urlpaths. You could also repeat this process for every future month, and so gradually add new data to your dataset as you continue your data-gathering activities.
At the conclusion of this process, all of our data are now in BigQuery and ready for use – which means that it’s time to switch to Tableau and connect to our data from there. Alternatively, of course, you could also query your datasets by using BigQuery’s own SQL query tools, or any other software that connects to BigQueryAccessing the Data through Tableau
Connecting to the data in Tableau is now very straightforward. Open Tableau, and select Google BigQuery from the Connect menu:
Now select your Project and Dataset – in my example, that’s TCAT Data and eurovision:
Drag the table that contains your full export data onto the Drag tables here canvas first – I’ve named it fullexport:
Now drop your other tables onto the same canvas. If you’ve processed everything using TCAT-Process, the results should look like this, and Tableau will already have guessed how you want to link these tables (it’s not quite correct in its guess, but we’ll address this in a second step):
For each of the linked tables, click on the overlapping circles and choose Left Join rather than Inner Join:
The end result should look as follows (though of course you may have chosen different table names, and there may not be a urlpaths table if you’ve chosen not to resolve URLs):
Alternatively, if you’re using the hashtagexport and mentionsexport tables from TCAT itself, Tableau will want you to select how these should be linked. Choose a Left Join, and link the tables on Id = Tweet Id in each case:
Then add the urlpaths table (if you’ve resolved the URLs), and make sure it’s connected with the full export table as a Left Join as well. The end result should look like this:
If you’re familiar with using Tableau to analyse the data in offline CSV and TSV files, you may be tempted to also create a data extract at this stage. This is not necessary when we’re working with BigQuery, because all of the data processing happens on Google’s servers. Creating an extract on the client side would in fact slow down your data processing considerably, especially when working with very large datasets – so stick with a Live connection:
When you’ve completed this setup, click Sheet 1 to go to a blank worksheet:
You’re now almost ready to analyse your data as you see fit. Two more setup steps will be useful, though: first, in analysing tweet volumes you’ll usually want to count the number of distinct tweet IDs, rather than simply the number of records in your dataset, as this provides the most accurate count. So, in the sidebar on the left, drag Id from Dimensions to Measures, where it will automatically be set to use Count (Distinct):
(To trigger this behaviour, our BigQuery import schema declared Id to be a string rather than a number, as you may have noticed earlier. While in principle, tweet IDs are numbers, of course, in our analysis we’ll never actually use them as such – what would be the point of calculating the sum or average of a bunch of IDs, after all?)
Second, the Time field contains the timestamp for each tweet, but is set to UTC by default (it’s calculated automatically by BigQuery from a numerical Unix timestamp). To adjust this to the local time most appropriate for your further analysis, create a calculated field that adds or subtracts a number of hours from Time. Hover your mouse pointer over the Time field, click the down arrow that appears, and select Create > Calculated Field…:
Give the field a name (for instance the code for the timezone you’re shifting to), and enter the following formula:
In my example, to shift UTC to Australian Eastern Standard Time (AEST, or UTC+10), this is the formula to use:
Click OK, and at the bottom of the Dimensions sidebar you should now find a new field with the name you’ve given it (e.g. AEST). Use this rather than Time itself for any further time-based analyses.
And with this final step, you’re ready to analyse your data, and benefit from the vastly improved analysis speeds that Google BigQuery offers, especially when you’re dealing with very large datasets. Some ideas for the range of analytical approaches you might want to pursue are outlined in my previous posts on using Tableau to analyse TCAT data – so perhaps start here or here?
In spite of my best intentions, I’m afraid the Australian Twitter News Index continues to be a somewhat irregular affair for the moment, and so this latest update once again covers a number of months: in this case, it’s reporting on news sharing patterns in Australia for the first quarter of 2016. We begin, therefore, with some of the overall trends in the data. Most notable, perhaps, is that The Conversation has advanced to become Australia’s third most widely shared news and opinion site: the more than 260,000 tweets linking to its content during January to March 2016 bested even the performance of such established mainstream news sites as news.com.au, The Age, and The Australian.
It is important to remember here, though, that this figure captures the global volume of tweets linking to each site: The Conversation’s continuing expansion into new territories (now including the UK, US, France, and southern Africa) no doubt accounted for a substantial portion of all tweets linking to the site, and – following ABC News and the Sydney Morning Herald – it is now well on its way to becoming a globally recognised news and opinion site. By comparison, we may also assume that those sites it has leapfrogged since our December update (when it was ranked sixth) continue to be popular mostly with a domestic audience, and largely fail to make much of an international impact.
While there is not enough space to cover all the specific news events contained in the present dataset, a handful of other observations are also worth making. First, somewhat hidden in the data is a spike in link sharing activity for a number of sites around Australia Day – and as we will see from our Hitwise data below, this translates into a substantial increase in site visitors especially for commentary site New Matilda: its article on that day’s Google doodle – a stunning artwork commemorating the Stolen Generations – accounts for some three quarters of the 2,000 tweets linking to New Matilda that day, and drives the number of total visits to the site to more than 500,000, when on normal days it struggles to break 50,000. Not all of the link sharing spikes on Australia Day are related to that story, however: the most widely shared piece on Nine News, by contrast, is about a US dog “accidentally” running a half-marathon, and also gains some 2,000 tweets.
Later in the quarter, SBS (and its subsidiary channel, National Indigenous Television) generate substantial impact, by comparison with their average level of visibility on Twitter, with a collection of articles addressing International Women’s Day. In total, its series of articles highlighting a range of inspiring women – including especially also indigenous Australian leaders – as well as addressing continuing sexism and injustice towards women more than doubles the number of tweets linking to SBS content, to nearly 3,400 tweets on 6 March 2016.
Meanwhile, the most shared articles on the leading news sites paint a widely divergent picture of day-to-day politics. On ABC News, they reflect a strong focus on the environment: a 28 March story on the large-scale coral bleaching in the Great Barrier Reef received some 2,100 tweets, while a report on 31 January on the devastating impact of Tasmania’s summer fires gains 1,500 shares, and an article on 21 January about the International Whaling Commission’s highly critical report on Japan’s continued illegal whaling programme receives nearly 1,400 tweets.
At the Sydney Morning Herald, several issues vie for attention. Leading the back is the paper’s 23 March report that the New South Wales Liberal Party ‘concealed’ illegal donations before the 2011 state election, attracting some 2,500 shares; but a 26 March article on “why Finland has the best schools” is also in the running, with 1,800 tweets linking to it (including quite possibly some Finnish Twitter users). In third place, finally, is the SMH’s 18 January coverage of an Oxfam report on the growing global inequality between rich and poor; it received some 1,600 tweets.
As always, our Hitwise data on the total number of visits by Australian Internet users to each of the news and opinion sites we track paint a somewhat different picture, both in total numbers and in the distribution of attention. To begin with, here The Conversation is not ranked quite so highly, since the Hitwise figures do not include international visitors to the site; nonetheless, The Conversation ranks a strong second in the opinion category in Australia, following the Huffington Post’s Australian operation. HuffPo Australia, by contrast, is further ahead of competition such as The New Daily than its link-sharing performance on Twitter would lead us to believe: New Daily readers seem more willing to promote its content in their tweets at this stage.
The overall ranking of the mainstream news sites in Australia has remained largely stable, and a notable gap in the number of total visits has developed between the top six sites and the remainder of all Australian news and opinion sites. Amongst that group, the Australian operation of the UK’s Daily Mail has dropped back again behind The Age and ABC News, following its excursion to fourth place during the November/December 2015 period; perhaps the particularly Australian focus of the news during the summer period (including the coverage around Australia Day and its various ceremonies and debates) has contributed to this renewed focus on more home-grown sites.
Also clearly visible in the Hitwise data is the substantial spike for New Matilda on Australia Day that we have already discussed above. Here, we see a clear demonstration of a site advancing – suddenly and briefly – well above its long-term baseline, but it is also notable that this does not have any lasting effect on New Matilda’s overall visitor numbers. It is quite possible, incidentally, that this increase was driven at least in part by Google itself: Google will often link to further information about the stories behind its doodles, and New Matilda’s story about it may well have been picked up as an article to link to, creating a feedback loop of attention.
What is striking about the key themes during the first quarter of 2016, then, is especially their focus on fundamental long-term topics, from the environment to reconciliation. As we return to the day-to-day politicking of a federal election year, we’re likely to see this replaced again by a considerably more narrow focus on short-term issues.
Standard background information: ATNIX is based on tracking all tweets which contain links pointing to the URLs of a large selection of leading Australian news and opinion sites (even if those links have been shortened at some point). Datasets for those sites which cover more than just news and opinion (abc.net.au, sbs.com.au, ninemsn.com.au) are filtered to exclude the non-news sections of those sites (e.g. abc.net.au/tv, catchup.ninemsn.com.au). Data on Australian Internet users’ news browsing patterns are provided courtesy of Hitwise, a division of Connexity. This research is supported by the ARC Future Fellowship project “Understanding Intermedia Information Flows in the Australian Online Public Sphere”.