The final Web Science 2016 keynote for today is by Daniel Olmedilla, whose work at Facebook is to police the ads being posted on the site. Ads are the only part of Facebook where inherently unsolicited content is pushed to users, so the quality of those ads is crucial – users will want relevant and engaging content, while advertisers need to see a return on investment. Facebook itself must ensure that its business remains scalable and sustainable.
Key problem categories are legally prohibited content (e.g. ads for illegal drugs); shocking and scary content; sexually suggestive material; violent and confronting content; offensive before-and-after images; ads with inappropriate language; and images containing a large amount of text.
The review of ads commences immediately after their submission; it breaks down the ad into its constituent components (text, image, audio, video, linked Websites, etc.), scores these components using a number of computational models, and based on such scores initiates immediate approval, rejection, or further human review.
The challenges in such assessments are the fact that there is a large class imbalance (with the majority of all ads being of good quality); limited human reviewer capacity and accuracy (sometimes people get their assessment wrong); feature engineering on ad content (how to assess the different elements of a complex ad); the continuous arms race with bad advertisers (whose latest tricks the automated system still needs to learn); the global nature of Facebook (which operates in a wide range of cultures and languages); and the scalability of the systems (given the vast size of Facebook and its contents).
Machine learning plays an immense role in such processes, and sampling data from the totality of all Facebook data is a major challenge here. Selecting the training data is a non-trivial problem, especially also across multiple languages: the vast majority of ads are fine, and random sampling would be a very inefficient approach to identifying inappropriate ads.
What is required are accurate estimates of the fraction of unsuitable ads, and to find them the time commitments of human reviewers need to be optimised; the more ads these reviewers need to look at, the worse their actual assessments become. Stratified sampling can help here at least to some extent; machine learning-assisted sampling is even more effective for this purpose. But this does not help with new types of unsuitable ads, and the model must constantly be recalibrate by introducing a small random sample of ads into the assessment process.
50% of the Facebook community does not speak English, so language diversity ads yet further complications. What is necessary here is to build a reliable translation model; this draws on large training corpora that generate quality word co-occurrence patterns and phrase translation pairs, as well as probability scores for the grammatical correctness of any given sentence.
The models based on these data must be optimised and tuned for best performance, of course – and this is further complicated by the diversity in modes of expression from formal to informal, the presence of spelling errors and variations, and the lack of reliable foundational corpora for less widely used languages. This may be able to be addressed at least in part by drawing on Facebook's own data, where users post the same content in multiple languages or are linking to content that exists in multiple language versions. But this, in turn, requires a reliable assessment of whether two given texts are direct translations of each other.
Yet another way to identify bad actors on Facebook is to utilise network analysis. The structure of the network around a given account seeking to post an ad reveals a great deal about the likelihood of the account to post unsuitable ads. Malicious accounts tend to artificially accumulate their networks, ideally mimicking what they believe genuine networks to look like; they may create clusters of fake users, and eventually attempt to infiltrate the networks of genuine users. These bad network regions can be detected by random walks from verified genuine regions, but this becomes more difficult once a genuine network has been infiltrated by bad accounts.
A further challenge is the analysis of images and videos included in ads. (The results of such image and video classification is shown in Facebook's LiveMap, incidentally.) This is an immense computational challenge, of course, and draws on the company's 40 pflops GPU cluster. The system makes it possible to assess image similarity, and thereby to identify inappropriate image subjects. Now that Facebook has also launched its live video functionality, such detection of inappropriate content has become even more important, of course. Audio analysis is somewhat less urgent by comparison, largely because Facebook ads are muted by default – but the company monitors videos for specific sounds that may point to inappropriate material (e.g. gunshots, porn sounds).
Finally, text detection in images and videos is also important, and there are some standard optical character recognition tools already available for this – but these do not work particularly well for complex backgrounds. The principal aim here is not to understand the text, though, but rather in the first place to simply understand how much text there is in an image.
And again, with all of this the question of scalability is paramount. Facebook data are complex and vast, and the company has developed its own language to connect a broad range of indicators into one assessment of the content. The models are continuing to get more complex, and so the workflow language needs to be flexible and powerful. There is a constant need to experiment, and given the enormous size of Facebook much of these processes need to be automated, with limited human oversight.