Explainable Content Moderation Using Convolutional Neural Networks

Introduction

One of the biggest challenges faced by social media networks today is content moderation. Users are free to upload text, images and videos which become visible and accessible to millions of other users. This becomes a problem when potentially offensive content demonstrating violence, portraying sexually explicit images, advocating the use of weapons, etc are posted. While there are human content reviewers, the volume of data produced automatically makes this a problem that needs to be solved at scale and is thus a good candidate to be solved using learning techniques. There are numerous issues with content moderation [2]. One of the main problems encountered with content moderation is that the process is both inconsistent and lacks transparency in terms of why a certain content is flagged. So, any robust automation system should be able to consistently capture the correct contents that need to be flagged and be able to clearly explain the reason for flagging of a certain content. This is important as we would like to be able to scale with ease (automated flagging) and at the same time take care to provide feedback (explainability) to the user uploading the content as well as to potentially any human moderators auditing the moderation results to help prevent cases where images get mislabelled as offensive, to investigate appeals from users, etc.

Goals and Objectives

The main objective of our project is to robustly classify and flag images as offensive or non-offensive. These images can be offensive due to the presence of violence, blood, weapons, sexual or other graphic content. Our model should be able to discriminate between such inappropriate images and other images.

The second objective of our project is being able to explain what regions in an image led to it being classified as offensive. The classification decision made by the system should also be explainable to ensure that the system does not harbour any biases such as flagging all images with humans as offensive or tagging all images with red liquids as offensive. We will highlight the salient regions that contribute most to the decision taken by the system. This will help us understand if the model has learned the discriminating factor accurately as well as gain insights into the kind of regions or objects that tend to cause the image to be mislabelled as offensive.

The input to our system will be images and the output would be a binary classification labelling the images as offensive or non-offensive. The images labelled offensive would also be marked with the salient regions that caused the images to be classified as offensive.

System Architecture

something

Approach

We will approach Content Moderation as a binary classification problem with 2 broad categories - offensive images and non-offensive images.

We propose to use Convolutional Neural Networks (CNNs) to perform a supervised binary classification of images into these two categories. We will be fine-tuning an existing CNN using transfer learning approaches to suit our domain.

We intend to use Grad-CAM[1], which is a class-discriminative localization technique to generate visual explanations of the salience regions in the classified images that caused the images to be classified as offensive.

In performing the classification and generating visual explanations, we build what would form the basis (a minimum viable product version) of a content moderation system for offensive images. The system could help end-users / human moderators identify images to be flagged as well as ‘see’ what caused the image to be flagged as offensive. In further experiments, we aim to explore how augmenting an image and/or replacing certain parts of the image with specific offensive/non-offensive content changes the behaviour.

Why are we approaching the problem as a classification problem and proposing the use of CNNs? What about object detection or any other techniques?

We expect that there are latent factors involved in identifying an image as potentially offensive. For instance, in the case of differentiating between images portraying violence and other images, we may not always be able to look for objects (such as knives), or something specific like fire/blood, etc in the scene in a manner that is scalable. Classification of the images using a fine-tuned Convolutional Neural Network will help us identify the nuances involved in flagging the images. With the proposed approach, we should be able to understand and see the reasoning behind the same as well. We believe that such a model is more helpful in a real-world setting where there’s usually a combination of human content-reviewers, rule-based systems and ML systems working together. This could help reduce the gap in understanding the reasons behind the machine’s take on a certain piece of content versus that of a human.

Experiments and Results

Datasets

Kaggle video dataset: [3] This dataset contains a balanced mix of both violent and non-graphic videos from Youtube. The real-life violent content comes majorly from street fights. Apart from the violence videos, there are videos containing regular activities like eating and playing football. We intend to extract frames from each video and treat them as separate, labelled images. This wide range of activities in the dataset would help us make our model robust and take care of corner cases and false positives.
Violent Scenes Dataset: [4] This dataset is derived from the video clips of violent films. Apart from physical fights, the scenes in the clips include the presence of blood, fights, fire, guns,cold arms, car chases and other gory images. We found this dataset to be suitable because it encompasses different aspects of violence, in varying intensities and amounts, which would make our model explain a wide range of salient violence features. We have contacted the owners of this dataset and have been allowed by them to use it for our project.

System Model

Our model will use Convolutional Neural Networks(CNNs) for the classification problem. Due to the lack of an extensive dataset, we will use Transfer Learning on a pre-trained network used for scene classification. Even with limited data, transfer learning has shown improved accuracy and speed when compared to training the model from scratch. We intend to fine-tune the existing model to make it more suited for our context.

We will integrate Grad-CAM within our model to explain our classification decision and identify salient regions in the image. This will enable us to create a model that can transparently explain the decisions taken and also help us in identifying the context the system has learned.

Measures of Success

The upfront measure of the success of our project is being able to classify violent images as offensive and being able to clearly point to the localized region in the image which caused the image to be classified as such. A further measure of success is being able to classify scenes with different types of violent activities (from fist-fights to gory accident scenes) and varying degrees of intensities of violence (two people fighting to mob violence). A crucial metric for our project would be perform classifications with as few false negatives as possible.We want the model to learn context-specific appropriateness for images too, i.e, a knife being present in the image of a kitchen should not be classified as offensive.

Classification Outputs

Input	Output (Classification + Salient Regions)
Images with blood, fire, violence such as:	Offensive (True Positive)
Normal Images such as people playing, eating, objects on a table.	Non-Offensive (True Negative)
Images which get mislabelled due to the presence of violent elements taken out of context	Offensive (False Positive)
Images that are offensive but do not get categorized as such. This has the highest misclassification cost for our system.	Non-Offensive (False Negative)

List of Experiments

Experiment	Motivation Behind the Experiment	Expected Result	Uncertainties in Results/Process
Transfer Learning to classify images as offensive or not	Identifying latent factors involved in identifying an image as potentially offensive	Binary Classification and Salient Region Detection
Using an augmented dataset to train model	Improving the accuracy of our model	Higher precision and accuracy	Obtaining/annotating an augmented dataset?
Replacing salient regions detected by Grad-CAM with non-offensive objects (eg. replacing a gun in an image with flower pot)	Check if the classification made depends on the objects only or also the context of the scene	Image should now be classified as non-offensive given that other parts of the image have no offensive content.	Is replacing one salient object in the image enough?
Replacing salient regions in violent images with a blurred version	Automatically filter out potential offensive images.	Possible extension to auto moderate and blur offensive sections.
Testing the model with degrees of violence (eg. replacing one gunman with a mob)	How does the classification/explanation change if the replacement is (relatively) less/more violent	Images should correctly get labelled according to their context	Model may not pick up heavily suppressed instances of (potential violence)

Possible Extensions to our Approach:

Replacing salient regions detected by Grad-CAM with non-offensive objects and seeing if the classification output changes. This is to check if the classification made depends on the objects only or also the context of the scene .
Data Augmentation by adding offensive objects to non-graphic images and checking if the classification of the scene changes.

Team

Shalini Chaudhuri ⋅ Rohit Mujumdar ⋅ Sushmita Singh ⋅ Sreehari Sreejith