January 3, 2020

Do your content moderators agree on their decisions?

Sami Virpioja, D.Sc. (Tech.), Lead Data Scientist

Utopia AI Moderator is an automated moderation service that learns the moderation policy of each customer from annotations made by human moderators. In this post, we consider variation in human moderation. We discuss why moderators disagree on their decisions and how to measure the disagreement, explain how it affects the evaluation of an AI model trained on the human decisions, and finally show how we use our tools to help our customers to follow and improve the quality of their moderation.

Inter-annotator agreement in natural language processing

Disagreement between human annotators is a well-known issue in machine learning and computational linguistics as well as other fields that rely on human-annotated data. The technical terms for the agreement between human annotators are inter-rater reliability or inter-annotator agreement. The simplest way to measure the agreement is the accuracy: the number of samples annotated in the same way divided by the total number of annotated samples. A more robust measure is Cohen’s kappa coefficient which normalizes the observed agreement of the annotators by the probability that they agree by chance, so that the value is 1.0 for complete agreement and 0.0 if all agreements are by chance. Unfortunately, the standard measures require that you have the same set of samples annotated by all the annotators you want to compare, which is both expensive and impractical in many situations.

How large an inter-annotator agreement you can expect depends on the nature of the task. For language-related tasks, one has to keep in mind that the use and interpretation of language is subjective: Different people can interpret the same text in different ways. Of course, even within natural language processing, there are tasks that are less subjective and tasks that are more subjective. In a task like named-entity recognition (locating named entities such as “John Smith”, “Acme Corp.”, or “White House” in a text) the decisions are rather consistent: for example, Desmet and Hoste (2010)1 report Cohen’s kappa values above 0.9 for Dutch named-entity recognition. On the other hand, a task related to the semantics of language such as sentiment analysis (labeling the different sentiments in given text) is prone to lower agreement. For example, Bobicev and Sokolova (2017)2 report average Cohen’s kappa 0.46 over four sentiment labels.

How subjective is content moderation?

At Utopia Analytics, we deal a lot with content moderation. Although the majority of the text content posted online may be easy to classify either as improper (e.g. spamming, swearing, illegal content) or accepted, there are many borderline cases where human moderators may (and will) disagree. Moreover, if the moderation decisions are also based on a more general quality criteria, like how the message will contribute to the on-going discussion, things get even more subjective.

Typically, the stricter the moderation policy, the more variation there is between the moderators. Of course, several other factors also play a role: for example, how experienced the moderators are, how much time they use, if they have any incentives for accuracy over speed, if the moderation policy is clear to everyone, and if the moderators discuss the policy together.

What does this mean for using AI models?

The subjectivity of the task and disagreement of human annotators is a challenge for the evaluation of the predictive AI models applied by Utopia Analytics. A single AI model cannot perfectly predict the outcome of multiple human annotators who disagree with each other, and thus, regardless of how sophisticated machine learning models are in use, it is impossible to get 100% accuracy for inconsistently annotated data.

For a simple example, consider two moderators who do the same amount of work and agree with 80% of the other’s decisions. When an AI model tries to learn the moderation policy, 20% of the outcomes are randomly accepted or improper. If the model correctly predicts everything that the two moderators agree on, and half of the remaining 20% gets correct by chance, the accuracy based on the pooled decisions will be 90%. (Note that this is still better accuracy than the 80% we get if we measure the decisions of the first moderator based on the decisions of the second moderator, or vice versa!)

However, while conflicting annotations are a challenge for training and evaluating the AI models, the same models make it possible for us to provide our customers insight on their moderation consistency.

Example: Visualizing moderators’ decisions

Here we look at an example case, in which eight human moderators have annotated messages in two weeks’ time either as accepted (published) or improper (rejected).

We start by dividing the time range in one-hour windows, and checking the number of messages in each hour:

At least a few messages are sent every hour, and the peak is around 250 messages at noon on the first Friday.

Next, let’s take a look at the proportion of messages moderated by each moderator. Here, each color represents one moderator (named from A to H), and the height of the color at each hour represents the proportion of messages processed by the moderator. We can see how the moderators mostly work in shifts, one at a time:

Now we can make a similar presentation of the annotation results, i.e. how many messages are annotated as accepted and improper each hour:

We can see that there is a considerable variation in the proportion of improper messages – improper ratio in short – over the time. Interestingly, the improper ratio looks correlated with the shifts of the human moderators. For example, whenever moderator H (grey color) is active, the improper ratio is comparably low in this example data – roughly 20% – while whenever moderator F (brown color) is active, the improper ratio is double as much.

There can be several explanations for different moderators having different improper ratios. If the moderators can select what they moderate, they may have different selections of areas or topics, and some topics draw more heated discussions. Or if the moderators often work in different time of the day, it could be for example that users write more improper content during the nighttime. Neither of these is observable in our example data. Still, in different days, there may be different topics that the users discuss that are more controversial, resulting in a higher improper ratio. So how can we tell if the moderators have significant differences in their moderation decisions?

A direct way is to prepare a separate evaluation setup where the moderators moderate exactly the same set of messages and study the decisions, as mentioned in the beginning. Unfortunately, it may require a large number of examples, and significant resources are taken away from moderating the new content posted to the service.

An indirect way to find an answer would be to look at the behavior of the moderators over a longer time period, as then random variations in the types of the moderated messages should even out. However, this will not reveal which kind of messages the moderators disagree on. And neither this nor the separate evaluation setup allows you to follow the variations in real time.

Using Utopia AI Moderator to measure annotator disagreement

Utopia AI Moderator offers an easy and robust way to do the analysis of inter-annotator agreement. At the core of Utopia AI Moderator service there is an AI model trained on the examples from all human moderators, and it will learn to predict the decisions according to the general moderation policy. We can compare the improper ratio of the model to the improper ratio of the human moderator for the same sample of messages. If on a particular day, there is a more heated discussion, the improper ratio of the Utopia AI Moderator will also reflect that. In this way, it does not even matter if several moderators work simultaneously, which would make it difficult to see the variation by just inspecting the improper ratio over time.

First, let’s take a look at the similar figure as before for the improper ratio of our data set, but now for the predicted ratio, that is, the decisions of the Utopia AI Moderator:

This clearly has smaller peaks or valleys than the previous figure, indicating that there is no large variation over time when the improper messages are sent.

Next, let’s observe the improper rates separately for the subsets of the messages moderated by each moderator. We can draw the human (blue) and predicted (orange) improper rates side-by-side to easily observe the differences:

Now we can confirm that the moderator F does have the highest improper rate among the moderators, and, for example, the moderator H we spotted before has a significantly lower improper rate. Moreover, the moderator E has even lower improper rate than H, which was not as easy to spot from the previous graphs because of the shorter periods of active moderation time.

There is some variation in the predicted improper rates between the subsets of messages, which indicates differences in the messages checked by each moderator. However, the variation between the subsets is small compared to the variation among the moderators. And if we consider the two moderators with extreme improper rates, E and F, the model actually predicts lower improper rate for the subset of messages moderated by F, who has the highest improper rate.

While we cannot measure the exact agreement of the moderators without having them moderate the same set of messages, we can still estimate. Moderator E had an improper ratio of 12.8%, and moderator F 39.2%, so they will disagree at least on 39.2% – 12.8% = 26.4% of the messages.3 Even though it may sound surprising that two moderators would make a different decision for more than every fourth message, it happens.

Apart from studying the agreement of the moderators, we can now take the individual messages where the Utopia AI Moderator and each of the human moderators disagree, and present them to our customer, who can then use them as examples to clarify their moderation policy. Once the moderation decisions become more consistent, the AI model can be trained again to improve its accuracy.


In content moderation, unclear goals and the general subjectivity of the task may cause large variations in moderation policy among different moderators. Such inconsistent decisions present some challenges for building predictive AI models. However, the same models can also be applied for studying and following the variation between the moderators, providing feedback for the moderators, and finally improving the overall quality of the moderation.

1 Bart Desmet and Véronique Hoste: Towards a Balanced Named Entity Corpus for Dutch. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, 2010.

2 Victoria Bobicev and Marina Sokolova: Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, 2017.

3 Similarly we can get an upper limit for the Cohen’s kappa coefficient mentioned in the beginning. Normalizing the upper limit for the observed agreement of the annotators E and F (1 – 0.264 = 0.736) with the probability that they agree by chance (0.580), we get the coefficient 0.371.

Article Categories

Read Next: Latest Articles