Learning The Difference That Makes A Difference With Counterfactually-Augmented Data
Syed Saad Naseem
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators are hired to modify data in order to change the meaning of the text or make it in a way that it represents the opposite label for example if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment label with minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset if trained only on the original dataset.
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. For sentiment analysis, they directed the annotators to revise this negative movie review to make it positive, without making any gratuitous changes. For the NLI tasks, which are 3-class classification tasks, the annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa.
After the data collection, a different set of workers was employed to verify whether the given label accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes (NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that n almost all cases, models trained on the counterfactually-augmented IMDb dataset performs better than models trained on comparable quantities of original data, this is shown in the table below.