ltl.uni-due at SemEval-2019 Task 5: Simple but Effective Lexico-Semantic Features for Detecting Hate Speech in Twitter

In this paper, we present our contribution to SemEval 2019 Task 5 Multilingual Detection of Hate, specifically in the Subtask A (English and Spanish). We compare different configurations of shallow and deep learning approaches on the English data and use the system that performs best in both sub-tasks. The resulting SVM-based system with lexicosemantic features (n-grams and embeddings) is ranked 23rd out of 69 on the English data and beats the baseline system. On the Spanish data our system is ranked 25th out of 39.


Introduction
Hateful, abusive, or offending statements which target individuals or groups on the basis of characteristics such as gender, nationality, or sexual orientation are called hate speech (Basile et al., 2019). Social media is particularly affected by hate speech, as it is known to poison the communication climate, build up negative sentiment towards groups of people, or even lead to reallife consequences (Warner and Hirschberg, 2012;Waseem and Hovy, 2016;Schmidt and Wiegand, 2017;Wojatzki et al., 2018;Benikova et al., 2017;Ross et al., 2017).
In this work, we present our submission to the SemEval 2019 Task 5: Multilingual Detection of Hate (Subtask A) for English and Spanish. The objective in Subtask A was to build a system which is able to predict whether given tweets in English or in Spanish are hateful or not hateful towards women or immigrants.
We develop a hate speech detection system by experimenting with a range of classifiers which are either based on engineered features or on neural network architectures. We systematically compare the performance of these different detection systems and for our final submission (for both English and Spanish) we use the model that performs best on the English training data. Our best system is a SVM equipped with n-gram features and fastText (Mikolov et al., 2018) embeddings. Our system obtains the 23rd rank (out of 69) on the English dataset and the 25th rank (out of 39) on the Spanish dataset.

System Description
For our submission, we compare a wide range of different neural and non-neural systems in terms of their performance. Our actual submission system is the system that performed best in this evaluation. We will now describe both our neural network approaches and the feature-engineering approaches for detecting whether tweets are hateful towards women or immigrants (Subtask A). We developed and evaluated the approaches for the English dataset and applied the best performing system as-is to the Spanish data. We now first briefly describe the provided data and then discuss the prediction approaches in more detail.
Dataset In Subtask A, the English training set consists of 9,000 tweets and the development set consists of 1,000 tweets. The Spanish training set consists of 5,000 tweets, the development set consists of 500 tweets. In the test data, there are 2,971 English and 1,600 Spanish tweets. For each tweet, the task organizers provided a binary annotation indicating whether a tweet is hateful or not hateful towards a given target (i.e. women or immigrants). An example for the label hateful (towards immigrants) is the tweet: This immigrant should be hung or shot! Period! Animal. https://t.co/wFcGoLCqJ5 An example for the label not hateful is the following tweet: Don't mess with these migrant dads #SkimmLife https://t.co/swVmkTlFRz via @theSkimm.
For more details on the dataset and its creation, we refer to the overview paper of the shared task (Basile et al., 2019).
Preprocessing In almost all of our classification approaches, we vectorize the tweets based on word occurrences. Hence, we tokenize the tweets with the twitter specific tokenizer provided by Owoputi et al. (2013). We decided not to remove or normalize social media specific phenomena such as @-mentions, #-hashtags, URLs, and emojis as we hypothesize that these phenomena may provide useful signals for classification. For example, it is conceivable that a reference to the twitter-handle of Donald Trump (@realDonaldTrump) may indicate hatred towards immigrants.

Feature Engineering Approaches
We now report on those approaches that are based on traditional machine learning algorithms and that represent the train and test instances using manually crafted and engineered features. The explored machine learning algorithms are: SVM (LibSVM by Chang and Lin (2011), XGBoost (Chen and Guestrin, 2016), RandomForest (Witten et al., 2016) and Vowpal Wabbit. 1 We implement the classifiers using the text classification framework DKPro TC (Daxenberger et al., 2014) which includes all of the abovementioned classifiers. We use the following features to represent the tweets: N-grams As a baseline feature, we represent the tweets using word and character n-grams. We experiment with n-gram sizes in the range from 1-3 for word n-grams and 2-5 for character n-grams. To reduce the feature space, we only use the ngrams that are most common in the (English and Spanish) training data. We experiment with the frequency cut-off values of 200, 500 and 1,000.
Hateword lists We hypothesize that the presence of specific hate or insult words gives an indication of whether a tweet constitutes hate speech. Hence, we check if the words in the tweets occur in lists of hate or insult words. We use the word lists provided by Wiegand et al. (2018), which contain a basic word list and a extended word list. 1 https://github.com/VowpalWabbit/vowpal wabbit There are 1,650 words in basic list with binary labels (abusive or not), and 8,478 words in extended list with a numeric weight. We extract abusive words to use in the following features: a) a boolean hateful feature if a posting contains any word contained in the basic list, b) a hatefulness ratio of total words to hateful words, and c) the sum of the hatefulness weights based on the extended list.
Sentiment We also suspect that the tone in which a tweet is composed can be an indication for hate speech. For instance, we assume that tweets that have a strong positive sentiment are rarely hate speech. To measure the overall sentiment of tweets, we use the tool by Socher et al. (2013) to compute a sentiment score for each tweet. The computed sentiment score uses a five-degree scale from very positive to very negative.
Word embeddings We use pre-trained word embeddings to enhance our tweet representation with a semantic component. For computing semantic features, we first average the 300dimensional (Spanish or English) word embeddings provided by Mikolov et al. (2018) of all words of a tweet. Next, we use every dimension of the averaged vector as a feature.

Neural Network Approaches
Besides traditional machine learning approach, we also experiment with neural network architectures: multilayer perceptrons (MLP), convolutional neural networks (CNN), bi-directional LSTMs and a combination of LSTMs and CNNs (LSTM + CNN). We initialize all setups with the 300-dimensional word embeddings provided by Mikolov et al. (2018), which were trained on the common crawl corpus. Furthermore, in all setups, we use a dropout of 0.25 after the embedding layer and update network weights using the Adam optimizer (Kingma and Ba, 2014). For all architectures, we have optimized the hyperparameters (e.g. number and size of layers) on an held-out development set. We here report only the best-found parameterization.
MLP Besides the final softmax layer, our MLP has a total of 6 densely connected layers. Starting from the input, the layers have 256, 128, 64, 32, 16 and 8 nodes. We use relu as activation function in all layers.
CNN Our CNN uses three stacked convolutional layers that use a filter size of two. The first layer has 128 nodes, the second 64 and the third 32. Subsequently, we apply max pooling, a dense layer with ten nodes and the final softmax classification layer.
LSTM At the core of our LSTM is a bidirectional LSTM layer with 128 nodes. This layer is followed by two dense layers (40 and 10 nodes) and the softmax layer.
LSTM + CNN For the combination of LSTM and CNN, we put our CNN model on top of LSTM model.
All of the above-described architectures are implemented using deepTC  with the Keras (Chollet et al., 2015) and Tensorflow (Abadi et al., 2015) backend.
BERT We also experiment with Bidirectional Encoder Representations from Transformers (BERT), which recently excelled in a number of NLP tasks (Devlin et al., 2018). For our experiments, we use the provided pre-trained multilingual-cased BERT-Base model, 2 a maximum sequence-length of 128 and batches of 32 instances. In the described configuration, BERT yields an accuracy of 0.66 after fine-tuning for the second time. As we observe that the performance of BERT begins to decrease from the third fine-tuning, we do not fine-tune the model furthermore.

Model Selection and Results
We evaluate each of the proposed prediction approaches in a 10-fold cross-validation on the English training dataset to determine the best performing one. As baseline, we use an SVM equipped with word unigram feature.
For all our approaches, we optimize the hyperparameters (e.g. SVM's slack variable or number of layers in neural networks) and feature configurations (e.g. frequency cut-offs for n-gram features) on the training data and report the best performance for each approach. We start with finetuning the n-gram features. We test a wide range of different combinations of n-gram sizes and frequency cut-offs with different classifiers. We report the results in Table 1  the abbreviations of word n-grams and character n-grams. We find that SVM has the overall best performance based on cross-validation, and we continue our experiment (hateword lists, sentiment, word embeddings) using LibSVM with the best n-gram setup. We compare this best feature-engineered system in Table 2 with the neural approaches.
Overall, we observe that the approaches based on feature engineering tend to outperform the neural approaches. As our SVM classifier performs best, we select it as our official submission and also apply it to the Spanish data. Interestingly, in our experiments, BERT and and LSTM perform worst by a considerable margin. However, the combination of LSTM and CNN shows to be competitive with feature engineering approaches.
In Table 3, we show how our system performs on the official test data. We observe a dramatic drop of 30.5 percentage points between performance on the English training and test set. We attribute this loss to the over-fitting to the training data. Nevertheless, our system is able to outperform the most frequent class baseline substantially and especially on the Spanish data the absolute difference to the top-scoring system is low (about 3 percentage points). This means that our system is indeed effective in the task at hand, but also that hate speech detection is a very challeng-   ing task.
Feature ablation To understand how important the individual features are for our system's performance, we conduct an ablation test for our feature set. We show the results of this ablation in Table 4. The results show that the absence of all features except n-grams and word embeddings leads to an improvement in performance. Consequently, we only use n-grams and word embeddings for our final model. The results also show that n-grams are the most important feature for our model.

Distribution of Hate Indicators
When comparing the performance of our system between the training data and test data, we notice a dramatic drop of 30.5 percentage points on macro-F 1 . To better understand this drop, we examine the distribution of words for which we suspect that they are good indicators for hate speech -i.e. words which both occur frequently in the data and are commonly seen as a highly offensive words. We examine a frequency distribution of all words and find that the word 'bitch' meet these criteria. However, the distribution of this word is significantly different in train data and test data. To see whether this is a special case, we examine another high frequency word 'fuck'. The result is shown in Table 5. Furthermore, we inspect how these words are distributed across the classes hate speech and not hate speech in both the train and the test set. We visualize this analysis in Table 6.   For the word 'bitch', we observe that -in the training data -its occurrence is strongly correlated (the probability is about 0.8) with the class hate speech. In the test set, however, this correlation is considerably weaker. As a result, it is very likely that our classifier will learn that 'bitch' is a strong evidence for hate speech. As the correlation is different in the test data, this heuristic is likely to lead to misclassification. We conclude that our classifier, which makes strong use of lexical features, is too sensitive to such distributions. Note, that we do not find such a shift for the word 'fuck'.

Conclusion
We present ltl.uni-due our submission to SemEval 2019 Task 5 Multilingual Detection of Hate. For building our system, We systematically compare a wide range of approaches -including neural network approaches such as LSTMs and BERT and approaches which are based on feature engineering. In our experiments a comparably simple classifier -a SVM equipped with lexico-semantic features (n-grams and word embeddings) -outperforms all other approaches. A comparison between performance on training and test data as well as a quantitative analysis of the dataset shows that our comparably simple classifier is prone to over-fitting, but nevertheless delivers competitive performance in this highly challenging task.