Do LSTMs really work so well for PoS tagging? – A replication study

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the result in general, and we additionally find that the advantage of LSTMs is even bigger for larger tagsets. However, we also find that for the very large tagsets of morphologically rich languages, hand-crafted morphological lexicons are still necessary to reach state-of-the-art performance.


Introduction
Part-of-Speech (PoS) tagging is an important processing step for many NLP applications. When researchers want to use a PoS tagger, they would ideally choose an off-the-shelf PoS tagger which is optimized for a specific language. If a suited tagger is not available two options remain: a) implementation of your own tagger, which requires technical knowledge and experience, or b) using an existing tagger and hope that the resulting model will be sufficiently accurate. One can assume that many taggers fit more languages than the one for which they have been constructed originally. Ideally, researchers should be able to fall back to a well-evaluated language-independent tagger if no reference implementation for a language is available.
A recent study by Plank et al. (2016) evaluated an LSTM PoS tagger and compared the results to Conditional Random Fields (CRF) (Laf-ferty et al., 2001) and Hidden-Markov (HMM) implementations on corpora of various languages. Their evaluation concludes that the LSTM tagger reaches better results than the CRF and HMM tagger. The evaluation corpora were all annotated with a coarse-grained tagset with 17 tags. Thus, this LSTM tagger seems to be a well-performing, language-independent choice for learning models on coarse-grained tagsets. While for many tasks a coarse-grained tagset might be sufficient some tasks require more fine-grained tagsets.
We, thus, consider it worthwhile to explore if the results are reproducible using corpora with fine-grained tagsets. We use the LSTM tagger provided by Plank et al. (2016) and compare the results likewise to CRF and an off-the-shelf HMM tagger implementation. We compile a fresh set of 27 corpora of 21 languages which uses the commonly used fine-grained tagset of the respective language. We suggest these corpora as evaluation set for tasks which require fine-grained PoS tags, as all corpora are freely available for research purposes. Our intention is to replicate the findings of Plank et al. (2016), which have been achieved on a coarse-grained tagset and investigate if they transfer to fine-grained tagsets.

PoS Tagger Paradigms
We distinguish two PoS tagger paradigms, which can be used to implement a tagger: The first one is Feature Engineering, in which a classifier learns a mapping from human-defined features to a PoS tag. Defining good features is often a non-trivial task, which furthermore requires a lot of experience. For instance a suffix feature which checks a word-ending for "ing" is highly discriminative for English gerunds, but might not provide any useful information for other languages. The details of the feature implementation might render a  tagger unsuited for learning models for other languages or tagsets. We will, thus, experiment with features and their configurations, and investigate how well they perform in combination for learning fine-grained tagsets of various languages. We implement those experiments using CRF which are frequently used for PoS tagging (Remus et al., 2016;Ljubešić et al., 2016). The second paradigm is Architecture Engineering, which relies on methods to learn the input representation by themselves. The challenge lies in finding an architecture that supports this selflearning process. Most recent representatives of this paradigm are neural networks of which we use the LSTM tagger provided by Plank et al. (2016).
In our experiments, we will focus on how to provide word-and character-level information to the classifiers as these two types of information are most relevant and most frequently used for training PoS tagger models. Furthermore, we will evaluate the performance on Out-Of-Vocabulary (OOV) words to learn if the taggers generalize to unseen words.
To provide a reference value to a well-known PoS tagger, we will compare all results to the HMM-based HunPos (Halácsy et al., 2007) tagger, which is a freely available re-implementation of the TNT tagger (Brants, 2000). HunPos has been used before for training models of various languages and tagsets (Seraji, 2011;Attardi et al., 2010;Hládek et al., 2012) which is why we consider this tagger to be a suitable baseline. Table 1 shows the fine-grained annotated corpora we collected by screening the literature. We do not claim that this list is complete, but the provided corpora are all reasonably easy to access and can be freely used for research purposes.

Evaluation Corpora Dataset
Selection To ensure reproducibility, we preferably selected corpora which are directly available via the Internet except German-3, Hungarian and Swedish-2. We intentionally exclude languages such as Chinese or Japanese, which do not provide whitespace delimiters to mark word boundaries. Tagging those languages requires a morpho-D a n is h D u tc h E n g li s h G e rm a n -1 G e rm a n -2 G e rm a n -3 Is la n d ic N o rw e g ia n S w e d is h -1 S w e d is h -2 B -P o rt u g . F re n c h -1 F re n c h -2 It a li a n S p a n is h C ro a ti a n -1 C ro a ti a n -2 C z e c h P o li s h R u s s ia n S lo v a k S lo v e n e -1 S lo v e n e -2 A fr ik a a n s F in n is h H e b re w H u n g a ri a n logical analysis which is a different task than the tagging task on which we are focusing here. Most corpora are manually annotated or were at least human-verified. There are four exceptions which we decided to add anyway to increase the number of languages represented in our setup. The tagset granularity of the corpora ranges from coarse (12 tags) to morphologically fine (1574 tags) to evaluate all taggers on various stages of granularity.

Language & Corpora Diversity
We analyzed the distribution of PoS tags in the corpora by mapping all tags to the 17 coarse-grained PoS tags of the Universal Dependencies (UD) project (Nivre et al., 2015) in Figure 1. The mappings to the UD tagset have been manually created. The partly large differences between the syntactical classes help to better understand the challenge in construction a tagger that is suited for all those languages. For instance, Germanic and Romanic languages have a lot of determiners while they do not occur at all in Slavic languages.

Corpus Size & Tagset
The corpora have varying sizes which makes a direct comparison between corpora difficult. To run our experiments under fully controlled conditions, we extract a randomized sub-sample of sentences from each corpus, which accounts for 50k tokens, and run all our experiments with 10fold cross-validation (CV). 1 Results reported use the fine-grained tagset of the respective corpus.
We deliberately do not use the corpora from the UD Treebank project in order to provide results on a fresh dataset. Additionally, UD uses a coarsegrained tagset for all its corpora. While this granularity is sufficient for many tasks, linguistic analysis often requires more fine-grained tagsets, and it is not clear whether results achieved on coarsegrained tagsets transfer well to more fine-grained tagsets. The collected corpora, thus, also represent an alternative dataset, which we suggest in case the UD tagset is too coarse-grained.

CRF Experiments
We reviewed the recent literature to determine the most commonly used features for training PoS taggers. As re-occurring features, we found word ngrams, fixed character sequences focusing on either pre-, in-, or suffixes of words and word distributional knowledge for PoS taggers of various languages (Brants, 2000;Horsmann and Zesch, 2016;Ljubešić et al., 2016). Word-and characterngrams have been used with various parametrizations depending on the language and there is no agreement which parameters are most advisable. We will, hence, run a series of parameter-search experiments over the word-and character-ngram parametrization to determine a configuration applicable to all languages. For this, we evaluate all permutations of the subsequently introduced feature configurations with 10fold cross-validation. The objective is to find a configuration that works well on all corpora, languages, and tagsets.

Word Features
We experiment with adding the 1, 2, 3 words to the right and left of the current word as lower-cased string features.
Character Features Which character-ngram is discriminative for a PoS tag strongly depends on the language. To avoid a language bias, we use a frequency-based approach in which we select the N most frequently occurring character-ngrams of length 1, 2, 3, 4 from the training dataset. We experiment with the following frequency cut-off values of N ε {250, 500, 750, 1000} to select only frequent and potentially informative characterngrams as features. These N features are boolean and are set to 1 if the respective character-ngram occurs in the current word.

Semantic Features
We use Brown clustering (Brown et al., 1992)   unlabelled text is obtained from the Leipzig Corpus Collection (Quasthoff et al., 2006), which provides large text quantities crawled from the web for many languages. We use 15 · 10 6 tokens to create the clusters from the same amount of text for all languages. We provide the cluster ids in substrings of varying length to the classifier (Owoputi et al., 2013).

Results
In Figure 2, we show the results of our parameter search experiment. The triangles mark the results of the various feature configurations. The diamond symbol shows the configuration which works best over all corpora. We refer to this best working configuration as Best CRF subsequently, it uses a word-context window of 1 word to the left and right and the 750 most frequent character [1.
.4] grams with additionally adding word clusters. Especially for morphologically-rich languages, the spread is quite large which is caused by the lower number of character-ngrams in those configurations. For corpora such as Slovene-1, we see that more accurate configurations exist than Best CRF but more importantly, the selected configuration is always among the best working ones.
We show the results of Best CRF and the performance of the individual features for each language in Table 2, and compare the results to HunPos, the highest accuracies are highlighted in grey. When evaluating the features separately, the character-ngrams reach the highest accuracy on OOV words. Especially on the Slavic language family the character-ngrams perform much better than using only word-ngrams or clusters. Furthermore, using only character-ngrams is often competitive to using only word-ngrams. Hence, a rather naïve strategy to achieving a decent performance on almost any language is to just use all kinds of character-ngrams. The cluster feature also performs better than the word-ngrams. Considering that we had to limit the amount of data for creating the clusters for comparability, this feature assumedly has more potential when using larger data sizes (Derczynski et al., 2015). The combination of all features in the column Best CRF shows that the features address quite different information and add up well, so unsurprisingly, this configuration reaches the overall best accuracies. The difference to HunPos is, with often less than one percent point difference, only small. Off-the-shelf taggers do, hence, not necessary have a disadvantage over constructing an own tagger. In the remainder of this work, we will use the Best CRF configuration when discussing CRF tagger results.

LSTM Experiments
When using neural networks, the details of how word and character information is provided greatly influences the learning success of the network. We will reproduce network setups which have also been used in Plank et al. (2016) to ensure comparability to the coarse-grained results to which we compare our results: Word In this setup, we train a network on the word embeddings only and provide them to a bidirectional LSTM. This setup will serve as baseline.
Char The character embeddings of a word are provided to a bidirectional LSTM. The last state of the forward and the backward character LSTM are combined (Ling et al., 2015) and provided to another bidirectional LSTM layer.
Word-Char This architecture is a combination of the previous two architectures. The last state of the character LSTMs is added to the word embedding information before it is provided to the next LSTM layer.
Word-Char+ The architecture by Plank et al. (2016) combines word and character level information and additionally considers the logfrequency of the next word during training. This tagger reported state-of-the-art results and we use the provided reference implementation of this tagger in our setup. LSTMs have the reputation to require larger amounts of training data. With the 50k tokens we use this is barely fulfilled, however, Plank et al. (2016) find this sensitivity to be less severe and set a corpus size of 60k tokens as lower bound for their coarse-grained tagging experiments. We will come back to this data size issue in Section 7, where we evaluate using all tokens in a corpus (and arriving at the same conclusions as for our 50k token datasets). Furthermore, in many cases only smaller dataset sizes are available, sometimes even less than 50k tokens. It is, thus, important to know if considering neural network taggers makes sense at all (on fine-grained tagsets), thus we will train LSTM models on smaller dataset sizes.
We implement the LSTM taggers in DyNet (Neubig et al., 2017) and use the hyper-parameter settings by Plank et al. (2016), i.e. we train 20 epochs using Statistical-Gradient-Descent with a learning rate of 0.1 and adding Gaussian noise of 0.2 to the embedding layer. We train word embeddings on the data we already used for the semantic feature in the CRF experiments by using fastText (Bojanowski et al., 2016) . The the character-level embeddings are trained on-the-fly.

Results
In Figure 3, we show the results for the LSTM architectures. The Word-Char+ tagger performs best followed by Word-Char, which is not surprising as Word-Char+ is based on this architecture. For the Germanic and Romanic languages, the accuracy of the various architectures is similar but for Slavic languages, which use much more fine-grained tagsets, the differences are rather large. For instance, the Char architecture reaches only small improvements over the Word baseline on Croatian or Czech while on Spanish, or Hungarian the character architecture is clearly better than the baseline. Table 3 shows the detailed results and additionally reports the accuracy values on OOV with best results highlighted in grey. The Char architecture is in many cases competitive to the HunPos reference system. This shows that the performance of many off-theshelf taggers is rather easy to approximate by relying only on character-level information.
The results by the Char architecture also explains why the Word-Char architecture performs so well although the amount of syntactical information is quite limited with 50k tokens. A large part of the necessary information is already obtained by the character model, which requires a lot less training data than a model on the word level. Thus, the results of Plank et al. (2016) on coarsetagsets are reproducible for fine-grained tagsets with the Word-Char architecture being the essential property to achieving high accuracy.

Influence of Tagset Size
A researcher who works with morphologically rich languages will often be interested in additional morphologic details such as case or gender. This drastically complicates the task, as a few hundred instead of a few dozen PoS tag distinctions have to be learned. In this experiment, we will examine the impact of an increasing number of PoS tags on the accuracy of the taggers to provide reference values of how much performance a tagger seems to loose with an increasing tagset size.

Results
In Figure 4, we show a comparison of the tagging accuracy in relation to the number of PoS tags. We show the best performing LSTM tagger Word-Char+, the CRF tagger and HunPos. Each data point represents the averaged CV result on one corpus with the respective tagger. We see a certain clustering of the data points for the small tagset sizes, which shows that the taggers tend to perform highly similarly for many languages. This means that the tagset size has a larger effect on the accuracy than the language of the corpus.
For each PoS tagger, a regression trendline is plotted which indicates the average loss in accuracy with an increasing tagset size. For onehundred additional PoS tags, Word-Char+ loses 0.35 points in accuracy, while CRF and HunPoS have a much steeper decay of 0.45 points. Hence, with growing tagset size the tagger choice becomes increasingly more important. Furthermore, the benefit of more sophisticated tagger architectures becomes only apparent on large PoS tagsets.

Comparison with Reference Taggers
In this experiment, we compare our results to reference taggers from the literature that are tailored towards certain languages. Our experiments until now were limited to the fixed dataset size that we set at the beginning for comparability. Especially for the morphologically fine-grained tagsets this might have been problematic, as it is doubtful if all PoS tags of a morphological tagset do even occur on 50k tokens. Thus, in order to evaluate the taggers using all available data, we will reproduce setups reported in the literature and compare the performance of the taggers to those results.
This experiment limits the number of comparisons we can make drastically, as we need to have Lang.

Word
Char Word-Char Word-Char+ HunPos Group Corpus Id All OOV All OOV All OOV All OOV All OOV Germanic  Figure 4: Influence of tagset size on accuracy the same corpora as used in the literature. We, thus, reproduce for Czech the setup by Spoustová et al. (2009) with training on 10 6 and evaluation on 2 · 10 5 tokens, for German-2 the setup by Giesbrecht and Evert (2009) and for Swedish-2 the setup by Östling (2013), which both use 10fold cross-validation over the full corpus size.
Taggers for Slavic languages often make use of additional resources such as morphological dictionaries, which we intentionally do not include to avoid human-crafted resources that are not available for all languages. Thus, we do not expect to reach state-of-the-art performance, but we want to quantify the size of the gap.

Results
In Table 4, we show a comparison of our results to the results reported in the literature. On German-2 and Swedish-2, the Word-Char+ tagger is able to reach better results than the reported reference values except for Czech which uses a morphologically fine-grained tagset. Thus, language-∆ to reference tagger Corpus Id # Tags Acc (%) HunPos CRF Word-Char+ Czech 1,574 95.9 -4.7 -3.2 -1.5 German-2 54 97.6 -0.1 -0.2 0.9 Swedish-2 153 96.1 0.0 -0.6 0.1 Table 4: Results of reproducing setups in the literature using the full corpus size fitted PoS taggers reach better results than neural networks when training models on corpora with extremely fine-grained PoS tagsets. However, for smaller tagsets sizes the need for using languagefitting is negligible.

Conclusion
We replicated a study in which LSTM PoS taggers are compared to CRF and HMM taggers on corpora with a coarse-grained tagset. Our replication focused on whether results reported for coarsegrained tagsets do also hold when training models on fine-grained tagsets. Therefore, we collected a large set of 27 evaluation corpora that are annotated with the commonly used fine-grained tagset of 21 languages. The replication confirmed the superior performance of the LSTM tagger reported by Plank et al. (2016) also on fine-grained tagsets. However, we also found that for smaller tagset sizes the differences between the LSTM, our selfimplemented CRF and the HMM tagger are often only small. The advantages of the LSTM tagger over other taggers grow proportionally with the tagsets size of the corpus. On morphologically fine tagsets, even the LSTM tagger fails to reach results reported in the literature when reproducing those setups.