LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging, adding unsupervised knowledge beyond the available training data is the most important factor for reaching acceptable tagging accuracy. A learning curve experiment shows furthermore that more in-domain training data is very likely to further increase accuracy.
Preview
Cite
Citation style:
Could not load citation form.
Rights
Use and reproduction:
This work may be used under a
.