Tagungsbeitrag CC BY 4.0
Veröffentlicht

LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text

We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging, adding unsupervised knowledge beyond the available training data is the most important factor for reaching acceptable tagging accuracy. A learning curve experiment shows furthermore that more in-domain training data is very likely to further increase accuracy.

Zitieren

Zitierform:
Zitierform konnte nicht geladen werden.

Rechte

Nutzung und Vervielfältigung:
Dieses Werk kann unter einer
CC BY 4.0 LogoCreative Commons Namensnennung 4.0 Lizenz (CC BY 4.0)
genutzt werden.