Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

Horsmann, Tobias; Beißwenger, Michael; Zesch, Torsten

doi:10.5281/zenodo.1041880

Tagungsbeitrag Sa., 30. Sept.. 2017 CC BY 4.0

Veröffentlicht

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

Horsmann, Tobias ; Beißwenger, Michael ; Zesch, Torsten

We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data. The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found. We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have. This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected. We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest suffices. Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others. The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps. An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones.

Vorschau

Einordnung

Konferenz:: 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17), 3-4 October 2017, Bolzano, Italy
Datum der Veröffentlichung:: 30.09.2017
URN:: urn:nbn:de:hbz:464-20211019-112758-8
DOI:: 10.5281/zenodo.1041880
Sprache:: Englisch
Ressourcentyp:: Text
Schlagwörter:: Part-of-speech; Social Media; CMC; Rare Phenomena
Kollektion:: E-Publikationen
Sachgruppen der Deutschen Nationalbibliographie:: 004 Informatik
Link URL:: https://cmc-corpora2017.eurac.edu/
Einrichtung:: Fakultät für Ingenieurwissenschaften, Informatik und Angewandte Kognitionswissenschaft, Informatik, Sprachtechnologie
Einrichtung:: Fakultät für Geisteswissenschaften, Institut für Germanistik
Informationen zur Erstveröffentlichung:: Horsmann, T., Beißwenger, M., Zesch, T., 2017. Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain. In: Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17), pages 39-43. DOI: https://doi.org/10.5281/zenodo.1041880