Hierarchy Identification for Automatically Generating Table-of-Contents

A table-of-contents (TOC) provides a quick reference to a document‘s content and structure. We present the first study on identifying the hierarchical structure for automatically generating a TOC using only textual features instead of structural hints e.g. from HTML-tags. We create two new datasets to evaluate our approaches for hierarchy identification. We find that our algorithm performs on a level that is sufficient for a fully automated system. For documents without given segment titles, we extend out work by automatically generating segment titles. We make the datasets and our experimental framework publicly available in order to foster future research in TOC generation.

Zitieren

Zitierform:
Zitierform konnte nicht geladen werden.

Rechte