A probabilistic description-oriented approach for categorising Web documents

Goevert, Norbert; Fuhr, Norbert; Lalmas, Mounia

doi:10.1145/319950.320053

Veröffentlicht

A probabilistic description-oriented approach for categorising Web documents

Gövert, Norbert ; Fuhr, Norbert ; Lalmas, Mounia

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous.
Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier.
Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

Vorschau

Einordnung

Konferenz:

CIKM99: Conference on Information and Knowledge Management; November 2 - 6, 1999; Kansas City, USA

Datum der Erstellung:

16.04.1999

Datum der Veröffentlichung:

2004

URN:

urn:nbn:de:hbz:464-duett-04232004-1430249

PURL:

http://purl.oclc.org/NET/duett-04232004-143024

DOI:

10.1145/319950.320053

Sprache:

Englisch

Ressourcentyp:

Text

Schlagwörter:

automatic categorisation; web documents

Kollektion:

E-Publikationen

Dewey Dezimal-Klassifikation:

000 Informatik, Wissen, Systeme

Sachgruppen der Deutschen Nationalbibliographie:

000 Allgemeines, Wissenschaft

Einrichtung:

Fakultät für Ingenieurwissenschaften, Informatik und Angewandte Kognitionswissenschaft

Abgrenzungspolitik:

First published in: Susan Gauch (ed.): CIKM '99: Proceedings of the eighth international conference on Information and knowledge management, New York: Association for Computing Machinery, 1999, pp 475 – 482.

ISBN: 978-1-58113-146-8

Online at: https://doi.org/10.1145/319950.320053

auf die Merkliste