In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.

A Machine Learning Approach to Web Mining

ESPOSITO, Floriana;MALERBA, Donato;
2000-01-01

Abstract

In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.
2000
3-540-67350-4
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/5570
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 3
social impact