English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Conference Paper

Mind the Gap: Large-scale Frequent Sequence Mining

MPS-Authors
/persons/resource/persons104381

Miliaraki,  Iris
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons44119

Berberich,  Klaus
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons44484

Gemulla,  Rainer
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45816

Zoupanos,  Spyros
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Miliaraki, I., Berberich, K., Gemulla, R., & Zoupanos, S. (2013). Mind the Gap: Large-scale Frequent Sequence Mining. In K. Ross, D. Srivastava, D. Papadias, & S. Papadopoulos (Eds.), SIGMOD'13 (pp. 797-808). New York, NY: ACM. doi:10.1145/2463676.2465285.


Cite as: https://hdl.handle.net/11858/00-001M-0000-0015-1D76-9
Abstract
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called ``gap constraints'', which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a ``projected database'' used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.