Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data 
Networks

Ntarmos, Nikos; Triantafillou, Peter; Weikum, Gerhard; Liu, Ling; Reuter, Andreas; Whang, Kyu-Young; Zhang, Jianjun

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

MPS-Authors

/persons/resource/persons45636

Triantafillou, Peter
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45720

Weikum, Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Ntarmos, N., Triantafillou, P., & Weikum, G. (2006). Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. In Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006) (pp. 1-10). Los Alamitos, USA: IEEE.

Cite as: https://hdl.handle.net/11858/00-001M-0000-000F-2274-7

Abstract

Counting in general, and estimating the cardinality of (multi-) sets in particular, is highly desirable for a large variety of applications, representing a foundational block for the efficient deployment and access of emerging internetscale information systems. Examples of such applications range from optimizing query access plans in internet-scale databases, to evaluating the significance (rank/score) of various data items in information retrieval applications. The key constraints that any acceptable solution must satisfy are: (i) efficiency: the number of nodes that need be contacted for counting purposes must be small in order to enjoy small latency and bandwidth requirements; (ii) scalability, seemingly contradicting the efficiency goal: arbitrarily large numbers of nodes nay need to add elements to a (multi-) set, which dictates the need for a highly distributed solution, avoiding server-based scalability, bottleneck, and availability problems; (iii) access and storage load balancing: counting and related overhead chores should be distributed fairly to the nodes of the network; (iv) accuracy: tunable, robust (in the presence of dynamics and failures) and highly accurate cardinality estimation; (v) simplicity and ease of integration: special, solution-specific indexing structures should be avoided. In this paper, first we contribute a highly-distributed, scalable, efficient, and accurate (multi-) set cardinality estimator. Subsequently, we show how to use our solution to build and maintain histograms, which have been a basic building block for query optimization for centralized databases, facilitating their porting into the realm of internet-scale data networks.