Home > Publications database > Scaling DBSCAN towards exascale computing for clustering of big datasets |
Journal Article | FZJ-2018-05447 |
; ; ;
2018
Soc.
Katlenburg-Lindau
Please use a persistent id in citations: http://hdl.handle.net/2128/19727
Abstract: Progress in sensor technology allows us to collect environmental data in more detail and with better resolutionthan ever before. One example are 3D laser scanners that generate 3D point-cloud datasets for land survey.Clustering can then be performed on these datasets to identify objects such as buildings, trees, or rocks in theunstructured point-clouds. Segmenting huge point-clouds (of whole cities or even whole countries) into objects isa computationally expensive operation and therefore requires parallel processing. Density-based spatial clusteringof applications with noise (DBSCAN) is a popular clustering algorithm and HPDBSCAN is an efficient parallelimplementation of it running on supercomputing clusters. Tomorrow’s supercomputers will be able to provideexascale computing performance by exploiting specialised hardware accelerators, however, existing softwareneeds to be adapted to make use of the best fitting accelerators. To address this problem, we present a mapping ofHPDBSCAN to a pre-exascale platform currently being developed by the European DEEP-EST project. It is basedon the Modular Supercomputer Architecture (MSA) that provides a set of accelerator modules which we exploitin novel ways to enhance HPDBSCAN to reach exascale performance. These MSA modules include: a ClusterModule (CM) with powerful multicore CPUs; the Extreme Scale Booster (ESB) module with manycore CPUs;the Network Attached Memory (NAM) module which stores datasets and provides extremely fast access to them;a fast interconnect fabric speeds up inter-process message passing together with the Global Collective Engine(GCE), which includes a multi-purpose Field Programmable Gate Array (FPGA) for, e.g., summing up valuestransmitted in messages collected. HPDBSCAN exploits the above accelerator modules as follows: the data that isto be clustered can be stored in the NAM, it gets subsequently distributed and load balanced, which is acceleratedby the GCE, to the CPU nodes of the CM; the parallel clustering itself is performed by the powerful CPUs of theCM which also merges the obtained cluster IDs; the merged cluster IDs are stored in the NAM for further level ofdetail (LoD) studies, i.e. zooming in and out based on continuous, instead of fixed, levels of importance for eachpoint, which can be regarded as an added dimension. The ESB module (supported by GCE) is most suitable tocalculate these continuous level of importance (cLoI) values and add them to the dataset in the NAM. Based on theadded cLoI data, the LoD studies can then be performed by re-clustering as described previously, i.e. distributionand load balancing of the cLoI value-enriched dataset followed by parallel clustering. The described approach willallow to scale HPDBSCAN-based clusterings on tomorrow’s hardware towards exascale performance.
The record appears in these collections: |