Distributed large-scale data storage and processing

Papailiopoulos, Dimitrios

Distributed large-scale data storage and processing

Access full-text files

PAPAILIOPOULOS-DISSERTATION-2014.pdf (3.35 MB)

Date

2014-05

Authors

Papailiopoulos, Dimitrios

Abstract

This thesis makes progress towards the fundamental understanding of heterogeneous and dynamic information systems and the way that we store and process massive data-sets. Reliable large-scale data storage: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. However, traditional erasure codes are associated with high repair cost that is often considered an unavoidable price to pay. In this thesis, we show how to overcome these limitations. We construct novel families of erasure codes that are optimal under various repair cost metrics, while achieving the best possible reliability. We show how these modern storage codes significantly outperform traditional erasure codes. Low-rank approximations for large-scale data processing: A central goal in data analytics is extracting useful and interpretable information from massive data-sets. A challenge that arises from the distributed and large-scale nature of the data at hand, is having algorithms that are good in theory but can also scale up gracefully to large problem sizes. Using ideas from prior work, we develop a scalable lowrank optimization framework with provable guarantees for problems like the densest k-subgraph (DkS) and sparse PCA. Our experimental findings indicate that this low-rank framework can outperform the state-of-the art, by offering higher quality and more interpretable solutions, and by scaling up to problem inputs with billions of entries.