Design and Evaluation of an SVM Framework for Scientific Data Applications

Glock, Philipp

Master Thesis

FZJ-2015-06512

Design and Evaluation of an SVM Framework for Scientific Data Applications

Glock, P. (Corresponding author)FZJ*

2015

ix, 58 p. (2015) = Maastricht University, Masterarbeit, 2015

Please use a persistent id in citations: http://hdl.handle.net/2128/9412

Abstract: Support vector machines (SVMs) are a popular classification method due totheir good accuracy and broad usage domains in scientific applications. Thecomputational complexity is between O(n2) and O(n3) for the number of n trainingsamples. The scalability for larger data sets is therefore a problem of SVMs. Withthe increasing number of large data problems, this disadvantage becomes moreand more significant. In order to overcome these scalability issues, this thesisdesigns and implements a parallel and scalable framework that realizes the cascadeSVM approach including specific improvements. A fundamental speed up andincreased scalability is gained by splitting up the data set into several sub setsthat can be worked on in parallel. The framework is designed to run in modernHigh Performance Computing (HPC) environments, that provide the necessarymassively parallel resources (e.g. large clusters with good node interconnects) tosolve large data problems. The framework however also works on a simple computerfor smaller problems if needed. To keep the interface usable for non-technical savvydomain scientists, Python is used.The standard cascade SVM approach is improved with a standardized file formatand parallel I/O is introduced that both improve the I/O performance, whichbesides computing is also often observed to be a bottleneck for large problems. Inorder to enable enhanced training speed up as well as a better accuracy furtherimprovements such as distance filters and cross-feedback options are realized andevaluated. The resulting improved cascade SVM approach and parallel and scalableframework design is then evaluated on a real world remote sensing data set andcompared to another parallel implementation called pi-SVM. The parallelizationstrategies of these two implementations are different whereby the cascade SVM is adata processing approach, pi-SVM follows primarily an algorithmic-driven approach.

Note: Maastricht University, Masterarbeit, 2015

Contributing Institute(s):