Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations
Abstract
Initially driven by a strong need for increased computational performance in science and
engineering, heterogeneous systems have become ubiquitous and they are getting increasingly
complex. The single processor era has been replaced with multi-core processors,
which have quickly been surrounded by satellite devices aiming to increase the throughput
of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable
Gate Arrays or other specialized processors have very different architectures.
This puts an enormous strain on programming models and software developers to take full
advantage of the computing power at hand. Because of this diversity and the unachievable
flexibility and portability necessary to optimize for each target individually, heterogeneous
systems remain typically vastly under-utilized.
In this thesis, we explore two distinct ways to tackle this problem. Providing automated,
non intrusive methods in the form of compiler tools and implementing efficient abstractions
to automatically tune parameters for a restricted domain are two complementary
approaches investigated to better utilize compute resources in heterogeneous systems.
First, we explore a fully automated compiler based approach, where a runtime system
analyzes the computation flow of an OpenCL application and optimizes it across multiple
compute kernels. This method can be deployed on any existing application transparently
and replaces significant software engineering effort spent to tune application for a particular
system. We show that this technique achieves speedups of up to 3x over unoptimized
code and an average of 1.4x over manually optimized code for highly dynamic applications.
Second, a library based approach is designed to provide a high level abstraction for
complex problems in a specific domain, stencil computation. Using domain specific techniques,
the underlying framework optimizes the code aggressively. We show that even in
a restricted domain, automatic tuning mechanisms and robust architectural abstraction are
necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling
of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs
and 3.6x on four.