Reactive Scheduling of DAG Applications on Heterogeneous and Dynamic Distributed Computing Systems
View/ Open
Date
04/12/2008Author
Hernandez, Jesus Israel
Metadata
Abstract
Emerging technologies enable a set of distributed resources across a network to be linked together and used in a coordinated fashion to solve a particular parallel application at the same time. Such applications are often abstracted as directed
acyclic graphs (DAGs), in which vertices represent application tasks and edges represent data dependencies between tasks.
Effective scheduling mechanisms for DAG applications are essential to exploit the tremendous potential of computational resources. The core issues are that the availability and performance of resources, which are already by their nature heterogeneous, can be expected to vary dynamically, even during the course of an
execution. In this thesis, we first consider the problem of scheduling DAG task graphs onto heterogeneous resources with changeable
capabilities. We propose a list-scheduling heuristic approach, the Global Task
Positioning (GTP) scheduling method, which addresses the problem by allowing rescheduling and migration of tasks in
response to significant variations in resource characteristics. We observed from experiments with GTP that in an execution with relatively frequent migration, it may be that, over time, the results of some task have been copied to several other sites, and
so a subsequent migrated task may have several possible sources for each of its inputs. Some of these copies may now be
more quickly accessible than the original, due to dynamic variations in communication capabilities. To exploit this observation, we extended our model with a Copying
Management(CM) function, resulting in a new version, the Global Task Positioning with copying facilities (GTP/c) system. The
idea is to reuse such copies, in subsequent migration of placed tasks, in order to reduce the impact of migration cost on makespan.
Finally, we believe that fault tolerance is an important issue in heterogeneous and dynamic computational environments as the
availability of resources cannot be
guaranteed. To address the problem of processor failure, we propose a rewinding mechanism which rewinds the progress of the
application to a previous state, thereby preserving the execution in spite of the failed processor(s).
We evaluate our mechanisms through simulation, since this allow us
to generate repeatable patterns of resource performance variation. We use a standard benchmark set of DAGs, comparing performance against that of competing algorithms from the scheduling literature.