Speaker: Christian Engelmann , Oak Ridge National Laboratory
Distributed Peer-to-Peer Control in Harness
Abstract: Harness (http://www.csm.ornl.gov/harness/) is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications.
Short Bio: Christian Engelmann is currently sponsored by Oak Ridge Associated Universities (ORAU) as a Research Associate in the Network and Cluster Computing Group (NCC) of the Computer Science and Mathematics Division (CSM) at the Oak Ridge National Laboratory (ORNL). His research focuses on fault-tolerance in large-scale distributed systems for scientific computing. He is involved in the Harness project and in the ORNL-IBM BlueGene research initiative for super-scalable algorithms.
His primary research interest is in distributed scientific computing with a focus on large-scale systems and covers distributed algorithms, scalable fault-tolerance, distributed systems management and collaborative computing. His secondary research interest centers on software engineering technologies including object-oriented analysis, architectures and design.
Host: Frank Mueller, Computer Science, NCSU