Speaker:
Thomas Bressoud, Ascend Communications, Inc.
Abstract: An important objective of software fault tolerant systems should be to provide a fault-tolerance infrastructure in a manner that minimizes the effort required by the application developer and/or that works for existing applications. In the limit, the objective is to provide fault tolerance transparently to the application.
TFT, the work presented in this talk, provides fault-tolerance at a higher interface than prior solutions. TFT coordinates replicas at the system call interface, interposing a supervisor agent between the application and the operating system. Moving the replica coordination to this interface allows uncorrelated faults within the operating system and below to be tolerated and also admits the possibility of online operating system and hardware upgrades.
To accomplish its task, TFT must enforce a deterministic computation above the system call interface. The potential sources of non-determinism addressed include non-deterministic system calls, delivery of asynchronous events, and the representation of operating system abstractions that differ between the replicas.
Short Bio: Thomas C. Bressoud received a B.S. in Mathematics and Computer Science from Muskingum College in 1983, an M.S. in Computer Science from Boston University in 1987, and an M.S. and Ph.D. in Computer Science from Cornell University in 1992 and 1996. He started his industry career at MIT Lincoln Laboratory from 1983 to 1989 before pursuing his Ph.D. He is currently a Senior Software Technical Consultant at Ascend Communications, Inc., after starting with Isis Distributed Systems in 1994 and then migrating with the company to Stratus Computer in 1995, which was then acquired by Ascend Communications in 1998. He is also an adjunct Assistant Professor with WPI in Worcester, MA. His research interests include operating systems, distributed systems, and, most specifically, transparent fault tolerance and availability techniques.