Fast Fault-Tolerant Concurrent Access to Shared Objects

We consider a synchronous model of distributed computation in which $n$ nodes communicate via point-to-point messages, subject to the following constraints: (i) in a single ``step'', a node can only send or receive $O(\log n)$ words, and (ii) communication is unreliable in that a constant fraction of all messages may be lost at each step due to node and/or link failures. We design and analyze a simple local protocol for providing fast concurrent access to shared objects in this faulty network environment. In our protocol, clients use a hashing-based method to access shared objects. When a large number of clients attempt to read a given object at the same time, the object is rapidly replicated to an appropriate number of servers. Once the necessary level of replication has been achieved, each remaining request for the object is serviced within $O(1)$ expected steps. Our protocol has practical potential for supporting high levels of concurrency in distributed file systems over wide area networks.

Postscript