|
|
John Douceur
Publications
| 16th SOSP |
Distributed Schedule Management in the Tiger Video File Server |
| MMCN 1999 |
Improving Responsiveness of a Stripe-Scheduled Media Server |
| SIGMETRICS '99 |
A Large-Scale Study of File-System Contents |
| 17th SOSP |
Progress-Based Regulation of Low-Importance Processes |
| SIGMETRICS 2000 |
Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs |
| 4th WSS |
Single Instance Storage in Windows© 2000 |
| 4th MASCOTS |
Large-Scale Simulation of Replica Placement Algorithms for a Serverless Distributed File System |
| 9th ESA |
Modeling Replica Placement in a Distributed File System: Narrowing the Gap between Analysis and Simulation |
| 15th DISC |
Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System |
| 4th SRDS |
Optimizing File Availability in a Secure Serverless Distributed File System |
| 1st IPTPS |
The Sybil Attack |
| USENIX 2002 |
Cooperative Task Management without Manual Stack Management |
| 22nd ICDCS |
Reclaiming Space from Duplicate Files in a Serverless Distributed File System |
| 5th OSDI |
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment |
| 18th ACSAC |
A Secure Directory Service based on Exclusive Encryption |
| PER 31 (3) |
Is Remote Host Availability Governed by a Universal Law? |
| 5th IPTPS |
Byzantine Fault Isolation in the Farsite Distributed File System |
| 1st EuroSys |
The SMART Way to Migrate Replicated Stateful Services |
| 7th OSDI |
Distributed Directory Service in the Farsite File System |
| 5th FAST |
A Five-Year Study of File-System Metadata |
| OSR 41 (2) |
The Farsite Project: A Retrospective |
| OSR 41 (2) |
MapCruncher: Integrating the World's Geographic Information |
| NOSSDAV 2007 |
Enhancing Game-Server AI with Distributed Client Computation |
| SPAA 2007 |
Maximizing Total Upload in Latency-Sensitive P2P Applications |
| SIGCOMM 2007 |
Lottery Trees: Motivational Deployment of Networked Systems |
Details
| Conference |
16th Symposium on Operating Systems Principles (SOSP) |
| Title |
Distributed Schedule Management in the Tiger Video File Server |
| Authors |
William J. Bolosky, Robert P. Fitzgerald, and John R. Douceur |
| Abstract |
Tiger is a scalable,
fault-tolerant video file server constructed from a collection
of computers connected by a switched network. All
content files are striped across all of the computers and
disks in a Tiger system. In order to prevent conflicts for a
particular resource between two viewers, Tiger schedules
viewers so that they do not require access to the same
resource at the same time. In the abstract, there is a
single, global schedule that describes all of the viewers in
the system. In practice, the schedule is distributed
among all of the computers in the system, each of which has a
possibly partially inconsistent view of a subset of the
schedule. By using such a relaxed consistency model for
the schedule, Tiger achieves scalability and fault tolerance
while still providing the consistent, coordinated service
required by viewers.
|
| Citation |
W. J. Bolosky, R. P. Fitzgerald, J. R. Douceur; Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, 1997, pp. 212-223 |
| Definitive Version |
|
| Download |
|
| Notice |
© ACM, (1997). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#31, ISS#5, (1997)} |
| Conference |
Multimedia Computing and Networking 1999 |
| Title |
Improving Responsiveness of a Stripe-Scheduled Media Server |
| Authors |
John R. Douceur and William J. Bolosky |
| Abstract |
Thrifty scheduling is
an algorithm that improves the responsiveness of a
stripe-scheduled multimedia server. It increases the
determinism of the data-distribution service, reduces the
likelihood of high startup delays, and enables an increase in
the rated load of the system. A stripe-scheduled media
server is a distributed video-on-demand system that
load-balances by striping video data across multiple computer
nodes and cyclically scheduling the distribution of the
data. The server displays highly variable startup delays
in response to requests for data streams. These delays
are due to clusters of allocated slots in the distribution
schedule, which form naturally as the system load
increases. Thrifty scheduling is a scalable algorithm
that improves responsiveness by allocating streams to schedule
slots in a way that reduces the clustering in the
schedule. This algorithm has been incorporated into the
Tiger video fileserver.
|
| Keywords |
Video server, video-on-demand, latency reduction, distribution, responsiveness, deterministic service, cyclical scheduling, thrifty, greedy, startup delay. |
| Citation |
J. R. Douceur, W. J. Bolosky; Multimedia Computing and Networking 1999, Dilip D. Kandlur, Kevin Jeffay, Timothy Roscoe, Editors, Proceedings of SPIE Vol. 3654, pp. 192-203 (1998) |
| Download |
|
| Notice |
Copyright SPIE - The International Society for Optical Engineering |
| Conference |
SIGMETRICS '99 International Conference on Measurement and Modeling of Computer Systems |
| Title |
A Large-Scale Study of File-System Contents |
| Authors |
John R. Douceur and William J. Bolosky |
| Abstract |
We collect and analyze a snapshot of data from
10,568 file systems of 4801 Windows personal computers in a commercial
environment. The file systems contain 140 million files totaling 10.5 TB
of data. We develop analytical approximations for distributions of file
size, file age, file functional lifetime, directory size, and directory
depth, and we compare them to previously derived distributions. We find
that file and directory sizes are fairly consistent across file systems,
but file lifetimes vary widely and are significantly affected by the job
function of the user. Larger files tend to be composed of blocks sized
in powers of two, which noticeably affects their size distribution.
File-name extensions are strongly correlated with file sizes, and
extension popularity varies with user job function. On average, file
systems are only half full.
|
| Keywords |
File-system contents, directory hierarchy, static data snapshot, workload characterization, analytical modeling. |
| Citation |
J. R. Douceur, W. J. Bolosky; Proceedings of the international conference on Measurement and modeling of computer systems, 1999, pp. 59-70 |
| Definitive Version |
|
| Download |
|
| Notice |
© ACM, (1999). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMETRICS Performance Evaluation Review, {VOL#27, ISS#1, (1999)} |
| Conference |
17th Symposium on Operating Systems Principles (SOSP) |
| Title |
Progress-Based Regulation of Low-Importance Processes |
| Authors |
John R. Douceur and William J. Bolosky |
| Abstract |
MS Manners is a
mechanism that employs progress-based regulation to prevent
resource contention with low-importance processes from
degrading the performance of high-importance processes.
The mechanism assumes that resource contention that degrades
the performance of a high-importance process will also retard
the progress of the low-importance process. MS Manners
detects this contention by monitoring the progress of the
low-importance process and inferring resource contention from
a drop in the progress rate. This technique recognizes
contention over any system resource, as long as the
performance impact on contending processes is roughly
symmetric. MS Manners employs statistical mechanisms to
deal with stochastic progress measurements; it automatically
calibrates a target progress rate, so no manual tuning is
required; it supports multiple progress metrics from
applications that perform several distinct tasks; and it
orchestrates multiple low-importance processes to prevent
measurement interference. Experiments with two
low-importance applications show that MS Manners can reduce
the degradation of high-importance processes by up to an order
of magnitude.
|
| Keywords |
Progress-based feedback, symmetric resource contention, process priority. |
| Citation |
Douceur, W. J. Bolosky; Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, 1999, pp. 247-260 |
| Definitive Version |
|
| Download |

|
| Notice |
© ACM, (1999). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#33, ISS#5, (1999)} |
| Conference |
SIGMETRICS 2000 International Conference on Measurement and Modeling of Computer Systems |
| Title |
Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs |
| Authors |
William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer |
| Abstract |
We consider an architecture for a serverless distributed file system that
does not assume mutual trust among the client computers. The system provides
security, availability, and reliability by distributing multiple encrypted
replicas of each file among the client machines. To assess the feasibility of
deploying this system on an existing desktop infrastructure, we measure and
analyze a large set of client machines in a commercial environment. In
particular, we measure and report results on disk usage and content; file
activity; and machine uptimes, lifetimes, and loads. We conclude that the
measured desktop infrastructure would passably support our proposed system,
providing availability on the order of one unfilled file request per user per
thousand days.
|
| Keywords |
Serverless distributed file system architecture, personal computer usage data, availability, reliability, security, trust, workload characterization, analytical modeling, feasibility analysis. |
| Citation |
W. J. Bolosky, J. R. Douceur, D. Ely, M. Theimer; Proceedings of the international conference on Measurement and modeling of computer systems, 2000, pp. 34-43 |
| Definitive Version |
|
| Download |
|
| Notice |
© ACM, (2000). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMETRICS Performance Evaluation Review, {VOL#28, ISS#1, (2000)} |
| Conference |
4th USENIX Windows Systems Symposium |
| Title |
Single Instance Storage in Windows© 2000 |
| Authors |
William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur |
| Abstract |
Certain applications, such as Windows 2000's
Remote Install service, can result in a set of files in which many
different files have the same content. Using a traditional file system
to store these files separately results in excessive use of disk and
main memory file cache space. Using hard or symbolic links would
eliminate the excess resource requirements, but changes the semantics of
having separate files, in that updates to one "copy" of a file
would be visible to users of another "copy." We describe the
Single Instance Store (SIS), a component within Windows©
2000 that implements links with the semantics of copies for files stored
on a Windows 2000 NTFS volume. SIS uses copy-on-close to implement the
copy semantics of its links. SIS is structured as a file system filter
driver that implements links and a user level service that detects
duplicate files and reports them to the filter for conversion into
links. Because SIS links are semantically identical to separate files,
SIS creates them automatically when it detects files with duplicate
contents. This paper describes the design and implementation of SIS in
detail, briefly presents measurements of a remote install server showing
a 58% disk space savings by using SIS, and discusses other possible uses
of SIS.
|
| Citation |
W. J. Bolosky, S. Corbin, D. Goebel, J. R. Douceur; Proceedings of the 4th USENIX Windows Systems Symposium, 2000, pp. 13-24 |
| Download |
|
| Notice |
The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. |
| Conference |
9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) |
| Title |
Large-Scale Simulation of Replica Placement Algorithms for a Serverless Distributed File System |
| Authors |
John R. Douceur and Roger P. Wattenhofer |
| Abstract |
Farsite is a scalable, distributed file system
that logically functions as a centralized file server but that is
physically implemented on a set of client desktop computers. Farsite
provides high degrees of reliability and availability by storing
replicas of files on multiple machines. Replicas are placed to maximize
the effective system availability, using a distributed, iterative,
randomized placement algorithm. We perform a large-scale simulation of
three candidate algorithms using machine availability data collected
from over 50,000 desktop computers. We find that algorithmic efficiency
and placement efficacy run counter to each other. We fit analytic
functions to the improvement rates and provide explanations for the
fitted curves. We explore the algorithms' properties through study of
their dynamic behavior. We visualize algorithmic placements and compare
them to theoretical worst cases. We quantify the degree of machine
failure correlation and develop a formula to approximate its effect.
|
| Citation |
J. R. Douceur, R. P. Wattenhofer; Proceedings of 9th IEEE MASCOTS, 2001, pp. 311-319 |
| Download |
|
| Notice |
© 2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. |
| Conference |
9th Annual European Symposium on Algorithms |
| Title |
Modeling Replica Placement in a Distributed File System: Narrowing the Gap between Analysis and Simulation |
| Authors |
John R. Douceur and Roger P. Wattenhofer |
| Abstract |
We examine the replica placement aspect of a
distributed peer-to-peer file system that replicates and stores files on
ordinary desktop computers. It has been shown that some desktop machines
are available for a greater fraction of time than others, and it is
crucial not to place all replicas of any file on machines with low
availability. In this paper we study the efficacy of three hill-climbing
algorithms for file replica placement. Based on large-scale
measurements, we assume that the distribution of machine availabilities
be uniform. Among other results we show that the MinMax algorithm is
competitive, and that for growing replication factor the MinMax and
MinRand algorithms have the same asymptotic worst-case efficacy.
|
| Citation |
J. R. Douceur, R. P. Wattenhofer; Proceedings of 9th ESA, 2001, pp. 356-367 |
| Download |
|
Tech Report (Extended Version) |
|
| Notice |
© Springer-Verlag |
| Conference |
15th International Symposium on Distributed Computing |
| Title |
Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System |
| Authors |
John R. Douceur and Roger P. Wattenhofer |
| Abstract |
The Farsite distributed file system stores
multiple replicas of files on multiple machines, to provide file access
even when some machines are unavailable. Farsite assigns file replicas
to machines so as to maximally exploit the different degrees of
availability of different machines, given an allowable replication
factor R. We use competitive analysis and simulation to study the
performance of three candidate hill-climbing replica placement
strategies, MinMax, MinRand, and RandRand, each of which successively
exchanges the locations of two file replicas. We show that the MinRand
and RandRand strategies are perfectly competitive for R = 2 and
2/3-competitive for R = 3. For general R, MinRand is at
least 1/2-competitive and RandRand is at least 10/17-competitive. The
MinMax strategy is not competitive. Simulation results show better
performance than the theoretic worst-case bounds.
|
| Citation |
J. R. Douceur, R. P. Wattenhofer; Proceedings of 15th DISC, 2001, pp. 48-62 |
| Download |
|
Tech Report (Reformatted) |
|
| Notice |
© Springer-Verlag |
| Conference |
20th Symposium on Reliable Distributed Systems |
| Title |
Optimizing File Availability in a Secure Serverless Distributed File System |
| Authors |
John R. Douceur and Roger P. Wattenhofer |
| Abstract |
Farsite is a secure, scalable, distributed file
system that logically functions as a centralized file server but that is
physically realized on a set of client desktop computers. Farsite
provides security, reliability, and availability by storing replicas of
each file on multiple machines. It continuously monitors machine
availability and relocates replicas as necessary to maximize the
effective availability of the system. We evaluate several
replica-placement methods using large-scale simulation with machine
availability data from over 50,000 desktop computers. We find that
initially placing replicas in an availability-sensitive fashion yields
pathological results, whereas very good results are obtained by random
initial placement followed by incremental improvement using a scalable,
distributed, fault-tolerant, and attack-resistant hill-climbing
algorithm. The algorithm is resilient to severe restrictions on
communication and replica placement, and it does not excessively
co-locate replicas of different files on the same set of machines.
|
| Citation |
J. R. Douceur, R. P. Wattenhofer; Proceedings of 20th IEEE SRDS, 2001, pp. 4-13 |
| Download |
|
| Notice |
© 2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. |
| Conference |
1st International Workshop on Peer-to-Peer Systems |
| Title |
The Sybil Attack |
| Author |
John R. Douceur |
| Abstract |
Large-scale peer-to-peer
systems face security threats from faulty or hostile remote computing
elements. To resist these threats, many such systems employ redundancy.
However, if a single faulty entity can present multiple identities,
it can control a substantial fraction of the system, thereby undermining
this redundancy. One approach to preventing these "Sybil attacks" is
to have a trusted agency certify identities. This paper shows that,
without a logically centralized authority, Sybil attacks are always
possible except under extreme and unrealistic assumptions of resource
parity and coordination among entities.
|
| Citation |
J. R. Douceur; Proceedings of 1st IPTPS, 2002 |
| Download |
|
| Conference |
USENIX 2002 Annual Technical Conference |
| Title |
Cooperative Task Management without Manual Stack Management |
| Authors |
Atul Adya, Jon Howell, Marvin Theimer, William J. Bolosky, and John R. Douceur |
| Abstract |
Cooperative task management can provide program architects with
ease of reasoning about concurrency issues. This property is
often espoused by those who recommend "event-driven" programming
over "multi-threaded" programming. Those terms conflate several
issues. In this paper, we clarify the issues, and show how one
can get the best of both worlds: reason more simply about
concurrency in the way "event-driven" advocates recommend, while
preserving the readability and maintainability of code
associated with "multithreaded" programming.
We identify the source of confusion about the two programming
styles as a conflation of two concepts: task management and
stack management. Those two concerns define a two-axis space in
which "multithreaded" and "event-driven" programming are
diagonally opposite; there is a third "sweet spot" in the space
that combines the advantages of both programming styles. We
point out pitfalls in both alternative forms of stack
management, manual and automatic, and we supply techniques that
mitigate the danger in the automatic case. Finally, we exhibit
adaptors that enable automatic stack management code and manual
stack management code to interoperate in the same code base.
|
| Citation |
A. Adya, J. Howell, M. Theimer, W. J. Bolosky, J. R. Douceur; Proceedings of USENIX 2002 Annual Technical Conference |
| Download |
|
| Notice |
The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. |
| Conference |
22nd International Conference on Distributed Computing Systems |
| Title |
Reclaiming Space from Duplicate Files in a Serverless Distributed File System |
| Authors |
John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer |
| Abstract |
The Farsite
distributed file system provides availability by replicating
each file onto multiple desktop computers. Since this
replication consumes significant storage space, it is important
to reclaim used space where possible. Measurement of over 500
desktop file systems shows that nearly half of all consumed
space is occupied by duplicate files. We present a mechanism to
reclaim space from this incidental duplication to make it
available for controlled file replication. Our mechanism
includes 1) convergent encryption, which enables duplicate files
to coalesced into the space of a single file, even if the files
are encrypted with different users' keys, and 2) SALAD, a
Self-Arranging, Lossy, Associative Database for aggregating file
content and location information in a decentralized, scalable,
fault-tolerant manner. Large-scale simulation experiments show
that the duplicate-file coalescing system is scalable, highly
effective, and fault-tolerant.
|
| Citation |
J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, M. Theimer; Proceedings of 22nd IEEE ICDCS, 2002. |
| Download |
|
Tech Report (Extended Version) |
|
| Notice |
© 2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. |
| Conference |
5th Symposium on Operating Systems Design and Implementation |
| Title |
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment |
| Authors |
Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, Roger P. Wattenhofer |
| Abstract |
Farsite is a secure,
scalable file system that logically functions as a centralized
file server but is physically distributed among a set of
untrusted computers. Farsite provides file availability and
reliability through randomized replicated storage; it ensures
the secrecy of file contents with cryptographic techniques; it
maintains the integrity of file and directory data with a
Byzantine-fault-tolerant protocol; it is designed to be scalable
by using a distributed hint mechanism and delegation
certificates for pathname translations; and it achieves good
performance by locally caching file data, lazily propagating
file updates, and varying the duration and granularity of
content leases. We report on the design of Farsite and the
lessons we have learned by implementing much of that design.
|
| Citation |
A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, R. P. Wattenhofer; Proceedings of the 5th OSDI, December 2002. |
| Download |
|
| Notice |
The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. |
| Conference |
18th Annual Computer Security Applications Conference |
| Title |
A Secure Directory Service based on Exclusive Encryption |
| Authors |
John R. Douceur, Atul Adya, Josh Benaloh, William J. Bolosky, Gideon Yuval |
| Abstract |
We describe the design
of a Windows file-system directory service that ensures the
persistence, integrity, privacy, syntactic legality, and
case-insensitive uniqueness of the names it indexes. Byzantine
state replication provides persistence and integrity, and
encryption imparts privacy. To enforce Windows' baroque name
syntax — including restrictions on allowable characters, on the
terminal character, and on several specific names — we develop a
cryptographic process, called "exclusive encryption," that
inherently excludes syntactically illegal names and that enables
the exclusion of case-insensitively duplicate names without
access to their plaintext. This process excludes entire names by
mapping the set of allowed strings to the set of all strings,
excludes certain characters through an amended prefix encoding,
excludes terminal characters through varying the prefix coding
by character index, and supports case-insensitive comparison of
names by extracting and encrypting case information separately.
We also address the issues of hiding name-length information and
access-authorization information, and we report a newly
discovered problem with enforcing case-insensitive uniqueness
for Unicode names.
|
| Citation |
J. R. Douceur, A. Adya, J. Benaloh, W. J. Bolosky, G. Yuval; Proceedings of the 18th ACSAC, December 2002. |
| Download |
|
| Notice |
© 2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. |
| Journal |
SIGMETRICS Performance Evaluation Review 31 (3) |
| Title |
Is Remote Host Availability Governed by a Universal Law? |
| Author |
John R. Douceur |
| Abstract |
The availability of
peer-to-peer and other distributed systems depends not only on
the system architecture but also on the availability
characteristics of the hosts participating in the system. This
paper constructs a model of remote host availability, derived
from measurement studies of four host populations. It argues
that hosts are incompletely partitioned into two behavioral
classes, one in which they are cycled on/off periodically and
one in which they are nominally kept on constantly. Within a
class, logarithmic availability generally follows a uniform
distribution; however, the underlying reason for this is not
readily apparent.
|
| Citation |
J. R. Douceur; ACM SIGMETRICS Performance Evaluation Review 31 (3), December 2003, pp. 25-29 |
| Download |
|
| Conference |
5th International Workshop on Peer-to-Peer Systems |
| Title |
Byzantine Fault Isolation in the Farsite Distributed File System |
| Authors |
John R. Douceur, Jon Howell |
| Abstract |
In a peer-to-peer system of interacting Byzantine-fault-tolerant
replicated-state-machine groups, as system scale increases, so
does the probability that a group will manifest a fault. If no
steps are taken to prevent faults from spreading among groups,
a single fault can result in total system failure. To address this
problem, we introduce Byzantine Fault Isolation (BFI), a
technique that enables a distributed system to operate with
application-defined partial correctness when some of its
constituent groups are faulty. We quantify BFI's benefit and
describe its use in Farsite, a peer-to-peer file system designed to
scale to 100,000 machines.
|
| Citation |
J. R. Douceur, J. Howell; Proceedings of 5th IPTPS, 2006 |
| Download |
|
| Conference |
1st EuroSys Conference |
| Title |
The SMART Way to Migrate Replicated Stateful Services |
| Authors |
Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell |
| Abstract |
Many stateful services use the replicated state machine
approach for high availability. In this approach, a service
runs on multiple machines to survive machine failures. This
paper describes SMART, a new technique for changing the
set of machines where such a service runs, i.e., migrating
the service. SMART improves upon existing techniques in
three important ways. First, SMART allows migrations that
replace non-failed machines. Thus, SMART enables load
balancing and lets an automated system replace failed machines.
Such autonomic migration is an important step toward
full autonomic operation, in which administrators play
a minor role and need not be available twenty-four hours a
day, seven days a week. Second, SMART can pipeline concurrent
requests, a useful performance optimization. Third,
prior published migration techniques are described in insufficient
detail to admit implementation, whereas our description
of SMART is complete. In addition to describing
SMART, we also demonstrate its practicality by implementing
it, evaluating our implementation's performance,
and using it to build a consistent, replicated, migratable
file system. Our experiments demonstrate the performance
advantage of pipelining concurrent requests, and show that
migration has only a minor and temporary effect on performance.
|
| Keywords |
Migration, replication, reconfiguration, Paxos, replicated state machine. |
| Citation |
J. R. Lorch, A. Adya, W. J. Bolosky, R. Chaiken, J. R. Douceur, and J. Howell; Proceedings of the First ACM EuroSys Conference, 2006, pp. 103-115 |
| Definitive Version |
|
| Download |
|
| Notice |
© ACM, (2006). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#40, ISS#4, (2006)} |
| Conference |
7th Symposium on Operating Systems Design and Implementation |
| Title |
Distributed Directory Service in the Farsite File System |
| Authors |
John R. Douceur, Jon Howell |
| Abstract |
We present the design, implementation, and evaluation of a
fully distributed directory service for Farsite, a logically
centralized file system that is physically implemented on a
loosely coupled network of desktop computers. Prior to
this work, the Farsite system included distributed mechanisms
for file content but centralized mechanisms for file
metadata. Our distributed directory service introduces
tree-structured file identifiers that support dynamically
partitioning metadata at arbitrary granularity, recursive
path leases for scalably maintaining name-space consistency,
and a protocol for consistently performing operations on files
managed by separate machines. It also mitigates metadata hotspots
via file-field leases and the new mechanism of disjunctive leases.
We experimentally show that Farsite can dynamically partition
file-system metadata while maintaining full file-system semantics.
|
| Citation |
J. R. Douceur, J. Howell; Proceedings of the 7th OSDI, November 2006. |
| Download |
|
| Notice |
The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. |
| Conference |
5th USENIX Conference on File and Storage Technologies |
| Title |
A Five-Year Study of File-System Metadata |
| Authors |
Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch |
| Abstract |
For five years, we collected annual snapshots of filesystem
metadata from over 60,000 Windows PC file systems
in a large corporation. In this paper, we use these
snapshots to study temporal changes in file size, file age,
file-type frequency, directory size, namespace structure,
file-system population, storage capacity and consumption,
and degree of file modification. We present a generative
model that explains the namespace structure and
the distribution of directory sizes. We find significant
temporal trends relating to the popularity of certain file
types, the origin of file content, the way the namespace is
used, and the degree of variation among file systems, as
well as more pedestrian changes in sizes and capacities.
We give examples of consequent lessons for designers of
file systems and related software.
|
| Citation |
N. Agrawal, W. J. Bolosky, J. R. Douceur, J. R. Lorch; Proceedings of the 5th USENIX Conference on File and Storage Technologies, 2007, pp. 31-45 |
| Download |
|
| Notice |
The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. |
| Journal |
SIGOPS Operating Systems Review 41 (2) |
| Title |
The Farsite Project: A Retrospective |
| Authors |
William J. Bolosky, John R. Douceur, and Jon Howell |
| Abstract |
The Farsite file system is a storage service that runs on the
desktop computers of a large organization and provides the
semantics of a central NTFS file server. The motivation behind
the Farsite project was to harness the unused storage and network
resources of desktop computers to provide a service that is
reliable, available, and secure despite the fact that it runs on
machines that are unreliable, often unavailable, and of limited
security. A main premise of the project has been that building
a scalable system requires more than scalable algorithms: To be
scalable in a practical sense, a distributed system targeting
105 nodes must tolerate a significant (and never-zero)
rate of machine failure, a small number of malicious participants,
and a substantial number of opportunistic participants. It also
must automatically adapt to the arrival and departure of machines
and changes in machine availability, and it must be able to
autonomically repartition its data and metadata as necessary to
balance load and alleviate hotspots. We describe the history of
the project, including its multiple versions of major system
components, the unique programming style and software-engineering
environment we created to facilitate development, our distributed
debugging framework, and our experiences with formal system
specification. We also report on the lessons we learned during
this development.
|
| Citation |
W. J. Bolosky, J. R. Douceur, J. Howell; ACM SIGOPS Operating Systems Review 41 (2), April 2007, pp. 17-26 |
| Download |
|
| Journal |
SIGOPS Operating Systems Review 41 (2) |
| Title |
MapCruncher: Integrating the World's Geographic Information |
| Authors |
Jeremy Elson, Jon Howell, and John R. Douceur |
| Abstract |
Current large-scale interactive web mapping services such as
Virtual Earth and Google Maps use large distributed systems for
delivering data. However, creation and editorial control of their
content is still largely centralized. The Composable Virtual Earth
project's goal is to allow seamless interoperability of geographic
data from arbitrary, distributed sources.
MapCruncher is a first step in this direction. It lets users easily
create new interactive map data that can be layered on top of
existing imagery such as road maps and aerial photography.
MapCruncher geographically registers and reprojects the user's
map into a standard coordinate system. It then emits metadata
that makes it easy for anyone on the Internet to find the published
map data and import it. Interactive maps them become
distributed, seamlessly composable building blocks — similar to
images in the early days of the Web.
|
| Citation |
J. Elson, J. Howell, J. R. Douceur; ACM SIGOPS Operating Systems Review 41 (2), April 2007, pp. 50-59 |
| Download |
|
| Conference |
17th International workshop on Network and Operating Systems Support for Digital Audio & Video |
| Title |
Enhancing Game-Server AI with Distributed Client Computation |
| Authors |
John R. Douceur, Jacob R. Lorch, Frank Uyeda, Randall C. Wood |
| Abstract |
In the context of online role-playing games, we evaluate
offloading AI computation from game servers to game clients. In
this way, the aggregate resources of thousands of participating
client machines can enhance game realism in a way that would be
prohibitively expensive on a central server. Because offloading
can add significant latency to a computation normally executing
within a game server's main loop, we introduce the mechanism of
AI partitioning: splitting an AI into a high-frequency but
computationally simple component on the server, and a lowfrequency
but computationally intensive component offloaded to a
client. By designing the client-side component to be stateless and
deterministic, this approach also facilitates rapid handoff,
preemptive migration, and replication, which can address the
problems of client failure and exploitation. To explore this
approach, we develop an improved AI for tactical navigation, a
challenging task to offload because it is highly sensitive to
latency. Our improvement is based on calculating influence
fields, partitioned into server-side and client-side components by
means of a Taylor series approximation. Experiments on a
Quake-based prototype demonstrate that this approach can
substantially improve the AI's abilities, even with server-clientserver
latencies up to one second.
|
| Citation |
J. R. Douceur, J. R. Lorch, F. Uyeda, R. C. Wood; Proceedings of the 17th International workshop on Network and Operating Systems Support for Digital Audio & Video, 2007, pp. 31-36 |
| Download |
|
| Notice |
© ACM, (2007). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. |
| Conference |
19th ACM Symposium on Parallelism in Algorithms and Architectures |
| Title |
Maximizing Total Upload in Latency-Sensitive P2P Applications |
| Authors |
John R. Douceur, Jacob R. Lorch, Thomas Moscibroda |
| Abstract |
Motivated by an application in distributed gaming, we define and
study the latency-constrained total upload maximization problem.
In this problem, a peer-to-peer overlay network is modeled as a
complete graph and each node vi has an upload
bandwidth capacity ci and a set of receivers
R(i). Each sender-receiver pair (vi,
vj) where vj in R(i), is a
request that should be satisfied, i.e., vi should
send a data packet to each vj in R(i).
The goal is to find a set of at most n multicast-trees
Ti of depth at most 2, such that each node
can be part of multiple trees, all capacity constraints are met, and
the number of satisfied requests is maximized. In this paper, we
prove that the problem is NP-complete, and we present an algorithm
with approximation ratio 1 — 2/sqrt(cmin), where
cmin is the minimum upload capacity. Finally, we
also study the impact of network coding on the quality and
approximability of the solution.
|
| Keywords |
Peer-to-peer gaming, multicast, upload maximization. |
| Citation |
J. R. Douceur, J. R. Lorch, T. Moscibroda; Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures, 2007, pp. 270-279 |
| Definitive Version |
|
| Download |
|
| Notice |
© ACM, (2007). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, {(2007)} |
| Conference |
SIGCOMM 2007 Conference |
| Title |
Lottery Trees: Motivational Deployment of Networked Systems |
| Authors |
John R. Douceur, Thomas Moscibroda |
| Abstract |
We address a critical deployment issue for network systems, namely
motivating people to install and run a distributed service. This work
is aimed primarily at peer-to-peer systems, in which the decision
and effort to install a service falls to individuals rather than to a
central planner. This problem is relevant for bootstrapping systems
that rely on the network effect, wherein the benefits are not felt
until deployment reaches a significant scale, and also for deploying
asymmetric systems, wherein the set of contributors is different
than the set of beneficiaries. Our solution is the lottery tree (lottree),
a mechanism that probabilistically encourages both participation in
the system and also solicitation of new participants. We define the
lottree mechanism and formally state seven properties that encourage
contribution, solicitation, and fair play. We then present the
Pachira lottree scheme, which satisfies five of these seven properties,
and we prove this to be a maximal satisfiable subset. Using
simulation, we determine optimal parameters for the Pachira lottree
scheme, and we determine how to configure a lottree system
for achieving various deployment scales based on expected installation
effort. We also present extensive sensitivity analyses, which
bolster the generality of our conclusions.
|
| Keywords |
Incentive systems, networked systems, deployment, bootstrapping, lotteries, prospect theory, desiderata, impossibility results. |
| Citation |
J. R. Douceur, T. Moscibroda; Proceedings of the ACM SIGCOMM 2007 Conference, 2007 |
| Download |
|
| Notice |
© ACM, (2007). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published in Proceedings of the ACM SIGCOMM 2007 Conference, {(2007)} |
John Douceur's home page.
|