*
Quick Links|Home|Worldwide
Microsoft*
Search for



John Douceur

Publications

16th SOSP Distributed Schedule Management in the Tiger Video File Server
MMCN 1999 Improving Responsiveness of a Stripe-Scheduled Media Server
SIGMETRICS '99 A Large-Scale Study of File-System Contents
17th SOSP Progress-Based Regulation of Low-Importance Processes
SIGMETRICS 2000 Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs
4th WSS Single Instance Storage in Windows© 2000
4th MASCOTS Large-Scale Simulation of Replica Placement Algorithms for a Serverless Distributed File System
9th ESA Modeling Replica Placement in a Distributed File System: Narrowing the Gap between Analysis and Simulation
15th DISC Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System
4th SRDS Optimizing File Availability in a Secure Serverless Distributed File System
1st IPTPS The Sybil Attack
USENIX 2002 Cooperative Task Management without Manual Stack Management
22nd ICDCS Reclaiming Space from Duplicate Files in a Serverless Distributed File System
5th OSDI FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment
18th ACSAC A Secure Directory Service based on Exclusive Encryption
PER 31 (3) Is Remote Host Availability Governed by a Universal Law?
5th IPTPS Byzantine Fault Isolation in the Farsite Distributed File System
1st EuroSys The SMART Way to Migrate Replicated Stateful Services
7th OSDI Distributed Directory Service in the Farsite File System
5th FAST A Five-Year Study of File-System Metadata
OSR 41 (2) The Farsite Project: A Retrospective
OSR 41 (2) MapCruncher: Integrating the World's Geographic Information
NOSSDAV 2007 Enhancing Game-Server AI with Distributed Client Computation
SPAA 2007 Maximizing Total Upload in Latency-Sensitive P2P Applications
SIGCOMM 2007 Lottery Trees: Motivational Deployment of Networked Systems



Details

Conference 16th Symposium on Operating Systems Principles (SOSP)
Title Distributed Schedule Management in the Tiger Video File Server
Authors William J. Bolosky, Robert P. Fitzgerald, and John R. Douceur
Abstract Tiger is a scalable, fault-tolerant video file server constructed from a collection of computers connected by a switched network. All content files are striped across all of the computers and disks in a Tiger system. In order to prevent conflicts for a particular resource between two viewers, Tiger schedules viewers so that they do not require access to the same resource at the same time. In the abstract, there is a single, global schedule that describes all of the viewers in the system. In practice, the schedule is distributed among all of the computers in the system, each of which has a possibly partially inconsistent view of a subset of the schedule. By using such a relaxed consistency model for the schedule, Tiger achieves scalability and fault tolerance while still providing the consistent, coordinated service required by viewers.
Citation W. J. Bolosky, R. P. Fitzgerald, J. R. Douceur; Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, 1997, pp. 212-223
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript format Document in HTML format
Notice © ACM, (1997). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#31, ISS#5, (1997)}


Conference Multimedia Computing and Networking 1999
Title Improving Responsiveness of a Stripe-Scheduled Media Server
Authors John R. Douceur and William J. Bolosky
Abstract Thrifty scheduling is an algorithm that improves the responsiveness of a stripe-scheduled multimedia server.  It increases the determinism of the data-distribution service, reduces the likelihood of high startup delays, and enables an increase in the rated load of the system.  A stripe-scheduled media server is a distributed video-on-demand system that load-balances by striping video data across multiple computer nodes and cyclically scheduling the distribution of the data.  The server displays highly variable startup delays in response to requests for data streams.  These delays are due to clusters of allocated slots in the distribution schedule, which form naturally as the system load increases.  Thrifty scheduling is a scalable algorithm that improves responsiveness by allocating streams to schedule slots in a way that reduces the clustering in the schedule.  This algorithm has been incorporated into the Tiger video fileserver.
Keywords Video server, video-on-demand, latency reduction, distribution, responsiveness, deterministic service, cyclical scheduling, thrifty, greedy, startup delay.
Citation J. R. Douceur, W. J. Bolosky; Multimedia Computing and Networking 1999, Dilip D. Kandlur, Kevin Jeffay, Timothy Roscoe, Editors, Proceedings of SPIE Vol. 3654, pp. 192-203 (1998)
Download Document in PDF format Document in Postscript format
Notice Copyright SPIE - The International Society for Optical Engineering


Conference SIGMETRICS '99 International Conference on Measurement and Modeling of Computer Systems
Title A Large-Scale Study of File-System Contents
Authors John R. Douceur and William J. Bolosky
Abstract We collect and analyze a snapshot of data from 10,568 file systems of 4801 Windows personal computers in a commercial environment. The file systems contain 140 million files totaling 10.5 TB of data. We develop analytical approximations for distributions of file size, file age, file functional lifetime, directory size, and directory depth, and we compare them to previously derived distributions. We find that file and directory sizes are fairly consistent across file systems, but file lifetimes vary widely and are significantly affected by the job function of the user. Larger files tend to be composed of blocks sized in powers of two, which noticeably affects their size distribution. File-name extensions are strongly correlated with file sizes, and extension popularity varies with user job function. On average, file systems are only half full.
Keywords File-system contents, directory hierarchy, static data snapshot, workload characterization, analytical modeling.
Citation J. R. Douceur, W. J. Bolosky; Proceedings of the international conference on Measurement and modeling of computer systems, 1999, pp. 59-70
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript format
Notice © ACM, (1999). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMETRICS Performance Evaluation Review, {VOL#27, ISS#1, (1999)}


Conference 17th Symposium on Operating Systems Principles (SOSP)
Title Progress-Based Regulation of Low-Importance Processes
Authors John R. Douceur and William J. Bolosky
Abstract MS Manners is a mechanism that employs progress-based regulation to prevent resource contention with low-importance processes from degrading the performance of high-importance processes. The mechanism assumes that resource contention that degrades the performance of a high-importance process will also retard the progress of the low-importance process. MS Manners detects this contention by monitoring the progress of the low-importance process and inferring resource contention from a drop in the progress rate. This technique recognizes contention over any system resource, as long as the performance impact on contending processes is roughly symmetric. MS Manners employs statistical mechanisms to deal with stochastic progress measurements; it automatically calibrates a target progress rate, so no manual tuning is required; it supports multiple progress metrics from applications that perform several distinct tasks; and it orchestrates multiple low-importance processes to prevent measurement interference. Experiments with two low-importance applications show that MS Manners can reduce the degradation of high-importance processes by up to an order of magnitude.
Keywords Progress-based feedback, symmetric resource contention, process priority.
Citation Douceur, W. J. Bolosky; Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, 1999, pp. 247-260
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript formatDocument in HTML format
Notice © ACM, (1999). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#33, ISS#5, (1999)}


Conference SIGMETRICS 2000 International Conference on Measurement and Modeling of Computer Systems
Title Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs
Authors William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer
Abstract We consider an architecture for a serverless distributed file system that does not assume mutual trust among the client computers. The system provides security, availability, and reliability by distributing multiple encrypted replicas of each file among the client machines. To assess the feasibility of deploying this system on an existing desktop infrastructure, we measure and analyze a large set of client machines in a commercial environment. In particular, we measure and report results on disk usage and content; file activity; and machine uptimes, lifetimes, and loads. We conclude that the measured desktop infrastructure would passably support our proposed system, providing availability on the order of one unfilled file request per user per thousand days.
Keywords Serverless distributed file system architecture, personal computer usage data, availability, reliability, security, trust, workload characterization, analytical modeling, feasibility analysis.
Citation W. J. Bolosky, J. R. Douceur, D. Ely, M. Theimer; Proceedings of the international conference on Measurement and modeling of computer systems, 2000, pp. 34-43
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript format
Notice © ACM, (2000). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMETRICS Performance Evaluation Review, {VOL#28, ISS#1, (2000)}


Conference 4th USENIX Windows Systems Symposium
Title Single Instance Storage in Windows© 2000
Authors William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur
Abstract Certain applications, such as Windows 2000's Remote Install service, can result in a set of files in which many different files have the same content. Using a traditional file system to store these files separately results in excessive use of disk and main memory file cache space. Using hard or symbolic links would eliminate the excess resource requirements, but changes the semantics of having separate files, in that updates to one "copy" of a file would be visible to users of another "copy." We describe the Single Instance Store (SIS), a component within Windows© 2000 that implements links with the semantics of copies for files stored on a Windows 2000 NTFS volume. SIS uses copy-on-close to implement the copy semantics of its links. SIS is structured as a file system filter driver that implements links and a user level service that detects duplicate files and reports them to the filter for conversion into links. Because SIS links are semantically identical to separate files, SIS creates them automatically when it detects files with duplicate contents. This paper describes the design and implementation of SIS in detail, briefly presents measurements of a remote install server showing a 58% disk space savings by using SIS, and discusses other possible uses of SIS.
Citation W. J. Bolosky, S. Corbin, D. Goebel, J. R. Douceur; Proceedings of the 4th USENIX Windows Systems Symposium, 2000, pp. 13-24
Download Document in PDF format Document in Postscript format
Notice The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes.


Conference 9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
Title Large-Scale Simulation of Replica Placement Algorithms for a Serverless Distributed File System
Authors John R. Douceur and Roger P. Wattenhofer
Abstract Farsite is a scalable, distributed file system that logically functions as a centralized file server but that is physically implemented on a set of client desktop computers. Farsite provides high degrees of reliability and availability by storing replicas of files on multiple machines. Replicas are placed to maximize the effective system availability, using a distributed, iterative, randomized placement algorithm. We perform a large-scale simulation of three candidate algorithms using machine availability data collected from over 50,000 desktop computers. We find that algorithmic efficiency and placement efficacy run counter to each other. We fit analytic functions to the improvement rates and provide explanations for the fitted curves. We explore the algorithms' properties through study of their dynamic behavior. We visualize algorithmic placements and compare them to theoretical worst cases. We quantify the degree of machine failure correlation and develop a formula to approximate its effect.
Citation J. R. Douceur, R. P. Wattenhofer; Proceedings of 9th IEEE MASCOTS, 2001, pp. 311-319
Download Document in PDF format Document in Postscript format
Notice © 2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.


Conference 9th Annual European Symposium on Algorithms
Title Modeling Replica Placement in a Distributed File System: Narrowing the Gap between Analysis and Simulation
Authors John R. Douceur and Roger P. Wattenhofer
Abstract We examine the replica placement aspect of a distributed peer-to-peer file system that replicates and stores files on ordinary desktop computers. It has been shown that some desktop machines are available for a greater fraction of time than others, and it is crucial not to place all replicas of any file on machines with low availability. In this paper we study the efficacy of three hill-climbing algorithms for file replica placement. Based on large-scale measurements, we assume that the distribution of machine availabilities be uniform. Among other results we show that the MinMax algorithm is competitive, and that for growing replication factor the MinMax and MinRand algorithms have the same asymptotic worst-case efficacy.
Citation J. R. Douceur, R. P. Wattenhofer; Proceedings of 9th ESA, 2001, pp. 356-367
Download Document in PDF format Document in Postscript format
Tech Report
(Extended Version)
Document in PDF format Document in Postscript format
Notice © Springer-Verlag


Conference 15th International Symposium on Distributed Computing
Title Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System
Authors John R. Douceur and Roger P. Wattenhofer
Abstract The Farsite distributed file system stores multiple replicas of files on multiple machines, to provide file access even when some machines are unavailable. Farsite assigns file replicas to machines so as to maximally exploit the different degrees of availability of different machines, given an allowable replication factor R. We use competitive analysis and simulation to study the performance of three candidate hill-climbing replica placement strategies, MinMax, MinRand, and RandRand, each of which successively exchanges the locations of two file replicas. We show that the MinRand and RandRand strategies are perfectly competitive for R = 2 and 2/3-competitive for R = 3. For general R, MinRand is at least 1/2-competitive and RandRand is at least 10/17-competitive. The MinMax strategy is not competitive. Simulation results show better performance than the theoretic worst-case bounds.
Citation J. R. Douceur, R. P. Wattenhofer; Proceedings of 15th DISC, 2001, pp. 48-62
Download Document in PDF format Document in Postscript format
Tech Report
(Reformatted)
Document in PDF format Document in Postscript format
Notice © Springer-Verlag


Conference 20th Symposium on Reliable Distributed Systems
Title Optimizing File Availability in a Secure Serverless Distributed File System
Authors John R. Douceur and Roger P. Wattenhofer
Abstract Farsite is a secure, scalable, distributed file system that logically functions as a centralized file server but that is physically realized on a set of client desktop computers. Farsite provides security, reliability, and availability by storing replicas of each file on multiple machines. It continuously monitors machine availability and relocates replicas as necessary to maximize the effective availability of the system. We evaluate several replica-placement methods using large-scale simulation with machine availability data from over 50,000 desktop computers. We find that initially placing replicas in an availability-sensitive fashion yields pathological results, whereas very good results are obtained by random initial placement followed by incremental improvement using a scalable, distributed, fault-tolerant, and attack-resistant hill-climbing algorithm. The algorithm is resilient to severe restrictions on communication and replica placement, and it does not excessively co-locate replicas of different files on the same set of machines.
Citation J. R. Douceur, R. P. Wattenhofer; Proceedings of 20th IEEE SRDS, 2001, pp. 4-13
Download Document in PDF format Document in Postscript format
Notice © 2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.


Conference 1st International Workshop on Peer-to-Peer Systems
Title The Sybil Attack
Author John R. Douceur
Abstract Large-scale peer-to-peer systems face security threats from faulty or hostile remote computing elements. To resist these threats, many such systems employ redundancy. However, if a single faulty entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining this redundancy. One approach to preventing these "Sybil attacks" is to have a trusted agency certify identities. This paper shows that, without a logically centralized authority, Sybil attacks are always possible except under extreme and unrealistic assumptions of resource parity and coordination among entities.
Citation J. R. Douceur; Proceedings of 1st IPTPS, 2002
Download Document in PDF format Document in Postscript format


Conference USENIX 2002 Annual Technical Conference
Title Cooperative Task Management without Manual Stack Management
Authors Atul Adya, Jon Howell, Marvin Theimer, William J. Bolosky, and John R. Douceur
Abstract Cooperative task management can provide program architects with ease of reasoning about concurrency issues. This property is often espoused by those who recommend "event-driven" programming over "multi-threaded" programming. Those terms conflate several issues. In this paper, we clarify the issues, and show how one can get the best of both worlds: reason more simply about concurrency in the way "event-driven" advocates recommend, while preserving the readability and maintainability of code associated with "multithreaded" programming.
We identify the source of confusion about the two programming styles as a conflation of two concepts: task management and stack management. Those two concerns define a two-axis space in which "multithreaded" and "event-driven" programming are diagonally opposite; there is a third "sweet spot" in the space that combines the advantages of both programming styles. We point out pitfalls in both alternative forms of stack management, manual and automatic, and we supply techniques that mitigate the danger in the automatic case. Finally, we exhibit adaptors that enable automatic stack management code and manual stack management code to interoperate in the same code base.
Citation A. Adya, J. Howell, M. Theimer, W. J. Bolosky, J. R. Douceur; Proceedings of USENIX 2002 Annual Technical Conference
Download Document in PDF format Document in Postscript format
Notice The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes.


Conference 22nd International Conference on Distributed Computing Systems
Title Reclaiming Space from Duplicate Files in a Serverless Distributed File System
Authors John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer
Abstract The Farsite distributed file system provides availability by replicating each file onto multiple desktop computers. Since this replication consumes significant storage space, it is important to reclaim used space where possible. Measurement of over 500 desktop file systems shows that nearly half of all consumed space is occupied by duplicate files. We present a mechanism to reclaim space from this incidental duplication to make it available for controlled file replication. Our mechanism includes 1) convergent encryption, which enables duplicate files to coalesced into the space of a single file, even if the files are encrypted with different users' keys, and 2) SALAD, a Self-Arranging, Lossy, Associative Database for aggregating file content and location information in a decentralized, scalable, fault-tolerant manner. Large-scale simulation experiments show that the duplicate-file coalescing system is scalable, highly effective, and fault-tolerant.
Citation J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, M. Theimer; Proceedings of 22nd IEEE ICDCS, 2002.
Download Document in PDF format Document in Postscript format
Tech Report
(Extended Version)
Document in PDF format Document in Postscript format
Notice © 2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.


Conference 5th Symposium on Operating Systems Design and Implementation
Title FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment
Authors Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, Roger P. Wattenhofer
Abstract Farsite is a secure, scalable file system that logically functions as a centralized file server but is physically distributed among a set of untrusted computers. Farsite provides file availability and reliability through randomized replicated storage; it ensures the secrecy of file contents with cryptographic techniques; it maintains the integrity of file and directory data with a Byzantine-fault-tolerant protocol; it is designed to be scalable by using a distributed hint mechanism and delegation certificates for pathname translations; and it achieves good performance by locally caching file data, lazily propagating file updates, and varying the duration and granularity of content leases. We report on the design of Farsite and the lessons we have learned by implementing much of that design.
Citation A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, R. P. Wattenhofer; Proceedings of the 5th OSDI, December 2002.
Download Document in PDF format Document in Postscript format
Notice The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes.


Conference 18th Annual Computer Security Applications Conference
Title A Secure Directory Service based on Exclusive Encryption
Authors John R. Douceur, Atul Adya, Josh Benaloh, William J. Bolosky, Gideon Yuval
Abstract We describe the design of a Windows file-system directory service that ensures the persistence, integrity, privacy, syntactic legality, and case-insensitive uniqueness of the names it indexes. Byzantine state replication provides persistence and integrity, and encryption imparts privacy. To enforce Windows' baroque name syntax — including restrictions on allowable characters, on the terminal character, and on several specific names — we develop a cryptographic process, called "exclusive encryption," that inherently excludes syntactically illegal names and that enables the exclusion of case-insensitively duplicate names without access to their plaintext. This process excludes entire names by mapping the set of allowed strings to the set of all strings, excludes certain characters through an amended prefix encoding, excludes terminal characters through varying the prefix coding by character index, and supports case-insensitive comparison of names by extracting and encrypting case information separately. We also address the issues of hiding name-length information and access-authorization information, and we report a newly discovered problem with enforcing case-insensitive uniqueness for Unicode names.
Citation J. R. Douceur, A. Adya, J. Benaloh, W. J. Bolosky, G. Yuval; Proceedings of the 18th ACSAC, December 2002.
Download Document in PDF format Document in Postscript format
Notice © 2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.


Journal SIGMETRICS Performance Evaluation Review 31 (3)
Title Is Remote Host Availability Governed by a Universal Law?
Author John R. Douceur
Abstract The availability of peer-to-peer and other distributed systems depends not only on the system architecture but also on the availability characteristics of the hosts participating in the system. This paper constructs a model of remote host availability, derived from measurement studies of four host populations. It argues that hosts are incompletely partitioned into two behavioral classes, one in which they are cycled on/off periodically and one in which they are nominally kept on constantly. Within a class, logarithmic availability generally follows a uniform distribution; however, the underlying reason for this is not readily apparent.
Citation J. R. Douceur; ACM SIGMETRICS Performance Evaluation Review 31 (3), December 2003, pp. 25-29
Download Document in PDF format Document in Postscript format


Conference 5th International Workshop on Peer-to-Peer Systems
Title Byzantine Fault Isolation in the Farsite Distributed File System
Authors John R. Douceur, Jon Howell
Abstract In a peer-to-peer system of interacting Byzantine-fault-tolerant replicated-state-machine groups, as system scale increases, so does the probability that a group will manifest a fault. If no steps are taken to prevent faults from spreading among groups, a single fault can result in total system failure. To address this problem, we introduce Byzantine Fault Isolation (BFI), a technique that enables a distributed system to operate with application-defined partial correctness when some of its constituent groups are faulty. We quantify BFI's benefit and describe its use in Farsite, a peer-to-peer file system designed to scale to 100,000 machines.
Citation J. R. Douceur, J. Howell; Proceedings of 5th IPTPS, 2006
Download Document in PDF format Document in Postscript format


Conference 1st EuroSys Conference
Title The SMART Way to Migrate Replicated Stateful Services
Authors Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell
Abstract Many stateful services use the replicated state machine approach for high availability. In this approach, a service runs on multiple machines to survive machine failures. This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service. SMART improves upon existing techniques in three important ways. First, SMART allows migrations that replace non-failed machines. Thus, SMART enables load balancing and lets an automated system replace failed machines. Such autonomic migration is an important step toward full autonomic operation, in which administrators play a minor role and need not be available twenty-four hours a day, seven days a week. Second, SMART can pipeline concurrent requests, a useful performance optimization. Third, prior published migration techniques are described in insufficient detail to admit implementation, whereas our description of SMART is complete. In addition to describing SMART, we also demonstrate its practicality by implementing it, evaluating our implementation's performance, and using it to build a consistent, replicated, migratable file system. Our experiments demonstrate the performance advantage of pipelining concurrent requests, and show that migration has only a minor and temporary effect on performance.
Keywords Migration, replication, reconfiguration, Paxos, replicated state machine.
Citation J. R. Lorch, A. Adya, W. J. Bolosky, R. Chaiken, J. R. Douceur, and J. Howell; Proceedings of the First ACM EuroSys Conference, 2006, pp. 103-115
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript format
Notice © ACM, (2006). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGOPS Operating Systems Review, {VOL#40, ISS#4, (2006)}


Conference 7th Symposium on Operating Systems Design and Implementation
Title Distributed Directory Service in the Farsite File System
Authors John R. Douceur, Jon Howell
Abstract We present the design, implementation, and evaluation of a fully distributed directory service for Farsite, a logically centralized file system that is physically implemented on a loosely coupled network of desktop computers. Prior to this work, the Farsite system included distributed mechanisms for file content but centralized mechanisms for file metadata. Our distributed directory service introduces tree-structured file identifiers that support dynamically partitioning metadata at arbitrary granularity, recursive path leases for scalably maintaining name-space consistency, and a protocol for consistently performing operations on files managed by separate machines. It also mitigates metadata hotspots via file-field leases and the new mechanism of disjunctive leases. We experimentally show that Farsite can dynamically partition file-system metadata while maintaining full file-system semantics.
Citation J. R. Douceur, J. Howell; Proceedings of the 7th OSDI, November 2006.
Download Document in PDF format Document in Postscript format
Notice The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes.


Conference 5th USENIX Conference on File and Storage Technologies
Title A Five-Year Study of File-System Metadata
Authors Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch
Abstract For five years, we collected annual snapshots of filesystem metadata from over 60,000 Windows PC file systems in a large corporation. In this paper, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory size, namespace structure, file-system population, storage capacity and consumption, and degree of file modification. We present a generative model that explains the namespace structure and the distribution of directory sizes. We find significant temporal trends relating to the popularity of certain file types, the origin of file content, the way the namespace is used, and the degree of variation among file systems, as well as more pedestrian changes in sizes and capacities. We give examples of consequent lessons for designers of file systems and related software.
Citation N. Agrawal, W. J. Bolosky, J. R. Douceur, J. R. Lorch; Proceedings of the 5th USENIX Conference on File and Storage Technologies, 2007, pp. 31-45
Download Document in PDF format Document in Postscript format
Notice The original publication of this paper was granted to USENIX. Copyright to this work is retained by the authors. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes.


Journal SIGOPS Operating Systems Review 41 (2)
Title The Farsite Project: A Retrospective
Authors William J. Bolosky, John R. Douceur, and Jon Howell
Abstract The Farsite file system is a storage service that runs on the desktop computers of a large organization and provides the semantics of a central NTFS file server. The motivation behind the Farsite project was to harness the unused storage and network resources of desktop computers to provide a service that is reliable, available, and secure despite the fact that it runs on machines that are unreliable, often unavailable, and of limited security. A main premise of the project has been that building a scalable system requires more than scalable algorithms: To be scalable in a practical sense, a distributed system targeting 105 nodes must tolerate a significant (and never-zero) rate of machine failure, a small number of malicious participants, and a substantial number of opportunistic participants. It also must automatically adapt to the arrival and departure of machines and changes in machine availability, and it must be able to autonomically repartition its data and metadata as necessary to balance load and alleviate hotspots. We describe the history of the project, including its multiple versions of major system components, the unique programming style and software-engineering environment we created to facilitate development, our distributed debugging framework, and our experiences with formal system specification. We also report on the lessons we learned during this development.
Citation W. J. Bolosky, J. R. Douceur, J. Howell; ACM SIGOPS Operating Systems Review 41 (2), April 2007, pp. 17-26
Download Document in PDF format Document in Postscript format


Journal SIGOPS Operating Systems Review 41 (2)
Title MapCruncher: Integrating the World's Geographic Information
Authors Jeremy Elson, Jon Howell, and John R. Douceur
Abstract Current large-scale interactive web mapping services such as Virtual Earth and Google Maps use large distributed systems for delivering data. However, creation and editorial control of their content is still largely centralized. The Composable Virtual Earth project's goal is to allow seamless interoperability of geographic data from arbitrary, distributed sources.
MapCruncher is a first step in this direction. It lets users easily create new interactive map data that can be layered on top of existing imagery such as road maps and aerial photography. MapCruncher geographically registers and reprojects the user's map into a standard coordinate system. It then emits metadata that makes it easy for anyone on the Internet to find the published map data and import it. Interactive maps them become distributed, seamlessly composable building blocks — similar to images in the early days of the Web.
Citation J. Elson, J. Howell, J. R. Douceur; ACM SIGOPS Operating Systems Review 41 (2), April 2007, pp. 50-59
Download Document in PDF format Document in Postscript format


Conference 17th International workshop on Network and Operating Systems Support for Digital Audio & Video
Title Enhancing Game-Server AI with Distributed Client Computation
Authors John R. Douceur, Jacob R. Lorch, Frank Uyeda, Randall C. Wood
Abstract In the context of online role-playing games, we evaluate offloading AI computation from game servers to game clients. In this way, the aggregate resources of thousands of participating client machines can enhance game realism in a way that would be prohibitively expensive on a central server. Because offloading can add significant latency to a computation normally executing within a game server's main loop, we introduce the mechanism of AI partitioning: splitting an AI into a high-frequency but computationally simple component on the server, and a lowfrequency but computationally intensive component offloaded to a client. By designing the client-side component to be stateless and deterministic, this approach also facilitates rapid handoff, preemptive migration, and replication, which can address the problems of client failure and exploitation. To explore this approach, we develop an improved AI for tactical navigation, a challenging task to offload because it is highly sensitive to latency. Our improvement is based on calculating influence fields, partitioned into server-side and client-side components by means of a Taylor series approximation. Experiments on a Quake-based prototype demonstrate that this approach can substantially improve the AI's abilities, even with server-clientserver latencies up to one second.
Citation J. R. Douceur, J. R. Lorch, F. Uyeda, R. C. Wood; Proceedings of the 17th International workshop on Network and Operating Systems Support for Digital Audio & Video, 2007, pp. 31-36
Download Document in PDF format Document in Postscript format
Notice © ACM, (2007). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.


Conference 19th ACM Symposium on Parallelism in Algorithms and Architectures
Title Maximizing Total Upload in Latency-Sensitive P2P Applications
Authors John R. Douceur, Jacob R. Lorch, Thomas Moscibroda
Abstract Motivated by an application in distributed gaming, we define and study the latency-constrained total upload maximization problem. In this problem, a peer-to-peer overlay network is modeled as a complete graph and each node vi has an upload bandwidth capacity ci and a set of receivers R(i). Each sender-receiver pair (vi, vj) where vj in R(i), is a request that should be satisfied, i.e., vi should send a data packet to each vj in R(i). The goal is to find a set of at most n multicast-trees Ti of depth at most 2, such that each node can be part of multiple trees, all capacity constraints are met, and the number of satisfied requests is maximized. In this paper, we prove that the problem is NP-complete, and we present an algorithm with approximation ratio 1 — 2/sqrt(cmin), where cmin is the minimum upload capacity. Finally, we also study the impact of network coding on the quality and approximability of the solution.
Keywords Peer-to-peer gaming, multicast, upload maximization.
Citation J. R. Douceur, J. R. Lorch, T. Moscibroda; Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures, 2007, pp. 270-279
Definitive Version Link to ACM Digital Library version of paper
Download Document in PDF format Document in Postscript format
Notice © ACM, (2007). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, {(2007)}


Conference SIGCOMM 2007 Conference
Title Lottery Trees: Motivational Deployment of Networked Systems
Authors John R. Douceur, Thomas Moscibroda
Abstract We address a critical deployment issue for network systems, namely motivating people to install and run a distributed service. This work is aimed primarily at peer-to-peer systems, in which the decision and effort to install a service falls to individuals rather than to a central planner. This problem is relevant for bootstrapping systems that rely on the network effect, wherein the benefits are not felt until deployment reaches a significant scale, and also for deploying asymmetric systems, wherein the set of contributors is different than the set of beneficiaries. Our solution is the lottery tree (lottree), a mechanism that probabilistically encourages both participation in the system and also solicitation of new participants. We define the lottree mechanism and formally state seven properties that encourage contribution, solicitation, and fair play. We then present the Pachira lottree scheme, which satisfies five of these seven properties, and we prove this to be a maximal satisfiable subset. Using simulation, we determine optimal parameters for the Pachira lottree scheme, and we determine how to configure a lottree system for achieving various deployment scales based on expected installation effort. We also present extensive sensitivity analyses, which bolster the generality of our conclusions.
Keywords Incentive systems, networked systems, deployment, bootstrapping, lotteries, prospect theory, desiderata, impossibility results.
Citation J. R. Douceur, T. Moscibroda; Proceedings of the ACM SIGCOMM 2007 Conference, 2007
Download Document in PDF format Document in Postscript format
Notice © ACM, (2007). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published in Proceedings of the ACM SIGCOMM 2007 Conference, {(2007)}



John Douceur's home page.


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement