17th ACM Symposium on Operating Systems Principles (SOSP'99)
Published as Operating Systems Review 34(5): 247–260, Dec. 1999
Progress-based regulation of low-importance processes
John R. Douceur and William J. Bolosky
Microsoft Research
Redmond, WA 98007
{johndo,
bolosky}@microsoft.com
Abstract
MS Manners is a mechanism that employs
progress-based regulation to prevent resource contention with low-importance
processes from degrading the performance of high-importance processes. The mechanism assumes that resource
contention that degrades the performance of a high-importance process will also
retard the progress of the low-importance process. MS Manners detects this contention by monitoring the progress of
the low-importance process and inferring resource contention from a drop in the
progress rate. This technique
recognizes contention over any system resource, as long as the performance
impact on contending processes is roughly symmetric. MS Manners employs statistical mechanisms to deal with stochastic
progress measurements; it automatically calibrates a target progress rate, so no
manual tuning is required; it supports multiple progress metrics from
applications that perform several distinct tasks; and it orchestrates multiple
low-importance processes to prevent measurement interference. Experiments with two low-importance applications
show that MS Manners can reduce the degradation of high-importance processes by
up to an order of magnitude.
Categories and subject descriptors: D.4.1 [Operating systems]: Process management – scheduling; G.3 [Probability and statistics]; G.4 [Mathematical software]
General terms: Algorithms, Measurement, Performance
Additional key words and phrases: progress-based feedback, symmetric resource contention, process priority
We have developed a method that prevents low-importance processes from interfering with the performance of high-importance processes. Such interference is generally caused by contention over shared resources, so our method relieves this contention by suspending the low-importance process whenever it detects resource contention. Our method detects resource contention by means of progress-based feedback: A control system monitors the progress of the low-importance application and regards a drop in the progress rate as an indication of resource contention. We refer to this control system as MS Manners, because it determines when a low-importance process should politely defer to a high-importance process.
The need for such a method arose in the development of the SIS
Groveler, a Microsoft® Windows® 2000 system utility that finds and merges
duplicate files on a file system. We
needed a mechanism that would allow our low-importance utility to use idle
resources without degrading other processes.
The Groveler is I/O-bound, so CPU priority is insufficient to control
its resource usage. The Windows 2000
kernel does not support priority for other resources, and we did not wish to
modify the kernel. The Groveler runs in
server environments with continuously running applications and unpredictable
workload schedules, which disqualifies all prior solutions of which we are
aware.
We found that this method worked well for the Groveler, so we expected
it might work well for other long-running, low-importance applications, such as
a backup system, a disk defragmenter, a file archiver, a virus scanner, or
other housekeeping utilities.
Therefore, we re-packaged our method as a general application in two
different forms: a library that can be called by low-importance applications,
and an executable program that externally monitors and regulates low-importance
applications that report progress through a standard Windows NT® mechanism.
Performance measurements indicate that MS Manners is very effective at
reducing the impact of low-importance processes on high-importance
processes. In two example
low-importance applications, MS Manners reduced the performance hit on
high-importance applications by as much as an order of magnitude: Running a low-importance process increased
the execution time of a concurrent high-importance process by 90%; applying MS
Manners reduced this increase to 7 – 12%.
In the following sections, we review previous approaches to
low-importance process regulation, we explain the assumptions and limitations
of our approach, we describe our technique in detail, and we present
experimental results on its effectiveness.
Many
approaches to low-importance process regulation have been proposed and
implemented, such as scheduling for specific times, running as a screen saver,
scanning the system process queue, and various resource-specific methods. These approaches vary considerably in their
generality, complexity, and invasiveness.
Our approach is not strictly better than any of the others; it is merely
another design point in the overall problem space.
If system activity is at least coarsely predictable, one simple
approach is to schedule a low-importance process to run during expected periods
of little or no system activity, such as the middle of the night. This technique fails to exploit
unanticipated idle times, and it fails to regulate during periods of
unanticipated activity. Moreover, if
system activity is not predictably periodic, this technique does not work at
all, and a mechanism to dynamically determine idle periods is needed.
One method for determining idle periods is to use the approach of a
screen saver: activate the low-importance process in the absence of keyboard or
mouse activity. This approach relies on
the assumption that a lack of user input indicates that the system is
unused. This assumption may be reasonable
for a desktop workstation, but it is not valid for a server, which is often
busy but which rarely receives direct user input.
Another approach is to run low-importance processes only when no
high-importance processes are in the system process queue [26]. However, a
high-importance process may be in the process queue without consuming
significant resources. For example, a
database-server application might run continuously but only require resources
when given a workload. In such a scenario,
this approach would never allow a low-importance process to run.
A time-honored method [12] is to use CPU priority. When assigned a low CPU priority, a low-importance process is
prevented from using the CPU when any normal priority process is using it. This works well if the CPU is the limiting
resource. If another resource – such as
disk or memory – is the limiting factor, the CPU is uncontended, so CPU
priority is ineffective.
Priority can be extended to other resources besides the CPU. In general, this requires modifying the OS
kernel. For example, Stealth [11] is a process scheduler that prioritizes CPU, virtual
memory, and file system cache. This
extension is still limited to a few specific resources, and the techniques are
resource-specific.
A similar but more general approach is performance isolation [27], which supports various allocation and sharing
policies for multiple resources. If a
low-importance process is entitled to zero resources, it can only use shares of
resources that are unused by other processes.
Although the performance isolation framework is general, each resource
requires a specific isolation technique.
Another resource-specific approach comes from the domain of real-time
systems. Resource kernels [19] schedule multiple system resources among concurrent
processes. All processes must make
explicit reservations for resources, and actual resource usage is monitored and
enforced by the kernel.
Our approach is resource-independent, and it requires no kernel
modifications. It works in server
environments in which high-importance applications run continuously and receive
workloads at unpredictable times. These
features differentiate it from previous approaches.
Like the
earlier approaches described above, MS Manners relies on several assumptions,
primarily the symmetry of performance impact from resource contention and the
accessibility of progress measures that accurately reflect resource consumption.
Our driving assumption is that the performance impact from resource
contention is roughly symmetric. If a
low-importance process is substantially degrading a high-importance process,
then the low-importance process will also be substantially degraded, so our
method will observe the effect and appropriately suspend the low-importance
process. On the other hand, if the
degradation of the low-importance process is too small to observe, then the
impact on the high-importance process is likely to be small as well, so the
low-importance process can continue.
This assumption of symmetry can be violated if the operating system
does not aim for fairness in sharing resources among processes. For example, a disk scheduler might favor
small transfers over large ones, so a low-importance process making small disk
transfers will degrade the performance of a high-importance process making
large disk transfers, without itself experiencing any reciprocal degradation.
One noteworthy example of resource asymmetry is physical memory. If the combined memory requirement of two
processes exceeds the available physical memory, operating systems tend to
drastically favor one process over another [23], in order to avoid page thrashing. This is reasonable behavior, but it
invalidates our key assumption for this important resource.
For the general case, we have no solution to the problem of resource
asymmetry. For resources with
user-settable priority, the problem can be averted by lowering the resource
priority of the low-importance process.
A second critical assumption is that the regulator has access to some
progress metric for the low-importance process. This means either that the application must export one or more
progress metrics via a standard interface, or that the application can be
modified to indicate its progress via a library call.
A malicious application might provide false progress information in
order to avoid being regulated. Our
method assumes that progress information is correct and reasonably accurate,
and it makes no attempt to detect or suspend malicious processes.
Beyond these assumptions, MS Manners has several limitations that
follow from its premises. Since it
automatically calibrates its tuning parameters (see section 4.3), it requires
significant periods of resource idleness.
If resources are continually busy, the calibrator cannot determine
correct parameter values. Even given
significant idle periods, calibration can require many hours or even days if
execution begins on a heavily loaded system.
If the calibration is started on a relatively idle system, this time can
be shortened to a few minutes. This
limitation generally restricts progress-based regulation to processes that are
long-running.
Since MS Manners is completely resource-independent, it does not
discriminate between various classes of resources, such as those internal and
external to a machine. For example, a
web crawler’s progress rate will degrade when the network is loaded, triggering
MS Manners to suspend the process, which may not be as desired. Solving this problem requires either making
MS Manners resource-aware or modifying the application to adjust its progress
metrics for external delays. These
fixes oppose our goals of resource independence and noninvasiveness.
Since MS Manners can regulate unmodified applications (see section
7.2), it may suspend a process that is holding a shared system resource, such
as a lock or an exclusive file handle.
This resource will remain unavailable to any high-importance process
that requires it, causing a priority inversion. For applications that are modified to use MS Manners directly,
suspension occurs only at well defined points in the application code, so this
problem can be avoided by not holding a resource at such a point. However, for unmodified applications, we
have no solution.
|
Figure 1. Application environment |
Figure 1
illustrates the application environment of a standard operating system, in
which multiple applications may run concurrently. Some system resources may be used exclusively by a single
application, whereas others may be shared by more than one application. If a shared resource is limited, then use of
that resource by one process will degrade the performance of other processes
that use that same resource.
If a user explicitly designates one or more applications as having low
importance, then the MS Manners control system prevents the process from using
resources that are in use by any normal, high-importance application. The control system monitors the progress of
the low-importance application. When
the control system sees a drop in the rate of progress, it infers that the
application is experiencing resource contention, so it suspends its execution.
|
Figure 2. System architecture |
Figure 2 illustrates the main architectural components of MS
Manners. Periodically, a low-importance
process provides an indication of its progress, through either a library call
or a standard reporting interface. A
rate calculator combines this progress indication with temporal information
from a system clock to determine the process’s progress rate. This progress rate is used for two
purposes: First, it is fed into a
target calibrator, which analyzes many progress rate measurements to determine
a target rate for the process. Second,
the progress rate is fed into a rate comparator, which compares it against the
target rate from the target calibrator.
The rate comparator judges whether the current progress rate is less
than the target progress rate. This
judgment is used for two purposes:
First, it is fed into a suspension time calculator, which maintains a
suspension time value; the calculator increases this value when the progress is
judged to be below target, and it decreases this value when the progress is
judged to be at or above target.
Second, the judgment is fed into an execution regulator; if the progress
is below target, the regulator suspends the process for the suspension time.
The following sections describe these components in the abstract. For implementation details, see section 7.
MS Manners’
core components are measuring the application’s rate of progress, comparing
this rate against a target rate, and suspending the process when the rate falls
below target. This section describes
these components in their simplest form; more sophisticated versions are
described in subsequent sections.
Periodically, at times known as testpoints, the control system
acquires metrics of the application’s progress. (For the moment, we ignore the mechanism by which progress
metrics are conveyed from the application to the control system. Two such mechanisms are described in section
7.) These progress metrics can be
expressed in virtually any unit that is meaningful to and easily tracked by the
application. For example, a file
compressor might indicate quantity of data compressed, whereas a content
indexer might indicate the number of directory entries scanned. Section 5 discusses properties of good
progress metrics.
Testpoints should be made fairly frequently, at least once per few
hundred milliseconds, so the process can be suspended promptly when
necessary. At each testpoint, MS
Manners calculates the elapsed time and the progress made since the previous
testpoint. It then calculates the
progress rate as the ratio of these two values.
MS Manners compares this progress rate to a target progress rate. The target rate is the progress rate expected
when the application is not contending for any resources (see section
4.3). If the actual progress rate is at
least as good as the target, MS Manners judges the progress rate to be good;
otherwise, it judges it to be poor.
If the progress rate is good, the control system allows the process to
continue immediately. If the progress
rate is poor, the control system suspends the process for a period of time
before allowing it to continue. The
execution is not stopped entirely, or else there would be no way to determine
when it is okay to continue.
The time a process is suspended depends on how many successive
testpoints indicate poor progress. On
each testpoint that indicates poor progress, the suspension time is doubled, up
to a set limit. Once a testpoint
indicates good progress, the process is allowed to continue, and the suspension
time is restored to its initial value.
The exponential increase makes the low-importance process adjust to the
time scale of other processes’ execution patterns. Following short periods of activity by a high-importance process,
the low-importance process will resume promptly, but during long periods of
high-importance activity, the low-importance process makes only infrequent
execution probes. The limit on
suspension time places a bound on the worst-case resumption time.
These components are necessary for progress-based regulation, but they
are not always sufficient. For example,
if progress measurements are stochastic, directly comparing them to the target
rate may yield an incorrect judgment of the progress rate. Also, these components do not include a
method for determining a target progress rate.
The rate calculation cannot cope with an application whose progress is
naturally measured along two or more dimensions. Finally, if multiple low-importance threads execute at the same
time, they can interfere with each other’s progress measurements if they use
any common resources. The following
sections describe additional components that deal with each of these complications.
Progress
rate can fluctuate due to several factors, such as variable I/O timing [29], coarse progress measures, and clock
granularity. If the control system
directly compares progress rate to target rate, it may frequently make
incorrect progress-rate judgments, causing inappropriate suspension or
execution of the process.
MS Manners copes with noisy measurements by using a statistical rate
comparator. Rather than making an
immediate judgment about the progress rate, the comparator continues to collect
progress-rate measurements until it has enough data to confidently make a
judgment.
The comparator feeds each progress-rate measurement into a statistical
hypothesis test (see section 6.1). The
test determines whether the progress rate is below the target rate, whether it
is at or above the target rate, or whether there is not enough data to make
such a judgment. In the latter case,
the process is allowed to continue until its next testpoint, but the current
value of the suspension time is preserved.
In this manner, the process is repeatedly allowed to continue, and the
progress rate is repeatedly measured, until the hypothesis test determines that
there is enough data to make a judgment.
At that point, a good judgment will reset the suspension time, or a poor
judgment will double the suspension time and suspend the process.
This technique assumes that the variability in an application’s
measured progress rate is not serially correlated. For example, a disk-bound application may, even on an unloaded
system, encounter some very lengthy seeks.
As long as these lengthy seeks are interspersed with short seeks, the
statistical comparator will correctly recognize that the progress rate is
good. However, a correlated series of
long seeks will inappropriately trigger suspension.
Progress-based
regulation requires a target progress rate for the regulated process. Ideally, this target rate represents the
expected progress rate when the process is not contending for resources. This ideal target rate may change over time
as properties of the resources change; for example, file fragmentation [24] may reduce the ideal target rate for a process that
reads files. Therefore, it is necessary
to track changes in the ideal target rate over time.
To determine the ideal target rate, the process must run for a while
without resource contention. We could
require the user of a low-importance application to perform a calibration
procedure, during which no other process runs on the system. However, this is a burden for the user, especially
since the calibration would have to be re-run periodically to track changes in
resource characteristics.
Instead, MS Manners automatically establishes a target rate as the
exponential average of the measured progress rate at each testpoint (see
section 6.2). Clearly, this approach
tracks changes over time, but it is not clear that it reflects uncontended progress. The key insight is that the averaging
procedure gives equal weight to each testpoint’s progress-rate
measurement. Since the process is
usually suspended when the progress rate is poor, few testpoints reflect poor
progress. Since the process is usually
executing when the progress rate is good, many testpoints reflect good
progress. Thus, the average tends to
converge to the rate of good progress.
This procedure is self-perpetuating, but it requires some way to get
started. Our method begins by allowing
the process to execute briefly with no true regulation. During this time, the calibrator averages
the progress rate measurements to bootstrap the calibration procedure.
If this initial execution is performed on a relatively idle system, the
initial value for target rate will be close to the ideal target rate, so
correct regulation will immediately commence.
However, if the process experiences resource contention during its
initial execution, the target progress rate will be set too low. Eventually, once the low-importance process
executes without resource contention, the calibrator will increase the target
rate, but in the mean time, the control system will not prevent the
low-importance process from running when it should not. To deal with this problem, MS Manners limits
the maximum execution rate for a probationary period, reducing the impact on
other processes.
A major weakness of this approach is that there may never be a
significant length of time when resources are uncontended, so the target rate
may never be set correctly. This
weakness is inherent in any approach that does not require the user to establish
an idle system.
This calibration procedure relies on the suspension behavior of a
regulated process. However, if a
process becomes critical, it might choose to ignore regulation. For example, a utility that archives old
files might, when disk space becomes scarce, choose to execute even if it
contends with a high-importance process.
When a process runs more aggressively than the regulator dictates, the
calibrator subsamples the progress data, ignoring measurements from testpoints
that would not have executed if the thread were following regulation strictly.
To preserve target values across restarts, calibration data is
maintained persistently (see section 7.1).
The
progress of some applications is not easily measured along a single
dimension. Some applications execute in
sequential phases with a different type of progress in each phase. For example, a garbage collector’s progress
might be measured by its mark rate during its mark phase and by its sweep rate
during its sweep phase. Furthermore,
some applications progress along multiple dimensions concurrently. For example, a content indexer might measure
progress in both bytes of content scanned and the count of indices added to its
database. Over a long term, separate
metrics may be positively correlated, because one type of progress may be a
precursor to another, as scanning is to indexing for a content indexer. However, over a short term, separate metrics
may be negatively correlated, because time spent progressing along one axis is
time not spent progressing along another.
For such applications, there is no single scalar value that accurately
reflects the progress rate.
Our method includes two ways of dealing with multiple progress
metrics. For applications that execute
in discrete phases, any given testpoint will occur during some specific
phase. If the phase is known, our
method compares the progress rate measured at each testpoint with a target rate
specific to the phase in which the testpoint occurs. For the garbage collector example, when a testpoint occurs during
the mark phase, the measured mark rate is compared to a target mark rate, and
when a testpoint occurs during the sweep phase, the measured sweep rate is
compared to a target sweep rate. Each
target rate is calibrated separately.
If the number of testpoints in each phase is very small, the
statistical rate comparator may not accrue sufficient data within a phase to
make a judgment. If the comparator were
unable to combine measurements from separate phases, the control system would
never be able to judge the progress rate, and the process could not be
regulated. However, the hypothesis test
described in section 6.1 compares each sample against a separate target and
combines these separate comparisons into a single judgment. This enables the rate comparison to span
multiple phases, permitting regulation even when the execution phases are very
brief.
The second way of dealing with multiple progress metrics is used for
applications that progress along multiple dimensions concurrently, or whose
phases are not available to the control system. To accommodate this situation, we must change the way progress
rates are compared. Section 4.1 stated
that the control system calculates the progress rate from the measured progress
and the measured duration since the previous testpoint, and it compares this
calculated rate to a target rate. The
modification is to calculate a target duration based on the measured progress
and the target rate, and to compare this calculated duration to the measured
duration since the last testpoint. For
a single progress metric, these formulations produce equivalent results, but
the latter allows an extension to support multiple progress metrics.
For multiple progress metrics, the control system calculates a target
duration as follows: Each progress
metric is combined with its corresponding target rate to yield a target duration
for the progress along that metric.
These separate target durations are added together to produce an overall
target duration, and this is compared against the measured duration since the
last testpoint.
As an example, consider a content indexer that scans data at a target
rate of 750 kB/sec and adds indices to its database at a target rate of 120
indices/sec. If a testpoint indicates
that it scanned 60 kB of data and added 5 indices to its database in 120
milliseconds, then its target durations are 80 msec for scanning and 42 msec
for indexing. The sum of these
durations, 122 milliseconds, is the overall target duration. When the rate comparator compares this value
against the actual duration of 120 milliseconds, it determines that the
progress rate is good.
This method assumes that the time to make progress along multiple
dimensions is equal to the sum of the times to make progress along each
separate dimension. This assumption can
be rendered incorrect by overlapping operations.
The calibrator establishes a target rate for each progress metric. For phased execution, this is straightforward, since each testpoint reports progress for a single metric. For multiple concurrent progress, the calibrator uses linear regression to infer the contribution of each metric to the overall duration between testpoints (see section 6.3); in particular, it uses a technique called “ridge regression” that is not thwarted by even highly correlated metrics. Rather than exponentially averaging the progress rate over time, the system exponentially averages the state information needed for the linear regression.
When
multiple threads or processes run concurrently, every low-importance process
should defer to any high-importance process.
If no high-importance process is contending for system resources, the
low-importance processes should share the resources fairly.
If multiple low-importance processes or threads were to execute concurrently,
they might contend with each other over resources, reducing each other’s
progress rates and causing the control system to suspend them. Mutually induced suspension with binary
exponential back-off can lead to unfairness [20] or instability [5, 21], fully suspending the processes even on an idle
system.
To address this problem, MS Manners allows only one low-importance
process or thread to execute at a time.
If multiple low-importance processes or threads run concurrently, the
control system multiplexes among them, allowing each to execute until its next
testpoint before suspending it and executing another. This “time-multiplex isolation” is somewhat inefficient, since it
prevents low-importance processes from overlapping usage of different resources
with each other. However, since these
processes are not critical, we consider this an acceptable cost.
Each thread is regulated separately.
There is one progress-rate comparator per thread, so threads that use
different system resources will not be implicitly coupled. For example, if only one disk on a computer
is being used by a high-importance process, a low-importance thread that is
using that disk will be suspended, but another low-importance thread using
another disk may not be.
A subset of an application’s threads can be designated as
low-importance. Any unregulated threads
in a process will not be time-multiplex isolated, so they will reduce the
progress rate of the regulated threads if they contend for resources. However, because this contention is always
present, it will also reduce the target rate and thus not interfere with
progress-based regulation.
In the
abstract, progress-based regulation can be based on any unit of progress that
is meaningful to the application being regulated. However, since our method assumes that a drop in progress rate
indicates resource contention, any progress metric that is used for regulation
should have an approximately constant target rate over the life of the
application. For example, in a
numerical solver, estimated solution accuracy is a poor metric, because its
rate of change decreases as the solution converges. A better metric for this example is the count of iterated solution
steps, because its expected rate of change is constant, barring interference
due to resource contention.
It is also important to use metrics that provide sufficient coverage of
all progress that the application might make.
For example, consider a file archive utility that scans through files
and only archives those older than a certain date. It is not sufficient to regulate based on count of files scanned,
because this rate will drop when scanning old files, since time will be
consumed archiving them. Similarly, it
is not sufficient to regulate based on count of files archived, because this
rate will drop when scanning new files, since time will be consumed scanning
but not archiving them.
To help convey good choices of progress metrics for various applications, we present a representative but non-exhaustive list of low-importance applications that could profitably use progress-based regulation, along with a suggested set of progress metrics for each:
·
A file compressor might
indicate the quantity of data it compresses.
This would account for resources consumed reading data, writing data,
and compressing data. It could also
indicate the count of files it compresses, if the overhead in opening and closing
a file is significant relative to reading, writing, and compressing.
·
A content indexer might
indicate both the quantity of content it scans and the count of indices it adds
to its database.
·
A file archive utility,
as mentioned above, might indicate the count of files it scans and the count of
files it archives; it should also indicate quantity of data it archives, since
there is likely some resource cost per byte as well as some resource cost per
file.
·
The SETI@home [2] program, which downloads and analyzes radio telescope
data, currently runs under a screen saver.
It could instead use progress-based regulation by indicating the
quantity of data it transfers and the number of computation steps it performs.
·
A backup system might
indicate the quantity of data it uploads.
This would account for both disk and network resources.
·
A virus scanner might
indicate the count of files and the quantity of data it scans.
·
A synchronization engine
for a distributed file system, such as Coda [9], scans files and uploads copies of those that have
been modified since a certain date. It
might indicate the count of files it scans, the count of files it uploads, and
the quantity of data it uploads.
·
The disk defragmenter
described in section 8 indicates the count of file blocks it moves and the
count of move operations it performs.
·
The Single Instance
Store Groveler, as described in section 8, finds and merges duplicate
files. It reports the count of read operations
it performs and the quantity of data it reads.
The following sections provide mathematical details of the statistical hypothesis test used by the statistical comparator, the exponential averaging technique used by the automatic calibration procedure, and the linear regression technique used by the automatic calibration procedure for multiple progress metrics.
The
statistical comparator described in section 4.2 makes use of a statistical
hypothesis test to determine whether the progress rate is below the target
rate, whether it is at or above the target rate, or whether there is not enough
data to make such a judgment.
The comparator uses a paired-sample sign test [4], which is a non-parametric hypothesis test. A non-parametric test makes no assumptions
about the distribution of the data and is therefore very robust. The test depends on n, the sample
size, and r, the count of sample rates that are below their
corresponding target values (or the count of sample durations that are above
their target values). If r is
greater than a threshold value that is a function of both n and a
control parameter a, the progress rate is judged to be poor. If r is less than a different
threshold value that is a function of both n and another control
parameter b, the progress rate is judged to be good. If r falls between the two threshold
values, the progress rate is indeterminate given the current data.
The control parameters a and b determine the sensitivity of the comparator. The parameter a is the probability of making a type-I error, judging
the progress rate to be poor when it is actually good. The parameter b is the probability of making a type-II error, judging
the progress rate to be good when it is actually poor. Increasing a improves the system’s responsiveness, decreasing b improves the system’s efficacy, and increasing b relative to a improves the system’s efficiency, as follows:
Increasing a allows faster reaction to poor progress, because a is negatively related to m, the minimum number
of samples for the sign test to recognize poor progress:
|
|
|
(1) |
Decreasing b reduces the performance impact on high-importance
processes, because b is – by definition – the probability that a
marginally poor progress rate will be judged incorrectly to be good.
Increasing b relative to a improves the stability of process execution, because
when progress is good, the process suspension state is a birth-death system
that is isomorphic to a bulk service queue [10] of infinite group size with an arrival rate of a and a bulk service rate of b. Thus, the
steady-state probability that k judgments of poor progress have occurred
since the last judgment of good progress is given by:
|
|
|
(2) |
With a
probability of a, the next judgment will yield poor progress. m testpoints are required for such a
judgment, and the suspension time will be 2k. Thus, the mean steady-state fraction of time
suspended is:
|
|
|
(3) |
This system
is unstable unless a < b. Increasing b relative to a increases the duty cycle of the background process
when its progress rate is good.
We have selected values of a = 0.05 and b = 0.2 for our experiments. Theoretically, with a testpoint every few hundred milliseconds,
these values yield a reaction time of a few seconds and a 1% performance
degradation on the low-importance process.
Empirically (see section 9), these values demonstrate a prompt reaction
to high-importance activity, a very stable suspension state, and a fairly low
impact on a high-importance process.
The
automatic calibration procedure described in section 4.3 uses exponential
averaging to track changes in the target progress rate over time. Each time a testpoint occurs, the duration d
since the previous testpoint and the amount of progress Dp since the
previous testpoint are used to update the target progress rate r according
to the following rule:
|
|
r ¬ x r +
(1 – x) Dp / d |
(4) |
The value
of x is determined by the
following equation:
|
|
x = (n
– 1) / n |
(5) |
where n
is selected by its effect on the following values:
|
|
ts = n ´ expected time between testpoints |
(6) |
|
|
tl = n / m ´ maximum suspension time |
(7) |
ts is the time constant for smoothing out short-term
variations in progress, so it should be large enough to maintain a steady
target rate. tl is the time constant for tracking long-term changes
in the target progress rate, so it should be small enough to respond to changes
in resource performance characteristics.
For our performance experiments, we have set n to 10,000. Given our other parameters, this will smooth
out short-term variation with a time constant of 20 – 30 minutes and track
long-term changes with a time constant of 7 days.
The method
for dealing with multiple progress metrics described in section 4.4 uses linear
regression to infer the contributions of separate progress metrics to the
duration between testpoints. To track
changes in target progress rates over time, it exponentially averages the state
information needed for linear regression.
The multiple-metric calibration procedure determines target rate values
rk for each progress metric k. It assumes that the duration d since
the last testpoint equals the sum of the times to make each type of progress,
where each of these times is the inverse of the target progress rate rk
times the measured progress Dpk:
|
|
|
(8) |
The calibration procedure performs least-squares linear regression [4] on Equation 8 to estimate the regression
coefficients 1/rk.
Since Equation 8 has a zero offset, the regression is constrained to
have no bias term. The minimum data
needed to solve the regression are known as the “sufficient statistics,” which
in this case are the matrix x and the vector y, defined as
follows:
|
|
|
(9) |
|
|
|
(10) |
To track changes in resource performance characteristics over time, the calibrator exponentially averages these sufficient statistics. A