This document was made by
OCR from a scan of the technical report. It has not been edited or proofread
and is not meant for human consumption, but only for search engines. To see the
scanned original, replace OCR.htm with Abstract.htm or Abstract.html in the URL
that got you here.
NOTES ON THE DESIGN OF
EUCLID
G. J. Popek
UCLA Computer Science Department
Los Angeles, California 90024
J. J. Horning
Computer Systems Research Group
University of Toronto
Toronto, Canada
B. W. Lampson, J. G.
Mitchell
Xerox Palo Alto Research
Palo Alto, California 94304
R. L. London
USC Information Sciences Institute
Marina del Rey, California 90291
Euclid is a language
for writing system programs that are to be verified. We believe that
verification and reliability are closely related,
because if it is hard to reason about programs using a language feature, it
will be difficult to write programs that use it
properly. This papei discusses a number of issues in the design of Euclid,
including such topics as the scope of names,
aliasing, modules, type-checking, and the confinement of machine dependencies;
it gives some of the reasons for our expectation tuat
programming in Euclid will be more reliable (and will produce more reliable
programs) than programming in Pascal, on which Euclid is based.
Key Words and Phrases:
Euclid, verification, systems programming language, reliability,
Pascal, aliasing, data encapsulation, parameterized
types, visibility of names, machine dependencies, legality assertions, storage allocation.
CR Categories: 4.12, 4.2, 4.34, 5.24 Introduction
Euclid is a programming
language evolved from Pascal [Wirth 1971] by a series of changes
intended to make it more suitable for verification and for system programming.
We
expect many of these changes to improve the reliability
of the programming process, firstly by enlarging
the class of errors that can be detected by the compiler, and
secondly by making explicit in the program text more of the
information needed for understanding and maintenance. In
addition, we expect that effort expended in program verification will directly
improve program reliability. Although Euclid is intended
for a rather restricted class of applications, much of what
we have done could surely be extended to
languages designed for more general purposes.
Like all designs, Euclid represents a
compromise among conflicting goals,
reflecting the skills, knowledge, and tastes (i.e., prejudices) of its designers. Euclid was conceived as an attempt to integrate into a practical language the
results of several recent
developments in programming methodology and program verification. As Hoare [1973] has pointed out, it is considerably more difficult to design a good
language than it is to select one's favorite set of good language
features or to propose new ones. A language
is more than the sum of its parts,
and the interactions among its features are often more
important than any feature considered
separately. Thus this paper does not
present many new language features. Rather, it
discusses several aspects of our design that, taken together, should improve the reliability of programming in Euclid.
We believe that the goals
of reliability, understandability, and verifiability are
mutually reinforcing. We never consciously
sacrificed one of these in Euclid to achieve another. We
had a tangible measure only for the third (namely, our ability to write
reasonable proof rules
[London et al. 1977]), so we frequently
used it as the touchstone for all three. Much of this
paper is devoted to decisions motivated by the problems of verification.
Another important goal of Euclid, the construction of acceptably
efficient system programs, did not seem attainable without some sacrifices in
the preceeding three goals. Muchof the language design effort was expended in
finding ways to allow the precise control of
machine resources that seemed to be necessary, while narrowly confining
the attendant losses of reliability,
understandability, and verifiability. These aspects of the language are
discussed in more detail by [Barnard and Elliott
1977]. The focus here is on features that contribute
to reliability.
Goals, History, And Relation To Pascal
The chairman
originally charged the committee as follows: "Let me
outline our charter as I understand it. We are
being asked to make minimal changes and extensions to Pascal in order to make the
resulting language one that would be suitable for systems programming while
retaining those characteristics of the language that are attractive for good
programming style and verification. Because it is highly
desirable that the language and appropriate compilers be available
in a short time, the language definition effort is to
be quite limited: only a month or two in duration. Therefore, we should not attempt to design a
significantly different language, for that, while highly
desirable, is a research project in itself. Instead, we
should aim at a 'good' result, rather than the superb."
[Popek 1976] We defer to the Conclusions a discussion of our current feelings
about these goals and how well we have met them.
The design of Euclid
took place at four two-day meetings of the authors in 1976,
supplemented by a great deal of individual
effort and uncounted Arpanet messages. Almost all of the basic changes
to Pascal were agreed upon during the first
meeting; most of the effort since then has been devoted to smoothing out unanticipated interactions among the changes and to developing a suitable
exposition of the language. Three versions of the Euclid Report have
been widely circulated for comment and
criticism; the most recent appeared in the February 1977 Sigplan Notices [Lampson
et
al. 1977]. Proof rules are currently
being prepared for
publication [London et al. 1977].
The System Development Corporation is currently implementing Euclid [Lauer 1977]. Since the implementation
is incomplete and no sizable Euclid programs have been written, our
expectations are still untested. Further experience may dictate changes in the
language.
We developed Euclid by modifying Pascal only where we saw "sufficient reason." We see it as a
(perhaps eccentric) step along one of
the main lines of current programming language development: transferring more and more of the work of producing a correct program, and verifying that it
is consistent with its specification,
from the programmer and the verifier
(human or mechanical) to the language and its compiler.
Our changes to Pascal generally took the form of
restrictions,
which allow stronger statements about the properties of programs to be based on
the rather superficial, but quite reliable, analysis that the compiler can
perform. In some cases, we
introduced new constructions whose meaning could be explained by expanding them
in terms of existing
Pascal
constructions. These were not merely "syntactic sugaring": we had to introduce
them, rather than leaving the expansion to
the programmer, because the expansion would have been forbidden by our
restrictions. Because the new constructions were sufficiently restrictive in
some other way, breaking our own restrictions
in these contolled ways did not break the protections they offered.
The
main differences between Euclid and Pascal are
Visibility of names
Euclid
provides explicit control over the visibility of names by requiring the program to list all the
names imported into a routine (i.e., procedure or function) or module body, or
exported from a module. The
imported names must be
accessible in every scope in which the routine
or
module name is used.
Variables: Euclid guarantees that two names in the same scope can never refer to the same
or overlapping variables. There is a single,
uniform mechanism for binding a name to a variable in a procedure call, on block entry (replacing the Pascal with statement),
or in a variant record discrimination.
Pointers. The avoidance of overlapping is extended to
pointers by allowing dynamic variables to be partitioned into collections,
and guaranteeing that two pointers into different collections can never
point to overlapping
variables.
Storage allocation: The program can control the
allocation of storage for dynamic variables explicitly, in a way that narrowly confines
knowledge about the allocation scheme used and opportunities for making
type errors. It is
also possible to declare that the
dynamic variables in a collection should be reference-counted
and automatically deallocated when no pointers to them remain.
Types: Type declarations are generalized to allow formal parameters,
so that arrays can have bounds that are fixed only when they are allocated, and variant records can be handled in a
type-safe manner.
Records are
generalized to include
constant
components.
Modules: A new type-constructor, module, provides
a
mechanism for packaging a collection of logically
related declarations (including
variables, constants,
routines, and types) together with initialization and finalization
components that are executed whenever instances of the type are created or
destroyed. This provides some of the advantages of abstract data types.
Constants: Euclid defines a constant to be a
literal or a name whose value is fixed throughout the scope in which it is declared, but not
necessarily at compiletime. A constant whose
value is fixed at compile time (as in
Pascal) is called a manifest constant.
For statements: The parameter of the for statement is a controlled
constant in Euclid. A module can be used as a generator to enumerate a
sequence of values for the controlled constant.
Loopholes:
Features of the
underlying machine can be accessed, and the
type-checking can be overridden, in a
controlled way. Except for these explicit loopholes, Euclid is designed
to be type-safe.
Assertions. The syntax allows
assertions to be
supplied at convenient
points to assist in verification
and to provide useful documentation. Some
assertions
can be compiled into run-time checks to assist in the
debugging of programs whose verification
is
incomplete.
Deletions. Several Pascal
features have been omitted from Euclid: input-output, real numbers,
multi-dimensional arrays, labels and go to's,
and
functions and procedures as parameters.
The only new features which can make it
hard to convert a Euclid program into a valid Pascal program by straightforward
rewriting are parameterized type declarations, storage allocation, finalization, and some of the
loopholes.
The balance of this paper presents the
motivations and consequences
of several of the changes.
Verification And Legality
One of our fundamental assumptions is that
(in principle)
all Euclid programs are to be verified before use. That is, we expect formal proofs of the
consistency between
programs and their specifications. These proofs may be
either
manual or automatic; we expect similar considerations to apply in either
case. We used the axiomatic method of [Hoare and Wirth 1973] for guidance.
Perhaps the most obvious consequence of this
assumption
is the provision within the language of syntactic means for including specifications and
intermediate
assertions. Routines are
specified by pre- and post-assertions; modules,
by a pre-assertion, an invariant, an abstraction function, and specifications for exported routines and types. In addition, assertions may be placed at any point
in the flow of control. (Most
verifiers require at least one such assertion to "cut" each loop.) Effort invested in writing such
assertions
should pay off in more understandable, better-structured
programs, even before the verification process is begun.
The basic assertion language consists of the Boolean expressions of Euclid. Most verifiers will require
somewhat richer languages,
containing, for example, quantifiers, ghost variables, and specification
routines. Rather than picking a particular form for this extended language, we
decided that extended assertions would be
bracketwd as comments; each verifier may choose a private syntax,
without affecting Euclid compilers. (Indeed, a program might be augmented with
two distinct sets of assertions, intended for different verifiers.)
Most programs presented to verifiers are actually wrong; considerable time can be wasted looking for proofs
of incorrect programs before discovering that debugging is still
needed. This
problem can be reduced (although not
eliminated) by judicious testing, which is generally the most
efficient
way to demonstrate the presence of bugs. To assist in the testing process, any scope in
Euclid can be prefixed by checked, which will cause the compilation of run-time
checks for all basic assertions (Boolean expressions not enclosed
in comment brackets) within the scope; this
includes all legality
assertions, which
will be discussed later. If any assertion
evaluates to False when it is reached in
the program,
execution
will be aborted with a suitable message.
Because we expect all Euclid programs to be verified, we
have not made special provisions for exception handling
[Melliar-Smith
and Randell 1977][McClaren 1977]. Run-time software errors should not
occur in verified programs (correctness is a compile-time property), and we
know of no efficient general mechanisms by which software can recover from unanticipated
failures of current hardware. Anticipated conditions can be dealt with
using the normal
constructs of the
language; most proposals for providing
special
mechanisms for exception handling would add
considerable
complexity to the language [Goodenough 1975].
We have also been led to a somewhat unorthodox
position on
uninitialized variables and dangling pointers. We do not forbid
these syntactically (cf. [Dijkstra 1976] for a rather elaborate proposal), nor, for reasons of
efficiency, do
we supply a
default initialization (e.g., to "undefined"). Our
reasoning is as
follows: verification generally places
stronger
constraints
on variables (pointers) than that they merely have
valid values when they are used--they must have suitable
values. However, if a program can be verified without
reference
to the initial value of a variable (current variable to
which a pointer points), then any value (variable) is
acceptable.
Relying
so heavily on verification has an obvious pitfall: suppose that the formal
language definition and the implementation don't agree. (Indeed, for Pascal,
they do not.) We could then be in the embarrassing situation of having failures in programs
that have formally been proved "correct" [Gerhart and
Yelowitz 1976]. Aside from some omissions
and known technical difficulties (e.g., [Ashcroft
1976]), the
major
discrepancies between the Pascal definition and
implementation
take the form of restrictions imposed by the
definition, but
not enforced by the implementation. For
example,
"The axioms and rules of inference...explicitly forbid the
presence of certain 'side-effects' in the evaluation of functions and execution
of statements: Thus programs which invoke such side-effects are, from a formal
point of view, undefined. The absence of such side-effects can in principle be checked by a textual
(compile-time) scan of the
program. However, it is not obligatory for a Pascal
implementation to
make such checks." [Hoare and Wirth
1973, p.337]
In the design of Euclid,
we have made a major effort to ensure that there are no gaps between what is required
by the definition and what must be enforced
by any implementation (and that such
enforcement is a reasonable task). Gaps have been eliminated by a variety of means: removing features from the
language, extending the formal definition, placing more definite requirements on the implementation,
and finally, introducing legality assertions as
messages from the compiler to the verifier about necessary checking.
There
are many language-imposed restrictions that must be satisfied by every legal
Euclid program. In addition to syntactic constraints, many of them (e.g.,
declaration of identifiers before use) are easily checked by the
compiler, and
it would be silly to ask the verifier to
duplicate this effort.
Others (e.g.,
type constraints) can usually be checked rather easily by the compiler, but may
occasionally depend on dynamically generated values. Still others (e.g., array
indices within bounds, arithmetic overflow) will usually depend on
dynamic
information, although the compiler can often use declared ranges or flow analysis to do partial
checking. (For example, i := i + 1
will obviously never assign a value that is too small if i was
previously in range.) Our philosophy is that the verifier should rely as much as possible on the checking done by the compiler. In fact, unless the
compiler indicates differently, the
verifier is entitled to assume that the program is completely legal. The
compiler is to augment the program
with a legality assertion (which the verifier is to prove)
whenever it has
not fully checked that some constraint is satisfied. Any program whose legality
assertions can all be verified
is a legal program, with well-defined semantics.
The compiler may produce legality assertions only for certain
conditions specifically indicated in the Euclid Report. They always take the
form of Boolean expressions, and are usually quite simple (e.g., i < 10,
i = j, p not= C.nil). Note that legality is a more fundamental
property than correctness, since (a) it is defined as consistency with the
language specification, rather than consistency with a particular program
specification (a program could be consistent with one
specification, and inconsistent with another), and (b) a program that is
illegal has no defined meaning, and hence cannot be said to be consistent with any specification. Also
note that a particular program is not sometimes legal and
sometimes illegal (e.g., depending on whether i = j on
some run): the verifier must show that the legality assertions
are valid (always true).
Later sections of this paper discuss some of the
non-obvious
legality conditions of Euclid.
Names And Scopes
In "Algol-like"
languages the rules connecting names (identifiers)
to what they denote (e.g., variables) give rise to some subtle, but
troublesome, problems for both programmers and verifiers. Some
variables, for example those passed as
variable
parameters, may be accessible by more than one name. Thus, assignment to x may
change y we will call this aliasing. Access to a global variable can
accidentally be lost in a scope by the interposition of a new declaration
involving the same name
(the "hole in scope" problem). Conversely,
failure to
declare a variable locally may result in a more global access than was intended.
(Such problems are generally not detected by compilers.) The intimate connection
between a variable's
lifetime and its scope frequently forces variables
to be declared
outside the local scopes in which they are intended to be used. Finally, the
automatic importation of all names in outer scopes into contained scopes, unless
redeclared, tends to
create large name spaces with
correspondingly
large opportunities for error. For more complete discussions of these problems, and some suggested solutions, see [Wulf and Shaw 1973] and [Gannon
and Horning 1975].
Several
Euclid features are intended to remove these problems;
they are discussed here and in the following two sections. Unlike the designers of Gypsy [Ambler et al. 1977], we did not discard the Algol notion of nested
scopes, which seems to us to be a
natural representation of hierarchy, and a good first approximation to the necessary name control. Rather, we have chosen to strengthen it by a number
of restrictions.
The
first restriction requires the programmer to control the "flow" of names between
levels of abstraction by means
of an import list. Every closed scope (routine or module
body) must be accompanied by such a list specifying those names
accessible in the containing scope that are to be accessible
within the closed scope, and, in the case of variables, whether the access is to be
read-only or read-write. Other names are
simply inaccessible. An open scope (e.g., an Algol-like block)
may access any name accessible within the scope that contains it.
The control supplied by import lists allows us to place a
further restriction: no name accessible in a scope may be redeclared in that scope. Such a
restriction would probably be intolerable in
Pascal, where a scope has no "protection" against unwanted
names from the outside, but it seems
sensible in Euclid. In
fact, it is generally a programming
error to redeclare an
imported name. Undiagnosed holes in
scopes would
certainly cause problems for the reader
and
maintainer,
and for the human verifier.
In practice, we found it desirable to relax slightly the requirement of
explicit importation. We do not wish every routine that uses built-in types, such as integer, or routines, such as abs(x), to import them explicitly. Many
programs will have user-defined types and routines that
are almost as
widely
used. Therefore, we have provided an
overriding
mechanism: constant,
routine, and type names may be declared
pervasive in a scope, which means that they will be implicitly
imported into all contained scopes (and hence may
not be
redeclared). The standard Euclid types
are all
pervasive: therefore, a
program cannot override them.
Euclid prohibits "sneak access" to
variables by means of procedure
calls. The name of a closed scope may not be imported (or used) if the names that are imported into
its body are not also imported
(accessible) at the point of use. It
is this
restriction that
simplifies the enforcement of a
complete
ban on side-effects in functions (and
hence in
expressions). Functions
cannot have variable parameters or
import variables. Although
they may import and call procedures, they cannot change any nonlocal variables
by doing so: thus, they behave like mathematical
functions. The
possibility
of side-effects in functions and expressions
complicates the verifier's task, and we believe
that their use is
rather
error-prone. We are willing to
sacrifice a few
well-known programming
tricks that rely on "benign"
side-effects
in order to simplify life for the readers,
maintainers, and verifiers of programs, and to
open up new
optimization possibilities
for the implementors of the language.
Programs involving functions with side-effects can be rewritten to use procedures instead.
Import lists are intended to make the interface
to each closed scope explicit. However, the list supplied by the
programmer is incomplete
(for the reader) in two respects: 1)
only
names are given, not complete
declarations, and 2)
pervasive names do not
appear. The compiler is expected to
complete the interface description from its symbol table. It
must augment the listing with information
from the
declarations of the imported
names, and the user-defined
pervasive
declarations for that scope. Requiring
the
programmer to supply this
information (which is mere
duplication) would invite
error, for no identifiable gain.
Aliasing And Collections
The disadvantages of aliasing (for programmers, readers, verifiers and implementors) have been
well-documented
[Hoare
1973, 1975] [Fischer and LeBlanc 1977]. If
assignment to x has the
"side-effect" of changing the value of
y, it is likely to cause surprise and
difficulty all around.
However, programmers and
language designers have been
reluctant
to eliminate all features that can give rise to
aliasing, e.g., passing
parameters by reference, and pointer variables. In designing Euclid, we took a slightly different approach: we kept the language features, but
banned aliasing. Essentially, we examined
each feature that could give rise to
aliases,
and imposed the minimum restrictions necessary
to
prevent them. Every
variable starts with a single name: if no
aliases can be created, then,
by induction, aliasing will not
occur.
The
case of variable parameters to procedures is typical, and easily generalized to
import lists and binding lists. All of the actual var parameters in a call must be nonoverlapping. If the actual parameters are
simple names ("entire variables"), this requirement merely means that they must all be
distinct However, we must also prohibit passing a
structured variable
and one
of its components (e.g., A and A(1)). What about two
components of the same variable? This is allowed
if they are
distinct (e.g., A(1) and A(2)), and disallowed if they are the
same (e.g., A(1) and A(1)). Since subscripts may be
expressions, it may be necessary to
generate a legality
assertion
(e.g., I not= J in the case
of A(I)
and AM) to
guarantee their distinctness.
It may appear that arrays already violate our rule that assignment to one
entire variable can never change another. After all, assignment to A(I) may change A(J). However, these are not entire variables. We adopt the
view of [Hoare
and
Wirth 1973, p.345] that an "assignment to an array component" is
actually an assignment to the containing array. Thus A(1) := 1 is an assignment to A, and
can be expected to change AM if J = 1.
Pointers appeared to pose a more difficult problem. Assignment to pt
(i.e., to the variable to which p refers) may change the value of qt (if p and q happen to point to the same variable, i.e., if p = q), or
may even change the value of
x (if
pointer variables are allowed to point at program
variables). We avoided the latter problem by retaining
Pascal's
restriction that pointers may only point to
dynamically generated
(anonymous) variables. (This is enforced
by not providing an "address of" operator or coercion.) The usual treatment of the former
problem is to consider pointers
as indices into "implicit arrays" (one for
each
type of dynamic
variable), and dereferencing as
subscripting [Luckham and
Suzuki 1976, Wegbreit and Spitzen
1976]. Thus pt is merely a shorthand for C(p), where C denotes p's implicit array, and the proof rules
for arrays can be carried over
directly. In particular, assignment via a dereferenced pointer is considered to be an assignment to its implicit array. From the verifier's standpoint,
the situation is slightly
better than that for arrays, since the decision of
whether two subscripts are
equal may involve arbitrary arithmetic
expressions, while the decision of whether two pointers are equal reduces to
the question of whether they resulted from the same dynamic variable generation
("New" invocation).
We have not yet discussed dereferenced pointers as variable parameters. If pt and qt (really C(p) and
C(q)) are both passed, the
nonoverlapping requirement demands p not= q. Passing
both p and pt (really p and C(p)) is not a problem
unless the formal parameter
corresponding to p is dereferenced, which can only
happen if C is accessible (i.e., imported). But then there would be an overlap
between C(p) and C, which makes the call clearly illegal. Passing
pointers themselves as parameters
(like passing array indices) does not
create aliasing
problems, since dereferenced pointers (like subscripted arrays) are not entire
variables; assignment to one of
them is considered as assignment to its implicit array.
Although the solution in the previous paragraph is formally complete, it is unsatisfactory in
practice. The minor difficulty is that Euclid provides no way of naming
implicit arrays for purposes of
importation. The major problem is that it is too restrictive. It prohibits passing a dereferenced pointer as a variable parameter to any procedure
that is allowed to dereference pointers to variables of
the same type
(i.e., that imports the implicit array for that type). We
have
solved both these problems
by introducing the notion of collections, which are explicit program variables that act
like the "implicit arrays" indexed by
pointers. Each pointer is
limited to a single
collection, and pt is still an acceptable shorthand for C(p). where C is now the collection name. pt is only allowed where C is accessible. Note that
this makes it possible to pass pointers as parameters to procedures that are not allowed to dereference them, although they can copy them.
We allow any number of collections to have elements of the same type, with no more difficulty than
arises from multiple arrays of
the same type. Thus, the programmer can partition his dynamic variables and
pointers into separate collections
to indicate some of his knowledge about how they will
be used; the verifier is assured that pointers in different
collections can never point
to overlapping variables. The astute
reader will have noted that we have returned to the "class variables" that were in the
original Pascal, but dropped in the revised version.
Collections also
provide convenient units for storage
management. We
have chosen to associate with each
collection
both the decision of whether to reference-count, and the selection of the (system- or user-supplied)
storage management module (called a zone) to provide the space.
One consequence of our complete elimination of
aliasing is that
"value-result" and "reference" are completely
equivalent implementation
mechanisms for variable parameters,
and a compiler is free to choose between them strictly on the basis of efficiency.
Modules
Since the introduction of "classes" by Simula 67 [Dahl et al. 1968], several programming languages have
introduced mechanisms for
"data encapsulation" or "information hiding"
[Parnas 1971]. A survey
of desirable properties of such
mechanisms is given in
[Horning 1976]. For Euclid, we chose
something less powerful than "classes," "forms" [Shaw et al. 1977], or "clusters" [Liskov et
al. 1977]. Our modules are closely akin to, but somewhat more complex than, the
"modules"
of Modula [Wirth 1977]. Adjusting the details of
modules satisfactorily has been more difficult
than expected.
Perhaps
this is because we still have an
imperfect
understanding, but it may also be because we
violated our
usual practice, and started from implementation
considerations, rather than
from verification issues.
The basic idea is that a
module should "package up" a
data structure and a related
set of routines for its
manipulation,
and should hide its internal details from the
outside world. We
originally viewed it as a sort of glorified
record, with some extra
components (routines, types, initialization,
finalization) and some control over the external visibility of its names (an export list). Like record,
module is a type constructor, and a module type can be used to create
many instances; this is the major source