A proposal for records in Haskell

Simon Peyton Jones and Greg Morrisett
Last updated: 24 Feb 2003

Haskell lacks a serious record system. (The existing mechanism for named fields in data types was always seen as a stop-gap measure.) At the 1999 Haskell Workshop, Mark Jones and Simon Peyton Jones proposed a more fully-fledged record system, closely modelled on the TRex system implemented in Hugs. But Simon never got around to implementing it in GHC. Why not? Mainly because the implementation cost turned out to be relatively high. (We elaborate later.)

In this note we suggest another design for a Haskell record system. It is a little less expressive than the earlier proposal, but (a) it is considerably simpler to implement, and (b) it is rather simpler to explain. In particular, we can translate source programs using records into System Fw, GHC's existing, strongly-typeed intermediate language. This is a major benefit, because it means that the existing transformations and optimisations in GHC's middle end can remain unchanged.

In common with the earlier proposal, this one strictly replaces Haskell's existing named-field design. That is important, because we need to re-use the syntatic forms; although it is also undesirable, beecause existing Haskell 98 programs will not work unchanged.

I'd be very interested to hear from Haskell afficionados about

The basic design

Records are like tuples: they do not need to be pre-declared. For example, here are some tuple expressions and their corresponding types:
  ('a', True) :: (Char, Bool)
  ("foo", False, 'x') :: (String, Bool, Char)
Here are some record expressions and their types:
  { name = "Fred", married = True } :: { name :: String, married :: Bool }
  { x = 3, y = 4 } :: { x,y :: Int }
Needless, to say, the same field name can occur in many different record types, thereby avoiding one of the most painful features of Haskell's existing system which is that record field names cannot be re-used.

Pattern matching works on records, of course:

  f1 :: { x,y :: Int } -> Int
  f1 { x = x1, y = y1 } = x1 + y1
Unlike Haskell 98, we allow punning:
  f2 :: { x,y :: Int } -> Int
  f2 { x, y } = x + y
Field selection uses an infix '.', so
  f3 :: { x,y :: Int } -> Int
  f3 pt = pt.x + pt.y
The '.' is distinguished from function composition in exactly the same way that Haskell distinguishes the qualified name "M.x" from the composition "M . x"; if you want function composition, put spaces round the '.'. Record-selection '.' is left-associative, and binds more tightly even than function application, so that f pt.x means f (pt.x).

Mutter mutter. Is (f 3).y OK? Is the rule "no spaces before the y", or "no spaces on either side of the ."?

Record update uses a "<-" syntax for the update specification:

  g :: { x,y :: Int } -> { x,y :: Int }
  g pt = pt { x <- pt.y }
An alternative syntax would be this:
  g pt = { pt | x <- pt.y }
I think I like this more: the curly braces say "here comes a record", while the innards say how the record is defined.

Contrast this with Haskell's existing syntax for record update:

  g pt = pt { x = y pt }    -- The Haskell 98 form
In our new system, the right hand side would be interpreted as the application of function pt to a singleton record { x = y pt }. Indeed, the ability to apply a function to a record easily makes the "record as keyword parameters" idiom very useful:
  print :: { width :: Int, mode :: Mode } -> Tree -> String
  print { width, mode } t = ....

  main = .....(print { mode = LaidOut, width = 80 } my_tree).......
Optionally, like ML, we could regard tuples as records with special labels '#1', '#2' etc; the only cost is gobbling up more name space.

Data types

Records can, of course, appear as fields of a data type:
  data Point = Point { x,y :: Int }
Haskell allows strictness annotations in data type declarations, thus:
  data Point = Point !{ x,y :: Int }
With this annotation, GHC can lay out the constructor with two fields, just as if it had been declared like this:
  data Point = Point Int Int
The "!" is not especially nice, but it's not necessary for a newtype:
  newtype Point = Point { x,y :: Int }
To extract a field from a Point (with either declaration), you can't just use '.', thus:
  f :: Point -> Int
  f p = p.x + p.y
because p's type isn't a record type. Instead, we need to get from Point to the record type. Like the earlier proposal, we suggest that every unary constructor C implicitly defines a postfix operator .C which does this. So one could write:
  f :: Point -> Int
  f p = p.Point.x + p.Point.y
Similarly, one could write x.Just to extract the contents of a value (Just v); it would fail if the value was Nothing.

Of course, you could always define your own selector functions, in the style of Haskell 98:

  x :: Point -> Int
  x p = p.Point.x

  y :: Point -> Int
  y p = p.Point.y
But if you had many data types, each with an 'x' field, you couldn't do this. And that is one of the things that is broken about Haskell 98 right now.

Polymorphism

A proper record system must support some form of sub-typing. We do this in the spirit of Haskell's type-class system, via ordinary polymorphsim together with a new form of type constraint, the "has predicate". Contrast these two defintions, both of which are well-typed:
  f3 :: { x,y :: Int } -> Int
  f3 pt = pt.x + pt.y

  f4 :: (r <: { x,y :: Int }) => r -> Int
  f4 pt = pt.x + pt.y
Here, f3 can only be applied to an argument of exactly the type {x,y::Int}. But f4 can be applied to an argument of any type r, provided that r is a record type possessing the specified field. The constraint, or predicate, (r <: {x,y::Int}) expresses the requirements on r. The lexeme "<:" is a new keyword, in types anyway. It can be read "has" or "is a subtype of", as you prefer. The constraint (r <: {x,y::Int}) is equivalent to the pair of constraints (r <: {x::Int}, r <: {y::Int}).

We can express polymorphic update (a standard awkward case) quite nicely:

  swapXY :: (r <: { x,y :: Int }) => r -> r
  swapXY pt = pt { x <- pt.y, y <- pt.x }
Our function f4, above, used the '.' operator, but we can use pattern matching too:
  f2 :: { x,y :: Int } -> Int
  f2 {x,y} = x + y

  f5 :: (r <: { x,y :: Int }) => r -> Int
  f5 {x,y, ..} = x + y
The ".." indicates the possibility of more fields.

And that is really it! We don't need to talk about row variables or any of that stuff; polymorphism does all what we need.

Type synonyms and extension

It would be quite convenient to be able to say
  type Point = { x,y :: Int }
  type ColouredPoint = Point + { col :: Colour }
Note the "+" which stands for record union. Just as the type-lambdas that 'type' declarations introduce are really just macros, which are expanded before the type checker gets to work, so is "+". That is, "+" is reduced away before the type checker does anything serious. You could not write
  data T a = MkT (a + { x,y :: Int })
because now the "+" could not be reduced away. Is this a hack?

It should be possible to use these type synonyms in constraints, thus:

 move :: (r <: Point) => r -> r
 r p = p { x <- p.x + 1, y <- p.y - 1 }
(At this point the subtyping notation (using "<:") begins to look much more attractive than "has".)

Show, Eq, and Ord

A nasty problem with records (and indeed with tuples) is how to generate the Read/Show/Eq instances for them. I'm not sure how best to tackle this. I think the best thing is simply to bite the bullet and provide a built-in mechanism. Read/Show/Ix/Ex are really suppposed to be available for all tuples, but GHC hacks it by defining a fixed bunch in a library. The hack runs out of steam with tuples. Maybe the solution is to abolish the hack, by building more into the compiler.

Perhaps we can take a leaf from the derivable-type-classes paper too. Suppose the compiler's built-in support amounted to providing a built-in support for the following instance schemes:

  instance Eq (a,(b,c)) => Eq {l::a,m::b,n::c}

  instance ShowR (a :*: b :*: One c)) => Show {l::a,m::b,n::c}
    -- Calls showR on a suitably-transformed argument

  class ShowR r where
    showR :: [String] -> r -> ShowS
    -- The [String] are the labels
    -- Instances for :*:, One, defined by user
Whether the additional programmability is worth it isn't clear.

Polymorphic record fields

We really, really want to have polymorphic record fields:
  newtype T m = T { return :: forall a. a -> m a, ...etc... }
So if x::T, then x.T.return must be a polymorphic function. The orthogonal way to achieve this is to allow records with polymorphic fields (and tuples too, presumably).

It's not quite clear what this means for inference. I guess that if you see 'x.l', it's only polymorphic if we know the record type for 'x' from the context.

Comparison with Haskell 98

One very convenient feature of Haskell 98 is the ability to have the same field in many constructors of the same type:
  data Decl = DeclType { declName :: String, ... }
            | DeclData { declName :: String, ... }
            | ...
You can still do this, of course, and you can define a selector too, like this:
	declName :: Decl -> String
	declName (DeclType r) = r.declName
	declName (DeclData r) = r.declName
	...etc...
Now the function declName is just like the current Haskell 98 selector. But we'd have lost the convenience of having the compiler generate the code for the selectors. Enter Template Haskell? Or "deriving( declName )"?

Comparison with TRex

Here's what we miss relative to TRex:

Implementation

A major benefit of our approach is that we can translate it into System F. Initially we give a full translation; then we explain how it can be optimised.

We say this is a "major benefit" for two reasons:

Improvement

Type inference gathers constraints as usual. There is the "improvement" rule that if we have the two constraints (r <: {l:t1}) and (r <: {l:t2}) then t1=t2. This is very like having a functional dependency (r,l -> t), so the technology exists already.

Basic translation

The basic idea is this: For example, consider our running examples:
  f3 :: { x,y :: Int } -> Int
  f3 pt = pt.x + pt.y

  f4 :: (r <: { x,y :: Int }) => r -> Int
  f4 pt = pt.x + pt.y
Here are their translations:
  f3 :: (Int,Int) -> Int
  f3 pt = (fst pt) + (snd pt)

  f4 :: (Has r Int, Has r Int) -> r -> Int
  f4 (hx, hy) pt = get hx pt + get hy pt
Notice that the monomorphic version, f3 gives more efficient code, because the compiler can bake in the selection operation. In contrast, f4 has to use a selector that is passed in. Uupdate is similar:
  swapXY :: (r <: { x,y :: Int }) => r -> r
  swapXY pt = pt { x <- pt.y, y <- pt.x }
translates to:
  swapXY :: (Has r Int, Has r Int) -> r -> r
  swapXY (hx,hy) pt = set hy (set hx pt (get hx pt))
                             (get hy pt)
The two big points to notice are

Optimising the translation

In the basic translation we represent the evidence that r has a particular field by a pair of functions. But of course we would rather represent it by an integer, the offset of the field.

What we cannot do is to expose the Int representation. Consider the function

  h :: (r <: {a::Int, b::Bool}) => r -> Int
  h r = if r.b then a else a+1
We cannot translate it to
  h :: (Int, Int) -> r -> Int
  h (ha, hb) r = if (get hb r) then get ha r else get ha r + 1
where ha is the offset of a and hb is the offset of b Reason: because now we have no way to type-check the translation, at least not without getting into dependent types, because we need (get ha r) and (get hb r) to have type Int and Bool respectively, but plainly we can't deduce that from the translate code of h. No, no, no.

Instead, just imagine that Has is an abstract type, known only to the code generator, with primitive operations get and set. The code generator can then generate good (inline) code for applications of get and set. The other thing we need is a way to construct values of type Has. For example:

  (h {a=3, b=True})
might translate to:
  (let ha = has_1_(Int,Bool)
       hb = has_2_(Int,Bool)
   in h (ha,hb) (3,True))
Here, has_n_type is an indexed family of constants. So far as most of GHC is concerned, it's just an opaque constant of the appropriate tpye, but the code generator knows that it is represented by the Int index. The type of has_i_(t1, ...tn) is just Has t1,..tn) ti, of course. GHC already has an indexed family of constants for foreign calls, so this is not entirely new.

Finally, we need some simple transformations (= rewrite rules):

   get has_i_(t1..tn) e
     ====>
   case e of (x1::t1, ..., xn:tn) -> xi
Similarly for set. This is what makes things efficient when we know the types and there is no polymorphism involved.

Open questions