Table of contents

Koka language specification

This is the draft language specification of the Koka language, version 0.1.
Currently only the lexical and context-free grammar and the layout rule are specified. The standard libraries are documented separately.

Lexical syntax

We define the grammar and lexical syntax of the language using standard BNF notation where non-terminals are generated by alternative patterns:

nonterm ::= pattern1 | pattern2

In the patterns, we use the following notations:

terminal A terminal symbol
x0A A character with hexadecimal code 0A
a..f The characters from a to f
 
( pattern ) Grouping
[ pattern ] Optional occurrence of pattern
{ pattern } Zero or more occurrences of pattern
pattern1 | pattern2 Choice: either pattern1 or pattern2
 
pattern<diff> Difference: elements generated by pattern except those in diff
nonterm[lex] Generate nonterm by drawing lexemes from lex

Care must be taken to distinguish meta-syntax such as | and ) from concrete terminal symbols as | and ).

Source code

Source code consists of a sequence of 8-bit characters. Valid characters in actual program code consists strictly of ASCII characters which range from 0 to 127 and can be encoded in 7-bits. Only comments, string literals, and character literals are allowed to contain extended 8-bit characters. This means that one cannot write identifiers or operators that use unicode symbols.

Encoding

A program source is assumed to be UTF-8 encoded which allows comments, string literals, and character literals to contain (encoded) unicode characters. Moreover, the grammer is designed such that a lexical analyzer and parser can directly work on source files without doing UTF-8 decoding or unicode category identification. To further facilitate the processing of UTF-8 encoded files the lexical analyzer ignores an initial byte-order mark that some UTF-8 encoders insert. In particular, any program source is allowed to start with three byte-order mark bytes 0xEF, 0xBB, and 0xBF, which are ignored.

Lexical grammar

In the specification of the lexical grammar all white space is explicit and there is no implicit white space between juxtaposed symbols. The lexical token stream is generated by the non-terminal lex which consists of lexemes and whitespace.

Before doing lexical analysis, there is a linefeed character inserted at the start and end of the input, which makes it easier to specify line comments and directives.

Lexical tokens

lex ::= { whitespace | lexeme }
lexeme   ::= conid | qconid
| varid | qvarid
| op | opid | qopid | wildcard
| natural | float | string | char
| reserved | reservedop | special

Identifiers

qconid ::= modulepath conid
qvarid ::= modulepath lowerid
modulepath ::= lowerid / { lowerid / }
 
conid ::= upperid
varid ::= lowerid<reserved>
 
lowerid ::= lower { idchar }
upperid ::= upper { idchar }
idchar ::= letter | digit | _
 
wildcard ::= _ { idchar }
typevarid ::= letter { digit }
 
reserved ::= infix | infixr | infixl | prefix
| type | cotype | rectype | alias
| forall | exists | some
| fun | function | val | var | con
| if | then | else | elif | match | return
| module | import | as
| public | private | abstract
| interface | instance | with
| external | inline | include
| yield | qualified | hiding (future reserved words)

Operators and symbols

qopid ::= modulepath opid
opid ::= ( symbols )
 
op ::= symbols<reservedop | typeop>
 
symbols ::= symbol { symbol } | /
symbol ::= $ | % | & | * | +
| " | ! | \ | ^ | #
| = | . | : | - | ?
| | | < | >
 
special ::= { | } | ( | ) | [ | ] | ; | , |`
 
reservedop ::= = | . | : | ->
typeop ::= > anglebar { anglebar }
| < anglebar { anglebar }
| | angle { symbol }
| ->< { symbol }
| :? { symbol }
 
anglebar ::= | | angle
angle ::= < | >

Literals

string ::= @" { graphic<"> | utf8 | space | tab | newline | "" } " (raw string)
| " { graphic<"|\> | utf8 | space | escape } "
char ::= ' ( graphic<'|\> | utf8 | space | escape ) '
 
escape ::= \ ( charesc | hexesc )
charesc ::= n | r | t | \ | " | '
hexesc ::= x hexdigit2 | u hexdigit4 | U hexdigit4 hexdigit2
hexdigit4 ::= hexdigit hexdigit hexdigit hexdigit
hexdigit2 ::= hexdigit hexdigit
 
float ::= decimal . decimal [exponent]
exponent ::= (e | E) [- | +] decimal
natural ::= decimal | 0   (x | X) hexadecimal
decimal ::= digit { digit }
hexadecimal ::= hexdigit { hexdigit }

White space

whitespace ::= white { white }
white ::= newline | space
| linecomment | blockcomment
| linedirective
 
linecomment ::= // { linechar } newline
linedirective ::= newline # { linechar } newline
linechar ::= graphic | utf8 | space | tab
 
blockcomment ::= /* blockpart { blockcomment blockpart } */ (allows nested comments)
blockpart ::= blockchars<blockchars (/*|*/) blockchars>
blockchars ::= { blockchar }
blockchar ::= graphic | utf8 | space | tab | newline

Character classes

letter ::= upper | lower
upper ::= A..Z
lower ::= a..z
digit ::= 0..9
hexdigit ::= a..f | A..F | digit
 
newline ::= [return] linefeed (windows or unix style end of line)
 
space ::= x20 (a space)
tab ::= x09 (a tab (\t))
linefeed ::= x0A (a line feed (\n))
return ::= x0D (a carriage return (\r))
graphic ::= x21..x7E (a visible character)
 
utf8 ::= xC0 x80 (encoded 0 character)
| (xC2..xDF) cont
| xE0 (xA0..xBF) cont
| (xE1..xEC) cont cont
| xED (x80..x9F) cont
| (xEE..xEF) cont cont
| xF0 (x90..xBF) cont cont
| (xF1..xF3) cont cont cont
| xF4 (x80..x8F) cont cont
cont ::= x80..xBF

Semicolon insertion

Just like programming languages like Haskell, Python, JavaScript, Scala, Go, etc., there is a layout rule which automatically adds semicolons at appropiate places. This enables the programmer to leave out most semicolons.

Koka will insert semicolons automatically for any statements and declarations that are aligned between curly braces ({ and }). For example, in the following program:

function eq1( x : int, y : int ) : io bool 
{  
  print("calculate equality")
  result = if (x == y) then True            
              else False
  result
}  
function eq1kokaspec/eq1: (x : int, y : int) -> io bool( xx: int : intstd/core/int: V, yy: int : intstd/core/int: V ) : iostd/core/io: E boolstd/core/bool: V 
{  
  printstd/core/print: (s : string) -> console ()("calculate equality")
  resultresult: bool = if (xx: int ==std/core/(==).1: (int, int) -> bool yy: int) then Truestd/core/True: bool            
              else Falsestd/core/False: bool
  resultresult: bool
}  

we get semicolons before each statement that was aligned between the braces:

function eqSemi( x : int, y : int ) : io bool 
{;  
  print("calculate equality");
  result = if (x == y) then True            
              else False;
  result;
}  
function eqSemikokaspec/eqSemi: (x : int, y : int) -> io bool( xx: int : intstd/core/int: V, yy: int : intstd/core/int: V ) : iostd/core/io: E boolstd/core/bool: V 
{;  
  printstd/core/print: (s : string) -> console ()("calculate equality");
  resultresult: bool = if (xx: int ==std/core/(==).1: (int, int) -> bool yy: int) then Truestd/core/True: bool            
              else Falsestd/core/False: bool;
  resultresult: bool;
}  

Since semicolons are only inserted for aligned statements, we can write a long statement on multipe lines by using more indentation:

function eq2( x : int, y : int ) : io bool 
{  
  print("calculate " +
         "equ" +
         "ality")
  result = if (x == y)
            then True 
            else False
  result
}  
function eq2kokaspec/eq2: (x : int, y : int) -> io bool( xx: int : intstd/core/int: V, yy: int : intstd/core/int: V ) : iostd/core/io: E boolstd/core/bool: V 
{  
  printstd/core/print: (s : string) -> console ()("calculate " +std/core/(+).4: (string, string) -> string
         "equ" +std/core/(+).4: (string, string) -> string
         "ality")
  resultresult: bool = if (xx: int ==std/core/(==).1: (int, int) -> bool yy: int)
            then Truestd/core/True: bool 
            else Falsestd/core/False: bool
  resultresult: bool
}  

In contrast to token-based layout rules, as in Scala or Go for example, this allows you to put line breaks at any point in a statement by just indenting more. Moreover, it means that the visual indentation of a program corresponds directly to how the compiler interprets the statements. Many tricky layout examples in other programming languages are often based on a mismatch between the visual representation and how a compiler interprets the tokens. With Koka's layout rule, there is no such mismatch.

To still allow for “block-style” layout, the layout rule does not insert a semi-colon for an aligned statement if it starts with then, else, elif, or one of {, ), or ].

function foo()
{  
  val xs = [ 
    "list",
    "elements",
  ]
  if (odd(randomInt())) 
  {
    print("odd")
  }
  else 
    print("even")
}  
function fookokaspec/foo: () -> <console,ndet> ()()
{  
  val xsxs: list<string> = [ 
    "list",
    "elements",
  ]std/core/Nil: forall<a> list<a>
  if (oddstd/core/odd: (i : int) -> bool(randomIntstd/core/randomInt: () -> ndet int())) 
  {
    printstd/core/print: (s : string) -> console ()("odd")
  }
  else 
    printstd/core/print: (s : string) -> console ()("even")
}  

Of course, it is still allowed to use semicolons explicitly which can be used for example to put multiple statements on a single line:

function equalLine( x : int, y : int ) : io bool {
  print("calculate equality"); (x == y) 
}  
function equalLinekokaspec/equalLine: (x : int, y : int) -> io bool( xx: int : intstd/core/int: V, yy: int : intstd/core/int: V ) : iostd/core/io: E boolstd/core/bool: V {
  printstd/core/print: (s : string) -> console ()("calculate equality"); (xx: int ==std/core/(==).1: (int, int) -> bool yy: int) 
}  

The layout algorithm also checks for invalid layouts where the layout would not visually correspond to how the compiler interprets the tokens. In particular, it is illegal to indent less than the layout context or to put comments into the indentation (because of tabs or potential unicode characters). For example, the program:

function equal( x : int, y : int ) : io bool {   
    print("calculate equality")
  result = if (x == y) then True   // wrong: too little indentation
  /* wrong */      else False
    result
}  
function equal( x : intstd/core/int: V, y : intstd/core/int: V ) : iostd/core/io: E boolstd/core/bool: V {   
    print("calculate equality")
  result = if (x == y) then Truestd/core/True: bool   // wrong: too little indentation
  /* wrong */      else Falsestd/core/False: bool
    result
}  

is rejected. In order to facilitate code generation or source code compression, compilers are also required to support a mode where the layout rule is not applied and where no semicolons are inserted. A recognized command line flag for that mode should be --nolayout.

The layout algorithm

The layout rule is natural to humans since semicolon insertion corresponds directly to the visual two-dimensional layout of the program. The formal specification of the layout rule is a bit more technical though since a compiler interprets source code as a linear one-dimensional sequence. To define the layout algorithm formally, we first establish some terminology:

Because braces can be nested, we use a layout stack of strictly increasing indentations. The top indentation on the layout stack holds the layout indentation. The initial layout stack contains the single value 0 (which is never popped). The operations on the layout stack are always done before the semicolon insertion:

As defined, semicolons are inserted whenever statements or declarations are aligned, unless the lexeme happens to be a clear statement continuation. To simplify the grammar specification, a semicolon is also always inserted before a closing brace and the end of the source. This allows us to specify many grammar elements as ended by semicolons instead of separated by semicolons which is more difficult to specify for a LALR(1) grammar.

Semicolon insertion can be easily implemented as part of the lexer, but could also be implemented as a straightforward transformation on the lexical token stream. In particular, there are no intricate dependencies with the parser that lead to bizarrely complex layout rules, as is the case in languages like Haskell or JavaScript.

Implementation

There is a full Flex (Lex) implementation of lexical analysis, which includes an implementation of the layout rule. Ultimately, the Flex implementation serves as the specification, and this document and the Flex implementation should always be in agreement.

Context-free syntax

The grammar specification starts with the non terminal module which draws its lexical tokens from lex where all whitespace tokens are implicitly ignored.

Modules

module[lex] ::= [moduledecl] modulebody
 
moduledecl ::= [semis] [visibility] module moduleid
moduleid ::= qvarid | varid
 
modulebody ::= { [semis] declarations } [semis]
| [semis] declarations
 
visibility ::= public | private
semis ::= ; { ; }

Top level declarations

declarations ::= { import semis } { fixitydecl semis } { topdecl semis }
 
import ::= [visibility] import [moduleid =] moduleid
 
fixitydecl ::= [visibility] fixity natural identifier { , identifier }
fixity ::= infixl
| infixr
| infix
 
topdecl ::= [visibility] puredecl
| [visibility] aliasdecl
| [visibility] typedecl
| [visibility] externdecl
| abstract typedecl

Type declarations

aliasdecl ::= alias typeid [typeparams] [kannot] = type
typedecl ::= typesort typeid [typeparams] [kannot] [typebody]
| struct typeid [typeparams] [kannot] [conparams]
typesort ::= type | cotype | rectype
 
typeid ::= varid
| []
| ( { , } )
| < >
| < | >
 
typeparams ::= < [tbinders] >
tbinders ::= tbinder { , tbinder }
tbinder ::= varid [kannot]
typebody ::= { [semis] {constructor semis } }
 
constructor ::= [con] [equantifier] conid [conparams]
conparams ::= ( [conparam { , conparam }] )
conparam ::= [paramid] : paramtype [= expr]

Value and function declarations

puredecl ::= val valdecl
| (fun | function) fundecl
 
valdecl ::= binder = expr
binder ::= identifier [: type]
 
fundecl ::= [quantifiers] funid fundef (block | = blockexpr)
fundef ::= parameters [annotres] [sigqualifier] [annotfull]
funid ::= identifier
| [ { , } ] (indexing operator)
 
parameters ::= ( [parameter { , parameter }] )
parameter ::= paramid [ : paramtype] [= expr]
paramid ::= identifier | wildcard
paramtype ::= type
| ? type
 
annotres ::= : tresult
annotfull ::= :: typesig
 
qidentifier ::= qvarid | qidop | identifier
identifier ::= varid | idop
 
qoperator ::= op | ` (qidentifier | qconstructor) `
 
qconstructor ::= conid | qconid

Statements

block ::= { [semis] { statement semis } }
 
statement ::= nofunexpr
| decl
 
decl ::= (fun | function) fundecl
| val pattern = valexpr (local values can use a pattern binding)
| valdecl (local values have an optional val keyword)
| var binder := valexpr

Expressions

blockexpr ::= funexpr
| nofunexpr (implicitly wrapped in a block)
 
expr ::= funexpr
| nofunexpr (returnexpr not allowed)
 
nofunexpr ::= matchexpr
| opexpr
| ifexpr
| returnexpr
 
noifexpr ::= matchexpr
| opexpr
| returnexpr
| funexpr
 
ifexpr ::= if atom then { elif } [else noifexpr]
then ::= [then] noifexpr
elif ::= elif atom then
 
matchexpr ::= match atom { [semis] { matchrule semis } }
 
returnexpr ::= return noifexpr (but no returnexpr allowed)
 
funexpr ::= (fun | function) [quantifiers] fundef block
| block (zero-argument anonymous function)

Operator expressions

For simplicity, we parse all operators as if they are left associative with the same precedence. We assume that a separate pass in the compiler will use the fixity declarations that are in scope to properly associate all operators in an expressions.

opexpr ::= fappexpr { qoperator fappexpr }
fappexpr ::= appexpr { funexpr }
 
appexpr ::= appexpr ( [arguments] ) (regular application)
| appexpr [ [arguments] ] (index operation)
| appexpr . prefix
| prefix
prefix ::= { qoperator } atom
 
arguments ::= argument { , argument }
argument ::= [identifier =] expr

In the grammer we use the dot (.) for both qualifying identifiers with a module name, but also for “dot” expressions' where e.f(x) is syntactic sugar for the application f(e,x). When we combine module qualification with dot expressions, we can get expressions like List.primes.List.map(sqr) where we map the squaring function over the list of prime numbers. Note that the grammar is still LALR(1) since qualified names are returned as single tokens by the lexical analyzer, i.e. List.primes and List.map.

Atomic expressions

atom ::= qidentifier
| qconstructor
| literal
| ( ) (unit)
| ( annexpr ) (parenthesized expression)
| ( annexprs ) (tuple expression)
| [ [annexprs [,]] ] (list expression)
 
literal ::= natural | float | char | string
 
annexprs ::= annexpr { , annexpr }
annexpr ::= expr [ : typescheme ]

Matching

matchrule ::= patterns [guard] -> branchexpr
guard ::= | compound
 
pattern ::= identifier
| wildcard
| qconstructor [( [patargs] )]
| ( [patterns] ) (unit, parenthesized pattern, tuple pattern)
| [ [patterns] ] (list pattern)
| pattern as identifier (named pattern)
| literal
 
patterns ::= pattern { , pattern }
patargs ::= patarg { , patarg }
patarg ::= [identifier =] pattern (possibly named parameter)

Type schemes

typesig ::= [quantifiers] tarrow [sigqualifier]
typescheme ::= [quantifiers] tarrow [qualifier]
type ::= [aquantifier] tarrow [qualifier]
 
quantifiers ::= [squantifier] [aquantifier]
aquantifier ::= forall < tbinders >
squantifier ::= some < tbinders >
equantifier ::= exists < tbinders >
 
sigqualifier ::= with predicates | qualifier
qualifier ::= with ( predicates )
 
predicates ::= predicate { , predicate }
predicate ::= typeapp (interface)

Types

tarrow ::= tatom [-> tresult]
tresult ::= tatom [tbasic]
 
tatom ::= tbasic
| < anntype { , anntype } [ | tatom ] >
| < >
 
tbasic ::= typeapp
| ( ) (unit type)
| ( tparam ) (parenthesized type or type parameter)
| ( tparam { , tparam } ) (tuple type or parameters)
| [ anntype ] (list type)
 
typeapp ::= typecon [< [anntype { , anntype }] >]
 
typecon ::= varid | qvarid
| wildcard
| ( , { , } ) (tuple constructor)
| [ ] (list constructor)
| ( -> ) (function constructor)
 
tparam ::= [varid : ] anntype
anntype ::= type [kannot]

Kinds

kannot ::= :: kind
 
kind ::= ( kind { , kind } ) -> kind
| katom -> kind
| katom
 
katom ::= V (value type)
| X (effect type)
| E (effect row)
| H (heap type)
| P (predicate type)

Implementation

As a companion to the Flex lexical implementation, there is a full Bison(Yacc) LALR(1) implementation available. Again, the Bison parser functions as the specification of the grammar and this document should always be in agreement with that implementation.

Appendix

Complete syntax definition