May 10-11, 2010, Microsoft Research, Roger Needham Building, Cambridge, United Kingdom
Drinks in the Eagle Pub, Benet Street, from 21:30. No organised dinner. You will probably find dinner companions at the Arundel House (try meeting at 19:00).
SQL is a language with arbitrary syntax and dubious semantics; its documentation now amounts to over 3500 pages and costs over £1,500. This is, to say the least, worrying for a language that was intended for "use by people who have need for interaction with a large database but are not trained programmers" and is generally held to be the most widely used of all programming languages. Despite its ungraceful nature, this organic language has incorporated a large number of features that are essential for work with data of all kinds. In this talk I shall describe some of the ways in which SQL remains influential in our development of models of data manipulation, constraints, and query languages.
Wikis make certain kinds of data (text, HTML) easy to create, edit and share on the Web, and facilitate flexible and distributed collaboration to create valuable resources. But wikis have limitations, especially for managing or querying large amounts of data with natural structure. Curated databases in bioinformatics often start out as wikis but soon need to migrate to a three-tier Web application based on a relational database. This transition is painful, in large part because the elegant mathematical foundations of relational databases have been buried under decades of cruft and language designs, mediated by expensive database administrators and programmers. Functional programming languages make it easy to abstract away from the low-level details of programming, from side- effects and I/O to CGI and Web database programming. However, while relational databases and functional programming ought to work well together, they are barely on speaking terms. In this talk I'll give an overview of the (ongoing) design of a Database Wiki, that is, a Web application for creating and managing both Web content (HTML) and Web data (relations, XML, what have you). The key to this design is the use of ideas from functional Web programming languages for defining Wiki pages with embedded queries, along with forms and interactions that update the database. A significant advantage of this design is that advanced features such as versioning, historical queries, provenance tracking, and data sharing can be implemented once in the Database Wiki system and re-used by many applications. The eventual goal of this work is to put the power of relational databases and functional programming into the hands of users, encapsulating the convoluted three-tier Web development process and hopscotching the false dichotomy between unstructured Web content and structured relations.
'M' is a textual language for building executable models. In this talk we construct a DSL for a small domain (a model of text), show how information from that model can be stored in a relational database, and finally we augment the model with type annotations and some calculations.
We study a first-order functional language with the novel combination of the ideas of refinement type (the subset of a type to satisfy a Boolean expression) and type-test (a Boolean expression testing whether a value belongs to a type). Our small core calculus can express a rich variety of typing idioms; for example, intersection, union, negation, singleton, nullable, variant, recursive, and algebraic types are all derivable. We formulate a semantics in which expressions denote terms, and types are interpreted as first-order logic formulas. Subtyping is defined as valid implication between the semantics of types. The formulas are interpreted in a specific model that we axiomatize using standard first-order theories. On this basis, we present a novel type-checking algorithm able to eliminate many dynamic tests and to detect many errors statically. The key idea is to rely on an SMT solver to compute subtyping efficiently. Moreover, interpreting types as formulas allows us to call the SMT solver at run-time to compute instances of types.
We discuss some type-related problems for the Relational Algebra (the canonical query language for the relational database model) and the Named Nested Relational Calculus (the canonical query language for the complex object database model). Both query languages are equipped with a natural and standard static type system. In particular, we first discuss the complexity of the typability problem for the relational algebra. This problem consists of deciding, for a given relational algebra expression, whether there exists an assignment of types to variables occurring in the expression such that the expression is well-typed under the assignment. We show that the problem is NP-complete in general, but can be solved in PTIME for restricted fragments. Most of these results follow from a close connection of the typability problem to non-uniform constraint satisfaction. Next, we turn our attention to type inference for the Nested Relational Calculus. The problem is solved by a logic-based approach. From it, we obtain that typability is again NP-complete. The move from the flat relational to the complex object data model hence does not yield higher complexity. Finally, we discuss the impact of the presence of a static type system on the expressiveness of the query languages: are there queries that are (1) well-defined (i.e., do not cause runtime errors), (2) expressible in the Nested Relational Calculus without static type system, but (3) cannot be expressed in a well-typed way? We show that such queries do not exist. As such, we rest assured that the static typing discipline ensures well-definedness while not limiting expressiveness.
Language-integrated database access tends today to fall into one of two camps: either we represent a subset of a language like SQL directly, giving up one or both of static checking (e.g., in mainstream web frameworks) or the ability to employ safe higher-order abstraction over queries (e.g., in LINQ); or we employ automatic translation into the database's language from some subset of our main programming language, such that the translation is either partial or will never take advantage of some database features. In this talk, I will present an alternate solution, which represents SQL queries explicitly, with full static checking, and supporting safe abstraction over families of queries. In the Ur/Web language, the SQL interface appears as "just another library." The base type system of the language involves no SQL-specific features; rather, a very general and flexible notion of type-level computation with records makes it possible to express the typing rules for SQL syntax as a signature in the tradition of the ML module system. Alternate interfaces like query comprehensions may be implemented as libraries, entirely within the language, with static checking that no abstraction can ever build an invalid query.
Most large-scale applications integrate remote services and/or transactional databases. Yet building software that efficiently invokes distributed service or accesses relational databases is still quite difficult. Existing approaches to these problems are based on the Remote Procedure Call (RPC), Object-Relational Mapping (ORM), or Web Services (WS). RPCs have been generalized to support distributed object systems. ORM tools generally support a form of query sublanguage for efficient object selection, but it is not well-integrated with the host language. Web Services may seems to be a step backwards, yet document-oriented services and REST are gaining popularity. The last 20 years have produced a long litany of technologies based on these concepts, including ODBC, CORBA, DCE, DCOM, RMI, DAO, OLEDB, SQLJ, JDBC, EJB, JDO, Hibernate, XML-RPC, WSDL, Axis and LINQ. Even with these technologies, complex design patterns for service facades and/or bulk data transfers must be followed to optimize communication between client and server or client and database, leading to programs that are difficult to modify and maintain. While significant progress has been made, there is no widely accepted solution or even agreement about what the solution should look like. In this talk I present a new unified approach to invocation of distributed services and data access. The solution involves a novel control flow construct that partitions a program block into remote and local computations, while efficiently managing the communication between them. The solution does not require proxies, an embedded query language, or constructions/decoding of service requests. Although the result itself is elegant and useful, what is more significant is that it overturns our deapseated assumption that the original problems can be solved using existing programming language constructs and libraries.
Local resource reasoning provides modular reasoning about programs, based on the analysis of a program’s use of resource. This style of reasoning was first introduced in Separation Logic, a program logic for reasoning about C-programs. Separation Logic was extended with abstract predicates to provide course-grained reasoning about program modules, and generalised to Context Logic to provide fine-grained reasoning about program modules. In this talk, I will describe recent work on reasoning about program modules, using the W3C XML update library DOM as the running example. In particular, I will present an axiomatic specification of DOM using Context Logic, and verify DOM implementations by introducing locality-preserving and locality-breaking translations between the abstract DOM specification and the concrete implementations.
Direct support for querying is becoming a "must have" for programming languages targeting Web 2.0 and Cloud development. Most of those languages (Microsoft's Linq, University of Edinburgh's Links, EPFL's Scala, Yahoo!'s Pig Latin, IBM's Thorn, etc) rely on the classic notion of comprehensions over collections. At the language level, comprehensions are a perfect choice, being well understood programming constructs, and capturing the expressive power of SQL iterators. At the compiler level, however, they are at odds with database optimizers which mostly rely on relational (or nested-relational) algebras. That mismatch was clearly on display during the design of XQuery, whose semantics is based on comprehensions, and for which most implementations target relational backends. We propose a alternative functional semantic formulation of XQuery to the one proposed by W3C, which is also based on comprehensions but has the benefit of corresponding precisely to compilation into a typed algebra that supports traditional database optimizations. First, this provides a formal foundation for XQuery implementations that want to ensure semantics integrity with the standard, along with modern database optimization techniques. Also, it provides key insights into the nature of database compilers that we believe is essential for the integration of database and programming languages technology. We notably discover that type systems for database algebras require an original solution to the old problem of subtyping with record concatenation, and that such a type system can eliminate the need for complex side conditions used in query language optimization.
Much useful work applying data constraints to programming languages and type systems focuses on achieving benefits at compile time. I will present (at lightning speed) a few complementary ideas focusing on the runtime aspects of data constraints, including possible synergies with debuggers, transactional memories. I will also briefly argue the importance of these runtime approaches in pursuit of pluggable (a.k.a. optional) type systems.
Access to persistent data is not easily controlled by regular database permission schema in the context of web applications. Three tier architectures force developers to implement ad-hoc security policies explicitly in the middle tier logic and to use lower security levels when accessing the database, regardless of the actual end-user's profile. This leads to rigid security policies, and even worst, programming errors may lead to giving unwanted access to the wrong users. We present a type based model to express security policies and control access to database entities. We use refinement types to represent and extend the usual database permissions, and present a runtime system that supports an agile programming style, with flexible upgrading of modules and self-healing of access control policies.
Next generation network architectures will benefit from the many years of practical experience that have been gained in designing, using and operating network protocols. Over time, the networking community has gradually improved its understanding of networked systems in terms of architecture, design, engineering and testing. However, as protocols and networked systems become more complex, it is our contention that it will be necessary for programming techniques to evolve similarly so that they better support the design, implementation and testing of both the functional and the non-functional requirements for the network protocols that will be used. We therefore envisage new levels of programming language support that permit: (a) the design and implementation of new protocols with provably correct construction; (b) inline testing; and (c) the expression of protocol behaviour within the design. Based on our ongoing work with both network protocols and programming language design, we believe that exploiting the capabilities of recent work in Domain Specific Languages (DSLs) will allow us to meet such requirements, allowing straightforward and "correct-by-construction" design and implementation of next generation network protocols.
For the last couple of years, my research group has been exploring next-generation programming language design for high assurance systems. In particular, we want high-level languages that give us the option to develop and maintain proofs of properties of our programs, including correctness. One of the artifacts that we've built is a simple, fully verified relational database in the spirit of MySQL. Databases are an attractive target for verification because they (ideally) have simple specifications based on well known abstractions (e.g., finite relations). At the same time, their implementations include many clever techniques, representations, and optimizations that are hard to verify, particularly in a modular fashion. I will describe what we've managed to accomplish so far, and point to the hard problems that still remain.
With recent steps toward mechanically verified RDBMS implementations, the time is right for a thorough study of mechanized query compilation. Mechanization provides behavioral guarantees and increases human understanding; the discipline demanded by mechanization ensures the modularity and compositionality of the resulting end-to-end formal model of query compilation. In this talk we outline our plans for building such a mechanized query compiler, and seek feedback from the community.
Given recent advances in automated theorem proving, we revisit the old idea of applying static program analysis to determine whether database updates preserve integrity constraints. We consider integrity constraints on SQL table declarations, extended with formulas of first-order logic. We work in the setting of a functional multi-tier language, where some functional code is compiled to SQL to query and update a relational database. We use refinement types, types qualified by logical formulas, to track constraints on data and the underlying state of the database; our static analysis uses a refinement-type checker, which relies on recent highly efficient SMT algorithms to check proof obligations. Our method is based on a list-processing semantics for an SQL fragment within the functional language, and is illustrated by a series of programming examples. The static analysis of database transactions is an alternative to costly runtime integrity checking. Our analysis could be applied either at the time of code development, to catch bugs early, or later at the time of code deployment on the database server, to allow only integrity-preserving stored procedures to be accepted, in the manner of proof-carrying code.
The world is full of ad hoc data sets --- non-standard, semi-structured collections of data files for which robust data processing tools are not available. Examples of ad hoc data sets include the billions of log files that are generated by web servers, file servers, billing systems, network monitors, content distribution systems, and other applications. The data analysts and programmers who find themselves working with ad hoc data formats waste significant amounts of time attempting to understand, organize, parse, validate, query and transform these data sources. In this talk, I will sketch two new systems designed to improve the productivity the programmers who work with such data sources on a regular basis. The first system, called Anne, and developed with Qian Xi (Princeton), aims to speed up the process of generating grammars and data processing tools directly from example data. The system operates by giving programmers an annotation language that can be used to tag raw ad hoc data files with structural information. The annotations, together with the raw data, are then used to compute a grammar for the data, and from there, a parser and XML converter. We will explain the design of the annotation language and illustrate its features through a variety of examples. The semantics of the language has some interesting connections to relevance logic. The second system is called Forest and is currently being developed with Kathleen Fisher (AT&T), Nate Foster (Princeton) and Kenny Zhu (Shanghai Jiao Tong University). Forest is a language for specifying and documenting the environmental assumptions necessary for an application or suite of applications to run correctly. Such assumptions might include the presence or absence of collections of files and directories with particular names, owners, permissions, timestamps, or formats. In exchange for such documentation, Forest will provide a suite of tools for validating, manipulating, and maintaining the application-specific environment. Forest will also provide a programmatic interface to the described environment, which will further serve to lower the barrier between external data and in-memory data processing.
Dependent grammars extend context-free grammars by allowing semantic values to be bound to variables and used to constrain parsing. Dependent grammars can cleanly specify context-sensitive features, like length fields in data formats and significant whitespace in programming languages, that cannot be handled in tools limited to context-free grammars. In this talk, we discuss how to extend a generic context-free parsing engine with dependent parsing. In our technique, the parsing engine communicates with a coroutine, which handles all of the dependency computations. The coroutine is produced automatically by a source-to-source transformation of a dependent grammar, and it handles all of the necessary environment manipulations using the native binding mechanisms of the target language. A related source-to-source transformation implements a notion of lazy execution of semantic actions, which are necessary in the presence of speculative parsing.
A venerable line of research aims to provide a general-purpose programming language with a well-defined subset that compiles into efficient queries, perhaps by translation into SQL or some other suitable query language. This talk discusses some older and more recent advances in this direction, including the languages Kleisli, LINQ, Ferry, M, and Links.
Programs formulated using LINQ and Links may have selected program fragments compiled for execution by an accompanying relational database back-end. We demonstrate a LINQ provider and a variant of the Edinburgh Links system, both of which are based on our Ferry compiler. Ferry implements a compositional translation scheme for a rich set of operations over lists that faithfully preserve order and nesting: with Ferry, more LINQ and Links programs may be considered database-executable. Ferry's code generator emits compact bundles of standard SQL:1999 queries and avoids the infamous ``n+1 query problem'' that may easily overwhelm database back-ends.
Type systems are effective tools for verifying the security of cryptographic protocols and implementations. In this talk, I present two approaches for verifying secrecy and authenticity properties using refinement types, first relying on a symbolic model of cryptography, then relying on concrete computational assumptions. I illustrate these approaches on F# programs verified by F7, a typechecker coupled with Z3, an SMT solver.
Methods for quantification of corruption—that is, damage to information integrity—have received little attention to date, whereas quantification of information leakage—damage to confidentiality—has been a topic of research for over twenty years. This talk introduces two kinds of integrity measures: contamination and suppression. Contamination measures how much untrusted information reaches trusted outputs; it is the dual of leakage. Suppression measures how much information is lost from outputs; it does not have a confidentiality dual. Two forms of suppression are considered—programs and channels. Program suppression measures how much information about the correct output of a program is lost because of attacker influence and implementation errors. Channel suppression measures how much information about inputs to a noisy channel is missing from channel outputs. For programs (such as statistical database anonymizers) that declassify secret information, channel suppression is an inverse of leakage. Database privacy conditions from the literature are examined using this theory of quantitative integrity and quantitative confidentiality.
Security views are a flexible mechanism for controlling access to confidential information. Rather than giving untrusted users unrestricted access to data sources, data owners instead provide them with a limited view, from which all confidential information has been removed. Security views are an effective means of ensuring confidentiality but they have a significant drawback: users typically cannot update the information in the view. In this talk, I will describe how lenses can be used to build updatable security views and discuss three ways of ensuring the integrity of sources in the presence of updates: one based on an information-flow analysis, another using provenance, and a third formulated in terms of a simple algebra of edit operations.