The essay that follows was sent to r6rs-discuss@r6rs.org
on Tue Jul 10 06:52:31 2007.  On 25 July, I noticed that
the version of this essay that was posted to the official
archive of that list had been truncated; see
http://lists.r6rs.org/pipermail/r6rs-discuss/2007-July/003080.html

I am therefore reposting the complete essay, in exactly
the form that I originally sent to r6rs-discuss@r6rs.org.
In case the problem reoccurs, the complete essay is also
available on my personal web site at
http://www.ccs.neu.edu/home/will/R6RS/essay.txt

I apologize for failing to notice this problem earlier.

                                * * *

an essay on language design:
fixing the syntactic record layer


Introduction
============

More than twenty years have passed since I wrote this [1]:

    Programming languages should be designed not by
    piling feature on top of feature, but by removing the
    weaknesses and restrictions that make additional
    features appear necessary.  Scheme demonstrates that
    a very small number of rules for forming expressions,
    with no restrictions on how they are composed,
    suffice to form a practical and efficient programming
    language that is flexible enough to support most of
    the major programming paradigms in use today.

I still believe that first sentence, and I still believe
Scheme ought to demonstrate what is claimed in the second
sentence, but the draft we are being asked to ratify does
not always do that.

This shortcoming of the candidate draft can be seen in the
modularity and interoperability problems that beset the
syntactic and procedural record layers.  As I will show,
these problems are caused by artificial restrictions that
have been imposed upon the syntactic layer.  Removing those
weaknesses would remove the problems.

A last-minute change in the 5.97 draft attempted to fix
things by piling yet another feature, a parent-rtd clause,
on top of the syntactic layer [2].  The presumed purpose of
that parent-rtd clause was to address Andre van Tonder's
observation that incompatibilities between the syntactic
and procedural record layers create a modularity problem:
You cannot define a new record type that inherits from
an existing record type without knowing whether the base
type was defined by the syntactic or by the procedural
layer [3,4].

That also implies that record definitions are brittle:
Unless a record type is sealed, its definition cannot
be changed from using the syntactic layer to using the
procedural layer, or vice versa, without breaking all
record types that inherit from it.

Although the editors acted with the best of intentions,
their addition of the parent-rtd clause did not solve
the problems it was intended to solve.  Even with the
parent-rtd feature, you *still* have to know whether
the base record type was defined using the syntactic
or record layer, and you *still* can't change a record
definition from one layer to the other without running
the risk of breaking client code.

To make matters worse, the 5.97 draft added a couple
of questionable statements that attempt to excuse the
interoperability problems while asserting privileged
status for the draft's syntactic layer.  One of those
statements is based upon a patently false claim.

The editors have submitted this draft to the Steering
Committee as a candidate for ratification, so there is
no meaningful technical review of these last-minute
changes apart from the ratification vote itself.

Abstract
========

I will summarize the interoperability problems mandated
by library chapter 8 of the 5.97 draft, trace them to
their root cause, show how they could easily be fixed
by removing artificial restrictions that are imposed by
the syntactic layer, and conclude by showing that the
two exculpatory statements of that chapter are partly
false and thoroughly misleading.

Symptoms
========

That the syntactic and procedural record layers do not
interoperate well has been known for a while now, and had
been acknowledged by the editors, who had declared their
intention not to do anything about it [5].  I did not
consider that to be an absolute barrier to ratification,
because better syntactic layers would have been proposed
as SRFIs, and one of those alternatives might eventually
have replaced the R6RS syntactic layer.  That would have
been a better outcome than piling on still more features
without fixing the fundamental problem.

The last-minute addition of parent-rtd addressed the
most obvious of the interoperability problems, which was
first mentioned in public by my formal comment 90 [5],
but left these others in place:

  * Record types defined by the syntactic layer are not
    interchangeable with record types defined by the
    procedural layer.

  * In consequence, the code you write for a record type
    definition that inherits from some base type depends
    upon whether that base type was defined using the
    syntactic or procedural layer.

  * Both layers are complex, which makes it hard for a
    casual reader to understand their relationships.

  * The procedural layer is the more expressive layer,
    so the draft's new warnings that try to frighten
    programmers into preferring the syntactic layer
    would have limited impact even if they were true.

The procedural layer is more expressive because it can
do everything the syntactic layer can do, and it can
also be used to create multiple constructor-descriptors
for a single record type descriptor [6].

That, of course, is a cue for someone to jump up and
say "We can fix that by adding a new clause to the
syntactic layer!"  Adding yet another feature would be
exactly the wrong thing to do.  We ought to fix the
problem, not try to cover it with still more sterile
adhesive strips.

The proper course of action is to understand why these
problems matter, why they arose, and how to fix them.
Then we should fix them.

The Impending Records War
=========================

By specifying two barely interoperable record systems,
and advocating the more complex and less expressive of
the two, the 5.97 draft would create an unnecessary
dilemma for organizations that use Scheme.  Most will
deal with incompatibilities between the two record
layers as they arise.  After dealing with several
instances of the problem, some organizations will
standardize on one or the other of the record layers.

Some will choose the procedural layer, because it is
more expressive or because it is more in keeping with
Scheme's roots as a higher order procedural language.
Others will choose the syntactic layer because that
is what the 5.97 draft suggests, or because Scheme's
macro system is really cool.  When these organizations
import code that uses the "wrong" record layer, they
will rewrite it to use their organization's standard
layer.  When they get tired of rewriting code, they
will clamor for the "wrong" record layer to be expunged
from the standard.

That conflict is unnecessary.  We do not have to fight
over which record layer is wrong, because we could fix
things so both are right.  That is not hard.  We should
do it.

The Root Cause
==============

The root technical problem is easy to understand.
I'll digress for a few paragraphs to give you a
chance to figure it out before I do.

A friend of mine remarked that it is impossible to
design a record system for Scheme that won't lead
to interoperability problems.  This is Scheme, after
all.  Any Scheme programmer can define a new
syntactic layer for records, and its notion of a
record type might be different from the standard
notion, so programmers shouldn't expect to be able
to define a record type that inherits from any other
programmer's record type.

That's true, up to a point.  The point, of course,
is that we should be able to define records that
inherit from any record system that uses the
standard notion of a record type.

The 5.97 draft doesn't have a standard notion of
a record type.  It has *two* standard notions of
a record type, with context-restricted coercions
between them.

That is the root technical cause of the modularity
and interoperability problems.  The solution is to
define a single standard notion of a record type,
and to use that one notion as the basis for both
the syntactic and the procedural layers.

To do that, of course, the standard notion of a
record type will have to be a first-class object.
The syntactic layer can deal with first-class values
by deferring them to run time, but the procedural
layer can't reach back in time to deal with macro
or expand-time values.

This has been a source of controversy among the
editors.  The 5.96 and earlier drafts fudged by
saying a record type is an "expand-time or run-time
description".  The 5.97 draft changed that phrase
to "expand-time representation of the record-type",
thereby institutionalizing the interoperability
problems even as it pretended to do something about
them.

In the 5.97 draft, the procedural layer's notion of
a record type is an rtd (record type descriptor).
The syntactic layer's notion of a record type is an
expand-time representation that bundles an rtd with
a preferred constructor-descriptor.

I will now describe a straightforward solution to
this muddle, based upon the following standard
notion of record type:

    A record type is an rtd.

To maintain compatibility with the syntactic layer
of the 5.97 draft, and for that reason only, every
non-opaque rtd will be associated with a preferred
constructor-descriptor.  The preferred
constructor-descriptor is the one associated with
the rtd in a special global table or, if that table
contains no preferred constructor-descriptor for rtd,
then the preferred constructor descriptor is the one
computed by

        (make-record-constructor-descriptor
         rtd <parent-preferred> #f)

where <parent-preferred> is the parent's preferred
constructor-descriptor, or #f if there is no parent.

Note that the global table is a run-time object that
holds run-time constructor-descriptors.  Note also
that any implementors who would like to maintain an
expand-time or compile-time table of (conservative
approximations to) the information contained within
that run-time table are welcome to do so.

How does an rtd become associated with its preferred
constructor-descriptor?  By having the two be passed
as arguments to a special procedure that is known to
the macro/library/compiler/whatever system, but is
not exported by any of the standard libraries.  In
other words, only the syntactic layer can associate
an rtd with a preferred constructor-descriptor other
than the default.

I understand that the preferred constructor-descriptors
are an ugly hack.  They would not be present in any
record system I would design from scratch.  Why then
am I proposing these preferred constructor-descriptors?

Because I am taking a lesson from C++, which caught
on in part because it was bug-compatible with C.
The system I am about to describe is, in one of Mike
Sperber's favorite phrases, a conservative extension
of the 5.97 record system.

That means everything that would work in the 5.97 system
would work in the system I am about to describe, and a
number of things that wouldn't work in the 5.97 system,
but should, will indeed work in the system I describe.

How do we arrange that?  By removing the artificial
restrictions mandated by the 5.97 draft.

(We'll keep the artificial restriction that limits the
procedural layer's preferred constructor-descriptors to
default constructor-descriptors.  That restriction would
be easy to remove also, but removing it might complicate
the optional expand-time or compile-time bookkeeping that
appears to have been the driving force behind the 5.97
design.)

Proposal
========

To avoid still more discussion of the API for the R6RS
record layers, I propose we keep the syntax and almost
all of the semantics of the 5.97 syntactic layer, and
keep all the procedures and all the semantics of the
5.97 procedural and inspection libraries.

I further propose we extend the syntactic layer by
eliminating certain weaknesses and restrictions.
We will:

  * Require define-record-type to bind the <record name>
    to the rtd, in the same group of definitions that binds
    the constructor, predicate, accessors, and mutators.

  * Allow the <parent rtd> and <parent cd> of a
    parent-rtd clause to be arbitrary expressions,
    as in the 5.97 draft.  (Notice, however, that
    the <record name> bound by a define-record-type
    is now an ordinary variable and can serve as the
    <parent rtd> without having to resort to a use
    of record-type-descriptor).

  * Extend the parent clause to allow any expression,
    which must of course evaluate to an rtd.

  * Extend record-type-descriptor to allow any
    expression as its <record name>, provided the
    expression evaluates to an rtd; in other words,
    record-type-descriptor would become a procedure.

  * Extend record-constructor-descriptor to allow
    any expression as its <record name>, provided
    the expression evaluates to an rtd; it would
    then evaluate to the rtd's preferred
    constructor-descriptor.  In other words,
    record-constructor-descriptor would become a
    procedure.
    
I might have missed something, but I believe that's
all it takes.

Note that record-type-descriptor has become unnecessary.
It is nothing more than the identity function restricted
to record type descriptors.  If I weren't trying to
describe a conservative extension of the 5.97 draft,
I would urge removal of record-type-descriptor from
the language [7].

Note that both the scope and semantics of a <record name>
bound by the syntactic layer have become clearer.  The
<record name> is no longer a name for some mysterious
"expand-time representation" that is neither a run-time
object nor a macro.  It is now an ordinary variable that
obeys ordinary scope rules, can be exported or imported
in the usual way, for run time, and has a first class
object as its value.

I'm not going to claim this is a good record system,
but it offers all the features of the 5.97 draft,
all of the performance (for all use cases that can
even be expressed using that draft), and none of the
modularity and interoperability problems associated
with the record layers of that draft.

Performance
===========

The 5.97 draft contains a couple of new paragraphs
that attempt to justify its limitations by appeal
to matters of performance.

Page 16 says:

    However, the record operations provided through
    the procedural layer may be significantly less
    efficient than the operations provided through
    the syntactic layer.  Therefore, alternative
    implementations of syntactic record-type
    definition [sic] should, when possible, expand
    into the syntatic [sic] layer rather than the
    procedural layer.

To put that in perspective, let me point out that
the map procedure may be significantly less efficient
than using a do loop.  Indeed, there have been many
implementations of Scheme in which do loops are more
efficient than calls to map.  Despite that fact, none
of the Scheme reports have ever advocated using do
loops instead of map.  To advocate such things would
be inappropriate for an implementation-neutral
standard.

In typical uses of records, the base record type
will be defined at the top level of a library, where
the variable that holds the rtd will be immutable,
as will all of the other top-level variables that
are defined in terms of the rtd.

That makes it almost as easy to optimize code written
using the procedural layer as code written using the
syntactic layer.  Sure, some compilers may optimize
one without bothering to optimize the other, but most
would optimize neither or both.

In any case, it is obvious that any program that can
be written under the restrictions of the 5.97 draft
is also a program under my proposal.  If some macro
expander and/or compiler were written to record some
expand-time information when the syntactic layer of
the 5.97 draft is used, then they can record exactly
the same information for the syntactic layer of my
proposal.  The only additional complication of my
proposal is that the macro expander and/or compiler
would have to recognize when the <record name> is
an expression other than a variable that was bound
by define-record-type.  Recognizing that is trivial.

My proposal would not require any new flow analysis.
The advanced optimizations that require flow analysis
would use essentially the same flow analysis under my
proposal as they would under the 5.97 draft.

Consider, for example, that the 5.97 draft allows the
rtd associated with a <record name> to escape via the
record-type-descriptor syntax.  That means the rtd of
a <record name> that is exported by a library, whether
explicitly or implicitly, may escape within some
importing library [8].  Hence any optimizations that
require flow analysis of the rtd must either defer the
optimization until a whole-program analysis can be
performed, or else assume that the rtd of an exported
<record name> will flow into arbitrary contexts.  In
other words, the rtd-flow analysis required by the
5.97 draft is already as bad as it could be, so my
proposal can't possibly make it any worse.

From page 18:

    Note:  Use of the parent-rtd clause generally
    forces an implementation to delay the generation
    of constructor, accessor, and mutator code until
    the record-type definition is evaluated at run
    time, since the type of the parent is not generally
    known until then.

That is a false statement.  The editors might as well
claim that the code for a lambda expression cannot be
generated until run time, since the values of its
free variables will not be known until then.

Even in the current release of Larceny, all of the code
generated for constructors, accessors, and mutators is
generated at compile time.  None of that code is ever
generated at run time.

In future releases, an unoptimized record access will
consist of a procedure call, a double tag check, an
indirect load, an eq? check, and a load.  Twobit's
existing optimizations, or easy extensions of them,
will eliminate any or all of that code when it is safe
to so.

The code that isn't eliminated by optimization will
be generated at compile time.  No code will ever be
generated at run time.

And that's for the procedural layer.  There is no
earthly reason for a compiler to generate worse code
for the syntactic layer than for the procedural layer,
or to generate it any later.

    The parent clause should therefore be used instead
    whenever possible.

This recommendation is based upon a false premise.

So What?
========

The substantive changes that were made in the 5.97
draft are immune to meaningful technical review, so
why did I write this?  Partly to blow off steam, of
course, but there were at least three other reasons
as well.

As Andre van Tonder wrote, the only way for us to
register disagreement with changes made in the 5.97
draft is to vote against ratification [9].  Under
the rules of that vote, any negative vote must be
accompanied by an explanation, so I had to write
something like this anyway.

I am told that, if this draft is not ratified, the
Steering Committee intends to pay a lot of attention
to the reasons cited in those explanations.  If you
vote against ratification for reasons that include
some of the issues I have discussed, then you may be
able to save some writing by citing this essay.

The second reason has to do with what happens after
the vote.  As I see it, there are three possible
outcomes:

    1.  The vote is negative, which would give the
        editors an opportunity to get it right.

    2.  The draft is ratified, and everyone pretends
        to live happily ever after.

    3.  The draft is ratified, and the unhappy folk
        design alternative syntactic layers, probably
        written up as SRFIs, that build upon the R6RS
        procedural layer.

This little essay of mine might be of some use, or
at least have some influence, in the event of outcomes
1 or 3.  I don't think outcome 2 is stable in the long
run.  I think it would evolve into outcome 3.

Thirdly, writing this essay gave me a chance to consider
whether I still believe what I wrote so long ago.

Conclusion
==========

Programming languages should be designed not by piling
feature on top of feature, but by removing the weaknesses
and restrictions that make additional features appear
necessary.

R6RS Scheme should demonstrate that a very small number
of rules for forming expressions, with no restrictions
on how they are composed, suffice to form a practical
and efficient programming language.

William D Clinger
5-9 July 2007

--------

[1] Jonathan Rees and William Clinger [editors].
Revised^3 report on the algorithmic language Scheme.
ACM SIGPLAN Notices 21(12), December 1986, pages 37-79.

[2] Michael Sperber et al.  Revised^5.97 report on the
algorithmic language Scheme -- standard libraries.
http://www.r6rs.org/versions/r5.97rs-lib.pdf
http://www.r6rs.org/document/lib-html-5.97/r6rs-lib.html

[3] Andre van Tonder.  Rationale issues.  Posted to
r6rs-discuss, 26 June 2007.
http://lists.r6rs.org/pipermail/r6rs-discuss/2007-June/002825.html

[4] William D Clinger.  Response to [3], 27 June 2007.
http://lists.r6rs.org/pipermail/r6rs-discuss/2007-June/002889.html

[5] William D Clinger.  Record layers are not orthogonal.
Formal comment #90, 13 November 2006.
http://www.r6rs.org/formal-comments/comment-90.txt

[6] It doesn't matter whether the descriptor was created
using the syntactic or the procedural layer.  This is an
example of the interoperability we should have throughout
the record system.

[7] It is analogous to endianness, buffer-mode, et cetera.

[8] Whether the 5.97 draft allows a <record name> to be
exported from a library may not be entirely clear, but
disallowing such exports would be disastrous, so I assume
the 5.97 draft is meant to allow such exports.

[9] Andre van Tonder.  parent-rtd clauses in records.
Posted to r6rs-discuss, 3 July 2007.
http://lists.r6rs.org/pipermail/r6rs-discuss/2007-July/003071.html

_______________________________________________
r6rs-discuss mailing list
r6rs-discuss@lists.r6rs.org
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss