1. The SPARC assembler
1.1. Structure of the SPARC assembler
1.2. Native code generation
1.3. The SPARC native assembler
1.4. Peephole optimization
2. The Programming model
2.1. Registers
2.2. Scheme-to-Scheme calling convention
2.3. Scheme-to-Millicode calling
convention
The assembler procedure picks the instruction apart and calls procedures in the code generator to emit machine code. Those procedures, found largely in Asm/Sparc/gen-msi.sch and Asm/Sparc/gen-prim.sch, decide on the instructions to generate, and call procedures in the native assembler. The native assembler creates the machine code for each instruction and performs certain local optimizations on the instruction stream such as branch delay slot filling and instruction scheduling.
The peephole optimizer is invoked on the instruction stream by the target-independent part of pass5, and coalesces common patterns of MacScheme instructions into instructions specialized for the SPARC.
taddcc %R1, %R2, %R1 bvc,a L1 nop sub %R1, %R2, %R1 ... call millicode ... L1:The instruction at L1 will usually be lifted into the branch delay slot. If the destination of the addition is neither R1 or R2, or if R1 and R2 are the same register, then a temporary must be used in order to recover, and three instructions are necessary (FIXME: this has been improved):
taddcc %R1, %R2, %TMP0 bvc,a L1 mov %TMP0, %R3 ... call millicode ... L1:Larceny never uses the trapping version of the tagged add instruction.
ble,a .+8 mov VALUE, %R1To generate a boolean object, the following three-instruction sequence is used:
mov FALSE_CONST, %R1 ble,a .+8 mov TRUE_CONST, %R1On SPARC version 9, a conditional move instruction can be used. Larceny currently uses no SPARC-9 specific instructions.
L0: call .+8 nopafter which register o7 contains the address of L0. The branch delay slot can be filled in the usual manner or it can be used to adjust the value of o7 if L0 is not at the call instruction.
It has been suggested that this kind of code may cause the processor to slow down, and an alternative way of computing the value of L0 instruction using the following code (in generated code only; a different method could be used in code that won't move):
L0: ld [ %REG0 + 3 ], %TMP0 ! get code vector sub %TMP0, 9, %TMP0 ! 1 for tag, 8 for address of _add_ add %TMP0, L0, %o7This code depends on the assembler resolving L0 as the relative address within the code vector (which is a feature). The memory reference is undesirable, but may be scheduled for earlier execution; the control dependence of the call instruction is gone.
All the mnemonics are defined at the end of Asm/Sparc/sparcasm.sch. There are also some abstractions there; for example, on the SPARC a move is expressed as a logical or with %g0, but this low-level implementation obscures the meaning of the instruction; therefore, a sparc.move instruction is provided.
Additionally, there is a sparc.label pseudo-operation, used for definining branch targets, and a sparc.slot instruction, which can be used in the branch delay slot of annulled branches.
In addition to creating the machine code for instructions and resolving branch targets and other expressions, the native assembler performs a limited amount of optimization. Currently, it fills branch delay slots using a simple algorithm: if the branch target is an instruction that can go in a delay slot, then insert the target instruction in the delay slot and increment the branch offset by 4.
A more sophisticated delay slot filling algorithm is desirable, because the above algorithm does not decrease code size -- it only makes a taken branch slightly faster. A better algorithm would work at least on basic blocks and try to move instructions across branches, into the slot following the branch, whenever possible. It would not be hard to implement this efficiently.
The native assembler could also perform instruction scheduling, but that is unlikely to pay off much until we get rid of most dynamic type checks.
The peephole optimizer is currently a simple hand-coded decision tree. This is fast (certainly fast enough), but makes maintenance harder than it could be with a more pattern-based approach. Clinger has written a generic pattern-based peephole optimizer and this should be adapted to be used with Larceny.
Furthermore, a number of the names introduced by the SPARC assembler are unintelligible for no good reason. This should be cleaned up.
For example, a sequence such as
reg 1 op2imm +, 7 setreg 2that adds 7 to the value in R1 and stores it into R2 is peephole-optimized into the single instruction (FIXME: new names now)
dresop2imm +,1,7,2This optimization is possible because the semantics of setreg specify that it destroys RESULT; hence, the use of RESULT can be avoided altogether in many cases.
The peephole optimizer has a table of primitives that it applies this optimization to. At the time of writing, these primitives are car, cdr, CELL-REF, +, -, eq?, and cons. Others could easily be introduced.
reg 1 op1 null? branchf LnHere, null? performs a test on the object in RESULT and sets RESULT to either #t or #f. The branchf instruction then compares RESULT with #f to determine whether to branch or not. The complete sequence of machine instructions looks something like this:
mov %r1, %result cmp %result, 10 ! 10 = empty list mov 2, %result ! 2 = true bne,a .+8 mov 6, %result ! 6 = false cmp %result, 6 be Ln nopIn this case, the creation of a boolean object can be avoided; instead, code can be generated that checks the condition bits directly. The above sequence of MacScheme instructions is collapsed into
optbreg1 null?, 1, Lnfor which the generated code is the much simpler
cmp %r1, 10 bne Ln nop
Again, this optimization is possible because the semantics of branchf is to destroy RESULT.
The primitives for which this optimization is performed are currently null?, pair?, zero?, eq?, <, <=, >, >=, =, char=?, char<, char<=, char>, and char>=. More should be added, notably many type primitives that are used by system code.
(make-vector 7 #f)In general make-vector can take any expression for the length and initialization arguments, and the above call is compiled as
const #f setreg 1 const 7 op2 make-vector, 1When the assembler translates the last instruction to native code, it doesn't know the value in RESULT, and must therefore translate it as a call to the generic make-vector millicode (actually, mem_alloci). That code allocates memory and then initializes it in a loop. For short vectors of known length it is possible to do much better, and the peephole optimizer translates the above pattern into
const #f setreg 1 op2 make-vector:7, 1The SPARC assembly code for the primitive now looks something like this:
set 6, %r1 ! 6 = false jmpl [ %millicode + M_ALLOC ], %o7 ! allocate set 32, %result ! 8 words set 0x1ca2, %g1 st %g1, [ %result ] ! header st %r1, [ %result+4 ] ! initialize elt 0 st %r1, [ %result+8 ] st %r1, [ %result+12 ] st %r1, [ %result+16 ] st %r1, [ %result+20 ] st %r1, [ %result+24 ] st %r1, [ %result+28 ] add 3, %result ! insert tagAdditionally, the allocation code can be in-lined.
The optimization is performed only for make-vector, and for all vectors of known length less than a cut-off point. The current cut-off is 10, which is not completely arbitrary: there's a trade-off between fast initialization and blow-up in code size. More sophisticated techniques are possible for longer vectors, for example specialized unrolled loops, but in those cases, generic code with an unrolled initialization loop is likely to do nearly as well.
global the-procedure setrtn Lx invoke 3 ! 3 arguments .align 4 ! ignored on the SPARC Lx:Splitting the setrtn and the invoke is convenient for the compiler and reduces the size of the intermediate code instruction set. However, on the SPARC a setrtn requires some nontrivial machinery; in the above case it would look something like this:
call .+8 add %o7, k, %o7 ! k reflects the distance from 'call' to Lx st %o7, [ %stkp+4 ] ! store in the stack frameand the invoke ends with
jmpl %tmp0 - 1, %g0 ! jump and discard return address set 12, %result ! setup argument countBut since the return point immediately follows the call in this case, the return address thrown away in the invoke is the return address we want, so we can fold the above MacScheme instructions into
global the-procedure invoke-with-setrtn 3, Lx Lx:where invoke-with-setrtn will end in the code sequence
set 12, %result ! setup argument count jmpl %tmp0 - 1, %o7 ! jump and discard return address st %o7, [ %stkp+4 ] ! store return address
The same optimization applies to local non-tail calls, which use branch rather than invoke. In that case, the SPARC's call instruction can be used as a PC-relative branch that saves the return address.
As it happens, this peephole optimization speeds up local non-tail calls but slows down non-local non-tail calls, on a SPARC Ultra 1. The problem with the non-local calls is probably that a pipeline dependency or memory bottleneck is created between the jump instruction and the store instruction in its delay slot. Certainly, in the case of a branch, the target is known to the processor well before the branch instruction is executed, hence the target instruction has probably been fetched already, and the memory system is available for the store.
The optimization has therefore been disabled for non-local calls. Once the run-time system and assembler are rewritten to allow the return address to be kept in a register, the dependency will go away, and this optimization should pay off also for non-tail calls.
Generated code must maintain a number of target-independent invariants. The target-dependent variants are explained in this section.
The mapping from logical to hardware registers is defined in Rts/regs.cfg.
The register GLOBALS points to a table of run-time system variables and values. The globals table is generated automatically from Rts/globals.cfg. Part of the GLOBALS table is a jump table into millicode routines (see below).
Larceny does not use the SPARC's register windows. Instead, generated code uses a set of virtual machine registers that is kept partly in processor registers, partly in the GLOBALS table. Virtual machine registers that are kept in hardware registers also have shadow locations in the GLOBALS table.
The predicate hardware-mapped?, defined in Asm/Sparc/sparcutil.sch, determines whether a virtual machine register is kept in a processor register.
The globals Scheme variable *nregs*, which is defined in Compiler/sparc.imp.sch, contains the number of general-purpose virtual registers. Parts of the run-time system knows the value of this "variable", so changing it is non-trivial.
In addition to the general-purpose register, several other values are kept in registers:
jmpl [ %GLOBALS + offset ], %o7 nopwhere offset is the table offset for the millicode routine in question, as defined in Rts/globals.cfg.
The address passed in %o7 should be the address of the return point minus 8 (as per the normal SPARC conventions, in case you're puzzled), so if the millicode procedure should not return to the point following the nop above, then the slot must be used to adjust the return address. For example,
jmpl [ %GLOBALS + offset ], %o7 add %o7, L0 - (. - 4) - 8, %o7 ... L0:Above, (.-4) is the address of the jmpl instruction, which is also the address in o7, and L0 is the address of the return point, so L0-(.-4)-8 is the quantity to add to o7 to set up a return point at L0-8.