Implementation notes for Ra, the R just-in-time compiler

Ra implementation notes               For Ra version 1.2.4
-----------------------


The idea is to make it trivially easy for users to speed up their
arithmetic code --- they just add a jit() statement.

In the current implementation life is not quite that simple --- a
jit() statement has to be added to each function to be jitted (because
jitted functions can't be nested, meaning you can't call a jitted
function from a jitted function).

I have tried to keep the implementation simple by focusing only the
problem of slow arithmetic in loops in R.  The implementation is not a
general solution which tries to speed up all of R --- such a solution is
probably not worth the complexity.  It's an attempt to get the most
bang for the buck and is a lot less ambitious than, say, Python's Psyco.

Using just-in-time compilation is roughly equivalent to declaring
variables before use.  In a jit block, the jit compiler knows the type
and size of a variable the first time it is used.  Thereafter the
memory address and type are known, and will not change while the code
is in the jit block.  That is, we don't allow the users to use R code
that changes the address or type of a variable in a jit block --- they
will get an error message if they try.  This allows jitted code to
access symbol values directly by memory address, rather than with an
expensive call to findVar.  It also means that most memory allocation
and deallocation is unnecessary in jitted code --- the memory is
preallocated just once.


Example of jitted code
----------------------

    jit(jit=1, trace=3)

    for(i in 1:100)
        x <- x + 3 + atan2(2,3)  # artificial code

    jit(0)

This will produce the following on execution.  The "trace=3" output is
quite verbose and some of it has been deleted below.

> for(i in 1:100)
+    x <- x + 3 + atan2(2,3)

# Begin JIT compilation: for (i in 1:100) x <- x + 3 + atan2(2, 3)
#   Did not compile x <- x + 3 + atan2(2, 3)
#   Did not compile x + 3 + atan2(2, 3)
#   Compiled x + 3
#       to JIT type double[1] expression x + 3
        pushsym_r       operand symbol "x"
        push_r          operand double length 1 [1] 3
        add_r1_r1       result  double length 1
        endop
# End JIT compilation: for (i in 1:1e6) x <- JITTED(x + 3) + atan2(2, 3)

On the first iteration through the loop, the jitter tries to compile
the entire expression
    x <- x + 3 + atan2(2, 3)

It fails because it cannot compile the call to "atan2(2, 3)".  (This
is just a limitation of the current implementation but it does provide
a good example here.)

On the next iteration, it tries to compile the subexpression
    x + 3 + atan2(2, 3)
It fails again, for the same reason.

On the next iteration it successfully compiles the subexpression
   x + 3


The implementation
------------------

The jit generation code is called from eval() and from routines it calls.
The R front end knows nothing about jitting.

There is a new SEXP type JITSXP which points to a JIT record:

struct JIT_RECORD {
    SEXP        original;  /* expression before it was jitted     */
    SEXP        ans;       /* buf to store the result of jit eval */
    JIT_OP      *ops;      /* array of jit instructions           */
}

As R code in a loop in a jit block is being evaluated by eval, the
jitter appends jit instructions into the ops buffer in the above
structure.  That is, it keeps a record of the operations performed by
eval.  A state machine in the jitter tracks progress through the R
evaluator.

A compilation is successful if the entire expression could be
converted to jitted form.

If the compilation was successful, we change the evaluated LANGSXP
expression to the JITSXP we have just created.  Thereafter, eval()
will execute the JITSXP instead of the original code.

If the compilation is not successful, we mark the expression as
not-jittable.  On the following iterations through the R loop, we try
to compile sub-expressions, if any, of the not-jittable expression.
Ultimately, all sub-expressions will be jitted or marked as
not-jittable, assuming there are enough iterations through the loop.

At the end of the jit block we restore the expression to the
LANGSXP it was before it was jitted.

See the diagrams of the jit data structures and state machine
on the Ra homepage www.milbo.users.sonic.net/ra


SxpInfo gp bits
---------------

There are two new sxpinfo gp bits:

#define JITTED_BIT     0x1000   /* set if symbol binding is jitted */
#define CANNOT_JIT_BIT 0x2000   /* set if LANGSXP is not jittable */

The JITTED_BIT marks symbol bindings that have been jitted. If the bit
is set then the binding cannot be moved in memory, and its length and
type cannot be changed.  The bit is used in a SYMSXP or a
LISTSXP in an evironment frame.

If an expression cannot be jitted then we set the CANNOT_JIT bit so we
don't keep on retrying to jit it.  In this case, the bit is used in
a LANGSXP.

The R function nojit(sym) sets the CANNOT_JIT bit for the current
binding of sym.  The bit is here used in a SYMSXP or a
LISTSXP in a frame.


Risks to existing R functionality
---------------------------------

I have tried to keep the changes to existing code minimal.
The jit code lies dormant on the whole and doesn't interfere
unless the user uses jit(1).

An exception is for/repeat/while loops.  The C code has has been
optimized so loops run a little faster even when jitting is not
enabled.

Another exception is eval() which has been slightly optimized and
also factorized as a consequence of the new JIT code.

Since is is not easily possible to know what the existing C code does
for every kind of R expression, I have taken a conservative approach
and heavily instrumented the new C code with assertions.  Many of
these assertions could be removed in due course.  For now it's better
to be a bit paranoid.


Testing
-------

There is a new file "test-jit.R" with extensive tests for jitted code and for
regression testing the changes made to the loop code.

Running "make check-all" with the new build is clean except for a
warning "Undocumented code object: jit" when checking base.  For now
we will live with the warning (fixing it is more complicated than
it's worth).


Remarks
-------

New C code that will be revisited is marked with a "RA_TODO" comment.

Code in existing files that was changed is also marked RA_TODO meaning
"I changed this code but realize my change could be questioned".

There is some code in the Ra sources (e.g. #if'ed code to force entry
into the debugger in error.c) that is just for development and will be
ultimately removed.  For now it makes the Ra release process a little
simpler to keep it in.  All this kind of code is marked with RA_TODO.

The jitted code interpreter is a big switch statement.  It is also
bloated with assertions.  The interpreter could be easily made more
efficient.

Note that eval.c is now split into several files:
$ wc -l *eval*.c
  1675 bceval.c         luke's byte code compiler
   344 eval.c           core of the R evaluator
   588 evalloops.c      for/while/repeat loops and if
   290 evalprof.c       R profiler
  1415 evalutils.c      everything else that was in eval
   556 jiteval.c        jit evaluator
  1624 jiteval1.c       jit instructions (mostly cookie cutting)

It would be great if the new code could be reviewed by someone with
experience in R expression evaluation and memory management.



Relationship to the existing byte code compiler
-----------------------------------------------

See also Bill Venables' and Luke Tierney's wise words at
www.nabble.com/Speedups-with-Ra-and-jit-to17012023.html

The byte compiler and the jitter could dovetail together nicely,
because they are doing quite separate things.  The compiler uses a
top-down approach on static code, and gets rid of overhead that can be
determined by static analysis (mostly interpretation overhead).  The
jitter uses a bottom-up approach on running code and gets rid of the
overhead of runtime type determination and symbol lookup.  It can be
seen as a way of providing variable declarations without bothering the
user.

Speculating, hooks to the jitter could be added to bcEval.  The jitter
would then generate jitted code for selected operations as bcEval
executes, in the same way as it does now for eval i.e. it would
convert selected sequences of compiled bytecode instructions to
sequences of jitted code instructions.  This would make jitted
bytecode arithmetic faster because there would be no runtime symbol
lookup and no runtime case statements based on type (after the first
time).

Thus we would get the fastest code using both the bytecode compiler and
the jitter.

There is some duplication of effort in that the jitter and the byte
compiler both have an execution machine.



Wouldn't it be simpler to add declarations to R instead of a jitter?
--------------------------------------------------------------------

Maybe, but there are some issues.  For fast code, declarations need to
specify both the type and length of variables.

[
Footnote: The length is needed so we can preallocate buffers.  For
example, to evaluate
    a = x + y + z
R first adds y and z and saves the result.  It needs a temporary
buffer to save the result.  If the expression is in a loop, standard R
allocates a temporary buffer every time the expression executes.
That's expensive.  Jitted code allocates the buffer once on the first
iteration through the loop and uses the same buffer for subsequent
iterations.

Remember that R variables are SEXPs with a 4 word header, not single
word scalars as they often are in C or Fortran.  You can't save the
value of an R variable in a machine register.
]

But declaring the length of a variable is too restrictive.  For
example the convolution example would then only work with fixed length
vectors.

This argument above falls away if we can find a way around the buffer
allocation problem.  We could just say that specifying the length in
declarations is optional and accept the fact that for variables
declared without lengths we would need to allocate buffers on the fly
--- but that would be slower than jitted code.

An additional disadvantage of the declarations approach is that a
symbol's binding would still have to be looked up at runtime, which is
slow.

A hybrid declarations/just-in-time approach approach may be worth
considering, especially if we want to jit code that is not in loops.


Speed versus C
--------------

It is unlikely that the jitter will ever produce code as fast as raw
C.  There are a several things that get in the way

(1) The jitted code is interpreted by a byte code interpreter.

(2) Jitted code operates on SEXPs, not directly on C ints or doubles.
    Even simple addition 1+1 requires SEXP manipulation.

(3) The jitted code still has to take care of things which you often
    can ignore in C, like the possibility of an out-of-range index
    or div-by-zero.

(4) The jitted code has to be compatible with R, as much as possible.
    For example, %% on integer arguments has to check the signs of the
    args and type convert to double and invoke floor if necessary.  In C
    you would usually just use "x % y", although that can give unexpected
    results for negative arguments.

(5) On the first pass through a loop, JIT compiling the code takes time
    (although this is quite quick and is asymptotically zero relative to
    the total loop time).

(6) "For" loop overhead is high.  The context must be changed and
    setjmp invoked, the iteration vector must be allocated, and the
    iteration variable must be defined.  This is a huge overhead compared
    to C.  The overhead would disappear if "for" loops could be jitted but
    there appears to be no simple way to jit them. jit(2) opposed to
    jit(1) is a first attempt, but you have to be more careful when you
    use it than one would like (see the jit help page).

(7) Copying of assigned objects (NAMED handling) causes a lot of overhead.



About compiling code that is not in loops
-----------------------------------------

Currently only code in loops gets jitted.  It would be better if whole
functions could be jitted, so a jitted function would be fast if
called repeatedly.

A way of implementing this would be to add preconditions to jitted code.
That is, at JIT compile time attach a list to the jitted block
which specifies the symbols used in the block and their expected
binding types and lengths.

On starting execution of the jitted block, if the preconditions
weren't met the jitted code would be scrapped and the expression would
be evaluated in its original LANGSXP form.  As a side effect, this
evaluation would cause the expression to be re-jitted under the new
set of conditions.

This would only work if (a) evaluating both the preconditions and the
jitted code is faster than just evaluating the original R code, and
(b) the usual case is that the preconditions are met.  When the
preconditions aren't met the total execution time is the time to
evaluate the preconditions plus the time to evaluate the
expression as a standard R expression.

Additionally, when evaluating the preconditions we would patch the
jitted code so references to bindings would point to the correct place
for the current environment.

This would all be transparent to users, except that they would see
faster execution (and would have to add jit(1) calls to their code).



Invoking the C compiler from within Ra
--------------------------------------

Another tack would be to add an option for the JIT compiler to
generate C instead of bytecode.  The C code would then be
automatically compiled on the fly using the techniques in the
"inline" package.  The generated machine code would be executed from
within Ra.

This would give us C-like speeds at the cost of a more complicated
setup for users.  They would need to have the C compiler available but
would not have to actually write any C.  I would imagine this will
cause user education issues, especially for people running R under
Windows.  This could be circumvented to a certain extent by the brute
force approach of actually including the mingw gcc compiler and a few
other header files in the Windows version of the jit package.
(Nothing else from the Rtools toolset would be needed I think. Maybe
the linker too.)

However this approach has disadvantages.  The C compiler could only be
invoked if the entire loop was jittable.  It wouldn't be called if the
loop body contained a non-jittable construct (for example, in the
current version of the jitter if the colon operator 1:3 was used in
the loop body).  We would issue a warning message "could not C-compile
the loop because of such and such statement".

The existing jit implementation, on the other hand, quite happily
compiles loop bodies in a piecemeal fashion. What can be compiled,
gets compiled; what can't be compiled, isn't, and coexists with
compiled code.

Calling the C compiler on the fly from within the jitter differs
from Scompile in the following ways (I use the present tense but
none of this has been implemented):

1. The user does not have not have to explicitly invoke the C compiler.

2. Type definitions etc. in the C code are deduced by the jit
   compiler, relieving the user of making sure that he or she
   is using doubles etc. appropriately.

3. Keeping track of the generated binary files is taken care of by
   the jitter, relieving the user of that.

4. It is less quirky than Scompile --- the entire R language can be
   used, although not everything is compiled.  The generated
   C code never causes the C compiler to issue an error message
   or to generate code that is not equivalent to the R code.

5. It is not as fast as Scompile:

   On the first invocation of the loop the C compiler gets
   invoked.  Ra and Scompile are about the same speed
   at this step because C compilation time dominates
   everything else.

   On subsequent invocation of the loops:

   (i) with Scompile the compiled code gets invoked directly

   (ii) with Ra, the jitter first checks that the types of the
   arguments to the compiled function have not changed and only then
   calls the compiled code.  There is a small overhead to do this
   (it uses techniqes described in the above section "About compiling
   code that is not in loops").  But it does mean that using
   the C compiler is safe and you won't shoot yourself in the foot by
   changing the way the function is called. If the type of your arguments
   change, the jitter will re-invoke the C compiler to generate
   new machine code.

It's worth remembering that there is still lots of room for
improvement in speed in the current approach. The current jitted
bytecode execution machine is very slow compared to what it could
be (it's a big C case statement).

Summing up, my feeling is that that invoking the C compiler from the
jitter, although easier to implement (maybe), is not as clean as
making the existing jit bytecode execution machine faster.  It may be
a satisfactory interim solution.  It someone takes the time to
implement it, people will use it.



Stephen Milborrow
May 03, 2008
www.milbo.users.sonic.net
To Ra homepage