The Ra just-in-time compiler: a case study
Ra is functionally identical to R
but provides just-in-time (JIT) compilation of loops and arithmetic expressions in loops.
The result is faster code.
In this document we look at some variations of the same piece
of R code to see some of the strengths and weaknesses of the JIT compiler.
The code is the "Distribution of a determinant" on page
154 in Venables and Ripley
S Programming.
This example is also mentioned in
An Introduction to R.
It is introduced in Venables and Ripley as follows:
"Consider the problem of finding the distribution of the determinant
of a 2 x 2 matrix where the entries are independent and uniformly
distributed digits 0, 1, ..., 9.
This amounts to finding all
possible values of ac-bd where
a, b, c and d are digits."
The times, summarized
The table below summarizes timing results for the various solutions we will consider.
Function RTime RaTime Ratio Notes
ms ms Ra/R
dd.for.c 240 230 0.96 original
dd.for.c1 240 err err c() prevents compile
dd.for.c2 240 201 0.84 nojit(val)
dd.for.prealloc 132 10 0.08 preallocate val
dd.for.tabulate 128 6 0.05 tabulate replaces table
dd.fast.tabulate 1.47 1.43 0.97 no loops
The tests were made on a 3GHz Pentium G on a Windows XP system using R 2.6.2 and Ra 1.0.6.
The times are in milliseconds.
Smaller time ratios mean that Ra is faster.
The standard deviation of each number in the table is less than 5%.
The code to reproduce these timing results is here.
Variations on a theme
A first solution to the problem is
dd.for.c <- function()
{
val <- NULL
for (a in 0:9) for (b in 0:9) for (d in 0:9) for (e in 0:9)
val <- c(val, a*b - d*e)
table(val)
}
We enable just-in-time compilation by adding a jit(1) statement:
dd.for.c1 <- function()
{
jit(1) # enable just-in-time compilation
val <- NULL
for (a in 0:9) for (b in 0:9) for (d in 0:9) for (e in 0:9)
val <- c(val, a*b - d*e)
table(val)
}
but run into a problem. Under Ra, we get the error message:
Error: cannot change the length of a jitted variable
Tried to change "val" from length 1 to length 2
The problem is that
the c() statement in the loop body is changing the length of val,
and you are not allowed to change the size or type of a jitted variable.
The environment has to be stable for the JIT compiler to optimize the code.
A way around this is to specify nojit(val).
Everything will get compiled as usual except expressions containing val:
dd.for.c2 <- function()
{
jit(1)
val <- NULL
nojit(val) # inhibit compilation of val
for (a in 0:9) for (b in 0:9) for (d in 0:9) for (e in 0:9)
val <- c(val, a*b - d*e)
table(val)
}
This function is about 20% faster under Ra.
That's not much of an improvement.
The for loops are indeed running faster but that improvement
is swamped by the slowness of the call to c().
(In general, functions called from a JIT block --- such as c() --- are not compiled.
Exceptions are mathematical functions such as sin.)
So we get rid of the c() by preallocating val:
dd.for.prealloc <- function()
{
jit(1)
val <- double(10000) # preallocate val
nval <- 0
for (a in 0:9) for (b in 0:9) for (d in 0:9) for (e in 0:9)
val[nval <- nval + 1] <- a*b - d*e
table(val)
}
The function is now faster under both R and Ra,
but with Ra about 13 times faster than R because of JIT compilation.
Replacing table with tabulate
We can replace the call to table()
with the faster function tabulate():
dd.for.tabulate <- function()
{
jit(1)
val <- double(10000)
nval <- 0
for (a in 0:9) for (b in 0:9) for (d in 0:9) for (e in 0:9)
val[nval <- nval + 1] <- a*b - d*e
tabulate(val) # was table()
}
With this, Ra runs about 21 times faster than R.
Avoiding loops
Venables and Ripley give a solution that avoids loops altogether:
dd.fast.tabulate.jit <- function()
{
jit(1)
val <- outer(0:9, 0:9, "*")
val <- outer(val, val, "-")
tabulate(val)
}
This solution is the fastest of all.
The Ra time is about the same as the R time because there are no loops to optimize.
The jit(1) statement has no effect.
Some conclusions
We should not draw too many conclusions from one case study, but it seems
safe to say the following.
"Vectorization" (i.e. getting rid of loops) usually gives you the fastest code.
The JIT compiler can give substantial performance improvements for looped arithmetic.
It is useful when using loops is a natural or convenient expression of the problem,
or when code cannot be vectorized.
See the Ra homepage for more information.