Writing Efficient R Code

April 22, 2020

Outline

What we’ll cover today

Intro & general tips
R performance tips: patterns to use and avoid
Vectors & Matrices
Memory management, large tables
Profiling and Benchmarking
Loops

General advice

If you don’t understand something, try some experiments
Browse the documentation, learn its jargon and quirks
Break your code into functions when appropriate
Use functions to reduce the need for global variables
Write tests for your functions
Use git to keep track of changes
Learn to distribute your code as a package

Is R slow?

Sometimes, but well written R programs are usually fast enough.

Designed to make programming easier
- Speed was not the primary design criteria
Slow programs often a result of bad programming practices or not understanding how R works
There are various options for calling C or C++ functions from R

R performance before you start

Premature optimization is the root of all evil – Donald Knuth

Become familiar with R’s vector and apply functions
Consider specialized performance packages
- E.g. data.table, bigmemory, dplyr, RSQLite, snow, multicore, parallel
Consider using external optimizations (OpenBLAS/MKL)
Don’t use an R GUI when performance is important

R tuning advice

Be methodical but don’t get carried away with micro-optimizations
Use monitoring tools such as top, Activity Monitor, Task Manager
Use vector functions
Avoid duplication of objects
Pre-allocate result vectors
Profile your code and run benchmarks
Byte-compile with cmpfun, or call a compiled language (e.g. C, C++)

Vectors & Matrices

Vectors are central to good R programming

Fast, since implemented as a single C or Fortran function
Concise and easy to read
Can often replace for loops
However, heavy use can result in high memory usage

Useful vector functions

math operators: +, -, *, /, ^, %/%, %%
math functions: abs, sqrt, exp, log, log10, cos, sin, tan, sum, prod
logical operators: &, |, !
relational operators: ==, !=, <, >, <=, >=
string functions: nchar, tolower, toupper, grep, sub, gsub, strsplit
conditional function: ifelse (pure R code)
misc: which, which.min, which.max, pmax, pmin, is.na, any, all, rnorm, runif, sprintf, rev, paste, as.integer, as.character

Dynamic features of vectors

initialize `x` and fill it with zeros

n <- 10
x <- double(n)
x

 [1] 0 0 0 0 0 0 0 0 0 0

Extend `x` by assignment

x[15] <- 100
x

 [1]   0   0   0   0   0   0   0   0   0   0  NA  NA  NA  NA 100

Dynamic features of vectors

Resize/truncate `x`

length(x) <- 5
x

[1] 0 0 0 0 0

rnorm vector function

x <- rnorm(10)
x

 [1] -0.4056235  1.3891037  0.2174550  2.3246995  1.2380935 -1.4947794
 [7]  0.2385808 -0.5493807 -1.3513489 -0.2154774

Vector indexing

Extract subvector

x[3:6]

[1]  0.217455  2.324700  1.238094 -1.494779

Extract elements using result of vector relational operation

x[x > 0]

[1] 1.3891037 0.2174550 2.3246995 1.2380935 0.2385808

Vector indexing

You can also use an index to assign values

x[is.na(x)] <- 0

Matrix indexing

Make a new matrix

m <- matrix(rnorm(100), 10, 10)

Extract 2X3 submatrix (non-consecutive columns)

m[3:4, c(5,7,9)]

           [,1]       [,2]      [,3]
[1,] -2.6358462 -0.1964796 -1.439384
[2,]  0.9059187 -0.8001697 -1.865229

Matrix indexing

Extract arbitrary elements as vector

m[cbind(3:6, c(2,4,6,9))]

[1]  0.2120675  0.5209789 -1.1947893  1.0693606

Extract elements using result of vector relational operation

head(m[m > 0])

[1] 0.6022579 0.3916312 1.9208050 0.1392003 0.2249831 0.6924349

Matrix indexing

You can also use a matrix index to assign values

m[is.na(m)] <- 0

Memory Considerations

Memory in R

Avoid duplicating objects, especially big ones or those in loops
Look into memory efficient libraries
Look into other formats to store data

Beware of object duplication

R uses pass by value semantics for function arguments
In general, this requires making copies of objects
- Functions must return modified object
R tries to avoid copying unless necessary

Example of object duplication

tracemem reports when an object is duplicated, which is very useful for debugging performance problems.

In this example, object duplication is expected and helpful.

x <- double(10)
tracemem(x)

[1] "<0x55a6d0c827d8>"

y <- x
y[1] <- 10

tracemem[0x55a6d0c827d8 -> 0x55a6d0c81cd8]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>

Example of object duplication

.Internal(inspect(x))

@55a6d0c827d8 14 REALSXP g0c5 [NAM(7),TR] (len=10, tl=0) 0,0,0,0,0,...

.Internal(inspect(y))

@55a6d0c81cd8 14 REALSXP g0c5 [NAM(1),TR] (len=10, tl=0) 10,0,0,0,0,...

x[1] <- 50

tracemem[0x55a6d0c827d8 -> 0x55a6d04f5a78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>

Example of unexpected object duplication

Passing matrix to non-primitive function such as `nrow` poisons for duplication

> m <- matrix(0, 3, 3)
> tracemem(m)
[1] "<0x7fc168d29df0>"
> m[1,1] <- 1
> nrow(m)
[1] 3
> m[1,1] <- 2
tracemem[0x7fc168d29df0 -> 0x7fc168d21f58]:

Splitting problem into smaller tasks

R makes it easy to read entire data sets in one operation, but reading it in parts can be much more efficient.

Splitting the problem into smaller tasks is compatible with parallel computing techniques
The foreach & iterators packages provide tools to split inputs into smaller pieces
Use Linux commands (split, awk, etc) or other languages (e.g. Python to preprocess
- Split data files and remove unneeded fields

Beware of `read.table`

The read.table function is commonly used for reading data files, but it can be very slow on large files

Use of the colClasses argument can improve performance
colClasses can be used to skip a column, using less memory
It can be faster to read a file in smaller chunks using the nrows argument
The scan function can be faster
Consider using similar functions, such as readr, data.table, sqldf, and bigmemory

Example csv

head -n 3 sample_people.csv

seq,name/first,name/last,age,street,city,state,zip,dollar,pick,date
1,Leona,Burton,61,Afjoj Loop,Dibemato,VT,57577,273.47,YELLOW,05/26/1946
2,Nina,Patrick,48,Tana Square,Namumene,MT,55670,6774.89,GREEN,02/14/1936

`colClasses` Comparison

Read in 99,999 rows of sample data

system.time(read.csv('sample_people.csv'))

   user  system elapsed 
  1.300   0.021   1.320

system.time(read.csv("sample_people.csv", header=TRUE, colClasses = 
    c("integer", "character", "character", "integer", "character", "character", 
      "character", "character", "numeric", "factor", "character")))

   user  system elapsed 
  0.248   0.000   0.248

An alternate universe

`readr`

Part of tidyverse
Generally faster than read.table

library(readr)
options(readr.num_columns = 0)
system.time(read_csv("sample_people.csv"))

   user  system elapsed 
  0.203   0.008   0.211

More alternative libraries

`data.table`

Like data.frame, but without row labels

`bigmemory`

Defines mutable matrix objects that aren’t automatically duplicated

`big.matrix`

Can use a backing file that is memory mapped

`biganalytics`

apply, biglm, bigglm, bigkmeans, colmax

Save data in binary format

Saving data in a binary format can make it much faster to read the data later. There are a variety of functions available to do that:

save/load
writeBin/readBin
write.big.matrix/read.big.matrix (from bigmemory)

SQLite in R

Consider putting data into an SQLite database.

RSQLite package is easy to use
Easy to get subsets of the data into a data frame
Command line tool very useful for experimenting with queries
Database can be accessed from many different languages
The sqldf package may be useful also
Can be quite slow

Speed up your Basic Linear Algebra Subprograms

Intel Math Kernel Libraries (MKL) also fast, trickier to get working from scratch

MacOS

brew install openblas
brew install r --with-openblas

Linux (Ubuntu)

sudo apt-get install libopenblas-base

Vanilla vs OpenBLAS

My Machine: Ubuntu Bionic, i9-8950HK CPU, R 3.4.4, OpenBLAS 0.2.20 15 tests from http://r.research.att.com/benchmarks/R-benchmark-25.R

Default R: ~29 Seconds

R with OpenBLAS: ~3 seconds

Profiling and Benchmarking

Find your marble, then start chiseling

Profiling vs Benchmarking

Profiling

If you’ve decided you need your code to perform better, profile first
Profiling helps isolate hot spots
Time spent here will likely yeild best ROI

Benchmarking

With hot spots in hand, examine the code and propose alternatives
While ensuring the reesults are the same, ask which performs best

Profiling

R profiling tools

R has builtin support for profiling, but there are additional

packages available:

proftools
profvis (RStudio integration)

Basic profiling with `proftools`

f <- function(a) { g1(a) + g2(2 * a) }

g1 <- function(a) { h(a) }

g2 <- function(a) { sqrt(a) }

h <- function(a) {
  b <- double(length(a))
  for (i in seq_along(a)) {
    b[i] <- sqrt(a[i])
  }
  b
}

Basic profiling with `proftools`

x <- 1:1000000
Rprof('prof.out')
for (i in 1:10) {
  y <- f(x)
}
Rprof(NULL)
summaryRprof("prof.out")$by.self

         self.time self.pct total.time total.pct
"h"           1.78       89       1.86        93
"g2"          0.10        5       0.10         5
"double"      0.08        4       0.08         4
"f"           0.02        1       1.98        99
"paste"       0.02        1       0.02         1

Basic profiling with `profvis`

Can also do this in RStudio, e.g. Profile -> Start Profile

profvis({
for (i in 1:10) {
  y <- f(x)
}
})

Benchmarking

Knowing where code is slow via profiling, use benchmarking tools

Put problem code into a functions
Benchmark different versions of code for comparison
system.time is useful for long running code
microbenchmark package is useful for analyzing short running code

Loops

Wherefore art thou performant… or not?

Are for loops in R slow?

Not all for loops are bad
Most common mistakes involve for loops.
The classic mistake is not preallocating a result vector.

Example 1

Create a vector of length n where all values are x

Example 1: a bad for loop

bad.for <- function(n,x) {
  result <- NULL
  for (i in 1:n) {
    result[i] <- x
  }
  result
}

Large number of iterations
Tiny amount of computation per iteration
Item result vector is reallocated and copied on each iteration
Triggering garbage collection periodically

Example 1: a better for loop

okay.for <- function(n,x) {
  result <- double(n)
  for (i in 1:n) {
    result[i] <- x
  }
  result
}

Improvement over the previous example, but it’s still slow because of the many tiny iterations.

Example 1: a puzzle loop

strange.for <- function(n, x) {
  result <- NULL
  for (i in n:1) {
    result[i] <- x
  }
  result
}

Is this loop faster or slower than the previous two?

Example 1: using a vector function

# use of vector assignment
vector.assn <- function(n, x) {
  result <- double(n)
  result[] <- x
  result
}

We can also use vector assignment

Example 1: using R built-in function

built.in <- function(n, x) {
  rep(x, n)
}

Or, we could read the fine manual and use a built-in function

Example 1: testing

Make sure functions produce identical output

n <- 10000
x <- 7
bad.result        <- bad.for(n, x)
okay.result       <- okay.for(n, x)
strange.result    <- strange.for(n, x)
vector.result     <- vector.assn(n, x)
built.result      <- built.in(n, x)
c(identical(bad.result, okay.result),
identical(bad.result, strange.result),
identical(bad.result, vector.result),
identical(bad.result, built.result))

[1] TRUE TRUE TRUE TRUE

Example 1: benchmark results

res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x),
                      vector=vector.assn(n,x), builtin=built.in(n,x))
kable(summary(res, unit="relative"))

expr	min	lq	mean	median	uq	max	neval
bad	199.156045	196.233778	93.472189	186.788741	160.062897	6.475202	100
okay	95.191405	92.625413	43.237866	87.084876	74.758367	3.359628	100
strange	95.790237	92.912893	42.203484	87.622573	75.647121	1.327639	100
vector	2.993742	2.930285	2.186368	2.798131	2.424152	1.573075	100
builtin	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	100

Example 1: benchmark plot

autoplot(res)

Example 2

Create a matrix with n rows and x columns

Each value in the matrix is sampled from normal distribution, \(\mu=0 , \sigma=1\)

Example 2: another bad for loop

Generate a matrix of values sampled from normal distribution n rows, x columns

bad.norm <- function(n,x) {
  m <- NULL
  for (i in 1:n) {
    m <- rbind(m, rnorm(x))
  }
  m
}

Example 2: preallocation of result vector

Just like before, we build a matrix and populate it with a for loop

ok.norm <- function(n,x) {
  m <- matrix(0, nrow=n, ncol=x)
  for (i in 1:n) {
    m[i,] <- rnorm(100)
  }
  m
}

Example 2: use lapply and rbind

don’t need to preallocate matrix

lapply.norm <- function(n,x) {
  do.call('rbind', lapply(1:n, function(i) rnorm(x)))
}

Example 2: Compute all rows at once

best.norm <- function(n,x) {
  m <- rnorm(x * n)
  dim(m) <- c(x, n)
  t(m)
}

Example 2: Test data

n <- 600
x <- 100
# Verify correct results
set.seed(123); bad.result <- bad.norm(n,x)
set.seed(123); ok.result <- ok.norm(n,x)
set.seed(123); lapply.result <- lapply.norm(n,x)
set.seed(123); best.result <- best.norm(n,x)

c(identical(bad.result, ok.result),
identical(bad.result, lapply.result),
identical(bad.result, best.result))

[1] TRUE TRUE TRUE

Example 2: benchmarks

res <- microbenchmark(bad=bad.norm(n,x), ok=ok.norm(n,x),
                        lapply=lapply.norm(n,x), best=best.norm(n,x))
kable(summary(res, unit="relative"))

expr	min	lq	mean	median	uq	max	neval
bad	12.046303	12.618224	12.047673	12.733923	12.155410	13.131735	100
ok	1.470018	1.501110	1.467030	1.489001	1.469674	1.136864	100
lapply	1.466219	1.471615	1.498865	1.486366	1.489676	1.103843	100
best	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	100

Example 2: benchmark plot

autoplot(res)

Just in Time (JIT) compiling your functions

Results in your function as bytecode

enableJIT(0)

[1] 3

fun.for <- function(x, seed=1423) {
  set.seed(seed)
  y <- double(length(x))
  for (i in seq_along(x)) {
    y[i] <- rnorm(1) * x[i]
  }
  y
}
fun.for.compiled <- cmpfun(fun.for)

Benchmarking JIT

x <- 10000
res <- microbenchmark(fun.for=fun.for(x),
                      fun.for.compiled=fun.for.compiled(x))
kable(summary(res, unit="relative"))

expr	min	lq	mean	median	uq	max	neval
fun.for	1.089921	1.089702	1.212659	1.072328	1.069257	1.89812	100
fun.for.compiled	1.000000	1.000000	1.000000	1.000000	1.000000	1.00000	100

Benchmarking JIT

autoplot(res)

END

Resources

This presentation: github.com/ycrc/efficient-R
Advanced R Book: adv-r.had.co.nz
R Manuals: cran.r-project.org/manuals.html
R Inferno: burns-stat.com/documents/books/the-r-inferno
Efficient R programming: csgillespie.github.io/efficientR
RStudio Webinars: resources.rstudio.com

Outline

What we’ll cover today

General advice

Is R slow?

Is R slow?

Sometimes, but well written R programs are usually fast enough.

R performance before you start

Premature optimization is the root of all evil – Donald Knuth

R tuning advice

Vectors & Matrices

Vectors are central to good R programming

Useful vector functions

Dynamic features of vectors

initialize x and fill it with zeros

Extend x by assignment

Dynamic features of vectors

Resize/truncate x

rnorm vector function

Vector indexing

Extract subvector

Extract elements using result of vector relational operation

Vector indexing

You can also use an index to assign values

Matrix indexing

Make a new matrix

Extract 2X3 submatrix (non-consecutive columns)

Matrix indexing

Extract arbitrary elements as vector

Extract elements using result of vector relational operation

Matrix indexing

You can also use a matrix index to assign values

Memory Considerations

Memory in R

Beware of object duplication

Example of object duplication

Example of object duplication

Example of unexpected object duplication

Passing matrix to non-primitive function such as nrow poisons for duplication

Splitting problem into smaller tasks

Beware of read.table

Example csv

colClasses Comparison

An alternate universe

readr

More alternative libraries

data.table

bigmemory

big.matrix

biganalytics

Save data in binary format

SQLite in R

Speed up your Basic Linear Algebra Subprograms

Vanilla vs OpenBLAS

Profiling and Benchmarking

Find your marble, then start chiseling

Profiling vs Benchmarking

Profiling

Benchmarking

Profiling

R profiling tools

Basic profiling with proftools

Basic profiling with proftools

Basic profiling with profvis

Can also do this in RStudio, e.g. Profile -> Start Profile

Benchmarking

Loops

Wherefore art thou performant… or not?

Are for loops in R slow?

Example 1

Example 1: a bad for loop

Example 1: a better for loop

Example 1: a puzzle loop

Example 1: using a vector function

Example 1: using R built-in function

Example 1: testing

Make sure functions produce identical output

Example 1: benchmark results

Example 1: benchmark plot

Example 2

Example 2: another bad for loop

initialize `x` and fill it with zeros

Extend `x` by assignment

Resize/truncate `x`

Passing matrix to non-primitive function such as `nrow` poisons for duplication

Beware of `read.table`

`colClasses` Comparison

`readr`

`data.table`

`bigmemory`

`big.matrix`

`biganalytics`

Basic profiling with `proftools`

Basic profiling with `proftools`

Basic profiling with `profvis`