May 22, 2019

Outline

What we’ll cover today

  • Intro & general tips
  • R performance tips: patterns to use and avoid
  • Vectors & Matrices
  • Memory management, large tables
  • Profiling and Benchmarking
  • Loops

General advice

  • If you don’t understand something, try some experiments
  • Browse the documentation, learn its jargon and quirks
  • Break your code into functions when appropriate
  • Use functions to reduce the need for global variables
  • Write tests for your functions
  • Use git to keep track of changes
  • Learn to distribute your code as a package

Is R slow?

Is R slow?

Sometimes, but well written R programs are usually fast enough.

  • Designed to make programming easier
    • Speed was not the primary design criteria
  • Slow programs often a result of bad programming practices or not understanding how R works
  • There are various options for calling C or C++ functions from R

R performance before you start

Premature optimization is the root of all evil – Donald Knuth

  • Become familiar with R’s vector and apply functions
  • Consider specialized performance packages
    • E.g. data.table, bigmemory, dplyr, RSQLite, snow, multicore, parallel
  • Consider using external optimizations (OpenBLAS/MKL)
  • Don’t use an R GUI when performance is important

R tuning advice

  • Be methodical but don’t get carried away with micro-optimizations
  • Use monitoring tools such as top, Activity Monitor, Task Manager
  • Use vector functions
  • Avoid duplication of objects
  • Pre-allocate result vectors
  • Profile your code and run benchmarks
  • Byte-compile with cmpfun, or call a compiled language (e.g. C, C++)

Vectors & Matrices

Vectors are central to good R programming

  • Fast, since implemented as a single C or Fortran function
  • Concise and easy to read
  • Can often replace for loops
  • However, heavy use can result in high memory usage

Useful vector functions

  • math operators: +, -, *, /, ^, %/%, %%
  • math functions: abs, sqrt, exp, log, log10, cos, sin, tan, sum, prod
  • logical operators: &, |, !
  • relational operators: ==, !=, <, >, <=, >=
  • string functions: nchar, tolower, toupper, grep, sub, gsub, strsplit
  • conditional function: ifelse (pure R code)
  • misc: which, which.min, which.max, pmax, pmin, is.na, any, all, rnorm, runif, sprintf, rev, paste, as.integer, as.character

Dynamic features of vectors

initialize x and fill it with zeros

n <- 10
x <- double(n)
x
 [1] 0 0 0 0 0 0 0 0 0 0

Extend x by assignment

x[15] <- 100
x
 [1]   0   0   0   0   0   0   0   0   0   0  NA  NA  NA  NA 100

Dynamic features of vectors

Resize/truncate x

length(x) <- 5
x
[1] 0 0 0 0 0

rnorm vector function

x <- rnorm(10)
x
 [1] -0.2719217 -0.6587842  0.3845842  1.1994035  0.8682213 -0.8759726
 [7]  1.8429973 -2.1819561 -0.8345673 -0.6641136

Vector indexing

Extract subvector

x[3:6]
[1]  0.3845842  1.1994035  0.8682213 -0.8759726

Extract elements using result of vector relational operation

x[x > 0]
[1] 0.3845842 1.1994035 0.8682213 1.8429973

Vector indexing

You can also use an index to assign values

x[is.na(x)] <- 0

Matrix indexing

Make a new matrix

m <- matrix(rnorm(100), 10, 10)

Extract 2X3 submatrix (non-consecutive columns)

m[3:4, c(5,7,9)]
          [,1]      [,2]       [,3]
[1,] 0.2154793 0.8835384 -2.0350827
[2,] 1.4431649 1.1231730 -0.2943935

Matrix indexing

Extract arbitrary elements as vector

m[cbind(3:6, c(2,4,6,9))]
[1] -1.0015788  0.8170496 -0.4317668 -0.7052398

Extract elements using result of vector relational operation

head(m[m > 0])
[1] 0.7602174 0.4952584 0.4411250 1.2670114 1.2284761 0.7516520

Matrix indexing

You can also use a matrix index to assign values

m[is.na(m)] <- 0

Memory Considerations

Memory in R

  • Avoid duplicating objects, especially big ones or those in loops
  • Look into memory efficient libraries
  • Look into other formats to store data

Beware of object duplication