- Intro & general tips
- R performance tips: patterns to use and avoid
- Vectors & Matrices
- Memory management, large tables
- Profiling and Benchmarking
- Loops
April 22, 2020
git
to keep track of changesdata.table
, bigmemory
, dplyr
, RSQLite
, snow
, multicore
, parallel
top
, Activity Monitor, Task Managercmpfun
, or call a compiled language (e.g. C, C++)+
, -
, *
, /
, ^
, %/%
, %%
abs
, sqrt
, exp
, log
, log10
, cos
, sin
, tan
, sum
, prod
&
, |
, !
==
, !=
, <
, >
, <=
, >=
nchar
, tolower
, toupper
, grep
, sub
, gsub
, strsplit
ifelse
(pure R code)which
, which.min
, which.max
, pmax
, pmin
, is.na
, any
, all
, rnorm
, runif
, sprintf
, rev
, paste
, as.integer
, as.character
x
and fill it with zerosn <- 10 x <- double(n) x
[1] 0 0 0 0 0 0 0 0 0 0
x
by assignmentx[15] <- 100 x
[1] 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 100
x
length(x) <- 5 x
[1] 0 0 0 0 0
x <- rnorm(10) x
[1] -0.4056235 1.3891037 0.2174550 2.3246995 1.2380935 -1.4947794 [7] 0.2385808 -0.5493807 -1.3513489 -0.2154774
x[3:6]
[1] 0.217455 2.324700 1.238094 -1.494779
x[x > 0]
[1] 1.3891037 0.2174550 2.3246995 1.2380935 0.2385808
x[is.na(x)] <- 0
m <- matrix(rnorm(100), 10, 10)
m[3:4, c(5,7,9)]
[,1] [,2] [,3] [1,] -2.6358462 -0.1964796 -1.439384 [2,] 0.9059187 -0.8001697 -1.865229
m[cbind(3:6, c(2,4,6,9))]
[1] 0.2120675 0.5209789 -1.1947893 1.0693606
head(m[m > 0])
[1] 0.6022579 0.3916312 1.9208050 0.1392003 0.2249831 0.6924349
m[is.na(m)] <- 0
tracemem
reports when an object is duplicated, which is very useful for debugging performance problems.
In this example, object duplication is expected and helpful.
x <- double(10) tracemem(x)
[1] "<0x55a6d0c827d8>"
y <- x y[1] <- 10
tracemem[0x55a6d0c827d8 -> 0x55a6d0c81cd8]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
.Internal(inspect(x))
@55a6d0c827d8 14 REALSXP g0c5 [NAM(7),TR] (len=10, tl=0) 0,0,0,0,0,...
.Internal(inspect(y))
@55a6d0c81cd8 14 REALSXP g0c5 [NAM(1),TR] (len=10, tl=0) 10,0,0,0,0,...
x[1] <- 50
tracemem[0x55a6d0c827d8 -> 0x55a6d04f5a78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
nrow
poisons for duplication> m <- matrix(0, 3, 3) > tracemem(m) [1] "<0x7fc168d29df0>" > m[1,1] <- 1 > nrow(m) [1] 3 > m[1,1] <- 2 tracemem[0x7fc168d29df0 -> 0x7fc168d21f58]:
R makes it easy to read entire data sets in one operation, but reading it in parts can be much more efficient.
foreach
& iterators
packages provide tools to split inputs into smaller piecessplit
, awk
, etc) or other languages (e.g. Python to preprocess
read.table
The read.table
function is commonly used for reading data files, but it can be very slow on large files
colClasses
argument can improve performancecolClasses
can be used to skip a column, using less memorynrows
argumentscan
function can be fasterreadr
, data.table
, sqldf
, and bigmemory
head -n 3 sample_people.csv
seq,name/first,name/last,age,street,city,state,zip,dollar,pick,date 1,Leona,Burton,61,Afjoj Loop,Dibemato,VT,57577,273.47,YELLOW,05/26/1946 2,Nina,Patrick,48,Tana Square,Namumene,MT,55670,6774.89,GREEN,02/14/1936
colClasses
ComparisonRead in 99,999 rows of sample data
system.time(read.csv('sample_people.csv'))
user system elapsed 1.300 0.021 1.320
system.time(read.csv("sample_people.csv", header=TRUE, colClasses = c("integer", "character", "character", "integer", "character", "character", "character", "character", "numeric", "factor", "character")))
user system elapsed 0.248 0.000 0.248
readr
read.table
library(readr) options(readr.num_columns = 0) system.time(read_csv("sample_people.csv"))
user system elapsed 0.203 0.008 0.211
data.table
data.frame
, but without row labelsbigmemory
big.matrix
biganalytics
apply
, biglm
, bigglm
, bigkmeans
, colmax
Saving data in a binary format can make it much faster to read the data later. There are a variety of functions available to do that:
save
/load
writeBin
/readBin
write.big.matrix
/read.big.matrix
(from bigmemory
)Consider putting data into an SQLite database.
RSQLite
package is easy to usesqldf
package may be useful alsoSee also: http://brettklamer.com/diversions/statistical/faster-blas-in-r/ https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS
Intel Math Kernel Libraries (MKL) also fast, trickier to get working from scratch
MacOS
brew install openblas brew install r --with-openblas
Linux (Ubuntu)
sudo apt-get install libopenblas-base
My Machine: Ubuntu Bionic, i9-8950HK CPU, R 3.4.4, OpenBLAS 0.2.20 15 tests from http://r.research.att.com/benchmarks/R-benchmark-25.R
Default R: ~29 Seconds
R with OpenBLAS: ~3 seconds
R has builtin support for profiling, but there are additional
packages available:
proftools
profvis
(RStudio integration)proftools
f <- function(a) { g1(a) + g2(2 * a) } g1 <- function(a) { h(a) } g2 <- function(a) { sqrt(a) } h <- function(a) { b <- double(length(a)) for (i in seq_along(a)) { b[i] <- sqrt(a[i]) } b }
proftools
x <- 1:1000000 Rprof('prof.out') for (i in 1:10) { y <- f(x) } Rprof(NULL) summaryRprof("prof.out")$by.self
self.time self.pct total.time total.pct "h" 1.78 89 1.86 93 "g2" 0.10 5 0.10 5 "double" 0.08 4 0.08 4 "f" 0.02 1 1.98 99 "paste" 0.02 1 0.02 1
profvis
profvis({ for (i in 1:10) { y <- f(x) } })
Knowing where code is slow via profiling, use benchmarking tools
system.time
is useful for long running codemicrobenchmark
package is useful for analyzing short running codeCreate a vector of length n
where all values are x
bad.for <- function(n,x) { result <- NULL for (i in 1:n) { result[i] <- x } result }
okay.for <- function(n,x) { result <- double(n) for (i in 1:n) { result[i] <- x } result }
Improvement over the previous example, but it’s still slow because of the many tiny iterations.
strange.for <- function(n, x) { result <- NULL for (i in n:1) { result[i] <- x } result }
Is this loop faster or slower than the previous two?
# use of vector assignment vector.assn <- function(n, x) { result <- double(n) result[] <- x result }
We can also use vector assignment
built.in <- function(n, x) { rep(x, n) }
Or, we could read the fine manual and use a built-in function
n <- 10000 x <- 7 bad.result <- bad.for(n, x) okay.result <- okay.for(n, x) strange.result <- strange.for(n, x) vector.result <- vector.assn(n, x) built.result <- built.in(n, x) c(identical(bad.result, okay.result), identical(bad.result, strange.result), identical(bad.result, vector.result), identical(bad.result, built.result))
[1] TRUE TRUE TRUE TRUE
res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x), vector=vector.assn(n,x), builtin=built.in(n,x)) kable(summary(res, unit="relative"))
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
bad | 199.156045 | 196.233778 | 93.472189 | 186.788741 | 160.062897 | 6.475202 | 100 |
okay | 95.191405 | 92.625413 | 43.237866 | 87.084876 | 74.758367 | 3.359628 | 100 |
strange | 95.790237 | 92.912893 | 42.203484 | 87.622573 | 75.647121 | 1.327639 | 100 |
vector | 2.993742 | 2.930285 | 2.186368 | 2.798131 | 2.424152 | 1.573075 | 100 |
builtin | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 100 |
autoplot(res)
Create a matrix with n
rows and x
columns
Each value in the matrix is sampled from normal distribution, \(\mu=0 , \sigma=1\)
Generate a matrix of values sampled from normal distribution n
rows, x
columns
bad.norm <- function(n,x) { m <- NULL for (i in 1:n) { m <- rbind(m, rnorm(x)) } m }
Just like before, we build a matrix and populate it with a for loop
ok.norm <- function(n,x) { m <- matrix(0, nrow=n, ncol=x) for (i in 1:n) { m[i,] <- rnorm(100) } m }
lapply.norm <- function(n,x) { do.call('rbind', lapply(1:n, function(i) rnorm(x))) }
best.norm <- function(n,x) { m <- rnorm(x * n) dim(m) <- c(x, n) t(m) }
n <- 600 x <- 100 # Verify correct results set.seed(123); bad.result <- bad.norm(n,x) set.seed(123); ok.result <- ok.norm(n,x) set.seed(123); lapply.result <- lapply.norm(n,x) set.seed(123); best.result <- best.norm(n,x) c(identical(bad.result, ok.result), identical(bad.result, lapply.result), identical(bad.result, best.result))
[1] TRUE TRUE TRUE
res <- microbenchmark(bad=bad.norm(n,x), ok=ok.norm(n,x), lapply=lapply.norm(n,x), best=best.norm(n,x)) kable(summary(res, unit="relative"))
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
bad | 12.046303 | 12.618224 | 12.047673 | 12.733923 | 12.155410 | 13.131735 | 100 |
ok | 1.470018 | 1.501110 | 1.467030 | 1.489001 | 1.469674 | 1.136864 | 100 |
lapply | 1.466219 | 1.471615 | 1.498865 | 1.486366 | 1.489676 | 1.103843 | 100 |
best | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 100 |
autoplot(res)
enableJIT(0)
[1] 3
fun.for <- function(x, seed=1423) { set.seed(seed) y <- double(length(x)) for (i in seq_along(x)) { y[i] <- rnorm(1) * x[i] } y } fun.for.compiled <- cmpfun(fun.for)
x <- 10000 res <- microbenchmark(fun.for=fun.for(x), fun.for.compiled=fun.for.compiled(x)) kable(summary(res, unit="relative"))
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
fun.for | 1.089921 | 1.089702 | 1.212659 | 1.072328 | 1.069257 | 1.89812 | 100 |
fun.for.compiled | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 100 |
autoplot(res)