Beyond Single Core R: Parallel Data Analysis

I was asked recently to do short presentation for the Greater Toronto R Users Group on parallel computing in R; My slides can be seen below or on github, where the complete materials can be found.

I covered some similar things I had covered in a half-day workshop a couple of years earlier (though, obviously, without the hands-on component):

How to think about parallelism and scalability in data analysis
The standard parallel package, including what was the snow and multicore facilities, using airline data as an example
The foreach package, using airline data and simple stock data;
A summary of best practices,

with some bonus material tacked on the end touching on a couple advanced topics.

I was quite surprised at how little had changed since late 2014, other than further development of SparkR (which I didn’t cover), and the interesting but seemingly not very much used future package. I was also struck by how hard it is to find similar materials online, covering a range of parallel computing topics in R - it’s rare enough that even this simple effort made it to the HPC project view on CRAN (under “related links”). R continues to grow in popularity for data analysis; is this all desktop computing? Is Spark siphoning off the clustered-dataframe usage?

(This was also my first time with RPres in RStudio; wow, not a fan, RPres was not ready for general release. And I’m a big fan of RMarkdown.)