Beyond Single Core R: Parallel Data Analysis

I was asked recently to do short presentation for the Greater Toronto R Users Group on parallel computing in R; My slides can be seen below or on github, where the complete materials can be found.

I covered some similar things I had covered in a half-day workshop a couple of years earlier (though, obviously, without the hands-on component):

  • How to think about parallelism and scalability in data analysis
  • The standard parallel package, including what was the snow and multicore facilities, using airline data as an example
  • The foreach package, using airline data and simple stock data;
  • A summary of best practices,

with some bonus material tacked on the end touching on a couple advanced topics.

I was quite surprised at how little had changed since late 2014, other than further development of SparkR (which I didn’t cover), and the interesting but seemingly not very much used future package. I was also struck by how hard it is to find similar materials online, covering a range of parallel computing topics in R - it’s rare enough that even this simple effort made it to the HPC project view on CRAN (under “related links”). R continues to grow in popularity for data analysis; is this all desktop computing? Is Spark siphoning off the clustered-dataframe usage?

(This was also my first time with RPres in RStudio; wow, not a fan, RPres was not ready for general release. And I’m a big fan of RMarkdown.)