Hadoop For HPCers

I and my colleague Mike Nolta have put together a half-day tutorial on Hadoop - briefly covering HDFS, Map Reduce, Pig, and Spark - for an HPC audience, and put the materials on github.

The Hadoop ecosystem of tools continues to rapidly grow, and now includes tools like Spark and Flink that are very good for iterative numerical computation - either simulation or data analysis. These tools, and the underlying technologies, are (or should be) of real interest to the HPC community, but most materials are written for audiences with web application or maybe machine-learning backgrounds, which makes it harder for an HPC audience to see how they can be useful to them and how they might be applied.

Most of the source code is Python. Included on git hub are all sources for the examples, a vagrantfile for a VM to run the software on your laptop, and the presentation in Markdown and PDF. Feel free to fork, send pull requests, or use the materials as you see fit.

hpc
hadoop
spark
tutorial