Spark in HPC clusters

hpc
spark

Over the past several years, as research computing centres and others who run HPC clusters tried to accommodate other forms of computing for data analysis, much effort went into trying to incorporate Hadoop jobs into the scheduler along with other more traditional HPC jobs. It never went especially well, which is a shame, because it seems that those past unsuccessful attempts have discouraged experimentation with related next-generation technologies which are a much better fit for large-scale technical computing. Hadoop v1 was always going to be...

Continue...

Machine Learning for Scientists

Jonathan Dursi

I recently taught a 1-day machine learning workshop for scientists for the good folks at SciNetHPC. There was enough interest (nearly forty people signed up for a day-long session near the end of term) that we had to book a large-ish classroom. There’s a lot of interest in the topic — which might even be surprising, given that a lot of the material is either familiar or pretty easy to digest for those who spend a lot of their time doing scientific data analysis. But...

Continue...

The Shell For Scientists

Jonathan Dursi

tutorial

I’ve posted a half-day “The Shell for Scientists” tutorial that I’ve given variants on a number of times; the motivating problem, provided by Greg Wilson for a two-day set of of tutorials at the University of Toronto, was cleaning up a bunch of auditory lab data on people’s cochlear implants. The focus is on productivity and automation; PDF slides are available here (although I really should translate them into a markdown-based format to make them more re-usable). Covered are a number of basic shell commands...

Continue...

Floating-Point Data Shouldn't Be Serialized As Text

Jonathan Dursi

Write data files in a binary format, unless you’re going to actually be reading the output - and you’re not going to be reading a millions of data points. The reasons for using binary are threefold, in decreasing importance: Accuracy Performance Data size Accuracy concerns may be the most obvious. When you are converting a (binary) floating point number to a string representation of the decimal number, you are inevitably going to truncate at some point. That’s ok if you are sure that when you...

Continue...

Hadoop For HPCers

Jonathan Dursi

I and my colleague Mike Nolta have put together a half-day tutorial on Hadoop - briefly covering HDFS, Map Reduce, Pig, and Spark - for an HPC audience, and put the materials on github. The Hadoop ecosystem of tools continues to rapidly grow, and now includes tools like Spark and Flink that are very good for iterative numerical computation - either simulation or data analysis. These tools, and the underlying technologies, are (or should be) of real interest to the HPC community, but most materials...

Continue...

Scalable Data Analysis in R

Jonathan Dursi

tutorial
R

R is a great environment for interactive analysis on your desktop, but when your data needs outgrow your personal computer, it’s not clear what to do next. I’ve put together material for a day-long tutorial on scalable data analysis in R. It covers: A brief introduction to R for those coming from a Python background; The bigmemory package for out-of-core computation on large data matrices, with a simple physical sciences example; The standard parallel package, including what was the snow and multicore facilities, using airline...

Continue...