Beyond Single Core R: Parallel Data Analysis

I was asked recently to do short presentation for the Greater Toronto R Users Group on parallel computing in R; My slides can be seen below or on github, where the complete materials can be found. I covered some similar things I had covered in a half-day workshop a couple of years earlier (though, obviously, without the hands-on component): How to think about parallelism and scalability in data analysis The standard parallel package, including what was the snow and multicore facilities, using airline data as...

Continue...

MPI's Place in Big Computing

Jonathan Dursi

The organizers of EuroMPI 2016 were kind enough to invite me to give a keynote and participate in a panel at their meeting, which was held at the end of September in beautiful Edinburgh. The event was terrific, with lots of very interesting work going on in MPI implementations and with MPI. The topic of my talk was “MPI’s Place in Big Computing”; the materials from the talk can be found on github. The talk, as you might expect, included discussion of high-productivity big data...

Continue...

Jupyter Notebooks for Performing and Sharing Bioinformatics Analyses

Jonathan Dursi

R
tutorial

I was asked to do a half-day tutorial at the Great Lakes Bioinformatics conference Workshop session. The focus was mainly on R, with some python as well. We covered: The basics of Jupyter notebooks - what they are and how they work How to install and run Jupyter notebooks on their laptop, in R and Python How to perform interactive analyses in a web browser using Jupyter Using markdown and latex to How to “Port” an R bioinformatics workflow from some scripts into a Jupyter...

Continue...

Spark, Chapel, TensorFlow: Workshop at UMich

Jonathan Dursi

The kind folks at the University of Michigan’s Center for Computational Discovery and Engineering (MICDE), which is just part of the very impressive Advanced Research Computing division, invited me to give a workshop there a couple of months ago about the rapidly-evolving large-scale numerical computing ecosystem. There’s lots that I want to do to extend this to a half-day length, but the workshop materials — including a VM that can be used to play with Spark, Chapel and TensorFlow, along with Jupyter notebooks for each...

Continue...

Approximate Mapping of Nanopore Squiggle Data with Spatial Indexing

Jonathan Dursi

Over at the Simpson Lab blog, I have an post describing a novel method for Directly Mapping Squiggle Data, using k-d trees to map segmented kmers; a simple proof of concept is available on github.

Continue...

On Random vs. Streaming I/O Performance; Or seek(), and You Shall Find --- Eventually.

Jonathan Dursi

At the Simpson Lab blog, I’ve written a post on streaming vs random access I/O performance, an important topic in bioinformatics. Using a very simple problem (randomly choosing lines in a non-indexed text file) I give a quick overview of the file system stack and what it means for streaming performance, and reservoir sampling for uniform random online sampling.

Continue...