Buckle up, CPUs are going to get weirder

The M1 is a good test run, let’s get ready

(Note: This post is adapted from last week’s issue 51 of the resarch computing teams newsletter)

The big news of the past month has been Apple’s new M1 CPU. The M1’s specs in and of themselves kind of interesting, but more important to us in research computing is that the M1 is an example of how CPUs are going to get more different as time goes on, and that will have impacts on our teams. The M1 going to be a trial run for a future of more diverse computing architectures that we’d do well to get ready for.

Large-scale research computing systems have all been about “co-design” for ages, but the truth is that in the mainstream, big-picture CPU design choices have been pretty fixed, with most of co-design being about choice of accelerators or mix and match between CPU, memory. and acceleration. Now that the market has accepted ARM as a platform — and with RISC-V on its way — we can expect to start seeing bolder choices for CPU design being shipped, with vendors making very different tradeoffs than have been made in the past. So whether or not you see yourself using Apple hardware in the future, M1’s choices and their consequences are interesting.

M1 makes two substantially different trade-offs. The first is having DRAM on socket. This sacrifices extensibility — you can’t just add memory — for significantly better memory performance and lower power consumption. Accurately moving bits back and forth between chips takes a surprising amount of energy, and doing it fast takes a lot of power! The results are striking:

M1 MacBook AirでLINPACK動かして電力測定をしてみた。USB PD電力計＋iOS用Linpackという謎アプリのため参考値だが34.01 GFlops/W。まともに測るべきだしスケールしないやり方なので比べられる値ではないが、点灯したLCD込みでGreen 500の1位は超えていることに… うーん、正しいのか？ pic.twitter.com/ldEroByfxt
— Ohtsuji (@ohtsuji) November 17, 2020

LINPACK - solving a set of linear equations - is a pretty flawed benchmark, but it’s widely understood. The performance numbers here are pretty healthy for a chip with four big cores, but the efficiency numbers are startling. They’re not unprecedented except for the context; these wouldn’t be surprising numbers for a GPU, which also have DRAM-on-socket, and are similarly non-extensible. But they are absurdly high for something more general-purpose like a CPU.

Having unified on-socket memory between CPU and integrated GPU also makes possible some great Tensorflow performance, simultaneously speeds up and lowers power consumption for compiling code, and does weirdly well at running postgreSQL.

The second tradeoff has some more immediate effects for research computing teams. Apple, as is its wont, didn’t worry too much about backwards-looking compatibility, happily sacrificing that for future-looking capabilities. The new Rosetta (x86 emulation) seems to work seamlessly and is surprisingly performant. But if you want to take full advantage of the architecture of course you have to compile natively. And on the day of release, a lot of key tools and libraries didn’t just “automatically” work the way they seemed to when most people first started using other ARM chips. (Though that wasn’t magic either; the ecosystem had spent years slowly getting ready for adoption by the mainstream.)

“Freaking out” wouldn’t be too strong a way to describe the reaction in some corners; one user claimed that GATK would “never work” on Apple silicon (because a build script mistakenly assumed that an optional library that had Intel-specific optimizations would be present - they’re on it), and the absence of a free fortran compiler on the day of hardware release worried other people (there’s already experimental gfortran builds). Having come of computational science age in the 90s when new chips took months to get good tooling for, the depth of concern seemed a bit overwrought.

This isn’t to dismiss the amount of work that’s needed to get software stacks working on new systems. Between other ARM systems and M1, a lot of research software teams are going to have to put in a lot of time porting new low-level libraries and tools to the new architectures. Many teams that haven’t had to worry about this sort of thing before are going to have to refactor architecture-specific optimizations out and into libraries. Some code will simply have to be rewritten - some R code has depended on Intel-specific NaN handling to implement NA semantics (which are similar to but different from NaN) that M1 does not honour, so natively compiled R needs extra checks on M1.

It’s also not to dismiss the complexity that people designing and running computing systems will have to face. Fifteen years ago, the constraints on a big computing system made things pretty clear — you’d choose a whackload of x86 with some suitably fast (for your application) network. The main question were how fat are the nodes, what’s the mix of low, medium, and high-memory nodes, and what your storage system is like. It’s been more complex for a while with accelerators, and now with entirely different processor architectures in the mix, it will get harder. Increasingly, there is no “best” system; a system has to be tuned to favour some specific workloads. And that necessarily means disfavouring others, which centres have been loathe to do.

So the point here isn’t M1. Is M1 a good choice for your research computing support needs? Almost certainly not if you run on clusters. And if you’re good with your laptop or desktop, well, then lots of processors will work well enough.

But even so, a lot of software is going to now have to support these new chips. And this is just the start of “weird” CPUs coming for research computing.

CPUs will keep coming that will make radically different tradeoffs than choices than seemed obvious before. That’s going to make things harder for research software and research computing systems teams for a while. A lot of “all the world’s an x86” assumptions - some that are so ingrained they are currently hard to see - are going to get upended, and setting things back right is going to take work. The end result will be more flexible and capable code, build systems, and better-targeted systems, but it’ll take a lot of work to get there. If you haven’t already started using build and deployment workflows and processes that can handle supporting multiple architectures, now is a good time to start.

But the new architectures, wider range of capabilities, and different tradeoff frontiers are also going to expand the realm of what’s possible for research computing. And isn’t that why we got into this field?