Blue Gene/P First Impressions - Hardware

The institution at which I work recently acquired a Blue Gene/P (BG/P) system as the first of three phases of building an advanced supercomputing center here, and as a computationalist in materials science, I've been given the opportunity to attend a three-day crash course on using it.

Getting access to a BG/P in itself is pretty exciting--having an account (albeit a temporary evaluation one) is like being a kid in a candy store.  I've become quite the computer nerd over the last decade, but although I am a professional computationalist, my work is done entirely on commodity hardware.  I've never had access to anything as exotic as BG/P, and I've certainly never had the opportunity to be taught how such a machine is designed by the people who developed it.  The guys that IBM sent to conduct this course really know their stuff and speak a language I don't often get to hear, so I've been trying to soak up as much as I can in the three days they're here.

So here are my first impressions.


Hardware

Blue Gene/P, contrary to my initial expectations, turns out to be a heterogeneous system.  Our head node is an IBM Power 550 Express with two dual-core 4.2 GHz POWER6 cores with hyperthreading and 16 GB of DDR2 running SUSE Linux Enterprise Server 10 SP1, although it sounds like it isn't uncommon for the head nodes to be x86-based Linux systems.  The BG/P compute nodes are sufficiently different from anything capable of acting as a head node that all scientific code needs to be cross-compiled anyway, and cross compiling for BG/P on Linux/x86 is no different than on Linux/POWER6.

Like a conventional cluster, the bulk of the work is done by individual compute "nodes."  Each node, like in a conventional cluster, has a set of CPU cores and its own memory, and each node runs its own system image.  But beyond this basic description, the "nodes" are really unlike the nodes of a conventional Beowulf cluster.

Physically, each node is little more than a 6 inch long card that plugs into a "node board."  Each node board kind of resembles what I would consider a "node" in a conventional cluster--a pizza box that slides into the rack.  Here is what a node board looks like (from Wikipedia):


Each copper heat sink is a single "node," which means a single node board has 16+16=32 nodes on it.  On BG/P, each node has four cores, meaning each node board has 128 cores.  Each BG/P rack takes 32 node boards, providing a pretty beefy 4096 cores per rack.  Although I'd imagine 72U racks loaded with those 16-core Opteron Interlagos processors can approach this density, such a machine would probably run into major memory contention issues due to the extensive sharing of memory bandwidth and caches.

Each BG/P node is based around a PowerPC 450 SOC.  Specifically, the nodes all have
  • one 32-bit PowerPC 450 CPU with four cores
  • 32KB instruction cache + 32KB data cache per core (cache coherent via snoop filtering)
  • 16 line (128-bytes per line) L2 data prefetch unit per core
  • 4 x 2 MB shared L3
  • 2 GB of DDR2 featuring uniform memory access (13.6 GB/s)
  • and that's about it.  No scratch space, nowhere for a page file to go, nothing.
The PPC450 processors were based on a design meant for embedded applications which means they run slow (always 850 MHz--no power saving, c-states, stepping, etc) but are very energy-efficient.  The BG/P processors in particular feature what's called a "double hummer" FPU which is a 64-bit vector unit that can do two fused multiply-add (MAF) operations in real*8 at once.  This unit is one of the core components of BG/P's performance capabilities, but it also is a big finicky--it requires data aligned on 16-byte boundaries, and it really needs fused multiply-add ops.  IBM says that their compilers will handle the fusing of multiply-add ops during optimization, but the data alignment is supposed to be a big deal that needs to be considered when designing programs.  I was unable to make much of any use of this double FPU with any of the applications I've run.

In the above photo of the node card, you can see that there are single-heatsink node cards all the way at the bottom.  These are I/O nodes, and they actually handle many of the system calls generated by code since compute nodes are largely incapable.  BG/P systems have a variable IO-to-compute node ratio, with lower ratios being better suited for IO-intensive applications but having significantly higher price tags.  The pictured node card is configured with the top-of-the-line 1:16 ratio; the BG/P that we have is 1:64, which is one step under the cheapest 1:128 configuration.  Since a single IO node handles the system calls for 16 to 128 compute nodes, there can be multiple jobs running on a single IO node and, as expected, this can create a bottleneck if the system is running a lot of tiny jobs.

The interconnect is the final major unique component of BG/P--it uses a proprietary, low-latency, highly interconnected 3D torus topology rather than a switched network.  This topology gives non-uniform "access" latencies ranging from 0.5 (one hop) to 5 microseconds (farthest node) theoretical, or 2 to 10 microseconds with MPI.  Each node therefore has twelve links on this 3D torus, each with 3.4 Gb/s bandwidth, and all of these links do DMA.  The IO nodes serve as the gateway to the outside world; each IO node has a 10Gb/s ethernet link to accomplish this.

What dazzled me is that, in addition to this 3D torus for message passing, there are two separate, dedicated networks: a tree network exclusively for reductions and collective calls with 850 MB/s bandwidth, and a single dedicated network just for barriers.  What's more is that all of this networking is rolled up into the BG/P MPI implementation (a derivative of Argonne's mpich2) so that proper utilization of the tree over the torus is generally transparent to the developer.

Software

I had intended to write up my notes on the software end of BG/P as well, but it's getting late and that's probably left for another day.  Let it suffice to say that I found the software end of BG/P just as fascinating.  Nodes run CNK, a "Linux-like" kernel, IO nodes run real Linux, and none of that is really directly accessible to the end-user.  Individual nodes can be activated in different modes ("virtual node," full SMP, or a hybrid mode), the MPI implementation has a bunch of neat extensions, and you can even play with DCMF, the underlying lightweight message passing API for Blue Gene.