How Did a Personal Computer Get Into the Top Ranks of the World's Supercomputers?
The TOP500 List:
In June and November each year a list of the world's top supercomputers, called the "TOP500", is published jointly by Univ. of Tennessee and Univ. of Mannheim (Germany). To see the current list, click here for a selection, or click here for full list. The Virginia Tech G5-cluster is number three on the list just issued on November 15, 2003, a great feather in Apple's cap. Fine. But what is even more astonishing is the price of the VT cluster compared to clusters of similar power, a very low $5M, while competitors are in the $10 to $100 millions. How is this possible? What is it that allows the G5 to compete on performance and price so successfully against the Xeon, Power3, Itanium, Sparc, and above all Opteron?Why Apple's G5 Dual does well in clusters:
(1.) Virginia Tech's 1,100-box cluster of G5s belongs to a class of supercomputers called "Beowulf Clusters", a piecing together of standard off-the-shelf computer boxes by means of fast interconnects, the overall system being controlled by open-source programs. Xeons and Opterons are also clustered this way, and started the astonishing movement to extremely low cost for supercomputers. But Apple's G5 carries this process further by being a mass-produced desktop computer, not a high-priced server machine, and so it's cheaper per unit of performance. (Cheap boxes: Economies of scale)(2.) As with other Beowulf clusters, the use of off-the-shelf hardware and existing cluster-controller software enabled Virginia Tech to build and make operational their machine in just three months (preceded by three months of planning). Old-style supercomputers often take as long as three years to plan, construct, debug, and test. In those three-years, though, Moore's Law has done its work, so that the 3-year-old design suffers from mediocre performance-per-dollar compared to new desktop machines. Old-style, tediously constructed, supercomputers are thus over-priced by the time they are ready to do work. Constructing a cluster in a short time results in a very low cost for a given performance level. (This year's technology, not that of three years ago: Reverse tortoise and hare effect)
(3.) Apple's Dual-G5 computer has an additional advantage from the fact that it uses the IBM-970 microprocessor, a derivative of IBM's extremely powerful Power4 processor. The 970 allows the computer to excel in the number of operations it can perform per clock cycle (IPC).
Now, the TOP500-list's "R-Peak" and "R-Max" values reflect in some way this IPC: "R-Peak" is the calculated upper limit on TFlops (TeraFlops, or trillion floating-point operations per second) that the cluster can possibly attain, based on the internal structure of its individual processors. "R-Max", on the other hand, is what the cluster itself can measurably attain when it operates with optimal data. R-Max is always less than the R-Peak, being somewhere between 50% to 80% of that upper limit. Think of R-Peak as the cluster's theoretical performance, the R-Max its actual. The ratio of the two is its "efficiency". The TOP500 supercomputers are ranked according to their R-Max, or actual performance. Now, how do we calculate the R-Peak of a cluster? R-Peak = number of float-units X each unit's float ops per clock cycle X clock frequency X number of processors per box X number of boxes. The numbers for the Virginia Tech cluster are these:
R-PeakVT cluster = 2 fp-units X 2 flops/hertz/unit X 2 GHz X 2 CPUs/box X 1,100 boxes = 17.6 TeraFlops
Regarding the number of flops/hertz/unit, or ops/cycle: Each of the 970's two float units executes, as is normal, just one float operation per clock cycle. BUT in the Power4/970 instruction set (in the Itanium's also) is a special non-RISC instruction "Multiply with Add", which like the rest, is performed in just one cycle, yet which counts as TWO flops in the Linpack scoring. Now, since in the Linpack benchmark program the test code consists predominantly of a continuous series of multiply followed by add -- this being simultaneous-equation solving -- the Power4 and derivatives (and the Itaniums) need to use only half the number of clock cycles that other chips require to get the same amount of Linpack-work done. Hence double speed on Linpack. On the other hand, since "Multiply with Add" seldom appears in the famous SPEC2000 test, the IBM-970 on that benchmark has no such advantage. How much of this R-Peak advantage of the 970 is translated into actual performance excellence (its efficiency) is determined by numerous factors: the quality of the interconnect technology and software, plus the limitations or advantages of the particular structures inside the cpu. The Big Mac cluster currently has an efficiency of 62%, after starting a month ago at 40%. A bit more tweaking of the operational parameters by the Virginia Tech team may elevate this percentage a small amount more, and thus raise the system's R-Max rating above its current 10.3 TeraFlops-per-second value. (Advantageous floating-point IPC on Linpack)
So put all these factors together, along with the enthusiasm and experience of a team of professors and graduate students who had built two previous smaller clusters, and the $5M price is explained.
16 Nov 2003

