ars technica, Posted January 30, 2004 22:00 by davechen.

Srinidhi Varadarajan spoke at the Johns Hopkins Applied Physics Lab about System X, the 1100 node PowerMac G5 cluster. He's an apt speaker, and it seemed very well attended. There were probably 200 people there. Below are my notes from the talk. Here are some slides pretty similar to the ones he used today, but made before the machine was built.

Friday 2 pm, Jan 30, 2004
Hopkins Applied Physics Lab, Columbia, MD

Dr Srinidhi Varadarajan

System X (Ten)
Building Va Tech's Supercomputer

I.      Motivation & Goals

    1990's built 16 node cluster grew to 200 node
    By 2000-1 system was way oversubscribed
    Build computational science engineering (CSE) program.  
    Build world class facility to complement CSE

    National Lambda Rail.  National optical network.  10-14 GBs

    Computing facility in academia, not just affiliated with academia.
    Support both experimental and production research.  Production 
    supercomputing facilities don't support research.  If you asked to
    run a research OS that might crash the system, they'd laugh at you.

    Second system 2005-6.  They expect a long line of machines.


II.     Hardware

    Price/performance.  Goal of 10 teraflops sustained.  
    Total cost 5.2 million.  Budget was set long before system designed.
        - Facilies upgrade 2 mil
            - 1 mill facilities
            - 1 mill UPS and backup power generators
        - Cheapest world class supercomputer

                    ppc    opteron              itanuim         ultrasparc
    IBM         jan 2004 turnkey
    HP                                          9-10 million
    Apple         early sept
    white box             mid august
    Dell                                        mid-august
    Sun                                                         ???

    All other options cost ~10 million.
    Picked apple because of 970.
    Opteron doesn't do fused multiply-add

    1100 dual G5's
        - 8 GFlop/proc double precision
            - fastest on any CPU
        - 4 GB RAM, 160 GB disk
            - 176 TB disk total
        - 4 head nodes for compilation/job startup
        - 1 management node

    Upgrade in the next month to Xserves.  3000 sq ft -> 1000 sq ft.  
        2/3 power.  Allows for expansion.  It definitely sounds like
        they want to expand.

    Primary Communications Architecture
        Infiniband by Mellanox
            Switched network.  Each node 20 Gbps full duplex bandwidth
                into network
            24 96 port switches switches. Fat tree topology
            8 us latency
            Each network card (HCA) 2 ports.  They only use one port
            4 times more bw than #2 machine

    Secondary Communication
        Gigabit ethernet
            NFS, control, job startup

III.    Facilities

    Data Center - 1980's machine room
        9000 sq ft total
        Scrunched other machines to get space
        3000 reserved for research
        2.5 miles from main campus

        3 MegaWatts of power.
            1.5 MW reserved for TCF.  Over-speced by factor of 3.
        2+ million BTU's of cooling
            Liebert's extreme density cooling.  they custom built the racks too.  .
                non standard height/width/depth.
            Tradition AC systems -> 60 mph winds under the raised floor.
            Hot aisle / cool aisle design, like other clusters.
            Empty air 3 times/minute.  Typical center does it 3 times/hour.
            2 water chillers. 125 tons? 45-60 degrees
            5 weeks of physical plant work

            On oct 22 or 23 lost 15 machines because of sun spots.

        160 student volunteers.  4 hr shifts.  3 shifts on 3 different days
            210 nodes/hr.  Finished one shift early.  Had 200 volunteers
            but didn't need last shift.
        Machine setup:
            Boot machine.  Put card in.  Boot machine.  Put in rack.  Boot again.
            Not one machine died.  One was DOA from Apple.
        Paid in pizza/coke.

IV.     Software

    OS X, 10.2.7
    Mellanox developed Infiniband drivers and network monitoring/management tools
    MPI
        - MVAPICH from Panda at OSU. provides optimized Infiniband support
            for MPICH

    C, C++
        IBM's xlc and gcc 3.3
    Fortran
        IBM's xlf

    Performance Enhancements
        New cache-optimized memory manager.
            Written as kext

        Ported MVAPICH to OSX

        Scalable job startup system for MVAPICH

    Reliability
        Difference between 6 sigma reliability and none is not much.
            One failure per month versus one per day.  Either way you
            have to handle failure.
        Deja vu - fault tolerance system
            NSF funded project
            Working on patent application

V.      Performance Results
    Infiniband driver 800 MBps with MPI performance at 700 MBps
        - MPI latency 8-14 us

    BLAS
        DGEMM optimized by Kazushige Goto.  84.1% efficiency
            He works in Japanese patent? office
            For single precision he got in the 90's
            On itanium he got 99%
    520k variable equation

    Complete BLAS (lvl 1,2,3) available in a few weeks.

    Powered up Sept 21, benchmark date Oct. 1?
        Rmax = 10.28 teraflops
        Nmax = 520K
        N1.5 = 152K (5.178 TF) more interesting number shows communication 
            performance.  N1.5 is the size of matrix to get half performance.
            Probably could have gotten similar Rmax with just Gigabit
            ethernet, but not the N1.5.
        Got 10.9 later

VI.     CSE Research

        Groups at vatech
            nanoscale electronics
            quantum chemistry
            computational chemistry/biochemistry
            aerodynamics through multidisciplinary design optimization
            cell cycle modeling
            molecylar statics
            computational acoustics
            computational fluid dynamics
            computational electromagnetics
            optimal design and control
            wireless systems modeling
            microarray experiment management
            large-scale network emulation <- Dr. V's work

        Experimental systems
            fault tolerance and migration
            queuing syustems, schedulers
            distributed OS/ DSM
            parallel file systems
            middleware for computational grids
            authenitcation / security systems

Questions
    How many failures?
        1-2 failures per day.  expected 2-3
    Not in full production yet.
    ECC?
        Under NDA's.  Very careful about answer.
        Xservers were a planned upgrade.  On schedule.
    Monitoring
    Wrote lots of code/scripts to administer cluster 

He talked more about the facilities stuff (power, cooling) than I noted. I didn't really know that stuff, so I didn't take that much down. It definitely sounded pretty complicated. They did CFD simulations of the heat dissipation in designing the cooling.

He didn't talk much at all about Deja Vu. Probably because of their patent application.

Re: ECC, all he would say was that the Xserves were a planned upgrade. My guess is that they knew about it pretty much from the start.

Overall it's quite an impressive feat that they've accomplished. To build the #3 super computer for such a low cost in such a short a time is really something.