Friday 2 pm, Jan 30, 2004
Hopkins Applied Physics Lab, Columbia, MD
Dr Srinidhi Varadarajan
System X (Ten)
Building Va Tech's Supercomputer
I. Motivation & Goals
1990's built 16 node cluster grew to 200 node
By 2000-1 system was way oversubscribed
Build computational science engineering (CSE) program.
Build world class facility to complement CSE
National Lambda Rail. National optical network. 10-14 GBs
Computing facility in academia, not just affiliated with academia.
Support both experimental and production research. Production
supercomputing facilities don't support research. If you asked to
run a research OS that might crash the system, they'd laugh at you.
Second system 2005-6. They expect a long line of machines.
II. Hardware
Price/performance. Goal of 10 teraflops sustained.
Total cost 5.2 million. Budget was set long before system designed.
- Facilies upgrade 2 mil
- 1 mill facilities
- 1 mill UPS and backup power generators
- Cheapest world class supercomputer
ppc opteron itanuim ultrasparc
IBM jan 2004 turnkey
HP 9-10 million
Apple early sept
white box mid august
Dell mid-august
Sun ???
All other options cost ~10 million.
Picked apple because of 970.
Opteron doesn't do fused multiply-add
1100 dual G5's
- 8 GFlop/proc double precision
- fastest on any CPU
- 4 GB RAM, 160 GB disk
- 176 TB disk total
- 4 head nodes for compilation/job startup
- 1 management node
Upgrade in the next month to Xserves. 3000 sq ft -> 1000 sq ft.
2/3 power. Allows for expansion. It definitely sounds like
they want to expand.
Primary Communications Architecture
Infiniband by Mellanox
Switched network. Each node 20 Gbps full duplex bandwidth
into network
24 96 port switches switches. Fat tree topology
8 us latency
Each network card (HCA) 2 ports. They only use one port
4 times more bw than #2 machine
Secondary Communication
Gigabit ethernet
NFS, control, job startup
III. Facilities
Data Center - 1980's machine room
9000 sq ft total
Scrunched other machines to get space
3000 reserved for research
2.5 miles from main campus
3 MegaWatts of power.
1.5 MW reserved for TCF. Over-speced by factor of 3.
2+ million BTU's of cooling
Liebert's extreme density cooling. they custom built the racks too. .
non standard height/width/depth.
Tradition AC systems -> 60 mph winds under the raised floor.
Hot aisle / cool aisle design, like other clusters.
Empty air 3 times/minute. Typical center does it 3 times/hour.
2 water chillers. 125 tons? 45-60 degrees
5 weeks of physical plant work
On oct 22 or 23 lost 15 machines because of sun spots.
160 student volunteers. 4 hr shifts. 3 shifts on 3 different days
210 nodes/hr. Finished one shift early. Had 200 volunteers
but didn't need last shift.
Machine setup:
Boot machine. Put card in. Boot machine. Put in rack. Boot again.
Not one machine died. One was DOA from Apple.
Paid in pizza/coke.
IV. Software
OS X, 10.2.7
Mellanox developed Infiniband drivers and network monitoring/management tools
MPI
- MVAPICH from Panda at OSU. provides optimized Infiniband support
for MPICH
C, C++
IBM's xlc and gcc 3.3
Fortran
IBM's xlf
Performance Enhancements
New cache-optimized memory manager.
Written as kext
Ported MVAPICH to OSX
Scalable job startup system for MVAPICH
Reliability
Difference between 6 sigma reliability and none is not much.
One failure per month versus one per day. Either way you
have to handle failure.
Deja vu - fault tolerance system
NSF funded project
Working on patent application
V. Performance Results
Infiniband driver 800 MBps with MPI performance at 700 MBps
- MPI latency 8-14 us
BLAS
DGEMM optimized by Kazushige Goto. 84.1% efficiency
He works in Japanese patent? office
For single precision he got in the 90's
On itanium he got 99%
520k variable equation
Complete BLAS (lvl 1,2,3) available in a few weeks.
Powered up Sept 21, benchmark date Oct. 1?
Rmax = 10.28 teraflops
Nmax = 520K
N1.5 = 152K (5.178 TF) more interesting number shows communication
performance. N1.5 is the size of matrix to get half performance.
Probably could have gotten similar Rmax with just Gigabit
ethernet, but not the N1.5.
Got 10.9 later
VI. CSE Research
Groups at vatech
nanoscale electronics
quantum chemistry
computational chemistry/biochemistry
aerodynamics through multidisciplinary design optimization
cell cycle modeling
molecylar statics
computational acoustics
computational fluid dynamics
computational electromagnetics
optimal design and control
wireless systems modeling
microarray experiment management
large-scale network emulation <- Dr. V's work
Experimental systems
fault tolerance and migration
queuing syustems, schedulers
distributed OS/ DSM
parallel file systems
middleware for computational grids
authenitcation / security systems
Questions
How many failures?
1-2 failures per day. expected 2-3
Not in full production yet.
ECC?
Under NDA's. Very careful about answer.
Xservers were a planned upgrade. On schedule.
Monitoring
Wrote lots of code/scripts to administer cluster
He talked more about the facilities stuff (power, cooling) than I noted. I didn't really know that stuff, so I didn't take that much down. It definitely sounded pretty complicated. They did CFD simulations of the heat dissipation in designing the cooling.
He didn't talk much at all about Deja Vu. Probably because of their patent application.
Re: ECC, all he would say was that the Xserves were a planned upgrade. My guess is that they knew about it pretty much from the start.
Overall it's quite an impressive feat that they've accomplished. To build the #3 super computer for such a low cost in such a short a time is really something.