The first exascale supercomputer has a hardware failure every day

0

In short: Frontier, the world’s most effective supercomputer, is on line but even now much from operational. Its director has verified reviews that it is going through a program failure each individual couple several hours, but insists which is par for the program.

Frontier is in a course of its have. It has 9,408 HPE Cray EX235a nodes, just about every powered by an AMD Trento 7A53 Epyc 64-core CPU equipped with 512 GB of DDR4, and 4 AMD Instinct MI250X GPUs / accelerators just about every outfitted with 128 GB of HBM2e. Summed, the method has 602,112 CPU cores and 8,138,240 GPU cores in complete, and 4.6 PB of the two DDR4 and HBM2e.

In May, Frontier joined the Top rated500 as the very first supercomputer to break the exascale barrier after it concluded the HPL benchmark with a rating of 1.102 ExaFlops/s. Considering that then, the Oak Ridge National Laboratory in Tennessee, which manages the supercomputer, has been readying it for scientific study scheduled to start off in January.

On the other hand, there have been stories that the launch of Frontier could be waylaid by too much hardware failures. Looking for answers, Within HPC organized an job interview with the System Director at Oak Ridge, Justin Whitt. In the job interview, he confirmed Frontier was dealing with day by day procedure failures but asserted that was inescapable in this sort of a substantial process.

“Necessarily mean time amongst failure on a program this size is several hours, it really is not times,” he said. “So you have to have to make guaranteed you recognize what individuals failures are and that you can find no patterns to those people failures that you will need to be anxious with.” Whitt added that heading a working day devoid of a failure “would be outstanding.”

“Our intention is still hours.”

There had been rumors that the hardware problems have been getting induced by the new AMD Intuition MI250X, but Whitt refuted them. The MI250X is AMD’s most powerful GPU/accelerator, and it only sells it to pick partners. It has 220 CUs containing 14,080 cores clocked at 1700 MHz in a 500 W package deal.

“The concerns span a good deal of different types, the GPUs are just just one,” Whitt remarked. “It can be been a fairly great unfold amongst typical culprits of pieces failures that have been a major component of it. I really don’t believe that at this stage that we have a ton of problem more than the AMD merchandise,” he added.

“We are working with a great deal of the early-everyday living sort of matters we’ve viewed with other machines that we have deployed, so it truly is absolutely nothing much too out of the normal.”

Whitt conceded that the unparalleled scale of Frontier experienced created good tuning it “a minimal little bit harder” but stated they had been nonetheless following the program established back in 2018-19 despite delays caused by the pandemic.

Head about to Inside of HPC to examine the complete interview.

Leave a Reply