There is no doubt that supercomputers are advancing along with technology and are becoming more powerful, but they are also becoming more vulnerable to failure. This is the conclusion that some researchers sustained at the SC12 conference held in Salt Lake City. They also featured a few possible solutions to overcome this problem which will affect the development speed of our future technologies and researches.
More powerful, more exposed to failure
It is known that today, ultra-high performance computers can add up to 100.000 nodes. Every node is built up from different parts like memory, internal system bus, processors and other microchips. It has been proven that the components can’t go on forever and at some point in time they are going to fail, halting the whole process developed inside the supercomputer. This is why right now it is critical to find a proper solution to such failures, before we raise the scale of performance to exabytes.
Although the problem isn’t new, as the first supercomputer built in 2001, made out of 600 nodes, the White supercomputer had a MTFB of only 5 hours. MTFB represents the mean time between failures. Having a supercomputer that would get a component to failure every 5 hours was unacceptable, and at some point later, scientists managed to raise the MTFB at 55 hours. However, the problem comes from the fact that scientist expect that in 10 years supercomputers will be 10 times as powerful as they are now, but also the failure rate wil increase exponentially.
To give a proper example, as the exascale computers are expected to be configured from millions of components, the reliability of the entire system will have to improved in such manner that it will raise 100 times only to keep the current mean time between failures; not even thinking of improving it.
What are the solutions?
David Fiala, a Ph. D student at North Carolina State University stated that in his researches he found a method that could improve the reliability of the supercomputer components. This way, he explained that running multiple software clones on a server. The application called RedMPI reads all the MPI messages that are sent by the application and then transmits them to the clones. If the clones running in parallel are calculating differently, the numbers can be rechecked on the fly.
By rechecking directly, it is avoided the problem of rewriting used by the checkpoint recovery method, where the software was written to the disk at certain points and then, when it was to fail, the job was re-launched from the last checkpoint written on the disk. However, this method of RedMPI might not be the best giving the fact that the network traffic is way too high to support it.
Ana Gainaru, another Ph. D student at the university of Illinois suggested that studying the logs of the application and interpreting them in the correct manner could lead into predicting when a failure is about to happen. Normal behavior is characterized by signal analysis, while data mining will be used to find common elements between different failures, as researches have shown that there are correlations between failures.