Talk – Mar 11: Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing
Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing
Speaker: Linda Stals, Australian National University
Time: Wednesday, March 11, 2020, 14:15
Room: 01.150-128 – Seminarraum (Cauerstraße 11)
Abstract
On future extreme-scale computers, faults will become increasingly common as the number of individual components grows without a compensating improvement in reliability. Achieving resilience is expensive since it inevitably requires redundancy and thus more system resources and additional energy. Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes and store the data to backup memory, but this will be too expensive and too slow in extreme-scale computing.