REAL

Improving resilience of scientific software through a domain-specific approach

Reguly, I. Z. and Mudalige, G. R. and Giles, M. B. and Maheswaran, S. (2019) Improving resilience of scientific software through a domain-specific approach. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 128. pp. 99-114. ISSN 0743-7315

[img]
Preview
Text
paper.pdf

Download (1MB) | Preview

Abstract

In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier–Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.

Item Type: Article
Uncontrolled Keywords: Domain Specific Language, High Performance Computing, Checkpointing, Resilience, Parallel I/O
Subjects: Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 18 Sep 2019 12:55
Last Modified: 18 Sep 2019 12:55
URI: http://real.mtak.hu/id/eprint/99795

Actions (login required)

Edit Item Edit Item