Now showing items 1-1 of 1

    • A study of checkpointing in large scale training of deep neural networks 

      Rojas, Elvis; Kahira, Albert Njoroge; Meneses, Esteban; Bautista-Gomez, Leonardo; Badia, Rosa M (arXiv.Org, 2021-03-29)
      Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed ...