Sede Regional Brunca

URI permanente para esta comunidadhttp://10.0.96.45:4000/handle/11056/14884

Examinar

Mostrando 1 - 3 de 3

A study of checkpointing in large scale training of deep neural networks
(arXiv.Org, 2021-03-29) Rojas, Elvis; Kahira, Albert Njoroge; Meneses, Esteban; Bautista-Gomez, Leonardo; Badia, Rosa M
Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
Exploring the effects of silent data corruption in distributed deep learning training
(Institute of Electrical and Electronics Engineers (IEEE), 2022-11-02) Rojas, Elvis; Pérez, Diego; Meneses, Esteban
The profound impact of recent developments in artificial intelligence is unquestionable. The applications of deep learning models are everywhere, from advanced natural language processing to highly accurate prediction of extreme weather. Those models have been continuously increasing in complexity, becoming much more powerful than their original versions. In addition, data to train the models is becoming more available as technological infrastructures sense and collect more readings. Consequently, distributed deep learning training is often times necessary to handle intricate models and massive datasets. Running a distributed training strategy on a supercomputer exposes the models to all the considerations of a large-scale machine; reliability is one of them. As supercomputers integrate a colossal number of components, each fabricated on an ever decreasing feature-size, faults are common during execution of programs. A particular type of fault, silent data corruption, is troublesome because the system does not crash and does not immediately give an evident sign of an error. We set out to explore the effects of that type of faults by inspecting how distributed deep learning training strategies cope with bit-flips that affect their internal data structures. We used checkpoint alteration, a technique that permits the study of this phenomenon on different distributed training platforms and with different deep learning frameworks. We evaluated two distributed learning libraries (Distributed Data Parallel and Horovod) and found out Horovod is slightly more resilient to SDCs. However, fault propagation is similar in both cases, and the model is more sensitive to SDCs than the optimizer.
Influencia de la capacitación continua en la labor del docente de inglés Centro de Idiomas de la Universidad Nacional, Sede Regional Brunca, Campus Pérez Zeledón
(Universidad Internacional San Isidro Labrador, 2021-12) Vargas Barboza, Cristina Melissa
La capacitación continua permite el desarrollo profesional y la actualización de los docentes; lo que se ha vuelto aún más relevante con la pandemia por el COVID-19. Además, en la enseñanza de un segundo idioma, en este nuevo contexto, el papel de la capacitación continua es aún más determinante. Es por esto que el presente estudio buscaba indagar la influencia de la capacitación continua en la labor de los docentes de inglés del programa CI-UNA en Pérez Zeledón. Para este estudio se utilizó un enfoque cualitativo para recolectar información sobre experiencias y opiniones de los participantes a través de un cuestionario en línea. Dentro de los resultados encontrados destacaron el compromiso por parte de los docentes de mantenerse actualizados, el aumento significativo en el número de capacitaciones recibidas después del inicio de la pandemia por el COVID-19, y la influencia positiva que tiene la capacitación sobre su forma de enseñar.

Examinar

Examinando Sede Regional Brunca por browse.metadata.type "http://purl.org/coar/resource_type/c_816b"