Analyzing a Five-year Failure Record of a Leadership-class Supercomputer

Rojas, Elvis; Meneses, Esteban; Jones, Terry; Maxwell, Don

dc.contributor.author	Rojas, Elvis
dc.contributor.author	Meneses, Esteban
dc.contributor.author	Jones, Terry
dc.contributor.author	Maxwell, Don
dc.date.accessioned	2020-02-12T17:10:31Z
dc.date.available	2020-02-12T17:10:31Z
dc.date.issued	2019-10-18
dc.identifier.uri	http://hdl.handle.net/11056/17204
dc.description.abstract	Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the mean-time-between-failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution.	es_ES
dc.language.iso	en	es_ES
dc.publisher	Institute of Electrical and Electronics Engineers, Incorporated (IEEE)	es_ES
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Fault tolerance, resilience, failure analysis, high performance computing.	es_ES
dc.title	Analyzing a Five-year Failure Record of a Leadership-class Supercomputer	es_ES
dc.type	http://purl.org/coar/resource_type/c_6501	es_ES

Files in this item

Name:: paper_IEEE.pdf
Size:: 1.184Mb
Format:: PDF
Description:: Artículo científico

View/Open

Name:: license_rdf
Size:: 1.203Kb
Format:: application/rdf+xml

View/Open

This item appears in the following Collection(s)

Artículos científicos [75]
Artículos científicos [73]

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States