際際滷

際際滷Share a Scribd company logo
Brian Austin
Alex Druinsky, Osni Marquez, Eric
Roman, Sherry Li (LBNL)
Incorporating
Error Detection
and Recovery
into
Hierarchically
Semi-
Separable
Matrix
Operations
- 1 -
April 8, 2015
Towards Optimal Order Resilient Solvers
at Extreme Scale (TOORSES)
Linear solvers are ubiquitous in scientific computing
 Performance
 HSS matrix format reduces computational complexity
 Resilience
 Error rates may increase on extreme scale systems.
 Increased concurrency  more parts that might fail
 Potentially lower part reliability
(smaller transistors, near-threshold voltage)
- 2 -
Outline
 Hierarchically Semi-Separable (HSS) decomposition
 Algorithm-based fault tolerance (ABFT) for dense
matrices
 Error detection for HSS matrix-vector multiplication
 Error recovery using Containment Domains
 Performance results.
- 3 -
Hierarchically Semi-Separable (HSS)
Matrix Decomposition
- 4 -
 Exploits low numerical rank of matrix.
 Structured block sparsity
 Factorization has bounded error.
A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*
HSS Matrix Vector multiplication
- 5 -
HSS Matrix-Vector multiplication: b=A.x
D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
Algorithm Based Fault Tolerance (ABFT)
for Dense Matrices (Huang & Abraham, 1984)
Checksum protection for individual matrices
Recovers up to one error per row/column
eT.A = [eTA]
A.e = [Ae]
Matix multiplication preserves checksums
[eTA].B = eT.[AB] A.[Be] = [AB].e
- 6 -
[Ae]
A
[eTA]
A.[Be]=C.e
C
[eTA].B = eT.C
A
[Ae]
[eTA]
[Be]
B
[eTB]
 =
Checksum relationships
can be derived from
associative properties.
Intermediate error checking for HSS-
MV
 Observation: between each parenthesis, there is an
implicit (i.e. not explicitly stored) matrix.
 Many invariant conditions can be constructed using
associativity.
 For example:
y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e
 Many options for error checking at different stages
of HSS-MV
- 7 -
A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
ABFT for HSS-mv
Error checking with
adjustable granularity
 Coarse + CD
[e.AHSS].x = e.[AHSS .x]
 Medium + CD
e.[V(L).x] = [e.V(L)].x
e.[V(0)V(L).x] =
[e.V(0)V(L)].x
[e.AHSS].x = e.[AHSS .x]
 Fine + CD
Detect errors in each MV
 Encoded
Detect & correct errors in
each MV
- 8 -
HSS Matrix-Vector multiplication: b=A.x
Error recovery by Containment Domains
(CDs)
Error Detection
 Classical ABFT cannot
recover all errors.
Multiple errors per row.
Errors in both A and B.
Redesign for every algorithm.
 Containment Domains
provide more robust
recovery techniques.
Users supply validation tests.
Remote safe store
Composable (nested,)
Automatic escalation
CD pseudocode
CD_Begin()
//first pass:
// store safe copies of A,B
//second pass:
// restore A,B
CD_Preserve(A,[eTA],[Ae])
CD_Preserve(B,[eTB],[Be])
Compute: C=A.B
CD_Assert(eT.C==[eTA].B)
CD_Complete()
- 9 -
Runtime overhead without error
injection
- 10 -
0.234
0.236
0.238
0.240
0.242
0.244
0.246
None
(1148.2)
Coarse
(2290.4)
Medium
(2290.9)
Fine
(2292.2)
Encoded +
Coarse
(2295.9)
Encoded
(1164.2)
TimeperHSSmviteration(s)
 Overhead is less than 2%
 Comparable to natural
performance variation.
(Memory (GB))
injection.
- 11 -
0.20
0.25
0.30
0.35
0.40
0.45 1.0E-3
3.2E-3
1.0E-2
3.2E-2
1.0E-1
3.2E-1
1.0E+0
TimeperHSSmvIteration(s)
Error Rate (#/s)
Coarse
Medium
Fine
Encoded
Conclusions & Future work
 Identified checksum relationships to validate HSS-MV
operations.
 Fine grained error checking:
 has very low overhead
 maintains excellent efficiency at high error rates.
 Containment Domains
 Fine-grained preservation has incurs minimal runtime overhead.
 Preservation doubles memory capacity requirements.
 Merge fault-tolerance branch into main (parallel) HSS
code.
 Incorporation into linear solver
- 12 -
Acknowledgement
 Toorses (LBNL)
 Sherry Li (PI)
 Eric Roman
 Osni Marquez
 Alex Druinski
 Strumpack  HSS Library
 Francois-Henry Rouet
 Containment Domains (UT)
 Mattan Erez
 Kyushick Lee
 Support
 This material is based upon work supported by the U.S. Department of Energy, Office of
Science, Office of Advanced Scientific Computing Research, Applied Mathematics program
under contract number DE-AC02-05CH11231.
 This research used resources of the National Energy Research Scientific Computing Center, a
DOE Office of Science User Facility supported by the Office of Science of the U.S. Department
of Energy under Contract No. DE-AC02-05CH11231.
- 13 -
National Energy Research Scientific Computing
Center
- 14 -

More Related Content

Viewers also liked (7)

亰舒磲亳!
亰舒磲亳!亰舒磲亳!
亰舒磲亳!
舒从亳仄 亠于舒
舒亶亟舒仄舒从亳
舒亶亟舒仄舒从亳舒亶亟舒仄舒从亳
舒亶亟舒仄舒从亳
njhujdbwz
Resume Jevy Callipare
Resume Jevy CallipareResume Jevy Callipare
Resume Jevy Callipare
Jevy Callipare
n 担ng c畉n hi畛u r探 h董n v畛 testosterone
n 担ng c畉n hi畛u r探 h董n v畛 testosteronen 担ng c畉n hi畛u r探 h董n v畛 testosterone
n 担ng c畉n hi畛u r探 h董n v畛 testosterone
sara633
Citizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the GrassrootsCitizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the Grassroots
Doyin Idowu
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
Karen Pao
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
Peter Klein
舒亶亟舒仄舒从亳
舒亶亟舒仄舒从亳舒亶亟舒仄舒从亳
舒亶亟舒仄舒从亳
njhujdbwz
Resume Jevy Callipare
Resume Jevy CallipareResume Jevy Callipare
Resume Jevy Callipare
Jevy Callipare
n 担ng c畉n hi畛u r探 h董n v畛 testosterone
n 担ng c畉n hi畛u r探 h董n v畛 testosteronen 担ng c畉n hi畛u r探 h董n v畛 testosterone
n 担ng c畉n hi畛u r探 h董n v畛 testosterone
sara633
Citizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the GrassrootsCitizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the Grassroots
Doyin Idowu
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
Karen Pao
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
Peter Klein

Similar to Austin_SIAMCSE15 (20)

PT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
Dr Fereidoun Dejahang
Support Vector Machines: Optimal Hyperplane for Classification and Regression
Support Vector Machines: Optimal Hyperplane for Classification and RegressionSupport Vector Machines: Optimal Hyperplane for Classification and Regression
Support Vector Machines: Optimal Hyperplane for Classification and Regression
adityacse1001
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
覦 蟾
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptxML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
shafanahmad06
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
lecture_01.ppt
lecture_01.pptlecture_01.ppt
lecture_01.ppt
ssuserd3cf02
Module 3 -Support Vector Machines data mining
Module 3 -Support Vector Machines data miningModule 3 -Support Vector Machines data mining
Module 3 -Support Vector Machines data mining
shobyscms
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESG
Qian Wang
Hardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMHardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVM
Qian Wang
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
Hsing-chuan Hsieh
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
Carlo Carandang
Test vector compression in Digital Testing
Test vector compression in Digital Testing Test vector compression in Digital Testing
Test vector compression in Digital Testing
Amr Abd El Latief
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombe
Matt Challacombe
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
ObaidUllah693733
Apache Cassandra, part 1 principles, data model
Apache Cassandra, part 1  principles, data modelApache Cassandra, part 1  principles, data model
Apache Cassandra, part 1 principles, data model
Andrey Lomakin
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
Self healing data
Self healing dataSelf healing data
Self healing data
Uwe Friedrichsen
PT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
Dr Fereidoun Dejahang
Support Vector Machines: Optimal Hyperplane for Classification and Regression
Support Vector Machines: Optimal Hyperplane for Classification and RegressionSupport Vector Machines: Optimal Hyperplane for Classification and Regression
Support Vector Machines: Optimal Hyperplane for Classification and Regression
adityacse1001
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
覦 蟾
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptxML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
shafanahmad06
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
仄亳亳亶 弌亠仍亳于舒仆仂于, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
Module 3 -Support Vector Machines data mining
Module 3 -Support Vector Machines data miningModule 3 -Support Vector Machines data mining
Module 3 -Support Vector Machines data mining
shobyscms
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESG
Qian Wang
Hardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMHardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVM
Qian Wang
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
Hsing-chuan Hsieh
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
Carlo Carandang
Test vector compression in Digital Testing
Test vector compression in Digital Testing Test vector compression in Digital Testing
Test vector compression in Digital Testing
Amr Abd El Latief
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombe
Matt Challacombe
Apache Cassandra, part 1 principles, data model
Apache Cassandra, part 1  principles, data modelApache Cassandra, part 1  principles, data model
Apache Cassandra, part 1 principles, data model
Andrey Lomakin
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib

More from Karen Pao (7)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
Karen Pao
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
Karen Pao
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
Karen Pao
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
Karen Pao
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
Karen Pao
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
Karen Pao
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
Karen Pao
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
Karen Pao
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
Karen Pao
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
Karen Pao
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
Karen Pao
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
Karen Pao
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
Karen Pao
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
Karen Pao

Recently uploaded (20)

Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.pptDigestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Jamakala Obaiah
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptxSCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
ROLANARIBATO3
Plant tissue culture- In-vitro Rooting.ppt
Plant tissue culture-  In-vitro Rooting.pptPlant tissue culture-  In-vitro Rooting.ppt
Plant tissue culture- In-vitro Rooting.ppt
laxmichoudhary77657
Direct Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic PlantsDirect Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic Plants
Kuldeep Gauliya
(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns
David Podorefsky, PhD
History of atomic layer deposition (ALD) in a nutshell
History of atomic layer deposition (ALD) in a nutshellHistory of atomic layer deposition (ALD) in a nutshell
History of atomic layer deposition (ALD) in a nutshell
Riikka Puurunen
Seymour Benzer's experiment and complementation test
Seymour Benzer's experiment and complementation testSeymour Benzer's experiment and complementation test
Seymour Benzer's experiment and complementation test
AkankshaSindhiya
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
David Podorefsky, PhD
Coordination and Response: The Nervous System | IGCSE Biology
Coordination and Response: The Nervous System | IGCSE BiologyCoordination and Response: The Nervous System | IGCSE Biology
Coordination and Response: The Nervous System | IGCSE Biology
Blessing Ndazie
Respiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE BiologyRespiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE Biology
Blessing Ndazie
Investigational New drug application process
Investigational New drug application processInvestigational New drug application process
Investigational New drug application process
onepalyer4
ABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.pptABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.ppt
laxmichoudhary77657
SCIENCE 7 Q4 1 Classifying Geological Faults.pptx
SCIENCE 7 Q4 1 Classifying Geological Faults.pptxSCIENCE 7 Q4 1 Classifying Geological Faults.pptx
SCIENCE 7 Q4 1 Classifying Geological Faults.pptx
ROLANARIBATO3
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
David Podorefsky, PhD
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
David Podorefsky, PhD
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
Graz University of Technology & Know-Center
Esko_Smart_adaptive_MS_complex_system.ppt
Esko_Smart_adaptive_MS_complex_system.pptEsko_Smart_adaptive_MS_complex_system.ppt
Esko_Smart_adaptive_MS_complex_system.ppt
tianmv168
Seminario- biologia molecular. Diapositivas
Seminario- biologia molecular. DiapositivasSeminario- biologia molecular. Diapositivas
Seminario- biologia molecular. Diapositivas
IsabelaRestrepo10
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptxBIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
24msbt33
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
David Podorefsky, PhD
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.pptDigestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Jamakala Obaiah
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptxSCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
SCIENCE 7 Q4 4 Assessing Earthquake Risks Using PHIVOLCS FaultFinder.pptx
ROLANARIBATO3
Plant tissue culture- In-vitro Rooting.ppt
Plant tissue culture-  In-vitro Rooting.pptPlant tissue culture-  In-vitro Rooting.ppt
Plant tissue culture- In-vitro Rooting.ppt
laxmichoudhary77657
Direct Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic PlantsDirect Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic Plants
Kuldeep Gauliya
(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns
David Podorefsky, PhD
History of atomic layer deposition (ALD) in a nutshell
History of atomic layer deposition (ALD) in a nutshellHistory of atomic layer deposition (ALD) in a nutshell
History of atomic layer deposition (ALD) in a nutshell
Riikka Puurunen
Seymour Benzer's experiment and complementation test
Seymour Benzer's experiment and complementation testSeymour Benzer's experiment and complementation test
Seymour Benzer's experiment and complementation test
AkankshaSindhiya
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
David Podorefsky, PhD
Coordination and Response: The Nervous System | IGCSE Biology
Coordination and Response: The Nervous System | IGCSE BiologyCoordination and Response: The Nervous System | IGCSE Biology
Coordination and Response: The Nervous System | IGCSE Biology
Blessing Ndazie
Respiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE BiologyRespiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE Biology
Blessing Ndazie
Investigational New drug application process
Investigational New drug application processInvestigational New drug application process
Investigational New drug application process
onepalyer4
ABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.pptABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.ppt
laxmichoudhary77657
SCIENCE 7 Q4 1 Classifying Geological Faults.pptx
SCIENCE 7 Q4 1 Classifying Geological Faults.pptxSCIENCE 7 Q4 1 Classifying Geological Faults.pptx
SCIENCE 7 Q4 1 Classifying Geological Faults.pptx
ROLANARIBATO3
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
(Journal Club) - AmpliconReconstructor integrates NGS and optical mapping to ...
David Podorefsky, PhD
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
(Journal Club) - Sci-fate Characterizes the Dynamics of Gene Expression in Si...
David Podorefsky, PhD
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
2025-03-03-Data-related-Ethics Issues in Technologies for Professional Learni...
Graz University of Technology & Know-Center
Esko_Smart_adaptive_MS_complex_system.ppt
Esko_Smart_adaptive_MS_complex_system.pptEsko_Smart_adaptive_MS_complex_system.ppt
Esko_Smart_adaptive_MS_complex_system.ppt
tianmv168
Seminario- biologia molecular. Diapositivas
Seminario- biologia molecular. DiapositivasSeminario- biologia molecular. Diapositivas
Seminario- biologia molecular. Diapositivas
IsabelaRestrepo10
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptxBIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
BIOFUELPRODUCTION AND ITS APPLICATIONS.pptx
24msbt33
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
(Journal Club) - DNA replication and repair kinetics of Alu, LINE1 and satel...
David Podorefsky, PhD

Austin_SIAMCSE15

  • 1. Brian Austin Alex Druinsky, Osni Marquez, Eric Roman, Sherry Li (LBNL) Incorporating Error Detection and Recovery into Hierarchically Semi- Separable Matrix Operations - 1 - April 8, 2015
  • 2. Towards Optimal Order Resilient Solvers at Extreme Scale (TOORSES) Linear solvers are ubiquitous in scientific computing Performance HSS matrix format reduces computational complexity Resilience Error rates may increase on extreme scale systems. Increased concurrency more parts that might fail Potentially lower part reliability (smaller transistors, near-threshold voltage) - 2 -
  • 3. Outline Hierarchically Semi-Separable (HSS) decomposition Algorithm-based fault tolerance (ABFT) for dense matrices Error detection for HSS matrix-vector multiplication Error recovery using Containment Domains Performance results. - 3 -
  • 4. Hierarchically Semi-Separable (HSS) Matrix Decomposition - 4 - Exploits low numerical rank of matrix. Structured block sparsity Factorization has bounded error. A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*
  • 5. HSS Matrix Vector multiplication - 5 - HSS Matrix-Vector multiplication: b=A.x D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
  • 6. Algorithm Based Fault Tolerance (ABFT) for Dense Matrices (Huang & Abraham, 1984) Checksum protection for individual matrices Recovers up to one error per row/column eT.A = [eTA] A.e = [Ae] Matix multiplication preserves checksums [eTA].B = eT.[AB] A.[Be] = [AB].e - 6 - [Ae] A [eTA] A.[Be]=C.e C [eTA].B = eT.C A [Ae] [eTA] [Be] B [eTB] = Checksum relationships can be derived from associative properties.
  • 7. Intermediate error checking for HSS- MV Observation: between each parenthesis, there is an implicit (i.e. not explicitly stored) matrix. Many invariant conditions can be constructed using associativity. For example: y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e Many options for error checking at different stages of HSS-MV - 7 - A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
  • 8. ABFT for HSS-mv Error checking with adjustable granularity Coarse + CD [e.AHSS].x = e.[AHSS .x] Medium + CD e.[V(L).x] = [e.V(L)].x e.[V(0)V(L).x] = [e.V(0)V(L)].x [e.AHSS].x = e.[AHSS .x] Fine + CD Detect errors in each MV Encoded Detect & correct errors in each MV - 8 - HSS Matrix-Vector multiplication: b=A.x
  • 9. Error recovery by Containment Domains (CDs) Error Detection Classical ABFT cannot recover all errors. Multiple errors per row. Errors in both A and B. Redesign for every algorithm. Containment Domains provide more robust recovery techniques. Users supply validation tests. Remote safe store Composable (nested,) Automatic escalation CD pseudocode CD_Begin() //first pass: // store safe copies of A,B //second pass: // restore A,B CD_Preserve(A,[eTA],[Ae]) CD_Preserve(B,[eTB],[Be]) Compute: C=A.B CD_Assert(eT.C==[eTA].B) CD_Complete() - 9 -
  • 10. Runtime overhead without error injection - 10 - 0.234 0.236 0.238 0.240 0.242 0.244 0.246 None (1148.2) Coarse (2290.4) Medium (2290.9) Fine (2292.2) Encoded + Coarse (2295.9) Encoded (1164.2) TimeperHSSmviteration(s) Overhead is less than 2% Comparable to natural performance variation. (Memory (GB))
  • 11. injection. - 11 - 0.20 0.25 0.30 0.35 0.40 0.45 1.0E-3 3.2E-3 1.0E-2 3.2E-2 1.0E-1 3.2E-1 1.0E+0 TimeperHSSmvIteration(s) Error Rate (#/s) Coarse Medium Fine Encoded
  • 12. Conclusions & Future work Identified checksum relationships to validate HSS-MV operations. Fine grained error checking: has very low overhead maintains excellent efficiency at high error rates. Containment Domains Fine-grained preservation has incurs minimal runtime overhead. Preservation doubles memory capacity requirements. Merge fault-tolerance branch into main (parallel) HSS code. Incorporation into linear solver - 12 -
  • 13. Acknowledgement Toorses (LBNL) Sherry Li (PI) Eric Roman Osni Marquez Alex Druinski Strumpack HSS Library Francois-Henry Rouet Containment Domains (UT) Mattan Erez Kyushick Lee Support This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. - 13 -
  • 14. National Energy Research Scientific Computing Center - 14 -

Editor's Notes

  • #4: ----- Meeting Notes (3/13/15 16:33) ----- audience likely know huang and abraham, so zip past this fix date on the slide fix colors on plots make relative time clearer explanation of CD
  • #10: ----- Meeting Notes (3/13/15 16:37) ----- cd slide is realy dense what is a containment domain walk through pseudo code as slowly as i did during q&a make it clear that checksums are also being preserved
  • #12: ----- Meeting Notes (3/13/15 16:33) ----- need a better introduction not necessarily a slide why am i here and what am i going to talk about