Application Checkpoint and Power Study on Large Scale Systems
YUPING FAN
Department of Computer Science
Illinois Institute of Technology
10 West 35th Street, Chicago, IL, 60616
USA
Abstract: - Power efficiency is critical in high performance computing (HPC) systems. To achieve high power
efficiency on application level, it is vital importance to efficiently distribute power used by application checkpoints.
In this study, we analyze the relation of application checkpoints and their power consumption. The observations
could guide the design of power management.
Key-Words: - High Performance Computing (HPC), power management, performance analysis
Received: August 25, 2021. Revised: April 18, 2022. Accepted: May 19, 2022. Published: June 8, 2022.
1 Introduction
As the size of HPC systems rapidly increasing, the
amount of electrical power consumed by HPC
systems are keeping increasing. As a result, power
becomes the leading constraint for the design of the
next generation HPC systems. There are several
existing solutions, such as power capping and
dynamic voltage and frequency scaling (DVFS), to
address the power constraints. To efficient distribute
power, it is crucial to analyze application behaviors
and provide customized power policies for various
applications.
Checkpointing is the widely used fault tolerate
mechanism. Checkpoints can be categorized into
system-level and application-level checkpoints.
Application-level checkpoints aim to recover failed
applications. Applications can choose their own
checkpoint frequency and other parameters related to
checkpoints.
Power and time consumed by checkpointing is not
negligible. Therefore, this work studies how the
checkpoint and faults affect the power and energy
consumption. We utilize the MonEQ1 to gather the
power information on Mira. MonEQ is a user-level
profiling library, which collects power information at
node card level. Beside the node card power
information as a whole, the power information breaks
down into six domains, which are dram voltage, link
chip voltage, SRAM voltage, optics voltage, optics
voltage, PCIExpress voltage and link chip core
voltage. Three benchmarks, NPB,2 Flash3 and
Stream,4 are investigated in this study. For NPB and
Stream benchmarks, we add the checkpoints and
faults by FTI.5 FTI is a fault tolerance interface,
which provides application level checkpointing for
large-scale supercomputers. Four configurable
checkpoint levels are offered in FTI, which provides
different levels of protection for applications. The
checkpoint frequencies can be configured and be
optimized6 to achieve the balance between execution
time and program correctness. Flash benchmark has
its own checkpointing strategy, but there is no option
to inject faults.
The observations in this study can guide HPC job
scheduler to intelligently schedule jobs717 and
wisely set checkpoint strategies in order to achieve
high power efficiency.
The remainder of this paper is organized as follows.
We start by introducing background and related work
in §2. §3 presents the analysis and observation. We
conclude the paper in §4.
2 Background
This section introduces HPC power management in
§2.1 and Application Checkpointing §2.2. Then, we
introduce the benchmarks analyzed in this study
(§2.3).
2.1 HPC Power Management
As the size of HPC systems size rapidly grow, the
limited power budget becomes one of the most
crucial challenges. Several hardward power
management mechanisms, such as dynamic voltage
and frequency scaling (DVFS) and power capping,
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
221
Volume 21, 2022
have been developed. Power can be management
either at the system level or at the application level.
Studies show that managing power at application
level is more efficient. This is because the effects of
power capping on different applications vary. Even
the effects of power capping on different stages of an
applications vary. Therefore, to efficient manage
power, it is crucial to comprehensively analyze the
effects of power management on individual
applications.
2.2 Application Checkpointing
Fault tolerance is a serious problem in HPC systems.
HPC systems consist of millions of components. A
single point of failure could lead to application
failure, or even system failure. As the HPC system
sizes rapidly increasing, the failures are happened
more frequently. Checkpointing is one of the most
widely used fault tolerance techniques. HPC
checkpointing can be implemented in two levels:
system level and application level. System level
checkpointing can prevent catastrophic system
failures, but it is very time- and memory- intensive.
On the other hand, application checkpointing is more
lightweight and can prevent application failures.
Typically, application checkpointing is scheduled
more often than system checkpointing.
2.3 Benchmarks
In order to comprehensively evaluate various types of
applications performance, three benchmarks are
studied in this work.
2.3.1 NPB benchmarks
The NAS Parallel Benchmarks (NPB) are a small set
of programs designed to help evaluate the
performance of highly parallel supercomputers. The
benchmarks are derived from computational fluid
dynamics (CFD) applications and consist of five
kernels and three pseudo-applications. We study
three representative programs from NPB
benchmarks:
1. CG: Conjugate Gradient, irregular memory access
and communication
2. FT: discrete 3D fast Fourier Transform, all-to-all
communication
3. LU: Lower-Upper Gauss-Seidel solver
2.3.2 Flash
Flash benchmark is multiphysics multiscale
simulation code. We select Sedov explosion problem
in this study. The Sedov explosion problem is a
hydrodynamical test to check the code’s ability to
deal with strong shocks and non-planar symmetry.
2.3.3 STREAM
STREAM is a simple, synthetic benchmark designed
to measure sustainable memory bandwidth (in MB/s)
and a corresponding computation rate for four simple
vector kernels. STREAM is the representative of
memory intensive jobs.
3 Analysis
3.1 NPB Benchmarks
Figure 1: The execution time of NPB benchmarks. The
green bars represent the execution time without
checkpoints; the red bars represent the execution time
with checkpoints but without fault injections; the blue
bars represent the execution time with checkpoints and
fault injections.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
222
Volume 21, 2022
For NPB benchmarks, we choose the largest problem
size E to do all the experiments.
In order to show the fault tolerance influence on
power consumption, each application ran three
times with the same settings, except the checkpoints
and faults. The first run records the power
information without checkpoints and faults. The
second run adds checkpoints but no injected faults.
The third run adds both checkpoints and injected
faults.
Figure 2: The energy consumption of NPB benchmarks.
Figure 1 and figure 2 presents the effects of
checkpoints on execution time and energy
consumption respectively.
The following subsection gives more detailed
analysis on these three benchmarks.
Figure 3: NPB: CG power consumption on 1024, 2048,
4096 and 8192 nodes respectively. \w/o FTI" denotes
experiment without checkpoints and faults. \w/
FTI(1,2,3,4)" denotes experiment with FTI and the
checkpointing frequencies from level 1 to level 4 are 1, 2,
3 and 4 respectively. \w/ FTI(1,2,3,4) I" denotes the
experiment with FTI and inject faults.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
223
Volume 21, 2022
3.1.1 CG
Figure 3 and figure 4 show the CG program power
consumption on 1024, 2048, 4096 and 8192 nodes
respectively. CG programs ran at four scales. They are
1k, 2k, 4k and 8k nodes, which means 1k, 2k, 4k and
8k MPI ranks separately. Since chip core, dram,
networks consume most energy and checkpoints and
faults do not significant influences on other domains,
we ignore other domains.
3.1.2 FT
Figure 5 and 6 show FT’s power consumption on
different problem sizes. The FT experiment ran on 1k,
2k, and 4k nodes. We skip the experiment on 8k nodes,
because FT experiment on 8k nodes is too short to
gather enough power consumption information.
Figure 4: Box plot comparison of average power
consumption at the node card level for NPB: CG.
Table 1: Incremental percentage of average power, execution time and energy. Negative percentage denotes the
decrease, for example, the power consumption with FTI of the NPB: CG on 2048 nodes decrease 1.71%
compare to that without FTI.
to that without FTI.percentage denotes the
Figure 5: NPB: FT power consumption on 1024, 2048 and 4096 nodes respectively.
Figure 6: Box plot comparison of average power consumption at the node card level for NPB: FT.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
224
Volume 21, 2022
Figure 8: Box plot comparison of average power consumption at the node card level for NPB: LU.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
225
Volume 21, 2022
Figure 9: The execution time and energy of Flash: Sedov application.
Figure 10: Flash: Sedov power consumption at 512 scale.
Figure 11: The execution time and energy of STREAM benchmark.
Figure 12: STREAM power consumption on 32 nodes.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
226
Volume 21, 2022
3.1.3 LU
Figure 7 and 8 show LU’s power consumption on
different problem sizes.
3.2 Flash
In-build checkpoint option in flash code enables us
doing checkpoint at regular time or instruction
intervals. In this experiment, we compare the power
consumption of Sedov ran on 512 nodes with
checkpoints and that with checkpoints.
Figure 9 presents the execution time and energy
consumption of Flash: Sedov application. Figure 10
represents Flash’s power consumption at 512 scale.
3.3 STREAM
Note that for STREAM, FTI is used to do regular
checkpoints and random fault injection. Figure 11
presents the execution time and energy consumption
of STREAM.
Figure 12 represents STREAM’s power consumption
on 32 nodes.
3.4 Key Observations
Based on the comprehensive analysis on NPB, Flash,
and Stream benchmarks, we summarize the key
observations with the following five categories.
3.4.1 With checkpoint or without checkpoint
The power consumption remains the same or
reduces a little (-3.58% -0.24%) in most cases by
adding checkpoint without injecting faults. The
possible reason is that checkpointing is not
computation intensive task, hence the chip core
power is reduced in most cases, which causes the
total power reduced too. There are only four
exceptions, which are luE1024, luE2048, ftE2048
and cgE1024, but the differences are trivial and
have not exceeded 1% in all three cases.
The execution time and energy increase after
adding checkpoints. Adding the checkpoints with
the same frequencies have the different effect on
different application and even the same
application running on different scales. For NPB
benchmarks, the increases in the time range from
6.41% to 38.85% and the increases in the energy
range from 4.17% to 27.95%. We can see from
figure 7, that the energy cost is close related to
execution time and they follow the same trend.
For Stream benchmark, the execution time
increase 33.15% and the energy increases
31.99% by adding checkpoints. Stream
benchmark is more sensitive to the checkpoints.
This is because Stream is a memory-intensive
application. There is no local disk on nodes, FTI
saves the level 1, 2 and 3 checkpoints in memory,
which cause the competition between the
checkpoint and progress of stream and we can
observe the higher DRAM power than other
benchmarks in this study. Hence, we observe
more significant influence on Stream benchmark
comparing with NPB benchmarks.
3.4.2 Inject fault or not
Injecting faults cause the power increases or
remain the same (-0.68% 3.76%) in most cases
comparing to only adding checkpoints without
faults. The main increases came from the chip
core power increases because the recovery
process is computationally intensive. From Table
1, we observe that CG and LU average power on
all 4 different scales increase the power
consumption. Adding faults seems to reduce the
power consumption for FT, this is because FT
benchmark runs too short, that these two
experiments do a different number of the four-
level checkpoint and the difference of the number
of checkpoints have a great influence on the short
jobs. Stream benchmark and Flash benchmark
also show the same trend.
Another finding is that injecting faults make the
power consumption fluctuate. We can see it from
the boxplots. It is clear that there are more
outliers for power consumption with fault
injections. There are more spikes after injecting
faults. The recovery process first retrieves
information from disk then re- computes from the
latest checkpoint. The retrieving process is the
not computationally intensive, but the re-
computation is. The interruption of faults explains
why there are more ups and downs when
injecting faults.
Randomly injected faults have different effects on
execution time and energy. Figure 7 shows that
injected faults make the execution time of
cgE1024, ftE4096 and luE8192 increase
significantly. The logs of these runs show that all
these runs experienced the fault error and
recovered from level 4 checkpoint, which is the
most time-consuming recovery process. Another
observation is that as the number of nodes
increases, the recovery process takes more time.
For example, a fault error of luE8192 takes
90.69% more than that without error.
If we ignore the cases that experienced a fatal
error, FTI is very efficient in recovery. The
incremental percentage of execution time is
ranging from 0.30% to 6.27% and that of energy
is ranging from 0.31% to 4.79%. In reality, the
failure rate is lower than the rate in our
experiment, because we injected faults. Hence, we
can say that if there are no errors that requiring
all the nodes back to last checkpoints, faults have
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
227
Volume 21, 2022
a trivial effect on execution time and energy.
Stream also shows the same feature that injects
faults without a fatal error does not affect the
execution time and energy too much.
3.4.3 Number of nodes
For CG and FT, increasing the number of nodes
seems further lowered the power with
checkpointing comparing to that without
checkpointing. The difference may be the result of
the communication delay by introducing
checkpoint. Level 2 and level 3 require storing
checkpoints at other nodes, which involves the
network. More nodes mean there are more
communication and information needed to
transfer in the system, which may delay the
computation. As the result, the chip core power
goes down as the number of node increases.
3.4.4 Energy
From the energy figures and Table 1, we can see
that running application without checkpoints
consumes less energy than running the same
application with checkpoints. It is also true that
application with fault injection consumes more
energy than application without fault injection.
3.4.5 Time
The column plots present that running
application without checkpoint takes less time
and running application with fault takes more
time than that without faults.
The time has a dominate role in energy. Although
larger scale results in shorter execution time, the
total energy consumed by application increase as
the scale becomes larger. From energy usage
view, smaller scale help saves energy.
4 Conclusion
In this paper, we have analyzed the effect of application
checkpoints on power consumption. The observations
provide insight to design power- and checkpoint-
aware scheduling policies.
References:
[1] Wallace, S., Z. Zhou, V. V., Coghlan, S.,
Tramm, J., Lan, Z., and Papka, M. E.,
“Application power profiling on ibm blue
gene/q,” in [IEEE International Conference on
Cluster Computing (CLUSTER)], (2013).
[2] “NAS Parallel Benchmarks.”
https://www.nas.nasa.gov/software/npb.html.
(Accessed: 27 August 2021).
[3] “Flash.” http://flash.uchicago.edu/site/.
(Accessed: 27 August 2021).
[4] “Stream.” https://www.cs.virginia.edu/stream/.
(Accessed: 27 August 2021).
[5] “Fault Tolerance Interface (FTI).”
http://leobago.com/projects/. (Accessed: 27
August 2021).
[6] Di, S., Bautista-Gomez, L., and Cappello, F.,
“Optimization of multi-level checkpoint model
with uncertain execution scales,” in
[Proceedings of the International Conference for
High Performance Computing, Networking,
Storage and Analysis (SC)], (2014).
[7] Fan, Y., Rich, P., Allcock, W., Papka, M., and
Lan, Z., “Trade-Off Between Prediction
Accuracy and Underestimation Rate in Job
Runtime Estimates,” in [CLUSTER], (2017).
[8] Qiao, P., Wang, X., Yang, X., Fan, Y., and Lan,
Z., “Preliminary Interference Study About Job
Placement and Routing Algorithms in the Fat-
Tree Topology for HPC Applications,” in
[CLUSTER], (2017).
[9] [9] Allcock, W., Rich, P., Fan, Y., and Lan, Z.,
“Experience and Practice of Batch Scheduling
on Leadership Supercomputers at Argonne,” in
[JSSPP], (2017).
[10] Li, B., Chunduri, S., Harms, K., Fan, Y., and
Lan, Z., “The Effect of System Utilization on
Application Performance Variability,” in
[ROSS], (2019).
[11] Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka,
M., Austin, B., and Paul, D., “Scheduling
Beyond CPUs for HPC,” in [HPDC], (2019).
[12] Yu, L., Zhou, Z., Fan, Y., Papka, M., and Lan,
Z., “System-wide Trade-off Modeling of
Performance, Power, and Resilience on
Petascale Systems,” in [The Journal of
Supercomputing], (2018).
[13] Fan, Y. and Lan, Z., “Exploiting Multi-Resource
Scheduling for HPC,” in [SC Poster], (2019).
[14] Qiao, P., Wang, X., Yang, X., Fan, Y., and Lan,
Z., “Joint Effects of Application
Communication Pattern, Job Placement and
Network Routing on Fat-Tree Systems,” in
[ICPP Workshops], (2018).
[15] Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock,
W., and Papka, M., “Deep Reinforcement Agent
for Scheduling in HPC,” in [IPDPS], (2021).
[16] Fan, Y. and Lan, Z., “DRAS-CQSim: A
Reinforcement Learning based Framework for
HPC Cluster Scheduling,” in [Software
Impacts], (2021).
[17] Fan, Y., Rich, P., Allcock, W., Papka, M., and
Lan, Z., “ROME: A Multi-Resource Job
Scheduling Framework for Exascale HPC
System,” in [IPDPS poster], (2018).
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27
Yuping Fan
E-ISSN: 2224-2872
228
Volume 21, 2022