a trivial effect on execution time and energy.
Stream also shows the same feature that injects
faults without a fatal error does not affect the
execution time and energy too much.
3.4.3 Number of nodes
For CG and FT, increasing the number of nodes
seems further lowered the power with
checkpointing comparing to that without
checkpointing. The difference may be the result of
the communication delay by introducing
checkpoint. Level 2 and level 3 require storing
checkpoints at other nodes, which involves the
network. More nodes mean there are more
communication and information needed to
transfer in the system, which may delay the
computation. As the result, the chip core power
goes down as the number of node increases.
3.4.4 Energy
From the energy figures and Table 1, we can see
that running application without checkpoints
consumes less energy than running the same
application with checkpoints. It is also true that
application with fault injection consumes more
energy than application without fault injection.
3.4.5 Time
The column plots present that running
application without checkpoint takes less time
and running application with fault takes more
time than that without faults.
The time has a dominate role in energy. Although
larger scale results in shorter execution time, the
total energy consumed by application increase as
the scale becomes larger. From energy usage
view, smaller scale help saves energy.
4 Conclusion
In this paper, we have analyzed the effect of application
checkpoints on power consumption. The observations
provide insight to design power- and checkpoint-
aware scheduling policies.
References:
[1] Wallace, S., Z. Zhou, V. V., Coghlan, S.,
Tramm, J., Lan, Z., and Papka, M. E.,
“Application power profiling on ibm blue
gene/q,” in [IEEE International Conference on
Cluster Computing (CLUSTER)], (2013).
[2] “NAS Parallel Benchmarks.”
https://www.nas.nasa.gov/software/npb.html.
(Accessed: 27 August 2021).
[3] “Flash.” http://flash.uchicago.edu/site/.
(Accessed: 27 August 2021).
[4] “Stream.” https://www.cs.virginia.edu/stream/.
(Accessed: 27 August 2021).
[5] “Fault Tolerance Interface (FTI).”
http://leobago.com/projects/. (Accessed: 27
August 2021).
[6] Di, S., Bautista-Gomez, L., and Cappello, F.,
“Optimization of multi-level checkpoint model
with uncertain execution scales,” in
[Proceedings of the International Conference for
High Performance Computing, Networking,
Storage and Analysis (SC)], (2014).
[7] Fan, Y., Rich, P., Allcock, W., Papka, M., and
Lan, Z., “Trade-Off Between Prediction
Accuracy and Underestimation Rate in Job
Runtime Estimates,” in [CLUSTER], (2017).
[8] Qiao, P., Wang, X., Yang, X., Fan, Y., and Lan,
Z., “Preliminary Interference Study About Job
Placement and Routing Algorithms in the Fat-
Tree Topology for HPC Applications,” in
[CLUSTER], (2017).
[9] [9] Allcock, W., Rich, P., Fan, Y., and Lan, Z.,
“Experience and Practice of Batch Scheduling
on Leadership Supercomputers at Argonne,” in
[JSSPP], (2017).
[10] Li, B., Chunduri, S., Harms, K., Fan, Y., and
Lan, Z., “The Effect of System Utilization on
Application Performance Variability,” in
[ROSS], (2019).
[11] Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka,
M., Austin, B., and Paul, D., “Scheduling
Beyond CPUs for HPC,” in [HPDC], (2019).
[12] Yu, L., Zhou, Z., Fan, Y., Papka, M., and Lan,
Z., “System-wide Trade-off Modeling of
Performance, Power, and Resilience on
Petascale Systems,” in [The Journal of
Supercomputing], (2018).
[13] Fan, Y. and Lan, Z., “Exploiting Multi-Resource
Scheduling for HPC,” in [SC Poster], (2019).
[14] Qiao, P., Wang, X., Yang, X., Fan, Y., and Lan,
Z., “Joint Effects of Application
Communication Pattern, Job Placement and
Network Routing on Fat-Tree Systems,” in
[ICPP Workshops], (2018).
[15] Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock,
W., and Papka, M., “Deep Reinforcement Agent
for Scheduling in HPC,” in [IPDPS], (2021).
[16] Fan, Y. and Lan, Z., “DRAS-CQSim: A
Reinforcement Learning based Framework for
HPC Cluster Scheduling,” in [Software
Impacts], (2021).
[17] Fan, Y., Rich, P., Allcock, W., Papka, M., and
Lan, Z., “ROME: A Multi-Resource Job
Scheduling Framework for Exascale HPC
System,” in [IPDPS poster], (2018).
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.27