Energy-efficient algebra kernels in FPGA for High Performance Computing
Keywords:dense and sparse NLA, FPGA, HLS, energy consumption
The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.
Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.
P. Ezzatti, E. S. Quintana-Ort´ı, A. Rem´on, and J. Saak, “Power-aware computing,” Concurrency and Computation: Practice and Experience, vol. 31, no. 6, p. e5034, 2019. e5034 cpe.5034.
T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh, “From opencl to high-performance hardware on fpgas,” in 22nd international conference on field programmable logic and applications (FPL), pp. 531–534, IEEE, 2012.
L.Wirbel, “Xilinx sdaccel: a unified development environment for tomorrows data center,” The Linley Group Inc, 2014.
G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore: The Johns Hopkins University Press, 2013.
P. Benner, P. Ezzatti, E. Quintana-Ort´ı, and A. Rem´on, “On the impact of optimization on the time-powerenergy balance of dense linear algebra factorizations,” in Algorithms and Architectures for Parallel Processing (R. Aversa, J. Kołodziej, J. Zhang, F. Amato, and G. Fortino, eds.), (Cham), pp. 3–10, Springer International Publishing, 2013.
F. Favaro, J. P. Oliver, E. Dufrechou, and P. Ezzatti, “Understanding the performance of elementary NLA kernels in fpgas,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020, New Orleans, LA, USA, May 18-22, 2020, pp. 479–482, IEEE, 2020.
P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ort´ı, T. Rhodes, R. A. van de Geijn, and F. G. Van Zee, “Deriving dense linear algebra libraries,” Formal Aspects of Computing, vol. 25, pp. 933–945, Nov 2013.
J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, pp. 1–17, Mar. 1990.
Y. Saad, Iterative Methods for Sparse Linear Systems. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2nd ed., 2003.
S. Kestur, J. D. Davis, and O. Williams, “BLAS Comparison on FPGA, CPU and GPU,” in 2010 IEEE Computer Society Annual Symposium on VLSI, pp. 288–293, July 2010.
S. Kestur, J. D. Davis, and E. S. Chung, “Towards a Universal FPGA Matrix-Vector Multiplication Architecture,” in 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, pp. 9–16, April 2012.
H. Giefers, R. Polig, and C. Hagleitner, “Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system,” in 2014 IEEE 25th International Conference on Application- Specific Systems, Architectures and Processors, pp. 92– 99, June 2014.
Y. Tan and T. Imamura, “Performance evaluation and tuning of an opencl based matrix multiplier,” in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp. 107–113, 2018.
J. de Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible communication avoiding matrix multiplication on fpga with high-level synthesis,” in The 2020 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 244–254, 2020.
N. Kapre and A. DeHon, “Parallelizing sparse matrix solve for spice circuit simulation using fpgas,” in 2009 International Conference on Field-Programmable Technology, pp. 190–198, Dec 2009.
T. Nechma and M. Zwolinski, “Parallel sparse matrix solution for circuit simulation on fpgas,” IEEE Transactions on Computers, vol. 64, pp. 1090–1103, April 2015.
F. Favaro, E. Dufrechou, P. Ezzatti, and J. P. Oliver, “Exploring FPGA optimizations to compute sparse numerical linear algebra kernels,” in Applied Reconfigurable Computing. Architectures, Tools, and Applications - 16th International Symposium, ARC 2020, Toledo, Spain, April 1-3, 2020, Proceedings [postponed] (F. Rinc´on, J. Barba, H. K. So, P. C. Diniz, and J. Caba, eds.), vol. 12083 of Lecture Notes in Computer Science, pp. 258–268, Springer, 2020.
M. Hosseinabady and J. L. Nunez-Yanez, “A streaming dataflow engine for sparse matrix-vector multiplication using high-level synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 6, pp. 1272–1285, 2019.
Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An opencl fpga benchmark suite,” in 2016 International Conference on Field-Programmable Technology (FPT), pp. 141–148, Dec 2016.
N. Bell and M. Garland, “Implementing sparse matrixvector multiplication on throughput-oriented processors,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, (New York, NY, USA), Association for Computing Machinery, 2009.
S. Barrachina, M. Barreda, S. Catal´an, M. F. Dolz, G. Fabregat, R. Mayo, and E. Quintana-Ort´ı, “An integrated framework for power-performance analysis of parallel scientific workloads,” Energy, pp. 114–119, 2013.
P. Benner, P. Ezzatti, E. S. Quintana-Ort´ı, A. Rem´on, and J. P. Silva, “Tuning the blocksize for dense linear algebra factorization routines with the roofline model,” in Algorithms and Architectures for Parallel Processing (J. Carretero, J. Garcia-Blas, V. Gergel, V. Voevodin, I. Meyerov, J. A. Rico-Gallego, J. C. D´ıaz-Mart´ın, P. Alonso, J. Durillo, J. D. Garcia S´anchez, A. L. Lastovetsky, F. Marozzo, Q. Liu, Z. A. Bhuiyan, K. F¨urlinger, J. Weidendorfer, and J. Gracia, eds.), (Cham), pp. 18–29, Springer International Publishing, 2016.
S.Williams, A.Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, pp. 65–76, Apr. 2009.
How to Cite
Copyright (c) 2021 Federico Favaro, Ernesto Dufrechou, Pablo Ezzatti, Juan Pablo Oliver
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.