#4 – Exploiting the Capabilities of Distributed Multi-Core Intel Processors for Accelerating Dense Linear Algebra

Fatma S. Ahmed And Mostafa I. Soliman. Exploiting the Capabilities of Distributed Multi-Core Intel Processors for Accelerating Dense Linear Algebra. Neural, Parallel, and Scientific Computations 28(2020), No.3, 198 – 222

https://doi.org/10.46719/npsc20202834

Abstract.
This paper exploits the capabilities of distributed multi-core Intel processors for accelerating dense linear algebra used in most calculations of scientific computing. Some kernels from BLAS (applying Givens rotation, rank-1 update, and matrix multiplication) and SVD are implemented and evaluated on the target system (cluster of Fujitsu Siemens CELSIUS R550 multi-core Intel processors). On a quad-core Intel Xeon E5410 processor running at 2.33 GHz, the maximum performance of applying Givens rotation (Level-1 BLAS) is improved from 2.10 to 8.08, 3.00, and 3.64 GFLOPS using SIMD, multi-threading, and multi-threading SIMD techniques, respectively. However, the use of MPI on multiple nodes degrades the performance because the network overhead for sending/receiving data/results dominates the overall execution time. For the same reason, the performance of rank-1 update (Level-2 BLAS) due to using multi-threading, SIMD, and blocking techniques degrades from 2.33 to 4.37×10-2 GFLOPS when eight nodes are used for parallel processing. The speedups of the traditional matrix-matrix multiplication (Level-3 BLAS) on a single quad-core Intel Xeon E5410 processor over the sequential execution when applying SIMD, multi-threading, multi-threading SIMD, and multi-threading SIMD blocking techniques are 3.76, 3.91, 7.37, and 12.65, respectively. Moreover, on ten nodes, the performance of traditional matrix-matrix multiplication reaches 99.73 GFLOPS. Finally, the executions of block Jacobi and hierarchal block Jacobi on eight nodes with applying SIMD and multi-threading techniques give performances of 206.97 and 515.62 GFLOPS, respectively. The speedups over sequential one-sided Jacobi are 49.8 and 124, respectively.

Keywords – ILP/DLP/TLP; MPI; SVD; SIMD; multi-threading.