Cache optimization and performance modeling of batched, small, and rectangular matrix multiplication on Intel, AMD, and Fujitsu processors
From MaRDI portal
Publication:6601380
DOI10.1145/3595178MaRDI QIDQ6601380
Rio Yokota, George Bosilca, Sameer Deshmukh
Publication date: 10 September 2024
Published in: ACM Transactions on Mathematical Software (Search for Journal in Brave)
Cites Work
- Title not available (Why is that?)
- Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
- High-performance matrix-matrix multiplications of very small matrices
- Communication lower bounds for distributed-memory matrix multiplication
- When cache blocking of sparse matrix vector multiply works and why
- BLIS: a framework for rapidly instantiating BLAS functionality
- Hierarchical Matrices: Algorithms and Analysis
- Cache-Oblivious Algorithms
- Stabilized rounded addition of hierarchical matrices
- Anatomy of high-performance matrix multiplication
- BLASFEO
- The BLAS API of BLASFEO
- Strategies for the Vectorized Block Conjugate Gradients Method
- Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
- A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization
- Analytical Modeling Is Enough for High-Performance BLIS
- FLAME
This page was built for publication: Cache optimization and performance modeling of batched, small, and rectangular matrix multiplication on Intel, AMD, and Fujitsu processors
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6601380)