Chetan Gupta · Janne H. Korhonen · Jan Studený · Jukka Suomela · Hossein Vahidi

Low-bandwidth matrix multiplication: Faster algorithms and more general forms of sparsity

SIROCCO 2025 · 32nd International Colloquium on Structural Information and Communication Complexity, Delphi, Greece, June 2025

authors’ version arXiv.org

Abstract

In prior work, Gupta et al. (SPAA 2022) presented a distributed algorithm for multiplying sparse $n \times n$ matrices, using $n$ computers. They assumed that the input matrices are uniformly sparse—there are at most $d$ non-zeros in each row and column—and the task is to compute a uniformly sparse part of the product matrix. The sparsity structure is globally known in advance (this is the supported setting). As input, each computer receives one row of each input matrix, and each computer needs to output one row of the product matrix. In each communication round each computer can send and receive one $O(\log n)$-bit message. Their algorithm solves this task in $O(d^{1.907})$ rounds, while the trivial bound is $O(d^2)$.

We improve on the prior work in two dimensions: First, we show that we can solve the same task faster, in only $O(d^{1.832})$ rounds. Second, we explore what happens when matrices are not uniformly sparse. We consider the following alternative notions of sparsity: row-sparse matrices (at most $d$ non-zeros per row), column-sparse matrices, matrices with bounded degeneracy (we can recursively delete a row or column with at most $d$ non-zeros), average-sparse matrices (at most $dn$ non-zeros in total), and general matrices.

We present a near-complete classification of the complexity of matrix multiplication for all combinations of these notions of sparsity. We show that almost all cases fall in one of these classes:

  1. We have an upper bound of $O(d^{1.832})$ rounds for computing $X = AB$. An example is the case where $X$ and $A$ are uniformly sparse but $B$ is average-sparse; this is a generalization of the prior work beyond the uniformly-sparse case.
  2. We have a lower bound of $\Omega(\log n)$ and an upper bound of $O(d^2 + \log n)$ rounds for computing $X = AB$. An example is the case where $X$, $A$, and $B$ have bounded degeneracy.
  3. We have a lower bound of $n^{\Omega(1)}$ rounds for computing $X = AB$. An example is the case where $X$ and $A$ have bounded degeneracy but $B$ is a general matrix.
  4. We have a conditional lower bound: a fast algorithm for computing $X = AB$ would imply major improvements in dense matrix multiplication. An example is the case where $X$, $A$, and $B$ are average-sparse matrices.

Our work highlights the role that bounded degeneracy has in the context of distributed matrix multiplication: it is a natural class of sparse matrices that is restrictive enough to admit very fast algorithms—much faster than what we can expect for the average-sparse case.

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.