Cluster-wise SpGEMM

Introduction

Sparse matrix-sparse matrix multiplication (SpGEMM) is a key kernel in many scientific applications and graph workloads. Unfortunately, SpGEMM is bottlenecked by data movement due to its irregular memory access patterns. We address these issues with hierarchical clustering for SpGEMM that leverages both row reordering and cluster-wise computation to improve reuse in the second input (B) matrix with a novel row-clustered matrix format and access pattern in the first input (A) matrix.

This library provides a shared memory parallel implementation of different Cluster-wise SpGEMM kernels. For more information, please read our preprint paper "Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation" on arXiv.

Requirements

Hardware

The experiments were executed using one CPU node on the Perlmutter supercomputer at NERSC, utilizing 64 threads. Each Perlmutter CPU node consists of two AMD EPYC 7763 (Milan) processors, with a total of 512 GB DDR4 memory. Each processor features 64 cores, 204.8 GB/s memory bandwidth, a 64 MiB L2 cache, and access to the full memory pool. Similar results are expected on other hardware platforms.

Software

The following software is required to compile and run the code (mentioned versions are those used in our experiments):

Intel C++ Compiler (icpc) (version 2024.1.0)
OpenMP
Intel Threading Building Blocks (TBB) (version 2019.9)
GNU Make (version 4.2.1)

Using Cluster-wise SpGEMM

All the SpGEMM implementations use hashtable as the sparse accumulator.

SpGEMM Kernels

hash_mult.h: Row-wise SpGEMM implementation.
hash_mult_flengthcluster.h: Fixed-length cluster-wise SpGEMM implementation.
hash_mult_vlengthcluster.h: Variable-length cluster-wise SpGEMM implementation.

Data structure

CSC.h: Compressed Sparse Column format implementation
CSR.h: Compressed Sparse Row format implementation
CSR_FlengthCluster.h: CSR_Cluster format implementation for fixed-length clusters
CSR_VlengthCluster.h: CSR_Cluster format implementation for variable-length clusters
BIN.h: the data structure for managing load balance in row-wise SpGEMM
BIN_FlengthCluster.h: the data structure for managing load balance in fixed-length cluster-wise SpGEMM
BIN_VlengthCluster.h: the data structure for managing load balance in variable-length cluster-wise SpGEMM

Sample codes

Files under sample directory: Sample codes with main function reading matrix data files, and then compute SpGEMM.

Implementation	Description	Input Parameter
RowSpGEMM	Row-wise SpGEMM	`text <matrix-A> <matrix-B>`
ReorderedRowSpGEMM	Row-wise SpGEMM (with reordering)	`text <matrix-A> <row-ordering of matrix-A>`
FlengthClusterSpGEMM	Fixed-length cluster-wise SpGEMM	`text <matrix-A> <matrix-B>`
ReorderedFlengthClusterSpGEMM	Fixed-length cluster-wise SpGEMM (with reordering)	`text <matrix-A> <row-ordering of matrix-A>`
VlengthClusterSpGEMM	Variable-length cluster-wise SpGEMM	`text <matrix-A> <matrix-B>`
ReorderedVlengthClusterSpGEMM	Variable-length cluster-wise SpGEMM (with reordering)	`text <matrix-A> <row-ordering of matrix-A>`
HierarchicalClusterSpGEMM	Hierarchical cluster-wise SpGEMM	`text <matrix-A> <matrix-B> <candidate close-pairs in matrix-A>`
tsRowSpGEMM	Row-wise SpGEMM for tall-skinny matrix	`text <matrix-A> <directory path of tall-skinny matrix-B>`
tsReorderedRowSpGEMM	Row-wise SpGEMM (with reordering) for tall-skinny matrix	`text <matrix-A> <row-ordering of matrix-A> <directory path of tall-skinny matrix-B>`
tsHierarchicalClusterSpGEMM	Hierarchical cluster-wise SpGEMM for tall-skinny matrix	`text <matrix-A> <directory path of tall-skinny matrix-B> <candidate close-pairs in matrix-A>`

Two utility codes are also placed in the sample directory. These code is used to generate candidate close-pairs and randomly shuffle the row-ids of an input matrix.

Implementation	Description	Input Parameter
GenerateCandidatePairs	Candidate close-pair generator using AxA^T (keep top-k per-row)	`text <matrix-A> <matrix-A> <top-k> -s`
ShuffleRowIDs	Randomly shuffling the row-ids of a input matrix	`text <matrix-A> -s`

Performance Tip

We recommend configuring your environment for OpenMP based on your hardware by setting the environment variables, including but not limited to OMP_NUM_THREADS, OMP_PROC_BIND, and OMP_PLACES.

How to use in your program

Copy all the header files (*.h) in this directory to your program. Below is an example of how to use the hierarchical cluster-wise SpGEMM in your program:

#include "utility.h"
#include "cluster_utility.h"
#include "multiply.h"
#include "CSR.h"
#include "CSR_VlengthCluster.h"
#include "hash_mult_vlengthcluster.h"

Replicating Our Benchmarks

###Get the reordering code:

> git clone https://github.com/PASSIONLab/clusterwise-spgemm.git
> export CLUSTERWISE_ROOT=$(pwd)/clusterwise-spgemm

> cd $CLUSTERWISE_ROOT

Build the code:

Before compiling the code, first modify the makefile with correct path to Intel Compiler.

> make clean && make sample_hw

Get the datasets:

If you haven’t downloaded the datasets yet, visit reordering-spgemm for instructions on downloading them and generating different matrix reorderings.

> export DATA_PATH=/pscratch/sd/r/raqib/dataset-spgemm

Get random reordering by shuffling the order of rows:

> export SHUFFLED_DATA_PATH=$DATA_PATH/reordering/shuffle_order
> mkdir -p $SHUFFLED_DATA_PATH
> sh scripts/reordering/reorder_random.sh > scripts/reordering/reorder_random.out

Validate if all the reordering paths are correctly set in the environment.

> export RABBIT_DATA_PATH=$DATA_PATH/reordering/rabbit_order
> export AMD_DATA_PATH=$DATA_PATH/reordering/amd_order
> export RCM_DATA_PATH=$DATA_PATH/reordering/rcm_order
> export ND_DATA_PATH=$DATA_PATH/reordering/nd_order
> export GP_DATA_PATH=$DATA_PATH/reordering/gp_order
> export HP_DATA_PATH=$DATA_PATH/reordering/hp_order
> export GRAY_DATA_PATH=$DATA_PATH/reordering/gray_order
> export DEGREE_DATA_PATH=$DATA_PATH/reordering/degree_order
> export SLASHBURN_DATA_PATH=$DATA_PATH/reordering/slashburn_order

Run benchmark on row-wise SpGEMM:

> sh scripts/rowwise/1_rowwise_original.sh > scripts/rowwise/1_rowwise_original.out
> sh scripts/rowwise/2_rowwise_random.sh > scripts/rowwise/2_rowwise_random.out
> sh scripts/rowwise/3_rowwise_rabbit.sh > scripts/rowwise/3_rowwise_rabbit.out
> sh scripts/rowwise/4_rowwise_amd.sh > scripts/rowwise/4_rowwise_amd.out
> sh scripts/rowwise/5_rowwise_rcm.sh > scripts/rowwise/5_rowwise_rcm.out
> sh scripts/rowwise/6_rowwise_nd.sh > scripts/rowwise/6_rowwise_nd.out
> sh scripts/rowwise/7_rowwise_gp.sh > scripts/rowwise/7_rowwise_gp.out
> sh scripts/rowwise/8_rowwise_hp.sh > scripts/rowwise/8_rowwise_hp.out
> sh scripts/rowwise/9_rowwise_gray.sh > scripts/rowwise/9_rowwise_gray.out
> sh scripts/rowwise/10_rowwise_degree.sh > scripts/rowwise/10_rowwise_degree.out
> sh scripts/rowwise/11_rowwise_slashburn.sh > scripts/rowwise/11_rowwise_slashburn.out

Run benchmark on fixed-length cluster-wise SpGEMM:

> sh scripts/flength_clusterwise/1_flength_original.sh > scripts/flength_clusterwise/1_flength_original.out
> sh scripts/flength_clusterwise/2_flength_random.sh > scripts/flength_clusterwise/2_flength_random.out
> sh scripts/flength_clusterwise/3_flength_rabbit.sh > scripts/flength_clusterwise/3_flength_rabbit.out
> sh scripts/flength_clusterwise/4_flength_amd.sh > scripts/flength_clusterwise/4_flength_amd.out
> sh scripts/flength_clusterwise/5_flength_rcm.sh > scripts/flength_clusterwise/5_flength_rcm.out
> sh scripts/flength_clusterwise/6_flength_nd.sh > scripts/flength_clusterwise/6_flength_nd.out
> sh scripts/flength_clusterwise/7_flength_gp.sh > scripts/flength_clusterwise/7_flength_gp.out
> sh scripts/flength_clusterwise/8_flength_hp.sh > scripts/flength_clusterwise/8_flength_hp.out
> sh scripts/flength_clusterwise/9_flength_gray.sh > scripts/flength_clusterwise/9_flength_gray.out
> sh scripts/flength_clusterwise/10_flength_degree.sh > scripts/flength_clusterwise/10_flength_degree.out
> sh scripts/flength_clusterwise/11_flength_slashburn.sh > scripts/flength_clusterwise/11_flength_slashburn.out

Run benchmark on variable-length cluster-wise SpGEMM:

> sh scripts/vlength_clusterwise/1_vlength_original.sh > scripts/vlength_clusterwise/1_vlength_original.out
> sh scripts/vlength_clusterwise/2_vlength_random.sh > scripts/vlength_clusterwise/2_vlength_random.out
> sh scripts/vlength_clusterwise/3_vlength_rabbit.sh > scripts/vlength_clusterwise/3_vlength_rabbit.out
> sh scripts/vlength_clusterwise/4_vlength_amd.sh > scripts/vlength_clusterwise/4_vlength_amd.out
> sh scripts/vlength_clusterwise/5_vlength_rcm.sh > scripts/vlength_clusterwise/5_vlength_rcm.out
> sh scripts/vlength_clusterwise/6_vlength_nd.sh > scripts/vlength_clusterwise/6_vlength_nd.out
> sh scripts/vlength_clusterwise/7_vlength_gp.sh > scripts/vlength_clusterwise/7_vlength_gp.out
> sh scripts/vlength_clusterwise/8_vlength_hp.sh > scripts/vlength_clusterwise/8_vlength_hp.out
> sh scripts/vlength_clusterwise/9_vlength_gray.sh > scripts/vlength_clusterwise/9_vlength_gray.out
> sh scripts/vlength_clusterwise/10_vlength_degree.sh > scripts/vlength_clusterwise/10_vlength_degree.out
> sh scripts/vlength_clusterwise/11_vlength_slashburn.sh > scripts/vlength_clusterwise/11_vlength_slashburn.out

Generate candidate close-pairs for hierarchical cluster-wise SpGEMM:

> export CLOSE_PAIR_DATA_PATH=$DATA_PATH/reordering/close_pairs
> mkdir -p $CLOSE_PAIR_DATA_PATH
> sh scripts/h_clusterwise/1_generate_close_pairs.sh > scripts/h_clusterwise/1_generate_close_pairs.out

Run benchmark on hierarchical cluster-wise SpGEMM:

> sh scripts/h_clusterwise/2_hierarchical_spgemm.sh > scripts/h_clusterwise/2_hierarchical_spgemm.out

Run benchmark on Tall-skinny matrices:

If the tall-skinny matrices haven’t been generated yet, follow the steps below to produce them from the BC frontiers output by CombBLAS.

Tall-skinny Row-wise SpGEMM:

> sh scripts/tall_skinny/rowwise/1_ts_original.sh > scripts/tall_skinny/rowwise/1_ts_original.out
> sh scripts/tall_skinny/rowwise/2_ts_random.sh > scripts/tall_skinny/rowwise/2_ts_random.out
> sh scripts/tall_skinny/rowwise/3_ts_rabbit.sh > scripts/tall_skinny/rowwise/3_ts_rabbit.out
> sh scripts/tall_skinny/rowwise/4_ts_amd.sh > scripts/tall_skinny/rowwise/4_ts_amd.out
> sh scripts/tall_skinny/rowwise/5_ts_rcm.sh > scripts/tall_skinny/rowwise/5_ts_rcm.out
> sh scripts/tall_skinny/rowwise/6_ts_nd.sh > scripts/tall_skinny/rowwise/6_ts_nd.out
> sh scripts/tall_skinny/rowwise/7_ts_gp.sh > scripts/tall_skinny/rowwise/7_ts_gp.out
> sh scripts/tall_skinny/rowwise/8_ts_hp.sh > scripts/tall_skinny/rowwise/8_ts_hp.out
> sh scripts/tall_skinny/rowwise/9_ts_gray.sh > scripts/tall_skinny/rowwise/9_ts_gray.out
> sh scripts/tall_skinny/rowwise/10_ts_degree.sh > scripts/tall_skinny/rowwise/10_ts_degree.out
> sh scripts/tall_skinny/rowwise/11_ts_slashburn.sh > scripts/tall_skinny/rowwise/11_ts_slashburn.out

Tall-skinny Cluster-wise SpGEMM:

> sh scripts/tall_skinny/h_clusterwise/1_ts_hierarchical_spgemm.sh > sh scripts/tall_skinny/h_clusterwise/1_ts_hierarchical_spgemm.out

Generating Tall-skinny Matrix

In our experiments, we generate the tall-skinny matrices from BFS frontiers produced by CombBLAS during Betweenness Centrality (BC) computations. As the number of frontiers varies among datasets, we only take the first 10 forward frontier matrices.

Here is the example code to add in the CombBLAS/Applications/BetwCent.cpp to save the first 10 forward frontiers:

SpParHelper::Print("Exploring via BFS...\n");
int iter = 0;
while( fringe.getnnz() > 0 )
{
    nsp += fringe;
    Dist<bool>::MPI_DCCols * level = new Dist<bool>::MPI_DCCols( fringe ); 
    bfs.push_back(level);
    
    if(iter < 10) {
      string frontier_filename = 
          frontier_dir + "/" + string("batch_") + std::to_string(i) + 
          string("_forwarditer_") + std::to_string(iter) + string(".mtx");
      fringe.ParallelWriteMM(frontier_filename, true);
      iter++;
    }

    fringe = PSpGEMM<PTBOOLINT>(AT, fringe);
    fringe = EWiseMult(fringe, nsp, true);
}

The full implementation for saving the first ten frontier matrices (forward and backward) is available in the following commits: forward and backward.

To generate all the tall-skinny matrices used in our experiments, follow these steps:

# Step-1: Download the CombBLAS library (with the saving forward frontier matrices code)
> git clone https://github.com/biqar/CombBLAS.git
> export COMBBLAS_ROOT=$(pwd)/CombBLAS
> cd $COMBBLAS_ROOT

# Step-2: Checkout to the commit that contains the saving frontier matrices code
> git checkout save_forward_frontier_mtx

# Step-3: Compile and Install CombBLAS
> mkdir _build
> mkdir _install
> cd _build
> cmake .. -DCMAKE_INSTALL_PREFIX=../_install
> make
> make install

# Step-4: Generate the tall-skinny matrix for the input graphs used in our experiments
> export TS_DATA_PATH=$DATA_PATH/tall_skinny
> mkdir -p $TS_DATA_PATH
> sh scripts/tall_skinny/generator/1_save_bc_frontier_matrices.sh

Citation and Acknowledgments

If you find this code useful, please cite our paper:

@article{cluster-wise-spgemm,
  title={Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation},
  author={Islam, Abdullah Al Raqibul and Xu, Helen and Dai, Dong and Bulu{\c{c}}, Ayd{\i}n},
  journal={arXiv preprint arXiv:2507.21253},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cluster-wise SpGEMM

Introduction

Requirements

Hardware

Software

Using Cluster-wise SpGEMM

SpGEMM Kernels

Data structure

Sample codes

Performance Tip

How to use in your program

Replicating Our Benchmarks

Build the code:

Get the datasets:

Run benchmark on row-wise SpGEMM:

Run benchmark on fixed-length cluster-wise SpGEMM:

Run benchmark on variable-length cluster-wise SpGEMM:

Generate candidate close-pairs for hierarchical cluster-wise SpGEMM:

Run benchmark on hierarchical cluster-wise SpGEMM:

Run benchmark on Tall-skinny matrices:

Tall-skinny Row-wise SpGEMM:

Tall-skinny Cluster-wise SpGEMM:

Generating Tall-skinny Matrix

Citation and Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
GTgraph		GTgraph
sample		sample
scripts		scripts
.gitignore		.gitignore
BIN.h		BIN.h
BIN_FlengthCluster.h		BIN_FlengthCluster.h
BIN_VlengthCluster.h		BIN_VlengthCluster.h
CSC.h		CSC.h
CSR.h		CSR.h
CSR_FlengthCluster.h		CSR_FlengthCluster.h
CSR_VlengthCluster.h		CSR_VlengthCluster.h
IO.h		IO.h
LICENSE.md		LICENSE.md
README.md		README.md
Triple.h		Triple.h
cluster_utility.h		cluster_utility.h
hash_mult.h		hash_mult.h
hash_mult_flengthcluster.h		hash_mult_flengthcluster.h
hash_mult_vlengthcluster.h		hash_mult_vlengthcluster.h
makefile		makefile
multiply.h		multiply.h
utility.h		utility.h

License

PASSIONLab/clusterwise-spgemm

Folders and files

Latest commit

History

Repository files navigation

Cluster-wise SpGEMM

Introduction

Requirements

Hardware

Software

Using Cluster-wise SpGEMM

SpGEMM Kernels

Data structure

Sample codes

Performance Tip

How to use in your program

Replicating Our Benchmarks

Build the code:

Get the datasets:

Run benchmark on row-wise SpGEMM:

Run benchmark on fixed-length cluster-wise SpGEMM:

Run benchmark on variable-length cluster-wise SpGEMM:

Generate candidate close-pairs for hierarchical cluster-wise SpGEMM:

Run benchmark on hierarchical cluster-wise SpGEMM:

Run benchmark on Tall-skinny matrices:

Tall-skinny Row-wise SpGEMM:

Tall-skinny Cluster-wise SpGEMM:

Generating Tall-skinny Matrix

Citation and Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages