The following are basically benchmarks of the different sparse matrix/vector multiplication methods of Eigen and cuSPARSE.

These benchmarks are supposed to be worst-case scenarios when doing candidate search, e.g. these benchmarks assume that every peptide would yield 100 ions and every spectrum 1000 peaks, while also performing normalization and gaussian peak modeling.

We ran benchmarks for different database sizes (different number of candidate peptides to be considered) to assess how that influences performance of the different methods. Furthermore, every benchmark is run five times to get a more comprehensive overview of computation times. The averages are plotted below, with error bars denoting standard deviation.

For all benchmarks we search 1001 spectra (this is specifically selected to see if batched multiplication has influence on performance) and return the top 100 candidates. All benchmarks were conducted during light background usage (e.g. open browser, text editor, etc.).

Abbreviations

The following terms are used synonymously throughout the document:

f32CPU_SV: Float32-(CPU-)based sparse matrix * sparse vector search (using Eigen)
i32CPU_SV: Int32-(CPU-)based sparse matrix * sparse vector search (using Eigen)
f32CPU_DV: Float32-(CPU-)based sparse matrix * dense vector search (using Eigen)
i32CPU_DV: Int32-(CPU-)based sparse matrix * dense vector search (using Eigen)
f32CPU_SM: Float32-(CPU-)based sparse matrix * sparse matrix search (using Eigen)
i32CPU_SM: Int32-(CPU-)based sparse matrix * sparse matrix search (using Eigen)
f32CPU_DM: Float32-(CPU-)based sparse matrix * dense matrix search (using Eigen)
i32CPU_DM: Int32-(CPU-)based sparse matrix * dense matrix search (using Eigen)
f32GPU_DV: Float32-(GPU-)based sparse matrix * dense vector search (using cuSPARSE)
f32GPU_DM: Float32-(GPU-)based sparse matrix * dense matrix search (using cuSPARSE)
f32GPU_SM: Float32-(GPU-)based sparse matrix * sparse matrix search (using cuSPARSE)

System 1 - Standard Office PC

The first system we tested this on was a standard office laptop with the following hardware:

Model: Dell Precision 3560
CPU: Intel Core i7-1185G7 [4 cores @ 1.8 GHz base / 3.0 GHz boost]
RAM: 16 GB DDR4 RAM [3200 MT/s, NA CAS]
GPU: Nvidia T500 [2 GB VRAM]
SSD/HDD: 512 GB NVMe SSD
OS: Windows 10 Education 64-bit (10.0, Build 19045)

10 000 Candidates

A * B = C where A[10000, 500000] and B[500000, 1001]

Using a database of 10 000 peptide candidates the methods yield the following runtimes:

benchmark_pc_10000 Figure 1: Float32-based sparse matrix * dense vector search using Eigen yields the fastest computation time of only 1.02 seconds.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	10000	3.96232	3.99317	4.16333	4.12433	4.03925	3.96232	4.16333	4.05648	0.0854273	8	1001	100
i32CPU_SV	10000	4.20677	4.21627	4.18454	4.21334	4.30658	4.18454	4.30658	4.2255	0.0469989	9	1001	100
f32CPU_DV	10000	1.02714	0.999038	1.04962	1.01544	1.03139	0.999038	1.04962	1.02453	0.0188148	1	1001	100
i32CPU_DV	10000	1.088	1.17937	1.17109	1.1531	1.18244	1.088	1.18244	1.1548	0.0390465	4	1001	100
f32CPU_SM	10000	1.16123	1.14092	1.08636	1.17035	1.1552	1.08636	1.17035	1.14281	0.0333204	3	1001	100
i32CPU_SM	10000	1.01817	1.06418	1.01925	1.07144	1.13448	1.01817	1.13448	1.0615	0.0476856	2	1001	100
f32CPU_DM	10000	1.8242	1.77216	1.74569	1.715	1.77249	1.715	1.8242	1.76591	0.040254	5	1001	100
i32CPU_DM	10000	1.91169	1.86213	1.79263	1.82148	1.81984	1.79263	1.91169	1.84156	0.0463954	6	1001	100
f32GPU_DV	10000	4.03647	4.09389	4.05512	4.07632	4.05695	4.03647	4.09389	4.06375	0.0219723	9	1001	100
f32GPU_DM	10000	3.62518	3.74288	3.75778	3.71924	3.73217	3.62518	3.75778	3.71545	0.0524091	7	1001	100
f32GPU_SM	10000	9.95502	10.0398	10.1103	10.1644	10.0673	9.95502	10.1644	10.0674	0.0784879	11	1001	100

100 000 Candidates

A * B = C where A[100000, 500000] and B[500000, 1001]

Using a database of 100 000 peptide candidates the methods yield the following runtimes:

benchmark_pc_100000 Figure 2: Float32-based sparse matrix * sparse matrix search using Eigen yields the fastest computation time of only 5.08 seconds. Note that f32GPU_SM has been excluded from the plot since computation times exceeded all other methods by more than 10-fold. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	100000	35.304	34.8771	31.7219	33.6381	28.2473	28.2473	35.304	32.7577	2.87956	9	1001	100
i32CPU_SV	100000	41.3168	42.1746	35.1852	33.7421	31.1516	31.1516	42.1746	36.7141	4.82477	10	1001	100
f32CPU_DV	100000	9.8869	9.8668	9.57659	7.34046	6.65369	6.65369	9.8869	8.66489	1.54662	3	1001	100
i32CPU_DV	100000	9.78072	9.80233	9.30471	7.98685	9.94904	7.98685	9.94904	9.36473	0.807484	4	1001	100
f32CPU_SM	100000	5.92302	5.56398	4.88576	4.40602	4.63187	4.40602	5.92302	5.08213	0.639863	1	1001	100
i32CPU_SM	100000	5.36173	5.56918	5.83226	4.43903	4.73719	4.43903	5.83226	5.18788	0.581964	2	1001	100
f32CPU_DM	100000	13.9166	14.7445	14.933	11.0524	11.2453	11.0524	14.933	13.1783	1.89294	6	1001	100
i32CPU_DM	100000	14.0893	14.2498	14.913	11.2577	10.6276	10.6276	14.913	13.0275	1.9409	5	1001	100
f32GPU_DV	100000	19.6112	20.1476	19.9083	19.1877	18.8013	18.8013	20.1476	19.5312	0.542965	7	1001	100
f32GPU_DM	100000	26.7439	26.9163	26.7714	26.4168	26.3571	26.3571	26.9163	26.6411	0.241999	8	1001	100
f32GPU_SM	100000	880.093	919.047	807.312	792.249	774.371	774.371	919.047	834.615	61.9812	11	1001	100

1 000 000 Candidates

A * B = C where A[1000000, 500000] and B[500000, 1001]

Using a database of 1 000 000 peptide candidates the methods yield the following runtimes:

benchmark_pc_1000000 Figure 3: Int32-based sparse matrix * sparse matrix search using Eigen yields the fastest computation time of only 45.04 seconds. Note that f32GPU_SM has been excluded from the plot since the method ran out of memory. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	1000000	292.024	305.725	275.855	283.298	308.572	275.855	308.572	293.095	14.0837	9	1001	100
i32CPU_SV	1000000	337.896	330.922	293.953	328.767	351.387	293.953	351.387	328.585	21.2805	10	1001	100
f32CPU_DV	1000000	87.1387	78.9427	73.9562	82.5062	88.0868	73.9562	88.0868	82.1261	5.8669	4	1001	100
i32CPU_DV	1000000	88.2644	76.9449	70.4682	81.2829	92.4659	70.4682	92.4659	81.8853	8.77158	3	1001	100
f32CPU_SM	1000000	59.1796	42.2678	38.3327	43.238	61.2774	38.3327	61.2774	48.8591	10.5662	2	1001	100
i32CPU_SM	1000000	41.7158	45.5913	38.0705	43.2118	56.596	38.0705	56.596	45.0371	7.0145	1	1001	100
f32CPU_DM	1000000	105.11	106.617	95.4387	105.418	114.833	95.4387	114.833	105.483	6.88718	5	1001	100
i32CPU_DM	1000000	113.402	105.918	96.2186	109.205	113.995	96.2186	113.995	107.748	7.23534	6	1001	100
f32GPU_DV	1000000	166.4	165.727	165.672	167.374	169.121	165.672	169.121	166.859	1.4387	7	1001	100
f32GPU_DM	1000000	301.266	299.136	254.768	257.244	256.796	254.768	301.266	273.842	24.0922	8	1001	100

2 500 000 Candidates

A * B = C where A[2500000, 500000] and B[500000, 1001]

Using a database of 2 500 000 peptide candidates the methods yield the following runtimes:

benchmark_pc_2500000 Figure 4: Float32-based sparse matrix * sparse matrix search using Eigen yields the fastest computation time of only 101.42 seconds. Note that f32GPU_DM has been excluded from the plot since the computation time exceeded that of other methods by more than 10-fold and f32GPU_SM has been excluded from the plot since the method ran out of memory. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	2500000	692.117	829.181	695.621	706.285	1380.83	692.117	1380.83	860.808	296.247	9	1001	100
i32CPU_SV	2500000	800.103	811.7	764.923	766.638	999.444	764.923	999.444	828.561	97.6981	8	1001	100
f32CPU_DV	2500000	204.586	207.268	187.071	185.345	202.865	185.345	207.268	197.427	10.3792	3	1001	100
i32CPU_DV	2500000	190.583	220.49	185.285	196.224	232.609	185.285	232.609	205.038	20.4678	4	1001	100
f32CPU_SM	2500000	104.16	103.163	96.9828	96.1288	106.669	96.1288	106.669	101.421	4.63096	1	1001	100
i32CPU_SM	2500000	106.88	159.144	93.4166	95.3154	101.387	93.4166	159.144	111.229	27.3045	2	1001	100
f32CPU_DM	2500000	278.991	334.042	240.312	242.37	266.467	240.312	334.042	272.436	38.1112	5	1001	100
i32CPU_DM	2500000	302.466	292.033	243.411	245.654	279.659	243.411	302.466	272.644	26.9143	6	1001	100
f32GPU_DV	2500000	455.415	438.436	422.096	423.759	426.902	422.096	455.415	433.322	13.901	7	1001	100
f32GPU_DM	2500000	8169.59	7931.93	7467.55	7840.16	7491.93	7467.55	8169.59	7780.23	299.621	10	1001	100

5 000 000 Candidates

A * B = C where A[5000000, 500000] and B[500000, 1001]

Using a database of 5 000 000 peptide candidates the methods yield the following runtimes:

benchmark_pc_5000000 Figure 5: Float32-based sparse matrix * sparse matrix search using Eigen yields the fastest computation time of only 210.98 seconds. Note that all GPU-based methods have been excluded from the plot since their computation times exceeded that of CPU-based methods by more than 10-fold or because they ran out of memory. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	5000000	1488.95	1753.58	1409.96	1405.52	1433.23	1405.52	1753.58	1498.25	146.545	7	1001	100
i32CPU_SV	5000000	1456.77	2199.68	1443.93	1433.08	1640.18	1433.08	2199.68	1634.73	327.082	8	1001	100
f32CPU_DV	5000000	362.758	434.276	371.356	371.242	396.402	362.758	434.276	387.207	29.1716	4	1001	100
i32CPU_DV	5000000	360.054	429.113	362.396	354.354	383.947	354.354	429.113	377.973	30.7108	3	1001	100
f32CPU_SM	5000000	202.057	253.796	195.927	197.155	205.942	195.927	253.796	210.975	24.2692	1	1001	100
i32CPU_SM	5000000	196.972	247.733	238.983	217.433	192.904	192.904	247.733	218.805	24.4611	2	1001	100
f32CPU_DM	5000000	495.787	543.992	501.467	506.691	542.057	495.787	543.992	517.999	23.1783	5	1001	100
i32CPU_DM	5000000	494.032	519.314	542.015	496.312	542.956	494.032	542.956	518.926	23.6736	6	1001	100
f32GPU_DV	5000000	13753.4	13738.6	13777.2	13396.6	14214	13396.6	14214	13775.9	290.558	9	1001	100
f32GPU_DM	5000000	14965.1	15271.3	15013.6	14908.9	14943.8	14908.9	15271.3	15020.5	145.243	10	1001	100

System 2 - High Performance PC

The second system we tested this on was a more powerful desktop PC with the following (more recent) hardware:

MB: ASUS ROG Strix B650E-I
CPU: AMD Ryzen 7900X [12 cores @ 4.7 GHz base / 5.6 GHz boost]
RAM: Kingston 64 GB DDR5 RAM [5600 MT/s, 36 CAS]
GPU: ASUS Dual [Nvidia] GeForce RTX 4060 Ti OC [16 GB VRAM]*
SSD/HDD: Corsair MP600 Pro NH 2 TB NVMe SSD [PCIe 4.0]
OS: Windows 11 Pro 64-bit (10.0, Build 22631)

*_Note:_ Dual is part of the name, this is a single graphics card!

10 000 Candidates

A * B = C where A[10000, 500000] and B[500000, 1001]

Using a database of 10 000 peptide candidates the methods yield the following runtimes:

benchmark_hpc_10000 Figure 6: Float32-based sparse matrix * dense matrix search using cuSPARSE yields the fastest computation time of only 0.59 seconds. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	10000	2.43146	2.55636	2.39993	2.42731	2.39449	2.39449	2.55636	2.44191	0.0660161	9	1001	100
i32CPU_SV	10000	2.7085	2.70474	2.70086	2.729	2.68543	2.68543	2.729	2.70571	0.0156976	10	1001	100
f32CPU_DV	10000	0.6167	0.599691	0.595198	0.584888	0.591805	0.584888	0.6167	0.597657	0.0119388	2	1001	100
i32CPU_DV	10000	0.667608	0.636094	0.642741	0.637341	0.622594	0.622594	0.667608	0.641275	0.0164842	4	1001	100
f32CPU_SM	10000	0.792935	0.773152	0.773728	0.765361	0.756675	0.756675	0.792935	0.77237	0.0134243	7	1001	100
i32CPU_SM	10000	0.772883	0.768502	0.759843	0.765107	0.778391	0.759843	0.778391	0.768945	0.00711509	6	1001	100
f32CPU_DM	10000	0.632086	0.619624	0.622482	0.604109	0.602857	0.602857	0.632086	0.616231	0.0125278	3	1001	100
i32CPU_DM	10000	0.772297	0.69401	0.687996	0.690859	0.703034	0.687996	0.772297	0.709639	0.0354791	5	1001	100
f32GPU_DV	10000	0.813919	0.76518	0.758183	0.770478	0.765139	0.758183	0.813919	0.77458	0.0224207	8	1001	100
f32GPU_DM	10000	0.595171	0.590653	0.585184	0.582419	0.581002	0.581002	0.595171	0.586886	0.00592235	1	1001	100
f32GPU_SM	10000	6.10897	5.90136	5.95603	5.97876	6.05967	5.90136	6.10897	6.00096	0.0829823	11	1001	100

100 000 Candidates

A * B = C where A[100000, 500000] and B[500000, 1001]

Using a database of 100 000 peptide candidates the methods yield the following runtimes:

benchmark_hpc_100000 Figure 7: Float32-based sparse matrix * dense matrix search using cuSPARSE yields the fastest computation time of only 1.75 seconds. Note that GPU-based sparse matrix * sparse matrix search has been excluded from the plot since its computation time exceeded that of all other methods by almost 20-fold. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	100000	17.2153	17.4153	17.2214	17.1595	17.0898	17.0898	17.4153	17.2203	0.12123	9	1001	100
i32CPU_SV	100000	19.5383	19.6478	19.4686	19.5157	19.8642	19.4686	19.8642	19.6069	0.158141	10	1001	100
f32CPU_DV	100000	2.20965	2.22161	2.24419	2.32234	2.24583	2.20965	2.32234	2.24872	0.0439058	3	1001	100
i32CPU_DV	100000	2.24745	2.25166	2.25541	2.25549	2.28039	2.24745	2.28039	2.25808	0.0128992	4	1001	100
f32CPU_SM	100000	2.54573	2.57403	2.53357	2.51469	2.69642	2.51469	2.69642	2.57289	0.0723424	6	1001	100
i32CPU_SM	100000	2.37684	2.38338	2.35434	2.37147	2.51973	2.35434	2.51973	2.40115	0.0671557	5	1001	100
f32CPU_DM	100000	2.86038	2.87027	2.86148	2.88828	2.88353	2.86038	2.88828	2.87279	0.0126856	7	1001	100
i32CPU_DM	100000	3.00192	2.9941	2.98285	2.98436	3.07926	2.98285	3.07926	3.0085	0.0403058	8	1001	100
f32GPU_DV	100000	1.89176	1.9011	1.79977	1.81026	1.86596	1.79977	1.9011	1.85377	0.0464785	2	1001	100
f32GPU_DM	100000	1.72806	1.74136	1.74271	1.74044	1.79692	1.72806	1.79692	1.7499	0.0269343	1	1001	100
f32GPU_SM	100000	368.121	372.694	366.824	367.272	374.929	366.824	374.929	369.968	3.62661	11	1001	100

1 000 000 Candidates

A * B = C where A[1000000, 500000] and B[500000, 1001]

Using a database of 1 000 000 peptide candidates the methods yield the following runtimes:

benchmark_hpc_1000000 Figure 8: Float32-based sparse matrix * dense vector search using cuSPARSE yields the fastest computation time of only 13.53 seconds. Note that GPU-based sparse matrix * sparse matrix search was not measured due to its extremely long computation time already evident from the 100 000 candidate benchmark. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	1000000	164.331	169.159	169.282	169.57	167.153	164.331	169.57	167.899	2.21262	9	1001	100
i32CPU_SV	1000000	187.277	194.068	192.426	187.627	193.715	187.277	194.068	191.023	3.31853	10	1001	100
f32CPU_DV	1000000	25.8363	25.9068	25.906	25.8961	25.7537	25.7537	25.9068	25.8598	0.0660915	6	1001	100
i32CPU_DV	1000000	24.3092	24.337	24.3	24.6825	24.4221	24.3	24.6825	24.4101	0.159677	5	1001	100
f32CPU_SM	1000000	20.5885	21.4535	21.15	21.1785	21.5341	20.5885	21.5341	21.1809	0.371133	4	1001	100
i32CPU_SM	1000000	19.2455	18.6396	19.5012	19.5133	19.917	18.6396	19.917	19.3633	0.470585	3	1001	100
f32CPU_DM	1000000	26.7471	26.6318	26.5303	27.5435	26.7253	26.5303	27.5435	26.8356	0.404922	8	1001	100
i32CPU_DM	1000000	26.506	26.3391	26.0741	26.1196	26.3898	26.0741	26.506	26.2857	0.183399	7	1001	100
f32GPU_DV	1000000	13.2421	13.7711	13.4819	13.5797	13.5924	13.2421	13.7711	13.5334	0.193436	1	1001	100
f32GPU_DM	1000000	14.4925	14.5667	14.4004	14.7976	14.7297	14.4004	14.7976	14.5974	0.164561	2	1001	100

2 500 000 Candidates

A * B = C where A[2500000, 500000] and B[500000, 1001]

Using a database of 2 500 000 peptide candidates the methods yield the following runtimes:

benchmark_hpc_2500000 Figure 9: Float32-based sparse matrix * dense vector search using cuSPARSE yields the fastest computation time of only 33.93 seconds. Note that GPU-based sparse matrix * sparse matrix search was not measured due to its extremely long computation time already evident from the 100 000 candidate benchmark. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	2500000	407.479	407.826	419.703	421.906	418.15	407.479	421.906	415.013	6.85122	9	1001	100
i32CPU_SV	2500000	467.131	478.789	475.488	481.399	479.534	467.131	481.399	476.468	5.63997	10	1001	100
f32CPU_DV	2500000	63.38	63.4611	64.0814	62.9425	62.8804	62.8804	64.0814	63.3491	0.483433	6	1001	100
i32CPU_DV	2500000	59.1485	60.1111	59.7992	58.7428	58.3854	58.3854	60.1111	59.2374	0.717179	5	1001	100
f32CPU_SM	2500000	50.3486	51.7864	51.3886	52.1593	50.9307	50.3486	52.1593	51.3227	0.710964	4	1001	100
i32CPU_SM	2500000	47.4197	46.9971	46.5492	48.0035	46.3684	46.3684	48.0035	47.0676	0.663803	3	1001	100
f32CPU_DM	2500000	64.9918	65.6541	64.0161	65.3734	65.5497	64.0161	65.6541	65.117	0.665061	7	1001	100
i32CPU_DM	2500000	64.8904	65.8225	64.7754	65.5182	65.5602	64.7754	65.8225	65.3134	0.455636	8	1001	100
f32GPU_DV	2500000	34.4445	34.2919	33.3516	33.3692	34.1864	33.3516	34.4445	33.9287	0.526903	1	1001	100
f32GPU_DM	2500000	37.2584	37.9242	36.1633	36.3446	37.3268	36.1633	37.9242	37.0035	0.734356	2	1001	100

5 000 000 Candidates

A * B = C where A[5000000, 500000] and B[500000, 1001]

Using a database of 5 000 000 peptide candidates the methods yield the following runtimes:

benchmark_hpc_5000000 Figure 10: Float32-based sparse matrix * dense vector search using cuSPARSE yields the fastest computation time of only 68.27 seconds. Note that GPU-based sparse matrix * sparse matrix search was not measured due to its extremely long computation time already evident from the 100 000 candidate benchmark. The raw data is available below.

Expand for raw data!

Method	Candidates	Run 1	Run 2	Run 3	Run 4	Run 5	Min	Max	Mean	SD	Rank	Y	N
f32CPU_SV	5000000	826.765	849.757	821.71	820.768	821.078	820.768	849.757	828.015	12.3965	9	1001	100
i32CPU_SV	5000000	939.623	947.093	952.345	941.501	954.725	939.623	954.725	947.057	6.57454	10	1001	100
f32CPU_DV	5000000	130.432	125.811	126.793	126.478	125.721	125.721	130.432	127.047	1.94481	6	1001	100
i32CPU_DV	5000000	117.315	117.658	118.168	118.61	117.866	117.315	118.61	117.923	0.493645	5	1001	100
f32CPU_SM	5000000	124.313	104.291	104.264	101.747	102.933	101.747	124.313	107.51	9.45234	4	1001	100
i32CPU_SM	5000000	90.7281	95.7376	95.3606	92.9142	94.5655	90.7281	95.7376	93.8612	2.06021	3	1001	100
f32CPU_DM	5000000	127.834	133.133	134.49	131.903	132.968	127.834	134.49	132.066	2.53797	8	1001	100
i32CPU_DM	5000000	128.806	133.237	131.651	133.81	132.751	128.806	133.81	132.051	1.97979	7	1001	100
f32GPU_DV	5000000	67.4017	68.5841	68.0146	68.9836	68.3605	67.4017	68.9836	68.2689	0.599003	1	1001	100
f32GPU_DM	5000000	71.757	73.6774	73.7368	74.6015	73.9481	71.757	74.6015	73.5442	1.0642	2	1001	100

Beyond 5 000 000 Candidates

Here are some single benchmarks of the best performing CPU-based search i32CPU_SM and the best best performing GPU-based search f32GPU_DV:

Method	Candidates	Time (s)
i32CPU_SM	10 000 000	192.9012858
f32GPU_DV	10 000 000	136.936034
i32CPU_SM	15 000 000	294.8084179
f32GPU_DV	15 000 000	222.3788737
i32CPU_SM	20 000 000	402.716955
f32GPU_DV	20 000 000	281.232755
i32CPU_SM	21 474 835	439.2972796
f32GPU_DV	21 474 835	failed

You might notice that the maximum number of tested candidates is 21 474 835, this is because this results in 2 147 483 500 non-zero elements, close to the maximum of a signed int32 type value. Going beyond that is impossible with the provided implementation. In case you really need to go beyond that, please adapt the implementation accordingly (e.g. using unsigned int32 or int64 data types instead). See also this issue.

Conclusions

CPU-based sparse matrix * sparse matrix search is generally a good choice, no matter the system configuration. Choosing an Int32- or Float32-based approach usually does not make a considerable difference, we recommend going with the Int32 variant as it ensures better reproducibility of results, eliminating any kind of deviations due to floating point shenanigans. If a decent GPU (e.g. anything comparable to an Nvidia GeForce RTX 4060 Ti 16 GB or better) is available, running a GPU-based search is the more performant choice. We recommend going with the sparse matrix * dense vector approach as it requires less GPU memory. The GPU-based sparse matrix * sparse matrix search appears to be the worst of the tested methods, yielding very long computation times and exceedingly high memory usage. This is most likely due to the fact that the algorithm assumes that the resulting matrix is also sparse, which almost never will be the case.