> What do you mean by "something completely different"? I mean compute the exact...

> What do you mean by "something completely different"?

I mean compute the exact thing you need to compute, optimizing for performance.

I’m pretty sure your current code is spending vast majority of time doing malloc / memcpy while moving input elements from the input array into per-bucket collections.

To aggregate the data the way you want, you do not have to do any of that. You only need a single number per bucket to compute the minimum. Updating that accumulator is very cheap as there’s no memory allocations involved: just load, compare, and store the updated number.

Moreover, that algorithm is easily to parallelize. Compute an array with these per-bucket minimums on different threads, wait for completion of all tasks, then aggregate multiple arrays computed by each thread into a single array with global minimums for each bucket. Then compute the final sum.