Independent threads are as independently fast as a single core and they slow down overall when synchronization is introduced. Amdahls law means that the cost of sequential tasks approaches the majority of time compared to the parallel speed up.
My design is to shard by thread. Each thread or machine maintains a sorted shard of the data. Erlang takes a similar approach. If you want a total order view, then you N way mergesort.
I have a lock free algorithm that can do about 1.2 million synchronizations a second across 11 threads. My lock benchmark does 3,444,152 3.4 million requests per second with 11 threads so my lock free algorithm isn't as good as locks. My lock free algorithm doesn't block though.
I'm thinking of calibrating message rate by increasing latency to increase throughput and limit interference for the lock. If I increase message rate to 2500-25000 messages then more messages get transferred per synchronization event.
I think this is to be expected. If you have a lock anywhere, the risk is, that the lock will become the bottleneck at some point when scaling to more and more concurrently running processes wanting to acquire the lock.
Why merge them at all? Can a lazy merge increase performance by the fact you are usually iterating over them and at that point most work will be done by the iteration and not by the merger. That gives you some time to work out which thread/machine to pull from next while the main thread works on the result of the last iteration (ie, preempt the iteratie). You could even buffer/stream them from each thread much faster than any code could do real work.
In my experience, lock-free solutions tend to not do well when threads share a physical core (vs virtual cores), but ymmv.
Independent threads are as independently fast as a single core and they slow down overall when synchronization is introduced. Amdahls law means that the cost of sequential tasks approaches the majority of time compared to the parallel speed up.
My design is to shard by thread. Each thread or machine maintains a sorted shard of the data. Erlang takes a similar approach. If you want a total order view, then you N way mergesort.
I have a lock free algorithm that can do about 1.2 million synchronizations a second across 11 threads. My lock benchmark does 3,444,152 3.4 million requests per second with 11 threads so my lock free algorithm isn't as good as locks. My lock free algorithm doesn't block though.
I'm thinking of calibrating message rate by increasing latency to increase throughput and limit interference for the lock. If I increase message rate to 2500-25000 messages then more messages get transferred per synchronization event.