Hacker News new | past | comments | ask | show | jobs | submit login

This is the problem I'm working on.

Independent threads are as independently fast as a single core and they slow down overall when synchronization is introduced. Amdahls law means that the cost of sequential tasks approaches the majority of time compared to the parallel speed up.

My design is to shard by thread. Each thread or machine maintains a sorted shard of the data. Erlang takes a similar approach. If you want a total order view, then you N way mergesort.

I have a lock free algorithm that can do about 1.2 million synchronizations a second across 11 threads. My lock benchmark does 3,444,152 3.4 million requests per second with 11 threads so my lock free algorithm isn't as good as locks. My lock free algorithm doesn't block though.

I'm thinking of calibrating message rate by increasing latency to increase throughput and limit interference for the lock. If I increase message rate to 2500-25000 messages then more messages get transferred per synchronization event.




I did some tweaking with my benchmark.

I picked 11 threads due to my CPU having 12 logical cores.

With 100 threads and incrementing 10 at a time, the LockBenchmark achieves 3,482,053 requests per second.

With 100 threads the lock free benchmark achieves 17,754,858 requests per second, 10 messages sending per synchronization event at a time.

In other words, the lock free algorithm scales better.

LockBenchmark code https://github.com/samsquire/multiversion-concurrency-contro...

Lock free actor2 algorithm https://github.com/samsquire/multiversion-concurrency-contro...


I think this is to be expected. If you have a lock anywhere, the risk is, that the lock will become the bottleneck at some point when scaling to more and more concurrently running processes wanting to acquire the lock.


Why merge them at all? Can a lazy merge increase performance by the fact you are usually iterating over them and at that point most work will be done by the iteration and not by the merger. That gives you some time to work out which thread/machine to pull from next while the main thread works on the result of the last iteration (ie, preempt the iteratie). You could even buffer/stream them from each thread much faster than any code could do real work.

In my experience, lock-free solutions tend to not do well when threads share a physical core (vs virtual cores), but ymmv.


How'd you pick 11 threads? How do the algorithms scale as you change # of threads? T=2, 10, 100, 1000, 10k


https://news.ycombinator.com/item?id=34126789

I replied to your comment here.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: