Hacker Newsnew | past | comments | ask | show | jobs | submit | saman_b's commentslogin

The results of our to be published paper clearly confirm the claims of this paper and shows that if implemented well, threads can perform and scale as good as events with not much memory overhead.

I worked on this subject during my Phd, and the result is a paper that will be published in sigmetrics2020.

We developed an M:N user-level threading library and exhaustively tested it against event-based alternatives and pthread based solutions.

We used both memcached and webservers to test it on 32 core and 64 core servers.

Even connection/pthread looks promising in terms of performance.

You can find the paper and source files here: https://cs.uwaterloo.ca/~mkarsten/papers/sigmetrics2020.html


Well, no, you are just hacking it to support your claims. A split-stack work-stealing implementation is ok, nothing special, basically goroutines. But you are not addressing the most important difference between events and shared memory multithreading - synchronizing access to shared memory. Idiomatic event driven applications don't do synchronization and have to be sharded to scale to multiple cores. Choosing memcached and running it multithreaded is particularly bad, as memcached is not a decent event driven applications, it mixes threads and events and suffers from all that synchronization overhead. At least you should run it in one process per core configuration [1]. But it's much worse than that, there is some serious research in this area that addresses those problems, in particular the Anna paper [2], it kills any possibility for shared memory multithreading as a concurrency model to compete with anything, it's just too broken on a fundamental level.

[1] https://github.com/scylladb/seastar/wiki/Memcached-Benchmark

[2] https://dsf.berkeley.edu/jmh/papers/anna_ieee18.pdf


I believe one should not confine event-driven only to applications that don't do synchronisation, that's part of the misconception that leads to thinking event-driven has higher performance. This is due to the fact that part of the problem with threads (as you mentioned) is synchronisation, but this is the same problem with event-driven applications. We have a web server experiment that shows comparable performance to an event-driven web server (ulib) with its various hacks to make it faster and is always on the top list on tech-empower.

Regarding memcached, the first reference you posted is from 2015, yes in 2015 memcached was in a very bad shape in terms of synchronisation and things have significantly changed from that version with locks per hash bucket rather than a global lock, avoiding try_lock and .... So those results are too old to rely on. Seastar moves network stack to user-level and if I remember correctly the scheduler consisted of multiple ever looping threads even if there was no work to do. Considering in our experiments 40-60% of the memcached overhead were coming from network I/O, there is no surprise about their results. I would call this a hack for sure.

I have not read the Anna paper to be able to comment, but it seems that they are creating a key-value store using the actor model. I briefly schemed through the paper, and I did not find anything that points to "killing any possibility for shared memory multi-threading as a concurrency model"; this is a very bold claim and if they do claim that, they should have really strong results.

But my guess is anna's whole actor model was implemented on top of threads and shared memory multi-threading? Which will be in contrast of that bold claim. I worked with actor models and implemented one as well, it is a perfect fit for many use cases but threads at least for now are the bread and butter of multicore programming.

Having said that, all these models have their respective place in the software world. What we are trying to show in our paper , through thorough experiments, is that the misconception that event-driven has higher performance than thread programming is not fundamentally true. Therefore, falling into asynchronous programming and create hard to maintain applications [1] only due to better performance has no merit.

[1] https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-k...


[flagged]


You know what's really silly? Conflating "event driven vs. threads" with "shared memory vs. non-shared". They're two different things. In fact, "silly" is too charitable a word to describe your abuse of terminology to support a clearly-refuted point.


We've asked you many times to stop breaking the site guidelines in comments here. If you keep doing it, we are going to have to ban you. Please review the rules and stick to them from now on: https://news.ycombinator.com/newsguidelines.html.


I think it was the "flash boys" by Michael Lewis, that mentioned a Russian programmer that was very good at programming with pen and paper, and the reason was in old days there were not many computers around or you had to use punch cards to program. So the only option was to have the ability to write the code on paper and make sure everything is well thought of before you run the code on an actual computer.

It only makes sense if the interview process followed the same pattern in those days. The difference is there were not much information to remember, most probably if you knew few data structures and algorithms that was enough.

I feel the tech interview process, like so many other things, is still stuck in the past. Seems that the monks are still tying the cat to the tree without knowing why. Everyone seems unhappy about this but nothing changes.


Since IO Multiplexing in go is based on Edge-triggered epoll, you always need to do a write/read syscall (it happens underneath) before blocking the goroutine to arm the epoll to poll that specific FD. Also this approach relies on parking/unparking the goroutine and there is no way around it, since it is part of the runtime.

One way go can make it nonblocking is to do the read/write and return the result immediately, but this kind of call requires a call back function to be provided, so instead of parking/unparking the goroutine, the call back function is called (either directly or by "go cb_function"), It can be tricky to make it work, since the initial goroutine might not be allowed to call read on FD anymore (maybe exit from the goroutine if the read was unsuccessful). In addition, you still need to provide a buffer to read from, the 1 byte trick works though, but the buffer management can be done either if the read was successful or in the call back function. Also this approach makes the programming asynchronous and similar to event programming.


Thanks for mentioning this, it is indeed very related and very interesting. I am not sure why not me or people around me were aware of Qthread, it has very good support for various architectures and provides many interesting features. It has many similarities with I have in mind for a concurrent library, and even some research goals seems to be very close to mine. Specially the notion of affinity and locality is what I am focusing on in uThreads. I am going through the papers and the source code at the moment to see what are the similarities and differences.

uThreads is still a work in progress and I have specific plans for it in the future that might differ of what Qthreads is trying to accomplish. For now my focus is more on providing auto tuning of Clusters based on the workload. I also will try to explore the pros and cons of uThread migration based on Cluster placement (NUMA and cache locality), and from the 2008 paper it seems that it is what you are trying to study as well. I am open to collaboration if there is an ongoing study around this topic.


ooh, ok. you might also check out openshmem (http://openshmem.org/site/) and openucx (http://www.openucx.org). I'd been planning on integrating both in RaftLib just not enough cycles to get it done yet. These combined would make it much easier to maintain a relatively portable yet performant back end. My thesis research was all about locality, memory placement, and throughput for big data systems. Current work is similar but I'm not neck deep in the hardware dev world. There are current research efforts on my part outside of work, most center around the raftlib.io platform. Before I forget, you might also want to check out the graph partitioning frameworks like metis and scotch...both are used in some MPI frameworks for more optimal partitioning. To get topology data you might want to look at the hwloc framework, it's cross platform and provides input for things like NUMA/cache/PCIe topology for optimization. I haven't had a chance to integrate this hook into RaftLib, however it's just a few lines away once I find the time. If you're wondering...I started out writing my own fiber library for RaftLib. Had ports for both IBM Power and Intel, but it gets a bit tiring maintaining/optimizing for every new architecture. Qthreads and the like have been used in HPC circles for quite awhile, so it made sense. There was no way I was beating them for dev time, so might as well join them.

Based on your auto-tuning discussion...RaftLib aims to do something similar, but for big-data stream processing workloads. Here's my 2016 IJHPCA paper: http://hpc.sagepub.com/content/early/2016/10/18/109434201667...

It looks like it's behind a paywall so if you don't have access I'll update my website with the "author archive" copy sometime today...will be at ( http://jonathanbeard.io/media ) once I update it. Bottom line if there's intersected interest, definitely open to collaboration :).


Thanks, so much information to absorb, let me go through all this and get back to you. I shoot you an email when I processed all this :)


Seems like a very mature library, I can't answer your question before I go through their documentation and code. Also, I find your question a bit abstract, since I am not aware of the details of the problem you are trying to solve, I cannot reason why any library or tool can be better over other libraries. In your case, TBB might fit your requirements and it might be hard to give a reason to switch to another library.


Good idea, I probably write a blog post on this later. If you take a look at [1], I explain the difference between N:1 and M:N mappings. StateThreads uses a N:1 mapping which means you can multiplex many fibers over a single thread and to take advantage of multi-processors you can have M processes that do N:1 mapping, which is a common practice (libmill, libdill, ...). But with uThreads you can multiplex many fibers over many kernel threads.

---------------------------------

[1]https://github.com/samanbarghi/uThreads#what-are-uthreads


Sure, I get the M:N argument, but that wasn't something you could extend existing solutions to support?


Do you have anything specific in mind? The only code I found similar to this is uC++ [1], which has way more features and more sophisticated scheduler. I am using this as part of my research and wanted to have sth very simple.

For all N:1 mappings, since there is only a single process, there is no need for synchronization, also there is no need for any scheduler as a simple queue suffice. But as soon as multiple threads are introduced, there is a need for orchestration among threads and it also changes all other aspects of the code. Of course, I could develop on top of an existing codebase, but I suspect I had to change so much that it is better to start from scratch anyway.

------------------------------------

[1]https://plg.uwaterloo.ca/usystem/uC++.html


> For all N:1 mappings, since there is only a single process, there is no need for synchronization, also there is no need for any scheduler as a simple queue suffice.

Wouldn't an M:N model simply amount to work stealing among N kernel-thread-local queues? This seems like it should be a pretty straightforward extension to one of the user-level C thread packages.

Or are you doing something more elaborate for your research?


Well on the surface, yes! when I started I expected the same. For now there is no work stealing among kThreads. But let me dig into the problem a bit deeper so you get an idea why it might not be very straight forward (also it very much depends on the use case).

Usually these libraries either provide their own queue or rely on underlying event system e.g. epoll/kqueue to manage the state of the light weight threads. Also, the other main part is IO multiplexing, which can be done using select/poll/epoll/kqueue ...

Lets say if they are using some sort of queue, since there is no other thread in the system, thus no synchronization required and its as simple as pushing to the tail of queue and pulling from the head. Now, if I add more threads, now I have to think how I want to synchronize among threads. The straight forward way of doing this is using the same queue and multiplex among kernel threads and use Mutex and CV. However, Mutex has high overhead and is not very scalable (pthread_mutex does not scale well under contention in Linux). What about N-kernel-thread-local queues along with Mutex? Well, if you see the documentation this approach does better but has high overhead as well. What is next? remove the queue and add some sort of lock-free queue, either MPMC, MPSC, or SPSC. This requires some work to determine which one has lower overhead, in my case I have not tested SPSC, and for now settled with MPSC. So the queue part is totally gone, since they probably did not care about all this and used a simple queue.

Next, comes the IO multiplexing. Relying on an epoll instance per kernel-thread is absolutely fine specially for cases where uThreads stick to the same kernel thread, but as soon as I introduce the notion of migration then moving from on kernel-thread to another kenrel-thread means issuing at least two system calls (in case of epoll, deregister from current epoll instance and register to the new one). This has high overhead, so I need to provide some sort of common poller that do multiplexing over multiple kernel threads, but due to having more than one kernel thread, it means connections should be properly synchronized as more than one thread might try to access a connection. Also, it has to be done in a way with low overhead as migrations for my research require to have very low overhead. Thus, the IO multiplexing part should be replaced.

So the main parts are required to be changed, and I believe the effort required to make fundamental changes to an existing system might be more than the efforts required for writing it from scratch. Also for each part implemented, I did performance optimisations and build on top of that which helped to keep the performance at an acceptable level, it would be hard to do the same with an existing system as it requires to isolate various parts and optimise each part, which requires additional effort.

I hope it makes it more clear :)


I agree there are additional complexities with M:N. I was assuming a lock-free queue, since you'll need this for scalable work-stealing.

I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

I'm not sure whether this would scale as well as what you're doing since you mentioned an epoll-per-kernel thread, but I wouldn't be surprised if it got close since it's so I/O-bound.


> I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

> When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

If I remember correctly, what you described is similar to how golang perform I/O polling since they are using work stealing, except there is no dedicated poller thread, and threads take turn to do the polling whenever they become idle.

Also, with epoll/kqueue there is no need to enqueue uThreads and you can simply store a pointer(a struct that has reader/writer pointers to uThreads) with the epoll event that you create and recover it later from the notification, this way you can set the pointer when need to do I/O.

I agree a single poller thread does not scale as well as per kthread epoll instance, and that requires some meditation to provide a low overhead solution that does not sacrifice performance over scalability.


In libdill approach, you are probably limited to only multiplex connections over multiple kernel threads. And when a connection is accepted over a kernel thread it has to perform all further instructions over that kernel thread. So it gives you a bit of control over which cores to be utilized but after that you do not have control over what part of the code should be executed on each core.

Using uThreads you can decide what part of the code should be executed over each kernel thread and, if taskset is used, which core to execute your code. You can do this by creating Clusters and using migration or uThread creating in runtime. Thus you can decide which thread is used to multiplex connections and which one is used to run CPU bound code in addition to having a thread pool for example to do the disk IO asynchronously. Ultimately, one can create a SEDA[1] like architecture using uTrheads. Also you can always use uThread as a single threaded application.

------------------------------------------

[1] https://en.wikipedia.org/wiki/Staged_event-driven_architectu...


First time I've heard of SEDA, though I've been aware of the concerns/concepts it addresses, in various forms, for some time.

Any thoughts on Welsh's Retrospective on SEDA?

http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda...


I have seen this before. Those are very good points, and I am trying to move this library towards supporting a SEDA type architecture with dynamic control and auto tuning runtime parameters.


Yes, there is a specific reason behind it, but in the future I might consider a less restrictive license.


Thanks for the reply and your future consideration. I'm glad you seem to have taken the question as it was intended. I certainly didn't mean to imply that one license is superior to another.


Right, however segmented stacks are have high overhead and stack copying is not very easy in C/C++. Thus, for now uThreads only support fixed size stacks, I know it makes it harder to be used in production, and in the future I might provide optional segmented stacks. As for why not using `boost::context`, I am implementing uThreads as part of my research in uwaterloo, I wanted to have full control over the code and be able to optimize for performance as much as I can. Thus, tried to avoid relying on any third party code when I started :)


Thanks! good point; now that I look at the page, there is not a single sample code in there. I'll update it soon.


Awesome! As an example, Rayon[0] does a very good job (IMHO) at this.

[0] - https://github.com/nikomatsakis/rayon


Thanks, oh all those fancy functions. I need to improve the interface a bit, as for now everything is only based on using uThreads as the unit of concurrency. e.g., this is a recursive Fibonacci: https://github.com/samanbarghi/uThreads/blob/master/test/Fib... and a fork-and-join Fibonacci: https://github.com/samanbarghi/uThreads/blob/master/test/Fib...

There is no fork-and-join in uThreads yet, and I created it using create and join. The interface will improve in the future :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: