As someone whose last 8 years of professional work have been largely split betwe...

thecoffman · on Sept 7, 2013

Interesting - I found the lambda syntax of LINQ to be quite "Ruby like"

  someList.FindAll(i => i < 2)
    .Select(i => i * 2)
    .GroupBy(i => (i % 2) == 0 ? "even" : "odd")

Is about as close to Ruby's Enumerable as I've found in a mainstream/enterprise language (unless you include Scala).

icambron · on Sept 7, 2013

Yeah, I mostly of agree with that and sort of said so in my post. Not sure it has much impact on my central point though.

One really key difference with LINQ is that it doesn't produce arrays (or dictionaries, as in your example); it produces Enumerators, which you then have to do call toList() or toDictionary() on. That laziness is actually an awesome feature and my favorite thing about LINQ, because it can massively improve performance by shortcutting work and not creating intermediate arrays. You can even work on infinite sequences with it. Besides performance, it's just tastier. It's so great I actually wrote a Ruby library to imitate it: https://github.com/icambron/lazer

VexXtreme · on Sept 7, 2013

Is LINQ really fast/performant though? Wouldn't the above expression cause three sequential loops to run?

One of the biggest performance issues I've seen with modern .NET code is people abusing LINQ and lambdas. Chaining functions like this is most decidedly not fast. I once wrote a library that had do do some heavy signal processing on large data sets, and since I wanted to ship the first version as soon as possible, I just used LINQ in a lot of functions to save time. It wasn't very performant so later I rewrote most of the functions to use standard native code such as loops for iteration, hashmaps for caching and all sorts of improvements like that. I completely got rid of LINQ in that version and for many functions the runtime went down from something like 500ms-1000ms to microsecond area.

So sure, LINQ makes development fast and it's very nice to be able to write code such as .Skip(10).Take(50).Where(x => ...). On most web projects, it won't make a huge difference. I've seen Rails "developers" use ActiveRecord in such a way that they would create double and triple nested loops and then hit the database multiple times by using enumerable functions on ActiveRecord objects without realizing how this works, what's going on behind the curtains and so on. I've seen .NET devs do similar things using EntityFramework.

So yeah, it's convenient and all, but it can also be very dangerous when used by someone who doesn't understand the fundamentals behind these principles.

icambron · on Sept 7, 2013

> Wouldn't the above expression cause three sequential loops to run?

No, it wouldn't; that's the really important point about LINQ I was, clumsily, trying to express above [1]. Take this admittedly totally contrived example:

  someList
    .Where(i => i % 2 == 0)
    .Select(i => i + 7)
    .Take(5)

This is not equivalent to a bunch of sequential loops. What it is is a bunch nested Enumerators. Here's how it works. It gets the list's Enumerator, which is an interface that has a MoveNext() method and a Current property. In this case, MoveNext() just retrieves the next element of the list. Then Where() call wraps that enumerator with another enumerator [2], but this time its implementation of MoveNext() calls the wrapped MovedNext() until it finds a number divisible by 2, and then sets its Current property to that. That enumerator is wrapped with one whose MoveNext() calls underlying.MoveNext() and sets Current to underlying.Current + 7. Take just sets Current to null after 5 underlying MoveNext() calls.

So all that returns an enumerable, so as written above, it actually hasn't done any real work yet. It's just wrapped some stuff in some other stuff.

Once we walk the enumerable--either by putting a foreach around it or by calling ToList() on it--we start processing list elements. But they come through one at a time as these MoveNext() calls bring through the list items; think of them as working from the inside out, with each MoveNext() call asking for one item, however that layer of the onion has defined "one item". The item is pulled up through the chain, only "leaving" the original list when it's needed. The entire list is traversed at most once, and in our example, possibly far less: the Take(5) stops calling MoveNext() after it's received 5 values, so we stop processing the list after that happens. If someList were the list of natural numbers, we'd only read the first 10 values from the list.

Now, those nested Enumerator calls aren't completely free, but they're not bad either, and you certainly shouldn't be seeing a one second vs microseconds difference. If you craft the chain correctly, it's functionally equivalent to having all of the right short circuitry in the manual for-loop version, and obviously it's way nicer.

So why are you seeing such poor perf on your LINQ chains? Hard to say without looking at them, but a few of pointers are: (1) Never call ToList() or ToDictionary() until the end of your chain. Or anything else that would prematurely "eat" the enumerable. (2) Order the chain so that filters that eliminate the most items go at the end of the chain, similar to how you'd put their equivalent if (...) continue; checks at the beginning of your loop body. (3) Just be cognizant of how LINQ chains actually work.

[1] In the example in the parent, FindAll isn't actually a LINQ method, so there is one extra loop in there. Always use Where() if you're chaining; use FindAll() when you want a simple List -> List transformation.

[2] A detail elided here: each level actually returns an Enumerable and the layer wrapping it does a GetEnumerator() call on that.

VexXtreme · on Sept 8, 2013

Thank you for this awesome explanation, it really clarifies how Enumerable methods work. I guess this is the real reason behind deferred execution. Still, I can't help thinking there is a big overhead involved with using such methods. We did some benchmarks in the past and the code that does the same thing manually always ended up being much faster.

The nice thing about Enumerable methods is that they can significantly speed up development and most projects won't suffer for it. However, for speed critical code it's probably not the best tool in the box.