$ files | narrow \.rs$ | narrow '!tests' | xargs cat | nlines 4251 It's 4k lines...

JoshTriplett · on Sept 9, 2018

First of all, take a look at Cargo.toml for the list of dependencies; repeat recursively. Projects like xsv and ripgrep are modular, with many components that others can and do reuse.

Second, lines of code hardly gives any but the roughest idea of how hard something would be to write, and write well.

Third, interesting that you're not counting the test cases; after all, if you're not doing any static typing, surely you'll want more tests...

Fourth, hey, as long as you're getting rid of the "static typing nonsense" you might as well drop the error handling and comments while you're at it. More seriously, though, type signatures and similar are hardly a significant part of the lines of code of the average Rust program.

But in any case, you've already seen the replies elsewhere in the thread inviting you to try if you feel confident you can do so.

> You don't like putting on a show for a crowd? It's one of the funnest things.

You're certainly showing the crowd something about yourself. Whether it's what you're intending is another question.

If you want to write a replacement or alternative for a tool, especially as an exercise in learning something, by all means do; it's a fun pastime. You don't need to dismiss someone else's work or choice of language in the process.

shawn · on Sept 9, 2018

If it sounded like I was dismissing someone else's work, you're reading too far into it. Who would be silly enough to dismiss a tool from the author of ripgrep?

JoshTriplett · on Sept 9, 2018

Claiming you can implement a version in a weekend and match the same performance is quite dismissive.

Superficially counting the lines of code in the top-level project (ignoring everything else) and implying that it's "just" 4000 lines of code (as though that's a full description of the effort that went into it) is also quite dismissive.

shawn · on Sept 9, 2018

It wasn't dismissive, it was foolish. The CSV parser is actually a separate project, and is around 15k lines of code. That certainly won't be done in a weekend.

Look, it's stellar, A+ software. All I was saying is that you can write it in a dynamic language without sacrificing performance. The goal wasn't to match the full functionality of XSV; that'd be absurd.

In some cases, LuaJIT is even faster than C. It's not an outlandish claim to say that it could match.

The Python claim was in the spirit of good fun, but that probably didn't come across.

Either way, software is meant to be fun. It's a positive statement to say that a dynamic language can match the performance of a statically typed one. Isn't that a cool idea, worth exploring? Why is it true?

The reason I'm confident in that claim is because LuaJIT has withstood the test of time and has repeatedly proven itself. This reduces to the old argument of static types vs lack of types. But a lack of typing was exactly why Lisp was so powerful, back in the day, and why a small number of programmers could wipe the floor vs large teams.

Either way, I've managed to stir the hive, so I'll leave this for whatever it is. To be clear: XSV is awesome software, and I never said otherwise.

burntsushi · on Sept 9, 2018

The LuaJIT idea is interesting, I've certainly been impressed by it in the past, and can agree it is to some extent something that dispels myths like "statically typed languages are always faster than unityped languages." But if you instead interpret that as a first approximation, then it's fairly accurate IMO.

In the interest of cutting to the chase, I'll try to explain some of the high level ideas of why the CSV parser is fast, and typically faster than any other CSV parser I've come across.

Firstly, it is implemented by a hand-rolled DFA that is built from an NFA. The NFA is typically what most robust CSV parsers use, and it is quite fast, but it suffers from the overhead of moving through epsilon transitions and handling case analysis that is part of the configuration of the parser (i.e., delimiter, quote, escaping rules, etc.). It seems to me like this concept could be carried over to LuaJIT.

Secondly, the per-byte overhead of the DFA is very low, and even special cases[1] some transitions to get the overhead even lower. If you were doing this in pure Python or Lua or really any unityped language, I would be very skeptical that you could achieve this because of all the implicit boxing that tends to go on in those languages. Now, if you toss a JIT in the mix, I kind of throw my hands up. Maybe it will be good enough to cut through the boxing that would otherwise take place. From what I've heard about Mike Pall, it wouldn't surprise me! If the JIT fails at this, I'm not sure how I'd begin debugging it. I kind of imagine it's like trying to convince a compiler to optimize a segment of code in a certain way, but only harder.

Thirdly, a critical aspect of keeping things fast that bubbles all the way up into the xsv application code itself is the amortization of allocation. Namely, when xsv iterates over a CSV file, it reuses the same memory allocation for each record[2]. If you've written performance sensitive code before, then this is amateur hour, but I personally have always struggled to get these kinds of optimizations in unityped languages because allocation is typically not a thing they optimize for. Can a JIT cut through this? I don't know. I'm out of my depth. But I can tell you one thing for sure: in languages like Rust, C or C++, amortizing allocation is a very common thing to do. It is straight-forward and never relies on the optimizer doing it for you. There are some different angles to take here though. For example, unityped languages tend to be garbage collected, and in that environment, allocations can be faster which might make amortization less effective. But I'm really waving my hands here. I'm just vaguely drawing on experience.

Anyway, I think it's kind of counter productive to try to play the "knows better than the hivemind" role here. There are really good solid reasons why statically typed languages tend to out-perform unityped languages, and just because there is a counter example in some cases doesn't make those reasons any less important. I think I could also construct an argument around how statically typed languages make it easier to reason about performance, but I don't quite know how to phrase it. In particular, at the end of the day, both cases wind up relying on some magic black box (a compiler's optimizer or a JIT), but I'm finding it difficult to articulate why that isn't the full story.

[1] - https://github.com/BurntSushi/rust-csv/blob/546291a0095a2537...

[2] - https://github.com/BurntSushi/xsv/blob/9574d89634031259802dd...

baq · on Sept 9, 2018

Just wanted to say that you ought to be paid for your comments in threads about your tools, they're so good. Thanks!

burntsushi · on Sept 9, 2018

My productivity doesn't come from writing software. It comes from reading its code and maintaining it. You can pry my types out of my cold dead hands. :-)

How long it takes you to do this largely depends on how much you can leverage your language's ecosystem. If you don't have a robust and fast CSV parser already written for you, then you'd need to sink many weekends into that alone.

smittywerben · on Sept 9, 2018

I hope this is a joke because I expected Python but got unix pipes.