Sure. https://gist.github.com/tqbf/4de61a3e34d2e4664044666c107abe77 (I don't vou...

cb321 · on Oct 18, 2020

You have/had a bug in your code. The realloc() needs to multiply its second argument csz by sizeof *counts or sizeof(struct wordcount).

tptacek · on Oct 18, 2020

Of course I have a bug in this code. Thanks, fixed! :)

With that fixed, this dumb program does my whole (very large) homedir in about 10 seconds, for some definition of "does" that may or may not include counting every word in every txt file. For the curious:

   0. get / 566006
   1. the / 158168
   2. pkg / 119419
   3. syscall / 105828
   4. const / 98259
   5. and / 86670
   6. ideal-int / 43078
   7. that / 31609
   8. for / 31129
   9. this / 24980
   10. text / 22907

:P

cb321 · on Oct 18, 2020

Just trying to help, not criticize. As to correctness, if you change the tokenization to:

  while (fgets(buf, 1024, fp)) {
    char *c, word[1024], *w = word;
    for (c = buf; *c; c++) {
      *c = tolower(*c);
      if (*c >= 'a' && *c <= 'z')
        *w++ = *c;
      else {
        *w = '\0';
        count(word, (size_t)(w - word));
        w = word;
      }
    }
    *w = '\0';
    count(word, (size_t)(w - word));
  }

and you have that `count` function skip words with len < 2 then I can match results with the other 3 (Nim, Rust, C++) versions. Also, your C version runs in just 7 ms on Tale Of Two Cities, about half the time of the PGO optimized Nim.

tptacek · on Oct 18, 2020

Oh, no, you don't sound critical at all. I'm making fun of myself. Thank you for actually reading it!

cb321 · on Oct 18, 2020

Cool. You could probably simultaneously speed it up and simplify it by changing from linked list collision to open addressing with linear probes (but you may need a larger TSZ). Besides being more cache friendly for larger files, at the end you can then just qsort your table instead of first building a flat copy. :-) You could also use the unreduced hash value as a comparison prefix (only calling strcmp on different values). But it's not bad for potato code. ;-)

tptacek · on Oct 18, 2020

I didn't do open addressing because I didn't want to write code to resize the whole table. :)

I thought about also keeping the top 10 as I go instead of copying the whole table. But I'm guessing that virtually all the time this program spends is in I/O.

cb321 · on Oct 18, 2020

I kind of suspected that might be the reason..but you do still have a scalability problem for very large vocabularies. :-)

I just did a profile and saw about 15% in strcmp in the hot cache run, but sure if it's not in RAM then IO is likely the bottleneck.

cb321 · on Oct 18, 2020

(You could also just save the len and use that as a comparison prefix, but the hash value is less likely to yield a false positive.)

agv720 · on Oct 18, 2020

Isn't this code kind of supporting the first point in the article? I mean, the article started by pointing how rust "protected" you from a "TTCTTOU" error.

This C code is not only not protecting you against that, but also it seems to have little to none error checking. What happens if fopen fails? It just skips to the next file and happily ignores all the entries in the file. What if fget fails? Again, it stops processing the file and goes to the next.

I won't even get into what happens if calloc fails and returns null. Or if ++ wraps around and you get a negative value (let's hope we are using 64 bits)

This program will likely work, but if it doesn't you will just get an invalid number and never find out.

Don't get me wrong, I understand this is "not serious" code, written in a Sunday and for fun. And I would be ok with it if the article wasn't about language safety.

tptacek · on Oct 18, 2020

Try as hard as you like, you aren't going to find anyone to argue that C is as safe as Rust.

As for the TOCTTOU issue --- which is kind of silly, and had really nothing to do with Rust --- the diff to fix that problem is tiny:

https://gist.github.com/tqbf/4de61a3e34d2e4664044666c107abe7...

On a sane system, what's going to happen if calloc (or realloc, or strdup) returns NULL is the same as what's going to happen in the Rust program: the program will abort. Checking malloc returns is a bad idea, and causes more problems than it solves.

Similarly: this code does essentially the same thing with errors that the Rust one does: it swallows them and continues.

At any rate, while this C++ vs Rust discussion might be about language safety (I vote Rust, like any other software security person would), this subthread isn't.

cb321 · on Oct 18, 2020

So, actually, this entire arc of TOCTTOU is little bit off track or at least incomplete. If someone replaces an ordinary file with a FIFO/named pipe "foo.txt" during the traversal (or simply if your recursive file tree walk does not filter out FIFOs) then any program will block during the open(2) of the FIFO. Things will never advance to the fstat() to check its file type.

I just checked and whatever default behavior dga's Rust code uses (both for traversal and for the open call) does indeed just block forever if there is a FIFO present with a name ending in .txt.

There is a way out - by adding O_NONBLOCK to the OS flags for the open call. { You could also stat() just before open(2) instad of fstat() just after, but then you still have a race condition, just with a much smaller time window. }

tptacek · on Oct 19, 2020

I love this thread so much.

cb321 · on Oct 19, 2020

Well, then you may also appreciate changing my Nim program from the `continue` to the final output `for` to:

    let mf = mopen(path)
    if mf == nil:
      stderr.write "wc: \"",path,"\" missing/irregular\n"
    else:
      var word = ""
      for c in toOpenArray(cast[ptr UncheckedArray[char]](mf.mem), 0, mf.len-1):
        let c = c.toLowerASCII
        if c in {'a'..'z'}:
          word.add c
        else:
          count.maybeCount word
          if word.len > 0: word.setLen 0
      count.maybeCount word # file ends in a word
      mf.close

You also need to add ",cligen/mfile" to the import line and install a version control head of the cligen package.

With that the Nim time goes down to 6.1 ms (no PGO) and 5.0 ms (PGO). Just a quick after dinner fix-up.

{ There is actually a trick related to ASCII where we could add `, 'A'..'Z'` to the character set and then `or` in a bit to make the letter lowercase (because 'a' - 'A' is 32), but I just did the above to test that cligen/mfile worked with FIFOs in the mix, not to maximize performance. I suspect pre-sizing the table by adding e.g., 10_000 or 20_000 to the init would make more difference.}

cb321 · on Oct 19, 2020

Beyond the above couple ideas, the next steps to make it faster would be to get rid of almost all dynamic memory allocation by having a table of byte-offset,len,count triples (or maybe add in a hashcode for a quadruple) and rolling your own "memcasecmp" if your stdlib does not provide one. This could probably make the histograms so small that they are likely fully L2 resident..roughly (8..12) times vocabSize bytes is 192 KiB for a 16384 vocabSize. https://github.com/c-blake/adix/blob/master/tests/anaPrime.n... has an example of how to do this in Nim (for values in a table, though in our case here it would be for keys and so need `hash` and `==`).

At that point if you had a lot of data, the better way to parallelize it is also not either of the two ways DGA mentions in his original blogpost with one shared table, but rather one histogram per CPU with no locks or memory contention at all and a final merge of the CPU-private histos (after processing all files) that can probably be single-threaded, though you could also organize it as a "tree" of parallel merges (e.g. first merge every pair, then every pair of the results and so on).

Whether to parallelize over files or over big chunks of files would depend upon how big your files are. The big chunks approach leads to more regular work balance because making all subproblems the same size is easy, but does need "clean-up" for any words that span each byte boundary which is a bit annoying, but there are only as many fix-ups as you have threads which is small. It would be unsurprising to me if Tale of Two Cities could be done in like 0.1 ms with 32 cores up to 500x faster than that initial Rust version (maybe omitting thread start-up overheads).

This particular problem is actually so fast that you might need to use all of Dickens cat'd together or more to get numbers less polluted by system noise and microsecond-scale thread start-up times and small L2 histo merges or do min-of-more-than-5 runs for times, but all the principles I've mentioned apply more broadly (except maybe the ASCII case trick).

Anyway, these are all sort of obvious points on a topic that was more about prog.langs, but a good systems programming language + ecosystem ought to make it "easy-ish" to get that last factor of 10..500x that is easy enough to express in a HackerNews comment.

agv720 · on Oct 18, 2020

>you aren't going to find anyone to argue that C is as safe as Rust.

That's good, I was starting to doubt humanity :)

>As for the TOCTTOU issue --- which is kind of silly, and had really nothing to do with Rust

agree. But well, was what it was claimed in the article.

>On a sane system...

I disagree with this one, although I know I am in the minority. A malloc failure is not a unrecoverable error, and the only reason to abort on memory error is because if you allow malloc errors, then virtually every line of code can fail, because it is either allocating memory or calling some other method that allocates memory. (I imagine you've already read http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p070... section 4.3 but I mention just in case someone hasn't heard of it)

Something similar happens with integer overflow, although we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648. (in both C++ or rust in release mode). But if we assumed that i++ can cause an error, it would mean that every line of code could possible be an error.

But well, back to memory allocation: Just a month ago I had a situation where using a language (C#) which just throws an exception in memory overflow saved my day. I left running overnight a program that builds and creates setups for the programs I make. There are about 190 different setups (win32, 64, linux, and variations), all compressed using max settings in 7zip. Those builds run in parallel in a threadpool, and each 7zip compression takes a lot of memory. So much, that during the night one thread raised an out of memory error. When I woke up next day, I had 189 setups created correctly, and one that failed. I rerun that missing setup and was ready to publish them. Had I used a "sane" language, the full app would have aborted and I would have to run the 190 setups again.

But seriously, it can happen a lot, specially in embedded. You are processing images and one image is too big. Do you abort the app, or report the error and move to the next image?

>this code does essentially the same thing with errors that the Rust one does: it swallows them and continues

Thats was my point in the answer: Rust and C++ don't shallow the error. You have the line: reader.lines().filter_map(Result::ok)

And then: if let Err(e) = scanfiles() { println!("Error: Something unexpected happened: {:#?}", e);

So, if there is an error, it will report it to you, not shallow it. And that is to me the full point: The guy who coded the Rust version would have to go out of his way to shallow the error, while in C it is the default.

tptacek · on Oct 18, 2020

The Rust program surfaces errors in globwalk, but swallows the fopen errors you were talking about (it filter_maps them away, which, to be fair, is what I'd do too).

None of the counters in that C program are going to wrap. If it makes you feel better, you can replace size_t with "unsigned long long", but that's already what it is on modern systems.

That Rust program will also abort on allocation failures, just like I'd expect my C program to, so recovering from memory exhaustion isn't really in the scope of the discussion here.

If you care about security, you abort a C program when allocation fails. If you ask programmers to check failures, you get malloc checking bugs. Especially in embedded systems, where offset-from-NULL pointers can get attackers something useful. There are exceptions to every rule (at least in C programming), but if the task is "count words in files", you're nuts if you do anything but blow up when malloc fails.

agv720 · on Oct 19, 2020

> it filter_maps them away, which, to be fair, is what I'd do too

Ok, I don't know much rust and I imagined filter_maps reported to the main app. If it is shallowing them, then to me it is a big + for C++ in that area. It seems I will have to agree to disagree here, but shallowing the error is not what I would do at all. This program has the possibility to give you a completely wrong result, and you would never know. You could use that result in other thing, say publish a study where you claim there were only 100 words in the files you processed but there were 1000 because your program crashed and didn't warn you.

> If you care about security, you abort a C program when allocation fails.

Well, that's why I personally prefer C++ and exceptions. They abort by default, let you handle if you want/can.

But I think my original point was lost. My complain was that this code wasn't checking at all for failure. Shouldn't it be something like:

if(!w->next) w->next = calloc(1, sizeof(*w));

if (!w->next) abort();

    w = w->next;

I mean, in this particular case it will indeed blow up because when it goes back to the while(w->word) w will be null and you will reference a null pointer. But that's mostly luck, and might be not be true anymore if you make changes to the code in the future. If you care about security, you should check all allocs, and abort if null. And that's my beef with C: It shallows and continues instead of crashing or reporting when there is an error.

> but if the task is "count words in files", you're nuts if you do anything but blow up when malloc fails.

I won't disagree here, if the task is count words in files. I was just trying to explain why imo, not all "sane" programs should abort in failed malloc.

tptacek · on Oct 19, 2020

In serious C programs, you rig malloc to abort rather than return NULL. With jemalloc, for instance, you'd do it with opt.xmalloc.

saagarjha · on Oct 19, 2020

> A malloc failure is not a unrecoverable error

It's not, but it's also a fairly difficult error to recover from effectively. You really need to know what you're doing to get out of this–I think someone did a study for how many STL implementations handled std::bad_alloc correctly (i.e. unwound and didn't accidentally take down the program) and I think the number was zero.

> we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648

Not always, overflowing is often exploitable since it lets you bypass less than checks.

Ut_Pwnsim · on Oct 19, 2020

>although we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648. (in both C++ or rust in release mode)

Signed integer overflow is undefined behavior in C++. Only unsigned integers are defined to wrap like that.

https://stackoverflow.com/a/16188846

markus_zhang · on Oct 18, 2020

Thanks I got the gist. Doesn't look too bad IMO.