More

ajeetdsouza · 2024-07-31T22:29:44 1722464984

This is really well-written!

Shameless plug: you may want to check out Loxcraft: https://github.com/ajeetdsouza/loxcraft

I too followed the path of "ignore safe Rust for maximum performance". It got pretty close to the C version, even beating it on some benchmarks.

ltungv · 2024-07-31T23:19:29 1722467969

Nice, thanks for sharing. Definitely gonna look into how you do it, I couldn't get as close to clox like you did

ajeetdsouza · on June 16, 2021

Shameless plug: zoxide recently added support for Nushell. It's a smarter cd command for your shell (similar to z or autojump).

https://github.com/ajeetdsouza/zoxide

ajeetdsouza · on Nov 21, 2019

Fixed, thank you!

ajeetdsouza · on Nov 20, 2019

The size of a compiled Hello World program is 2 MB on my machine, so you're probably right!

ajeetdsouza · on Nov 20, 2019

The tests were run 10 times, and I used the median value. There wasn't much variance between runs, so I don't think disk caching played much of a role here.

Being a garbage collected language with a runtime, Go certainly cannot match the performance of C, and it was never my point to prove otherwise - obviously, for the same algorithm, the C implementation would be faster. Instead, I was exploring Go to highlight how simple it is to write safe, concurrent code in it.

ajeetdsouza · on Nov 20, 2019

Author here. I think you should go through the article again. I think it's quite readable, and there are no "hand-optimizations" as you say. Also, the single-core implementation was already faster than the C version - the multithreaded version was only done to explore different methods of concurrency in Go.

Hope that clarifies things.

ajeetdsouza · on Nov 20, 2019

I tested this on a fresh install of Fedora 31, so I didn’t really see any benefit of running it on a LiveUSB. As I mentioned in the article, the wc implementation I used for comparison has been compiled locally with gcc 9.2.1 and -O3 optimizations. I’ve also listed my exact system specifications there. I’ve used the unprocessed enwik9 dataset (Wikipedia dump), truncated to 100 MB and 1 GB.

I understand your frustrations with the previous posts, but I’ve tried to make my article as unambiguous as possible. Do give it a read, if you have any futher suggestions or comments, I’d be happy to hear them!

ajeetdsouza · on Nov 20, 2019

Author here. I have addressed this in the article. The bufio-based implementation was the first one, and it was actually slower.

In the second section, I was able to surpass the performance of the C implementation - by using a read call with a buffer. As I mentioned in the article, the C implementation does the same, and in the interest of a fair benchmark, I set equal buffer sizes for both.

esmi · on Nov 20, 2019

I don't know Go so I am not sure what file.Read does exactly but wc, on my system which uses the same wc.c as Jenner's article, is doing blocking reads in a naive way which I argue makes it somewhat of a paper target.

Maybe the Linux wc version you have is better. I don't know. They do exist and Penner's article gave a link to one. I am not sure as you didn't link to the source of your wc. (Edit: I just noticed you did use the OSX wc, you're just running it on Fedora. Sorry about that.)

But in any case using just wall clock time can be deceptive. Here is what I mean. On my machine, a somewhat old Mac Mini with a spinning disk, user CPU is only ~1/5 of the real time. The rest waiting for the OS or the disk.

  % for ((i=0;i<250;i++)); do cat /usr/share/dict/words >> a; done
  % /usr/bin/time wc -l a                                         
   58971500 a
          1.24 real         0.28 user         0.64 sys

ajeetdsouza · on Nov 20, 2019

This is not true, see my comment here: https://news.ycombinator.com/edit?id=21587907

danieldk · on Nov 21, 2019

Take the Darwin version linked from your site. Run perf record wc thefile.txt. Then run perf report and you will see iswspace in the call graph.

As I show in [1], removing this call and replacing it by a character match, gives a speedup of almost 2x.

[1] https://news.ycombinator.com/item?id=21592089

ajeetdsouza · on Nov 20, 2019

Author here. This is not true - I included a link to the manpage (https://ss64.com/osx/wc.html) in the article to avoid this confusion. I did not use GNU wc; I used the OS X one, which, by default, counts single byte characters. From the manpage:

> The default action is equivalent to specifying the -c, -l and -w options.

> -c The number of bytes in each input file is written to the standard output.

> -m The number of characters in each input file is written to the standard output. If the current locale does not support multi-byte characters, this is equivalent to the -c option.

Moreover, I also mentioned in the article that I was using us-ascii encoded text, which means that even -m would have been treated as ASCII text.

Hope that clarifies your issue.

danieldk · on Nov 20, 2019

It is not about the character count, but the word count. wc decodes characters to find non-ASCII whitespace as word separators. If you read further in the same man page:

White space characters are the set of characters for which the iswspace(3) function returns true.

That your text is ASCII encoded does not matter, since ASCII is a subset of UTF-8. So at the very least, you need an extra branch to check that a byte's value is smaller than 128 (since any byte that does not start with a leading zero is a multi byte character in UTF-8).

However, if you look at the implementation at

https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

You can see that in this code path it actually uses mbrtowc, so there is also the function call overhead.

eMSF · on Nov 20, 2019

It only calls mbrtowc if domulti is set (and MB_CUR_MAX > 1), i.e. only when given the option -m.

danieldk · on Nov 20, 2019

You are right! So that's a Darwin oddity, still a wide char function is called in that code path, iswspace, which adds the function call overhead in a tight loop.

tom_mellior · on Nov 20, 2019

If domulti is not set, the wide char function is not called as far as I can tell. Why would it? It's explicitly meant not to do wide char stuff in that case.

FWIW, when this was going around for the first time I took this Darwin version of wc and experimented with setting domulti to const 0, statically removing all paths where it might do wide character stuff. I didn't measure any performance difference to just running it unmodified.

danieldk · on Nov 21, 2019

It's about iswspace as I mentioned in the parent comment. Replace the line

    if (iswspace(wch))

by

    if (wch == L' ' || wch == L'\n' || wch == L'\t' || wch == L'\v' || wch == L'\f')

And I get a ~1.7x speedup:

    $ time ./wc ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc ../wiki-large.txt  0.47s user 0.02s system 99% cpu 0.490 total
    time ./wc2 ../wiki-large.txt                     
      854100 17794000 105322200 ../wiki-large.txt
    ./wc2 ../wiki-large.txt  0.28s user 0.01s system 99% cpu 0.293 total

Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:

    $ time ./wc3 ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc3 ../wiki-large.txt  0.25s user 0.01s system 99% cpu 0.267 total

If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72).

Edit: for comparison, the Go version from the article:

    $ time ./wcgo ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wcgo ../wiki-large.txt  0.32s user 0.02s system 100% cpu 0.333 total

So, when removing the multi-byte character white space handling, the C version is indeed faster than the (non-parallelized Go version).

[1] https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2...

tom_mellior · on Nov 21, 2019

Thanks, I finally understood what you are saying. Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice. For whatever it's worth, even just changing

    if (iswspace(wch))

to something like

    if (domulti && iswspace(wch))
        ...
    else if (!domulti && isspace(wch))
        ...

got something like a 10% speedup on my machine. And replacing isspace with an explicit condition like yours is much faster still. I checked, isspace is macro-expanded to a table lookup and a mask, but apparently that's still slower than your explicit check. I'm a bit surprised by this but won't investigate further at the moment.

danieldk · on Nov 21, 2019

Thanks, I finally understood what you are saying.

I am sorry for the unclear comments. I'll stop commenting on a phone ;).

Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice.

I agree, it's really strange. This seems to be inherited by the FreeBSD version, which still does that as well:

https://github.com/freebsd/freebsd/blob/8f9d69492c3da3a8c1ea...

It has the worst of both worlds: it incorrectly counts the number of words when there is non-ASCII whitespace (since mbrtowc is not used), but it pays the penalty of using iswspace. It's also not in correspondence with POSIX, which states:

The wc utility shall consider a word to be a non-zero-length string of characters delimited by white space.

[...]

C_CTYPE

Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files) and which characters are defined as white space characters.