More

yshui · 2026-01-19T15:00:31 1768834831

If you think directly calling Windows syscall is crazy, some applications parse binary code from ntdll.dll to figure out what the syscall numbers actually are, since they change across different Windows versions :)

yshui · 2026-01-19T14:56:26 1768834586

see also the proton release notes: https://github.com/ValveSoftware/Proton/releases

yshui · 2026-01-19T14:54:46 1768834486

It's a bit different. WoW64 is the technology used to run 32-bit applications on 64-bit Windows. Wine has support WoW64 for a long time. The difference is the "old" wine WoW64 mode required 32-bit libraries on the host side, whereas this "new" WoW64 mode doesn't.

yshui · 2026-01-19T14:44:45 1768833885

> who are the devs of Wine.

they employs many devs of Wine (including the project lead, Alexandre Julliard). but technically Wine is still an independent open-source project, and has many contributors from outside Codeweavers.

yshui · 2025-09-23T22:05:18 1758665118

That's a cool find. I wonder if LLVM also does the other way around operation, where it pattern matches handwritten CAS loops and transform them into native ARM64 instructions.

jerrinot · 2025-09-23T22:44:08 1758667448

That's a very good question. A proper compiler engineer would know, but I will do my best to find something and report back.

Edit: I could not find any pass with a pattern matching to replace CAS loops. The closest thing I could find is this pass: https://github.com/llvm/llvm-project/blob/06fb26c3a4ede66755... I reckon one could write a similar pass to recognize CAS idioms, but its usefulness would be probably rather limited and not worth the effort/risks.

tialaramex · 2025-09-24T09:28:42 1758706122

The term of art for this technique is "idiom recognition" and it's proper ancient, like, APL compilers did have some idiom recognition 50+ years ago.

An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

However, LLVM is dealing with an IR generated by other compiler folk so I think it probably has less use for idiom recognition. Clang would do the recognition and lower to the same LLVM IR as Rust does for its intrinsic population count core::intrinsics::ctpop so the LLVM backend doesn't need to spot this. I might be wrong, but I think that's how it works.

toth · 2025-09-24T10:38:57 1758710337

> An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.

And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]

[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html

tialaramex · 2025-09-24T12:31:38 1758717098

The existence of platform specific hacks is not interesting. In reality what happens is that software which has at any point cared about being portable doesn't use them.

But yes stdc_count_ones is indeed the intrinsic you'd want here, and only a few years after I stopped writing C, so thanks for mentioning that.

std::popcount is C++ but it's also kinda miserable that it took until C++ 20 and yet they still only landed the unsigned integer types, even though C++ 20 also insists the signed integers have two's complement representation, so the signed integers do have these desirable properties in fact but you can't use that.

articulatepang · 2025-09-24T15:13:09 1758726789

> In reality what happens is that software which has at any point cared about being portable doesn't use them.

I don't think this generalization is actually true. Fast portable software compiles conditionally based on the target platform, picking the fast platform-specific intrinsic, and falls back to a slow but guaranteed portable software implementation. This pattern is widespread in numerical linear algebra, media codecs, data compressors, encryption, graphics, etc.

toth · 2025-09-24T14:06:01 1758722761

Maybe we are just quibbling over semantics but the compiler intrinsic here is '__builtin_popcount'. 'stdc_count_ones' is a standard library element that presumably will be implemented using the intrinsic.

And FWIW all major C/C++ have for a long time have had a an intrinsic for this. In clang it even has the same name, Visual Studio it's something like just '_popcount'. So it has long been easy to roll your own macro that works everywhere.

tialaramex · 2025-09-24T15:16:21 1758726981

Yes, just semantics. But I don't think I can agree that because you could have ensured this works portably people actually did. That's not been my experience.

Yesterday I watched that "Sea of Thieves" C++ 14 to C++ 20 upgrade story on Youtube, that feels much more like what I've seen - code that shouldn't have worked but it did, kept alive by people whose priority is a working game.

gpderetta · 2025-09-24T14:55:00 1758725700

__builtin_popcount is not platform specific.

tialaramex · 2025-09-24T15:18:33 1758727113

OK, sure, vendor specific then. C23 does not promise this incantation, it's presumably a GCCism.

Arnavion · 2025-09-24T03:55:45 1758686145

I checked Godbolt, with RISC-V instead of ARM since I'm more familiar with that, and it doesn't look like it.

https://gcc.godbolt.org/z/b5s4WjnTG

(amomax is the atomic fetch-max instruction. lr and sc are load-reserved and store-conditional instructions; sc is like a regular store except it only succeeds if the address was not modified since the previous lr that accessed it. IOW the assembly is basically one-to-one with the C source.)

yshui · on Jan 28, 2025

CodeWeavers, which I am very lucky to be part of, is also an (almost) 30-year-old small company! We make Wine, Proton, and CrossOver, you might've heard of them. And I think we are hiring as well, please have a look!

yshui · on Jan 2, 2025

Any autoregressive model can do what you are describing. transformers are generating one token at a time too, not all at once.

intalentive · on Jan 2, 2025

True but memory requirements grow with sequence length. For recurrent models the memory requirement is constant. This is why I qualified with "low memory".

whimsicalism · on Jan 2, 2025

yes but transformers are much slower than state space models

yshui · on Dec 29, 2024

further reading: xkcd's "A mole of moles" https://what-if.xkcd.com/4/

yshui · on Dec 10, 2024

i'd say it's mostly due to linking. the debug info generated is absolutely enormous.

vient · on Dec 10, 2024

LLVM_USE_SPLIT_DWARF may help with this, some recent measurements: https://www.tweag.io/blog/2023-11-23-debug-fission/#an-examp...

yshui · on Nov 17, 2024

I (kind of) know multiple languages and they all have different word orders. I find it interesting that my brain is able to switch from expecting information to be received in one order to another. Each word order feels normal to me in its respective language, mix them and they will feel weird. It's like my brain is able to process information in different orders, but there are feature flags to enable them based on the language used.