So, a lot of what people think of when they think of OS is UI based (GUI/TUI/CLI...

inferiorhuman · on Sept 24, 2023

  Passing a string as a file name in C++ with macOS or with Linux in
  my experience was simple. The permitted length of using ASCII characters
  is about 4 times as long (may god have mercy on your soul).

Macs are simple enough too if you ignore the quirks. HFS (which was never seen on a modern MacOS) usually stores no information about what encoding was used for filenames. It's entirely dependent upon how the OS was configured when the file was named (although some code I've seen suggests that something in System 7 would save encoding info in the finderinfo blobs). So non-latin stuff gets mangled pretty easily if you're not careful. Filenames are pretty short (32 bytes) minus the one byte because (except for the volume name) they're Pascal strings with the length at the front.

HFS+ (which is what you'll find on OSX volumes) uses UTF-16 but then mandates its own quirky normalization and either Unicode 2.1 or 3.2 decomposition depending… which can create headaches because most HFS+ volumes are case-insensitive. It's been so long since I've touched anything Cocoa, but I assume the file APIs will do the UTF-16 dance for you and the POSIX stuff is obviously OK with ASCII.

And, of course, let's not forget the heavily leveraged resource forks. Of course NTFS has forks but nobody seems to use them.

APFS standardized on Unicode 9 w/ UTF-8.

CDs? Microsoft's long filenames (Joliet) use big endian UTF-16 (via ISO escape sequences that theoretically could be used to offer UTF-8 support). Which sounds crazy until you realize their relative simplicity (a duplicate directory structure) compared to the alternative Rockridge extensions which store LFNs in the file's metadata with no defined or enforced encoding. UDF? Yeah that's more or less UTF-16 as well.

I think we're perhaps forgetting just how young UTF-8 is.

Affric · on Sept 24, 2023

Thanks for the comment. HFS/HFS+ is a fascinating bit of history.

It strikes me how developer ergonomics have improved as computers have become cheaper/increased in power.

As to UTF-8, we may say it’s young but in 14 months it will be old enough to purchase and consume alcohol in the United States. From other comments it seems like Microsoft don’t think the tech debt is too great so long as they have good libraries in C#

sn_master · on Sept 24, 2023

In fairness, not that many people (including Microsoft) write native C++ apps for Windows anymore, certainly not without tried and tested libraries.

You can write C# code dealing with reading/writing files once and compile it on Linux/Windows/Mac and it'll work pretty much the exact same.

pjmlp · on Sept 24, 2023

Microsoft does write native C++ apps for Windows all the time.

First of all, games are apps, second even if apps unit keeps mostly ignoring WinUI/UWP (written in C++), whatever they do with Web widgets is mostly backed by C++ code, not C#.

On of the reasons why VSCode is mostly usable despite being Electron, is exactly the amount of external processes written in C++.

Applications being written in .NET is mostly on the Azure side.

jabits · on Sept 24, 2023

“Applications being written in .NET is mostly on the Azure side.”

You are of course, wrong about this. Most .Net/C# code is not Azure (yet anyway) -related; it is the billions of lines of enterprise application code across businesses around the world (for me, since 2001)…

pjmlp · on Sept 24, 2023

You are not Microsoft apps unit, the subject of what is being discussed here.

sn_master · on Sept 24, 2023

Microsoft has literal teams with budgets of several millions USD just for the file open/save in Office which is written in C++.

ilrwbwrkhv · on Sept 24, 2023

But despite that they cannot fix it. Consistently they make perhaps the worst APIs of any major tech company.

shortrounddev2 · on Sept 24, 2023

Maybe for file handling in C++, but DirectX/HLSL is the best Graphics API I've worked with and C# is easily my favorite language to develop in. It's easy for us to talk shit about Win32 today, 30 years after it was initially developed, but there are myriad historical reasons why UTF-16 is used by Java, Windows, and other languages/runtime environments and why it's not simple to just break compatibility with decades of software running at hospitals and financial trading firms because the 32 year old armchair experts at HN said so.

According to wikipedia:

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

> The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the Basic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP.

pjmlp · on Sept 24, 2023

I will take Win32 over anything related to X Windows, OpenGL and Vulkan, with pleasure.

sn_master · on Sept 24, 2023

True. They broke the basic Windows search functionality some time in 2007 and broke Outlook search around 2013 and neither of which have been fixed since.

pjmlp · on Sept 24, 2023

Those file/save dialogs are an application of their own, and with multiple versions across all supported platforms.

cerved · on Sept 24, 2023

It's not all backwards comparability. I'm willing to bet that some (a large part?) is just sloppy software development.

SQL Server (2017?) breaks if you update it on a UTF-8 Windows because it runs a TSQL that doesn't work with that code page. That script is a mess. Some of it is indented using tabs, some space. Trailing whitespace. Yuck

nikanj · on Sept 24, 2023

My hot take: Code quality is not measured by formatting issues, but by error resilience and number of actual bugs.

Much of modern linting and commit hooking is dedicated to checking whitespace placement, variable naming and function lengths but the well-formatted newly rewritten code is still buggy as hell - it just looks pretty

edejong · on Sept 24, 2023

Formatting doesn’t remove bugs, but it’ll help you detect them. Linted code helps you scan the code faster and provides valuable pattern recognition, allowing us to detect common mistakes.

There have been numerous bugs caused by incorrect code formatting, most notably an SSL security bug from 2014: https://dwheeler.com/essays/apple-goto-fail.html

Another reason for formatting is the “minimal diff” paradigm. If a formatting rule would not be followed, in the next commit hitting this code, the format would also be affected, causing a larger diff than necessary.

There are other reasons for simple format linting, but the reasons above are the most profound.

Lastly, formatting is part of a range of static code analysis tools. Generally, formatting inconsistencies are the easiest to detect and resolve, as opposed to more sophisticated tools.

lpapez · on Sept 24, 2023

True, but personally for me at least it is easier to find bugs in neatly formatted consistent code than something written in multitude of styles.

It is kind of like "pattern matching" on the error patterns.

cerved · on Sept 26, 2023

I've often found that people that don't care about white-space also don't care that much about other aspects of code quality

The inverse may not have the same correlation, since it can be automated.

cm2187 · on Sept 24, 2023

I never understood what backward compatibility was met by windows api not supporting >260 chars file paths. It will work just in the same was if you pass any short path and no old application expects a long path anyway.

Arainach · on Sept 24, 2023

Decades of binaries are in use doing something like

``` wchar_t filename[MAX_PATH]; CreateFileW(...) ```

in both first part and third party Windows code, often in deep callstacks passing file names around. Changing the length requires fixing them all.

See comments in https://archives.miloush.net/michkap/archive/2006/12/13/1275...

TonyTrapp · on Sept 24, 2023

Your example isn't problematic API-wise, because CreateFileW doesn't need to care if you pass in 16 characters or 1600 - if it does, that is mostly a matter of refactoring and not inherent to how the function works. The real problem are APIs that inherently assume that you pass in a string of at most MAX_PATH characters, because you provide a pointer but no size, and the API is expected to write to that pointer. This affects most shell32 getter functions, e.g. SHGetKnownFolderPath.

But for functions outside of Windows itself, this is the exact reason why the long path feature is hidden behind an opt-in flag.

mappu · on Sept 24, 2023

MAX_PATH is a #define. So its value is baked in to old binaries.

In RAM-constrained world of the past, you would stack-allocate `char buff[MAX_PATH]` and do all your strcpy/strspn in there with no problems.

Now, if that app receives a long path into a too short buffer, it will instantly stack overflow and may cause exploitable problems.

Someone · on Sept 24, 2023

They have API calls that fill user-supplied buffers that have room for MAX_PATH characters.

See for example https://learn.microsoft.com/en-us/windows/win32/api/fileapi/..., which also shows how they gradually made the input argument more flexible:

- “By default, the name is limited to MAX_PATH characters. To extend this limit to 32,767 wide characters, prepend "\\?\" to the path”

- “Starting with Windows 10, Version 1607, you can opt-in to remove the MAX_PATH limitation without prepending "\\?\"”

I also guess there’s lots of code that sees those paths (anti-virus software, device drivers)