Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.


FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.


> no I/O

The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.

> doesn't immediately jump out as SIMD.

IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).

This is the sort of thing that jumps out as easily vectorizable.

Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D


Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.

As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.


> transforming them into the text to print

I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.


You don't even need to do division. There are only 3 patterns that repeat for every set of 10 numbers. You just need to track which one you're on.


Yes the patterns repeat every 15 integers. So you only need to do one division operation, get the index modulo-15.

I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.

   for (int i=1; i<100;i++)
   {
       // make a bit vector with a 1 in ith bit, modulo 15.
       unsigned int i_as_bitvector = 1 << (i % 15); 
       // Check it against the valid positions for FizzBuzz

       // 0ctal 011111 is binary 1001001001001          
       // Octal  02041 is binary   10000100001
       printf("%d  ", i); 
       if (i_as_bitvector & 011111 ) printf("Fizz"); 
       if (i_as_bitvector &  02041 ) printf("Buzz"); 
       printf("\n");
  }

I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.


Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.


Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.

We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.


In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.

Same idea could be used to parallelize for a GPU.


Because the digits increment in groups of ten.


I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.


> There are only 3 patterns that repeat for every set of 10 numbers

Ah, yep, good point!


The real question is if you can use some for formatting numbers.


Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).

As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.

But OP likely didn't use because most CPU lacks support for AVX512.

The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.

There might also be possible to try make the CPU operate at higher clockspeed.

Code looks a bit long. Not sure if the unrolling actually helps.

EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.

In theory, it can run 4x faster (on bare metal, maybe).


I think you're missing the point. The issue is that with the advent of higher level languages, starting from Java, Javascript and on to Python and so on, most people have forgotten or have never learnt the skills to optimize code.

I'll argue that, as a percent, the number of people who can write proper multi-threaded code has only diminished over the years.

And we see the result, despite the massive increase in computing power, software in general has become slower and bloated.


The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

If you have "months to write fizzbuzz" levels of resources available, sure, you can microoptimize everything. Except in practice you can't, because the amount of effort needed for this kind of microoptimization is quadratic or worse in the complexity of the problem you're actually solving.

For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.


> The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

I disagree. I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

Edit: https://www.youtube.com/watch?v=pgoetgxecw8 <- Casey's most recent intro to non-pessimization

> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster

The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.


Not disagreeing or anything, but an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application. The thing is, we possibly could not even write most of these applications in a low level language, as we should not forget that while Rust indeed has good abstractions, low level languages simply leak memory-related informations to high levels as well, making it necessary to deal with them on a high level. And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)


> an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application

Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The reason it's not done is mainly social and not technical. Ecosystems matter and this is particularly the case in graphical desktop environments where massive amounts of time and effort are required to re-implement features matching users' expectations (e.g. Flutter).

If we're talking a server side http app server then the library requirements are significantly lower and the libraries are generally there in almost any language. To keep the language comparison the same, it's not significantly slower to implement a CRUD backend in Rust than it is in Python. Depending on your specific setup and how much pre-prep is involved it can be faster. I'm not aware of any framework in a dynamic language that can produce production grade endpoint handlers that are as terse as a Rocket app heavily leveraging request guards. The guards handle validation, conversion, session context and db connection based on the endpoint parameter types so you just write the happy path which is 1-2 lines of code for CRUD.

> speed of refactors and maintainability

Dynamic languages are significantly worse for maintainability and refactoring. I say this as someone who generally gets paid to be a frontend developer and spent years writing both Python and Clojure professionally. Despite my arguments, I do not write server side Rust for work because team coherence is more important for doing my actual job (solving a customer's problem) than programming language and I'm the only person in my company who writes Rust. I've been doing this long enough that I accept the worse solution as the cost of doing business. My personal projects don't have this constraint so perf is orders of magnitude better in exchange for a lot more knowledge and a bit more design work.


> Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The difference in code size between a good and a bad language goes up as program size increases, because a large codebase makes it harder to keep it all in your head and forces you to add more layers and duplication.


>And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)

About the raw speed. Maybe you want to take a look at these nice benchmarks? https://www.techempower.com/benchmarks/

You don't have to ditch Java or C# for your app and use C or Rust instead. But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java? Couldn't they just buy more servers instead?


> you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

In practice you only have so much attention and there are better places to spend it.

> Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java?

If raw speed matters we have to wonder why Twitter started with RoR, and only switched (mostly to Scala, not Java, AIUI) once they had tens of millions of users.

In a tiny handful of outlier cases where a given piece of code is run literally billions of times a day it eventually becomes worth doing performance microoptimisations. But only after you've confirmed that there's absolutely massive demand for this particular code and already done your iterative refactors and figured out what the code should do.


> But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Writing Java in a more performant way often means ditching OOP and using primitive types, primitive arrays everywhere instead, and also avoiding allocations up to the point where all objects are reused. Such code is just as slow to write and as error-prone as C, and way worse than C++ where at least you have RAII, generics, STL algorithms and zero-cost abstractions.


You only have to do that in hot loops (in most problems), that you can easily profile with the very good observability tools that java provides. The rest can be maintainable, good quality OOP code.

I would even go as far and say that OOP’s encapsulation pretty much exists just for this: the “ugly” implementation can reside inside a class and you only have to care about its public API from the outside.


> I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

I don't/can't watch videos; I would be interested to see a written argument, but generally my view is that copies and pointer chasing are irrelevant because code is so far away from optimal already. You're right that code is mainly written without any consideration to performance, but that manifests itself first in poor choices of algorithm and/or datastructure. I don't think there's anything "deep" about, say, using a hashmap rather than searching linearly through a list; indeed I'd consider it easier to understand than avoiding copying.

> The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.

Advent of Code is still extremely tiny programs, and Rust is a much much better language than C for thinking in.


True. Most business apps I've seen seemed like people did serious research into how to write the worst code from a performance point of view. Gazillions of wrappers upon wrappers, tons of indirection, data being copied and recopied thousands of times, creating FactoryFactoryFactories just for the sake of it.

If people would pay more attention on how the data flows instead of Uncle Bob and GoF, they would not only end up with a cleaner software architecture but also with more performance.


> I believe the majority of code is slow because it's written without any consideration at all to performance.

That can both be correct. I use a std. data structures for many things even if some prefix tree might be better for the operations I perform on a set. I don't think about that in most cases since computing power is cheaper than the additional work. If performance becomes a problem, I can optimize later.

Advent of Code isn't a realistic performance metric. Working software is for most software projects.


> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.

I think this is more nuanced than that. Competitive programming actually gives fairly good insight here, because skilled programmers are placed under time constraints and obviously care about performance.

What you see, usually, is that programs that don't need maximum performance are often written in Python, which allows people to quickly write code that needs less debugging. But double the allotted time and the C programmer finally has their code finished and free of bugs, and they're beating the pants off of your Python code right out of the gate. You try and come up with clever ways to avoid copies and drive down the Big-O, but the C programmer comes up with the same thing just a bit behind you and leads most of the time because they get a free 10x speedup just because of the language they're using.

It's not surprising that the highest levels of competitive programming are dominated by C (and to a greater extent, C++) programmers, with Python use extremely rare, almost always reserved for rare, complicated algorithms that happen to bring a significant big-O improvement that justify the use of Python to write them.


The overwhelming majority of code isn't slow because it used the wrong algorithm, it's because the Product team can't demo or show off an increase in efficiency. It's because they aren't measured by their ability to reduce or control the growth of the AWS bill. And they probably wouldn't have the knowledge to ascertain a reasonable bill. So they push deadlines which pushes devs to take ugly but fast to write solutions using high level languages.

Also, performance improvements pay for themselves over months and years while new features usually pay for themselves faster (at least they do in forecasts).

And finally, touching working code to make a performance improvement necessarily has some risk and that risk might be worth more than a year of cost savings from the improvement.


>The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

Try to look at The Computer Language Benchmarks Game where all implementations are using the same algorithm. An optimized C implementation can be 300 times faster than Python.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


That's exactly the kind of missing the point that I'm talking about: tiny programs where all implementations are using the same algorithm and people have spent months microoptimising and are utterly unreflective of realistic business problems.


> debugging undefined behaviour because they used the wrong combination of integer bit lengths.

It's a function of debugging time which is a function of the programmer's skills. I almost certainly fall into the Python subset though :/


There's a law that says that as resources becomes easier to consume, people consume them more. Like when increasing the capacity of a road doesn't reduce the time spent in traffic. Wouldn't that apply to software? I feel like it's a bit unfair to call software "slower and bloated" when you wouldn't call a road "bloated" because it has a new lane. People can do more things now.


More like I start manufacturing 3 lane wide cars because the material im using is cheaper than cutting and engineering it to normal size - the way I see modern software analogy in this context.


I don't think that's fair, we do way more things now with computers than before.


It feels like the frustration comes because I see the analogy this way:

The roads have been given extra lanes, and the total possible throughout of the road has increased. As a result, car manufacturers have decided that they can use a cheaper manufacturing process which makes slower cars. But when anyone complains, they point to the additional lanes and say ‘but the number of cars that get from point A to point B has increased, even if each trip is slower.’

The end user isn’t getting a choice when software developers make slower code each year. They have to buy more expensive hardware to keep up with the slowdowns of software developers.


I get it. But what exactly do you want to happen here?

In terms of your analogy, do you want more expensive cars that less people have?

Software developers could spend more time optimizing their code. But that means that are spending less time elsewhere, fixing bugs, adding features, developing other products.


It should be noted that many times features come in that people do not want.

“The new format bar” for slack and “huddles” (perhaps less antagonisingly) are examples of things that Slack has done seemingly to prove they’re still working on the program.

Despite the the features not being needed in the slightest.


I actually disagree, at least about huddles. Huddles are a direct response to the low-friction nature of Discord voice channels (and arguably Teamspeak/Mumble/etc before Discord, although nothing else married text and voice channels quite as well before Discord.) It's almost silly how small of a difference there is between starting a huddle and a channel-wide "call" - the only difference is that it doesn't start out by "ringing" for everyone in the channel. But that tiny difference completely shifts the feeling from "this is a formal communication" to "we're all just hanging out and you can join if you want but no pressure."

IMO huddles happened because the pandemic moved so many more people remote that the need for a water-cooler replacement became much more obvious.


I mean, we’re programmers here and it’s the easiest damn thing to learn. We could just write the software we like. Scratch our own itch.

I, for one, don’t care that much for performance so my personal forks of projects are all feature oriented.

But if you care about performance this much then writing something high performance is the best way to find others like you.


That would imply that we do less or as much things as before with computers which I think is false. Maybe you don't agree with what everyone does, but that doesn't mean that it's worthless.


Jevon's Paradox.


I think the kind of optimization here is beyond that which people used to do. AVX didn't exist when it was common to write assembly.


Yes, this is extreme and way beyond my C capabilities at any point in time. I was making a general point which I do believe is valid.


Sometimes bloaty design decisions transcend the question of programming language. Electron probably has (I didn't check) a faster JavaScript engine than Qt, but I'd expect that similarly sophisticated implementations of the same program in Qt or Electron show an advantage in performance to the former. But people use Electron because it's familiar and it's got a lot of active development, and more cynically, Google likes it that way because it means that people keep coding for Chrome (cf. Microsoft's VS ecosystem, Apple's walled garden, etc).

The problem with modern software is that vendors encourage new developers to learn using heavyweight toolkits because lock-in is profitable in the long run. One day Electron will work with Go, but it will still be a plague.


>FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

The working set depends on how you organize the data. I/o can be made in larger chunks most of the times. You have to consider about the data you need and when you need it.

As for branching, that most of the time depends on programmer rather than on the problem being solved. The situation where you need very complicate branching to solve a problem are rare. And I've seen O(n log n) algorithms beating clever O(n) algorithms because they could be better optimized for the CPU.


> for a problem that doesn't immediately jump out as SIMD.

It's a single transformation applied to elements of a large array. Parallelization is obvious, and maybe the exact SIMDization is not obvious, one is certainly motivated to formulate it. And a scatter-like approach does spring to mind right away, I believe.


FizzBuzz is a constant.

It should never branch.


Unrolling FizzBuzz is generally not a feasible solution.


Sure it is: https://codegolf.stackexchange.com/questions/215216/high-thr...

It executes just one (well predicted) branch every 30 numbers written, and incrementing/printing the line number is branchless too.

It's not as fast as the subject of the post (40 GB/s) but it's only a few hours of work.


The parent comment asserted that FizzBuzz should never branch. This would require it to not have any backwards control flow, and thus require full unrolling of an infinite sequence. Hence the infeasibility.


FizzBuzz is limited to [1…100].

Output as an ASCII string, it is fewer than 400 bytes including \n’s.

If you’re counting GB/S, you’ve failed to understand the specification.

One of the sophisticated engineering aspects of FizzBuzz is that all optimizations are premature.

It is a boring problem. People being people invent different problems to provide a chance to be clever.


I interpreted that as no branches, only loops.


It is possible to instrument the code to see how much overhead those branchs add.


It pays of to understand the CPU architecture even if you are not using the assembler: https://blog.cloudflare.com/branch-predictor/

Once upon a time most software was highly optimized with hot code paths written in assembly. If you look at DOS source code, DOOM source code you will see lots of optimization.

When CPUs got more powerful, people got lazy and they thought they can spend the improvements on conveniences.

Now we are at the point that we run "desktop" apps written in Javascript on top of embedded browsers.


Hard times create strong programmers. Strong programmers create good times. Good times create lazy programmers. And, lazy programmers create hard times.


Programming is a metaphor of life.


If you look at DOS source code, DOOM source code you will see lots of optimization

If you look at VSCode code, you’ll see even more of these. What really changed was more people wanted to hack things and write non-PhD code, and the web tech was (sadly) popular enough to take this role. It’s not programmers who are lazy, it was that desktop frameworks utterly failed in their non-competive niches. MSVC libraries replaced each other almost faster than github’s new web frameworks, to the point that barely anyone can remember their chronology. Gtk+ got stuck for years dealing with C legacy, added non-C styling engines too late, and then got “owned” by a particularly deteriorating DE. Qt always had a not-so-clear position on community-licensed versions, and C++ impedance mismatch didn’t help that either. UI/AppKits got pretty useful and advanced in recent ~ten years, but were ios/osx only. Minor frameworks like fox, wx, fltk, etc never got enough traction or new ideas and were just shadows of their bigger brothers. Meanwhile, with electron, bootstrap and little js one can make a page in few minutes, which could take few hours on conventional desktop.

I mean, you are correct that programming went from hardcore to relaxed mode, but there is more history to it in ui sense.


Business code must be optimized for maintenance first.

Because it will live for many years. It will have to survive in the hands of multiple caretakers. It will have to evolve to work with the underlying foundations changing (operating system, compiler/runtime, desktop -> web -> mobile).

That's different from most video games (one single release, then little patches) and from advent calendar code.


Hot code paths are still written in assembly, it's just on a different level. DOOM is an engine; nowadays people use off the shelf engines so that they can focus on creativity instead of low level details.

I mean I could probably hyperfocus on storing my application state in a file on disk, but why should I bother when there's off the shelf SQL databases right there? Which have been optimized endlessly, I might add. I don't get paid to write low level ASM, I get paid to build applications.

Edit: And to add, my current thing I get paid for is several orders of magnitude faster than the one it replaces. Not because I spend more time optimizing, but because I write it in sensible Go instead of 10K LOC PHP files that concatenate XML strings that get converted to JSON to be rendered in Dojo-flavored HTML concatenated together in JS because of reasons.


Not lazy, sensible. The market has spoken and it wants bloated electron apps with rapid development, rather than performant C/assembly apps that hardly ever change.


All Electron apps users were demanding something that eats up their RAM, crashes and run slowly?

The market demand seems more like:

Jack: "Our bellowed CEO wants us to deliver our wonderful app to unwashed masses still using a desktop. Our enlighted marketing team made a study which realized that for whatever weird reason, corporations and businesses still make heavy use of those boxes which come with a mouse and keyboard attached."

Joe: "Sure thing boss, we will have to hire some programmers and testers and will take about a year or so."

Jack: "I forgot to tell you that the marketing study already took a year and a very large budget because we hired the best of the best to do it. One year we don't have, money we don't have. But what about those people who wrote our web app? We still pay them. Can't they deliver?"

Joe: "We will have our glorious desktop app in two weeks, boss, I just had an idea."

Jack: "Do that and I will personally remind about you to our bellowed CEO when he will want to do some promotions."


“The market”?

The power dynamics of companies/customers are often not as dynamic as all that.

If slack is electron and I work at a company that uses slack: I must use it.

The competition in that space is all electron, you can’t choose.

It’s like saying that “the market chose non-ECC ram”. No, Intel chose for you and you don’t get much choice except to suck it up (or pay well above the odds.)

It takes a lot to avoid using a product. I mean people still use Oracle products!


That does not actually contradict the point. We got stuck in a suboptimal local maxima due to the all early design decisions of browsers and JavaScript. The original inventors did not expect anyone wiring web version of Google Drive on the web.

The market surely pushes against the bloated electron apps, yet the convenience of having the same app on web as well as "native" and the amount of man years which went to make HTML+JS the richest multi-platform UI framework on the market is more important.


The market has spoken. It wants factories to dump waste into nearby rivers because it allows them to make cheaper products that are more competetive.

It's not really a good argument is it?


There is no market for things that just work on V1.0 then continue to work flawlessly. You wont sell a version n again and again and there is no support contract for something that does a single thing and does it well.



Yes, thank you for mentioning it.


These days, compilers have gotten so damn smart with their optimizations that using assembly will actually make your code slower unless you really know what you're doing like the original post here.

A few years ago, I was doing Project Euler problems, and one of them had a solution that took about 45 minutes for my C program to solve. I decided that it was an incredibly simple problem, so rewrote it in Assembly.

My assembly solution took an hour to run.

I disassembled my C solution, and couldn't figure out what it was actually doing, though my assembly knowledge is pretty weak.


People might not care to learn it, but isn't there a lot more to know nowadays re: the obscurities going on deep down at the lowest levels of modern architectures?


I don't think you should be an expert in CPUs architectures but but having an understanding of memory layout and how instruction are dispatched is not very hard or time consuming.

I see many people having trouble to understand the difference between value and reference, what pointers are, why it's better for structures to contain variables of the same type and have a certain size, why is better to call a function once for a large chunk of data instead of calling it many times for small chunks of data, why iterative or tail call recursivity are to be preffered over simple recursive functions.

The view is most LOB apps won't care about performance because they are waiting for i/o so we should not care about performance but coding speed, OOP, Uncle Bob's principles, GoF patterns, Restful and trendy architecture of the day. While I sure that coding speed matters, a sound architecture matters, I also think that throughput matters, that minimizing delays matters and that some problems are better dealed with by thinking of the data and how the CPUs likes to access data and work with it instead of just firing up more Kubernetes pods hoping that scaling will get rid of performance issues. By not thinking about the data and the CPU we didn't get rid of the complexity we just moved it to the infrastructure and in code having to deal with a more complex infrastructure.


We are also at a point where people whine that $5 for an app on your phone is "vastly overcharging". You get what you pay for.


At some point I believe we will start to bump up against the limits we saw in the first three computer revolutions (tubes, transistors, integrated circuits). This will cause the pendulum to swing from commodification to optimization. What I mean is, you won't be able to snap your fingers and scale up or out, thus optimization will begin again in earnest. Clearly this isn't always the case: GPU shaders, crypto ASICs, video processing... there are plenty of places where optimization is crucial for high performance loop tuning. But optimization hasn't been required across the board like there was just before the three big innovations I described hit.


This tends to happen at a smaller scale with gaming consoles. Towards the end of a generation, the games coming out usually perform/look a lot better than the ones at the beginning of the generation. I'm assuming due to a lot more careful optimizations due to not being able to change the hardware.

I've always been curious about how far we could really push modern computers if somebody wanted to spend the time going to the lengths in the original post when creating practical software. Of course it's usually not worth the tradeoff, but it's interesting to think about.


Protip:

while you made a great comment, ND people like me, and even most NT people have diffivulty ingesting walls of text like what you just wrote.

Please, therefore, breakup your text into paragraphs every 3 sentences. It does wonders for readability for just about everyone. :)


…it’s literally four sentences.


You have multiple paragraphs longer than mine in your history.


This is what we call shifting the the goalposts and tu quoque fallacious attacks.

Just because I have erred in the past does not mean I cannot suggest to you to improve your outputs.

Or, I have had learnings previously, and I am simply trying to pass them along to you.

Here is another protip: dinae be so defensive. I was not attacking you personally, but your reply clearly was an attempt to attack me.


And imagine not coming up with this solution in your next MANGA interview!


It takes a lot to to correctly explain exactly why the set of design choices is fastest, but writing just takes quite a simple model of the CPU internals, knowledge of the insturction set, focus on constantly measuring performance, and an agility to iterate quickly with different approaches.


And reading more closely, beating the competition low level OS knowledge and understanding peculiarities of the benchmark in question.

The benchmark was about getting the output to a pipe as fast as possible, and there's this great pipe speed hack:

  // To produce output without losing speed, the program  therefore needs
  // to avoid copies, or at least do them in parallel with calculating
  // the next block of output. This can be accomplished with the
  // `vmsplice` system call, which tells the kernel to place  a reference
  // to a buffer into a pipe (as opposed to copying the data into the
  // pipe); the program at the other end of this pipe will then be able
  // to read the output directly out of this program's memory, with no
  // need to copy the data into kernelspace and then back into
  // userspace.


Also goes to show how much would it cost.


And maintain and evolve and debug and work on different machines, etc.


If you have talented staff then you'd be surprised how far you can get just buy giving someone who already does that particular application as a hobby an unlimited supply of coffee.

Obviously finding talented staff is very hard, but once you have your tribe you can go a very long way i.e. I look at apps made by some people I work with (fast, lightweight etc.) then compare with crap pumped out by startups with literal billions in capital. I think it's a question of confidence more than competence.


It shows that abstractions are leaky as f :)


No, this simply shows that abstraction slows performance, which is usually a worthwhile tradeoff.

Leaky abstractions are a different problem altogether.


The problem with that tradeoff is that you only compare the performance with the 'top layer of abstraction' that is already implemented and not with the baseline.

Death by thousand cuts


Not usually. There's a stack of a thousand abstractions between me writing this and you reading it, but it still works fine.


Yep. I suspect most people here could write a working fizzbuzz on a whiteboard in language of choice in under 5 mins during a job interview.

Sure your Python/JavaScript/Haskell/Rust version builds on a bunch of abstractions, but it’ll run on just about anything, and …

“I've spent months working on this program”

That’s not what your boss wants to hear.


You can pump out a Modern C++ version in 5 minutes too that will run loops (haha) around the higher level languages. The readability won't even be very different...


True. I bet the Rust guys would come close too.

But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

(Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM... Because nerd fights are fun.)


> Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM

Not really. WASM is significantly simpler and more abstract than x86 assembly, and has a JIT compile step that probably wouldn't get anywhere near as optimized. You could probably hand-write a WASM version that would JIT compile to something roughly similar and still get very good performance, but it would probably be more comparable to the other compiled versions at best, rather than the x86 ASM one.


> probably wouldn't get anywhere near as optimized.

Yeah, I'm not doing this for performance, I'm doing out for nerd fight points and the lulz :-)

The Most Hightly Optimised Fizz Buzz EVER!

In your browser!

As a service.

Join the waiting list now! Email: [_____________] [submit]


>But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

What if 100 million people log from different corners of the world? Would the WordPress widget still cut it?


In that scenario, the fizzbuzz widget is gonna be just a tiny tiny part of your WordPress scaling problem…


Not only you can write a modern C++ version in 5 minutes but by carefully choosing your subset of C++ and coding style you can do that while being as readable and maintainable as in higher level languages. C++ isn't hard because it's low level, it is hard because it is a Frankenstein monster and I don't think there is a living being that masters all the complexities of C++. I would love to see a C++ Next version with lots of crap cut down from it.


C++ next version with lots of crap cut down is called Rust. ;)


Judging by it's age and how much crap it's accumulated already, I think in 10 years time Rust won't be in a much better situation.

Similarly, judging by C++'s current trajectory, in 10 years it will have a simplified subset (enforced with something similar to --pedantic) which is easier to get right than Rust is today. Also, it will have a static analysis borrow checker based on clang-tidy.


Just put everything inside an electron container.

Unless your talking about micro controller programing Ram is basically free.


It could be, but in most cases it's not due to RAM commonly being unupgradable in laptops.

A larger amount of RAM might still be cheap to install in the first place, but that choice is not always directly up to the consumer.


Correct, but Slack is based off electron and is widely successful.

End users are used to tolerating a basic chat application eating an indefinite amount of ram.

From a business pov it doesn't make sense to spend time optimizing since most users don't seem to mind.


Cache however is not




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: