I think a better description than “underutilized” would be “sunk capex cost” - G...

immibis · on July 30, 2024

There is also a mathematical relationship in queuing theory between utilization and average queue length, which all programmers should be told: https://blog.danslimmon.com/2016/08/26/the-most-important-th...

As you run close to 100% utilization, you also run close to infinity waiting times. You don't want that. It might be acceptable for your internal projects (the actual waiting time won't be infinity, and you'll cancel them if it gets too close to infinity) but it's certainly not acceptable for customers.

thaumasiotes · on July 30, 2024

There is a genre of game called "time management games" which will hammer this point home if you play them. They're not really considered 'serious' games, so you can find them in places where the audience is basically looking to kill time.

https://www.bigfishgames.com/us/en/games/5941/roads-of-rome/...

The structure of a time management game is:

1. There's a bunch of stuff to do on the map.

2. You have a small number of workers.

3. The way a task gets done is, you click on it, and the next time a worker is available, the worker will start on that task, which occupies the worker for some fixed amount of time until the task is complete.

4. Some tasks can't be queued until you meet a requirement such as completing a predecessor task or having enough resources to pay the costs of the task.

You will learn immediately that having a long queue means flailing helplessly while your workers ignore hair-on-fire urgent tasks in favor of completely unimportant ones that you clicked on while everything seemed relaxed. It's far more important that you have the ability to respond to a change in circumstances than to have all of your workers occupied at all times.

bombcar · on July 30, 2024

> You will learn immediately that having a long queue means flailing helplessly while your workers ignore hair-on-fire urgent tasks in favor of completely unimportant ones that you clicked on while everything seemed relaxed.

Ah, sounds like Dwarf Fortress!

immibis · on July 30, 2024

I was thinking Oxygen Not Included.

dekhn · on July 30, 2024

In practice it's more complicated than this- borg isn't actually a queue, it's a priority-based system with preemption, although people layered queue systems on top. Further, granularity mattered a lot- you could get much more access to compute by asking for smaller slices (fractions of a CPU core, or fraction of a whole TPU cluster). There was a lot of "empty crack filling" at google.

efitz · on July 30, 2024

TL/DR: You should think of and use queues like shock absorbers, not sinks. Also you need to monitor them.

Queues are useful to decouple the output of one process to the input of another process, when the processes are not synchronized velocity-wise. Like a shock absorber, they allow both processes to continue at their own paces, and the queue absorbs instantaneous spikes in producer load above the steady state rate of the consumer (side note: if queues are isolated code- and storage-wise from the consumer process, then you can use the queue to prevent disruption in the producer process when you need to take the consumer down for maintenance or whatever).

Running with very small queue lengths is generally fine and generally healthy.

If you have a process that consistently runs with substantial queue lengths, then you have a mismatch between the workloads of the processes they connect - you either need to reduce the load from the producer or increase the throughput of the consumer of the queue.

Very large queues tend to hide the workload mismatch problem, or worse. Often work put into queues is not stored locally on the producer, or is quickly overwritten. So a consumer end problem can result in potential irrevocable loss of everything in the queue, and the larger the queue, the bigger the loss. Another problem with large queues is that if your consumer process is only slightly faster than the producer process, then a large backlog of work in the queue can take a long time to work down, and it's even possible (admission of guilt) to configure systems using such queues such that they cannot recover from a lengthy outage, even if all the work items were stored in the queue.

If you have queues, you need to monitor your queue lengths and alarm when queue lengths start increasing significantly above baseline.

bbarnett · on July 30, 2024

I doubt they are doing this, but if they did burn in tests with 3 machines doing identical workloads, they could validate workloads but also test new infra. Unlike customer workloads, it would be OK to retey due to error.

This would be 100% free, as all electricity and "wear and tear" would be required anyhow.