Linux Performance

keepamovin · on Oct 14, 2023

I use tuned for my Debian and Ubuntu VPSs that run real-time apps, and it seems to work well. Simpler than me toggling kernel parameters (also known as sysctl settings or kernel tunables) myself.

https://access.redhat.com/documentation/en-us/red_hat_enterp...

  sudo apt install tuned

cloudengineer94 · on Oct 15, 2023

I also use Tuned for SAP workloads, love it!

spandextwins · on Oct 14, 2023

Didn't know Brendan Gregg was at Intel now.

https://en.wikipedia.org/wiki/Brendan_Gregg

He is also the star of the Shouting in the Data Center viral video

https://www.youtube.com/watch?v=tDacjrSCeq4

True genius!

PaulDavisThe1st · on Oct 14, 2023

I think it worth noting briefly that almost everything discussed in TFA concerns bandwidth (network, disk, other I/O, CPU) and not latency. That's understandable, because performance for a lot of people is about bandwidth. But there are some of us for whom bandwidth definitely takes a back seat, and you need a different set of tools for tuning latency in Linux.

dan-robertson · on Oct 14, 2023

That’s a pretty reasonable point. It’s much easier to increase bandwidth than latency too[1] so often caring about latency can be important.

0. Another thing people may want to optimise for is performance per watt, but I won’t say much more about it.

1. There are cases where bandwidth optimisations are latency optimisations, eg if you can fit more of your processes onto one box, you are reducing the average distance between the processes and whatever they talk to and hence average latency

2. A very obvious thing to do when optimising for latency is increase bandwidth enough that the bandwidth doesn’t throttle you

3. I feel like mostly if you are aggressively optimising latency, there isn’t much Linux tuning to do. Maybe I’m wrong – I don’t really know much about this – but I think it’s mostly pinning to a core, running tickles, doing user space networking, and then hardwarey things like tuning page size, SMT, power-saving settings, and other things like choice of hardware.

[1] https://pdfs.semanticscholar.org/bce7/5f78d340cac32dccd8631f...

toast0 · on Oct 14, 2023

I would say throughput rather than bandwidth, although the meaning is similar, people have been discussing throughput vs latency for a long time.

Sometimes you can improve both, but often it's a tradeoff.

meisel · on Oct 14, 2023

What are the tools and resources you would recommend with respect to latency tuning?

interroboink · on Oct 14, 2023

Not the person you asked, but generally you might want to look at "frame-based" profilers. These are typically used in video games, but the concept is general, and can apply to other applications. The "frame" could also be something like a request or transaction being processed. I like Tracy[1], myself.

Another latency metric that you'll see, often w/respect to web apps and microservices is "P99" and similar. This is the amount of time in which 99% of requests get their response. For a higher percentile, you get a better idea of worst-case performance.

[1] https://github.com/wolfpld/tracy

klysm · on Oct 14, 2023

Latency of what particular system?

SkyMarshal · on Oct 14, 2023

I'm sure there's lots of info on tuning linux servers, both for throughput and latency, so how about for desktop workstation use cases?

stefan_ · on Oct 14, 2023

With the caution that if your goal is to reduce latency for the realtime properties of your system, chances are that you can turn any number of different knobs on your out of the box Linux distribution but it will not result in a satisfactory system.

thanatos519 · on Oct 14, 2023

Waiting is also a good strategy!

As of Linux 6.5, the scheduler understands that when one SMT "core" is busy, that means it might not be the best idea to schedule something on the the other "core", since it's really just a single core with a very low cost context switch. This makes certain very-parallel things noticeably snappier for me, and I can see it on the CPU usage graphs.

YMMV due to cache coherency and NUMA issues. :D

sambazi · on Oct 14, 2023

simply disabling smt also gives the "snappier" effect for earlier kernels while thwarting some side-channel attacks and increasing power efficiency

MonaroVXR · on Oct 15, 2023

I still,don't know why my file explorer syart faster on my 4 core n95 intel vs 5700u

Same goes for my dual core laptop vs quacore laptop, both same generation, but other one runs definitely hotter (4 core)

dang · on Oct 14, 2023

Linux Performance - https://news.ycombinator.com/item?id=8205057 - Aug 2014 (22 comments)

nickdothutton · on Oct 14, 2023

If you are of a mind to change a tuneable parameter yet cannot tell me why this tuneable will have the desired effect, or why it is so-set in the first place, then I will not allow you to change it (in prod).

“Chesterton’s tuneable parameter”.

klysm · on Oct 14, 2023

Even if I could tell you a hypothetical reason, why not benchmark outside of prod anyway?

Brian_K_White · on Oct 15, 2023

Even in prod (sometimes) there is no other way to "tell me if it will have the desired effect". Changing something to see what happens, even in production, is not automatically wrong. It's not automatically right either. It's almost like there is no thoughtless simple rule that is always right.

p5a0u9l · on Oct 14, 2023

Nicely done Brendan! Thank you, knowing Brendan’s work with eBPF, I take this as a way to more easily monitor and assess performance under different types of performance. Tweaking/tuning comes with trade offs, and I usually end up optimizing one thing to the detriment of others..

Side note, I’ve found btop a super useful replacement for glances, to have an all-in-one TUI view of system performance and loading. Wonder how much those dev(s) are leveraging this, and whether anything out there’s motivated to build better TUI monitoring tools.

Every server I go on, first thing is, start up tmux, dedicate one window to btop.

umvi · on Oct 14, 2023

For me, "tuning" linux for performance = disabling spectre/meltdown mitigations (in this case compute nodes are running in a VPC with no internet access, so seems pretty low risk)

Vecr · on Oct 15, 2023

Depends on what CPU you are running, on Zen 4 it's not supported to disable the mitigations and caused bugs/crashes. I think they did fix that exact crash but I'd still not recommend it. New CPUs from both AMD and Intel are designed to be run with at least the default mitigations on.

costco · on Oct 14, 2023

Bookmarked! This will be useful to me soon for something I'm working on.

I haven't read all the slides yet but one thing I was wondering was if you ever found any significant performance increases from kernel build options. In my Gentoo days when I would play around with build flags I would change kernel Makefile to use -O3 and apply a patch for -march=native. In hindsight, looking at some Phoronix benchmarks it appears this is actually harmful to a number of workloads. Curious if you ever found any cases otherwise.

spaintech · on Oct 15, 2023

Great site! I kind of have a predisposition to summarize linux performance, be it tuning or monitoring, taking a deep breath…

This is such a depth subject, with a long list of variety of observability tools. At minimum, make sure you know deeply uptime, dmesg, and iostat. These are your friends to give you a glimpse into various system aspects like load, memory, CPU, and more, enabling a diagnostic overview of system health. This is what I call, the “let me take a look at it” check list, 1st page of 100!

When emphasizing methodologies for performance analysis I recommend careful benchmarking to holistically evaluate system behavior and workload characteristics. with before and after scenarios. Make smaller changes first, then gradually compound what you think will provide benefits. Remember, labs and production never behave the same.

This is where it gets tricky, CPU profiling with tools like “perf” and visual aids like flame graphs enable targeted analysis of CPU activity, along with tracking hardware events to optimize computational efficiency. You need to know more than “it’s the app man, was fine until the latest release from development”

When you are the admin and speaking to a developer; Linux, tools like ftrace and BPF come into play, allowing for detailed tracking of kernel function execution and system calls, which can be vital in troubleshooting and performance optimization. You can also be the developer, varying the admin’s intuition… as the saying goes, trust but verify.

When it’s your code, then you better know BPF! It not only facilitates efficient in-kernel tracing but also propels the development of advanced custom profiling tools through bcc and bpftrace, offering deeper insights into system performance.

Last comment, it’s %$$% hard! Tuning means you need to navigate through adjusting a myriad of system components and kernel parameters, from CPUs and memory to network settings, aiming to optimize performance and reliability across various system workloads, else you can blame it on the network! :D

Really, you need to have a good behavioral attitude at change management, as chasing code or kernel parameters could be a daunting task that just overwhelms everyone in a moment where you might be time constrained and the preasure could lead to a higher degree of human errors.

jtriangle · on Oct 14, 2023

Current kernel and current distro tuning is almost always folly unless there's a specific issue you're trying to work around.

Trying to squeeze a little more juice out of something is bound to come at the detriment of something else, or worse, break something else in unexpected ways.

Basically, if the tunables aren't obvious in whatever default config you're using, the issue isn't in that config, it's that you're asking too much of your hardware and just need better hardware.

jeffbee · on Oct 14, 2023

> just need better hardware.

That's .. yeah, that's completely false. I can think of dozens of things that are not right out of the box on any distro, on any hardware, in common practice. For example suppose you roll out Ubuntu on an EC2 instance, say a c6i.16xlarge, a 32C/64T, single-socket x86 server with a Nitro ENA. Where are the netrx interrupts delivered? Is RSS on/working? RPS, XPS? Interrupt coalescing? The distro can't make optimal choices for all use cases, but what they ship by default is a config that's not optimal for any use case. Literally nobody would consciously choose the defaults after thinking it over.

bravetraveler · on Oct 14, 2023

Most common example I can think of is 'net.core.somaxconn'. The default limit out of the box is laughably small. Has been for decades at this point.

Jach · on Oct 14, 2023

Is 4096 still too small over 128? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

bravetraveler · on Oct 15, 2023

Nope, that's pretty decent

... this still hasn't made it to the distributions we're using, though. More specifically: the kernel releases. There's a lot in service pre 5.4.

Hanging on by a thread. Relief is near.

jeffbee · on Oct 14, 2023

According to that message, they lifted it to 4096 8 years ago, presumably for reasons. Have your systems got any bigger or faster since 2015?

n3storm · on Oct 14, 2023

that is the free/open bsd (from what I know) thinking path and I love there are linux people that think on that terms too. not every year is the Linux on the Desktop year! :D

toast0 · on Oct 14, 2023

> Current kernel and current distro tuning is almost always folly unless there's a specific issue you're trying to work around.

Of course there's no reason to tune if stock works fine. Plenty of people buy or a rent a reasonable computer and it has more than enough capacity for their work with default tuning. That's fine.

But when you run out of CPU or memory or X, it's often a good idea to see if there's reasonable things you can do to get more out of the hardware you already have. Depending on what you're doing, there's often a lot of room for improvement.

For some networking tasks, doing proper alignment of threads and work with Receive Side Scaling or similar can make a tremendous improvement in capacity versus naive threads. In some environments, the bandwidth costs when you're using enough capacity to see that mean that machine costs of doing it well versus naively don't matter, so you may as well do it naively and spend your engineering time elsewhere. In other environments, gettingthe same work done with 10% of the nodes is valuable.

Also, in many cases, better hardware needs more tuning, rather than less. You don't need to spend a lot of time avoiding cross core communication on an 8-core single socket machine. But if you get a dual-socket, 128-core per socket machine and you're not careful about cross socket communication, you'll spend a lot of CPU on memory arbitration (which you'll have to know or learn how to look for)

simne · on Oct 14, 2023

Unfortunately for me, I have few times suffered with performance issues, and have not found good deep resource fast enough. For example, few times need to recompile FFMPEG or Unreal Engine, and have spent weeks for things, which on my hardware done in hours.

Now bookmarked this immediately.

Have not read deep, but from first view look good!

talent_deprived · on Oct 14, 2023

Great site, excellent visual layouts, and the site runs on the Apache web server so you know he knows what he's doing. :-)

anthk · on Oct 14, 2023

Often, if you tune up your settings for performance and interactivity, I/O will suffer, and viceversa. Your beast serving/copying tons of data concurrently might not be the best one to play that Vulkan/GL 4.5 game without frequent slowdowns.

alberth · on Oct 14, 2023

I wish there were optimization scripts based on your use case like: web server, database, etc … that would turn off unnecessary services & tune settings appropriately.

For both OS and services.

notetaker · on Oct 14, 2023

Thank you.

Would any one happen to know a similar set of resources for Windows tuning (preferably Windows 2019 AWS EC2s)?

metadat · on Oct 14, 2023

(2021)

dijit · on Oct 14, 2023

The issue with performance is that junior sysadmins and developers start flipping knobs thinking that the defaults are somehow holding them back.

The truth is usually there are tradeoffs and the defaults fit a broad general case.

If you want throughput there are tunables for that, if you want low latency then usually those are inversely correlated. Same for tuning for low data loss after failure and so on.

You have to spend time learning the tradeoffs, which sysadmins used to do- now nobody has time as they have been munged into one role at many places.

karmakaze · on Oct 14, 2023

This broad case is much broader than the typical servers that Linux is often used for. A simple example is file access times on a (database) server. These are largely unnecessary. It's even rare for a desktop user to actually look at these.

In the past I've been a sr dev (but a jr sysadmin) and was tasked with improving the performance of an upgraded database server. The problem turned out to be with NUMA on the larger server which a combination of reading and semi-random config fiddling of both Linux and MySQL parameters (plus a bit of BIOS/CMOS tweaking) brought up to expected levels.

There's often no better way than learning on the job as there is so much to know that you can't simply learn them up front for when you will need it. What we can do is learn what there is to know and remember to look into those if it seems relevant. I mean everyone who's well experienced now probably started config twiddling somewhere to get there.

js2 · on Oct 14, 2023

There's a right and a wrong way to learn though. The wrong way, which is the one I most often see is: smash keys until it appears to work w/o every bothering to try to understand the deeper problem or why some particular combination of key smashing appears to work. This becomes cargo culting over time. Stack Overflow and ChatGPT make this oh so much worse. Often the change then get committed with a useless message such as "make thing work" w/o any explanation of what was broken, what the change fixed, and why or how. It becomes instant tech debt.

The right way is to follow the scientific method. Collect data. Make a hypothesis with a plausible mechanism of action. Test the hypothesis. Arrive at a solution. Record how you arrived at your new found knowledge so that those who follow you understand why you made the change. The person who follows you is your future self as often as not.

unmole · on Oct 14, 2023

> A simple example is file access times on a (database) server.

This isn't a issue if you use `relatime` which is the default.

pflanze · on Oct 14, 2023

To give context for those who don't know: relatime is a relatively (I'm a dinosaur) newer introduction into the Linux kernel; mounting partitions noatime used to be the only alternative to the default which was updating the atime on every access, thus fiddling with that setting used to be important. Not any more.

BTW I used to continue to mount everything noatime anyway, since having the atime field be set upon file creation and not anymore afterwards was a way to get file creation time, and I found that more useful than the access time. This isn't necessary either anymore since the introduction of an actual file creation time.

vbezhenar · on Oct 14, 2023

The truth is that often nobody thinks too much of defaults, unless they are horribly wrong. So there are two good things about defaults: they're probably not horribly wrong and they don't require any additional work.

Some defaults are just historical curiosities, some defaults were configured 20 years ago and nobody took the crusade to update them, some might be bad, but changing them would break too much stuff in the wild.

Now I'm not suggesting that everyone should change everything. I almost never change defaults, myself. But I just don't agree than defaults are good. They're probably not bad, and that's about it.

eduction · on Oct 14, 2023

That may be, but Brendan Gregg is not a junior flipping switches, and a resource like this can help others do smart tuning.

dijit · on Oct 14, 2023

I didnt expect anyone would read my comment that way.

No, Brendan Gregg is certainly no junior. The issue is that people take his advice as law and do not read further.

I, myself, remember flipping random switches because a (book) resource said that it would unlock performance.

I’m suggesting that taking the time to understand performance properly is ideal and I am attempting to urge people to properly invest the time.

Then I am lamenting the fact that we do not actually have time over because there’s so much smushing of responsibilities.

eduction · on Oct 14, 2023

Ah fair enough, sorry if I misunderstood!

traceroute66 · on Oct 14, 2023

> Brendan Gregg is not a junior flipping switches

That's because Brendan Gregg doesn't flip switches, he shouts at hard drives[1]. ;-)

[1] https://www.youtube.com/watch?v=tDacjrSCeq4

pmcf · on Oct 14, 2023

I do the same thing to my kids to change their performance.

cinntaile · on Oct 14, 2023

Does it work?

bornfreddy · on Oct 14, 2023

Not GP, but no, it doesn't. It appears to work but then side effects get much worse. Taking the time to understand and to tweak the right settings, I mean, explain, is much better in the long run.

nimbius · on Oct 14, 2023

though i dont know much about it, i run linux and most of the documents i hear about tuning edge cases are now either avoided or boilerplated with a strong word of caution. For example: 10Gb ethernet used to require a litany of sysctl.conf chicanery to even approach half the line speed. not the case anymore, and most of the old kernel 2.4 optimisations are either nonsensical in 2023 or actively worsen the performance of the interface.

tyingq · on Oct 14, 2023

> The truth is usually there are tradeoffs and the defaults fit a broad general case.

Part of the problem is that reasonable defaults for performance is a somewhat new phenomenon. It used to be that the defaults for Linux kernel settings, Apache, MySQL, etc... were terrible for production use.

So there's a lot of history that "you have to change them" burned into people minds, documents, etc.

jeffbee · on Oct 14, 2023

> defaults fit a broad general case.

Yeah, they fit the general case of a MIPS R4400 with a 1mbps network adapter, a situation that nobody faces today. I think the most glaring example is `rmem_max` which is never sufficient to support a coast-to-coast 1gbps flow, and every individual Linux user in history has needed to independently discover this stupid sysctl.

ranting-moth · on Oct 14, 2023

The problem you're explaining can be generalized:

"Junior X starts flipping knobs thinking Y is somehow holding him back"

This is sadly how many of us have to learn. If you don't have a mentor or someone to check your work, it's a way to learn.

Experience will tell you not to make bad decision. You get experience by making the wrong decisions.

klysm · on Oct 14, 2023

Most of the time I senior folks say benchmarking is the only way to know. So I'm not sure that this piece of advice is good.

efortis · on Oct 14, 2023

I wouldn't say that that's a performance tuning, seniority, or even smart people problem. I've seen smart people blindy tuning knobs. Similarly, switching to an expensive O(1) algorithm for a tiny space. It beats me, but I think the problems are closer to lazyness or trying to discredit something.

SkipperCat · on Oct 14, 2023

Gregg's book "Systems Performance" was a real game changer for me. Helped me understand how Linux internals and system performance inter-relate. I love how he's able to take these pretty esoteric concepts and flesh them out. Truly one of the Linux GOATs

He also wrote a lot about Solaris, but I won't hold that against him /s.

avtar · on Oct 14, 2023

> He also wrote a lot about Solaris, but I won't hold that against him /s

Looks like the second edition of his book almost addresses that :P

"The second edition adds content on BPF, BCC, bpftrace, perf, and Ftrace, mostly removes Solaris" https://www.brendangregg.com/systems-performance-2nd-edition...