I prefer looking at output of "sar" which shows you nicely, in 10-minute increments, how idle the system was today (and I think usually this data captured is rotated daily for a month), and gives you also a good idea of whether you have processes waiting excessively for IO.
It also has a bunch of other options.
On my own system, I also generally run "ps auwx" every 15 minutes, together with a scan of what queries Postgres servers are doing and dump of web request activity (read last 10k lines from access logs, find out how long ago the first request was to determine rough hits per second and ares of application they hit). That way when someone says "hey, the system was slow around this time" I can go back and find out that some cron job had a dozen processes taking up tons of memory or blocking on IO.
Some of those statistics also go into some RRD-based system which makes it easier to follow e.g. number of users logged in or number of Apache children based on weekday/time of day.
I'm a big fan of 'sar' as well. It's nice that it can also show you i/o wait for the sampled period. I also love vmstat, as you can use it to see everything that is happening with the system sampled every second if you like. The first two columns will show you the number of processes in the run queue as well as number of processes blocked on i/o.
We must be running different versions of sar then, as "sar" by itself here (RHEL 5) shows information about the time split between user/system/waitIO/idle -- that certainly does not come from /proc/loadavg.
If you run "sar -q" you could get the load average information, but that's not particularly useful, as you can't see whether the 20 load avg an hour ago was caused by heavy disk IO or a dozen CPU bound processes.
Nope, we're likely running the same version. The particular info you mentioned comes from /proc/stat (you're right that it's a different file), but again it's the same sources as top:
# lsb_release -d
Description: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
# rpm -qf $(which sar)
sysstat-7.0.2-3.el5
# strace /usr/lib64/sa/sa1 1 1 &> results
# grep open results
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib64/libtermcap.so.2", O_RDONLY) = 3
open("/lib64/libdl.so.2", O_RDONLY) = 3
open("/lib64/libc.so.6", O_RDONLY) = 3
open("/dev/tty", O_RDWR|O_NONBLOCK) = 3
open("/proc/meminfo", O_RDONLY) = 3
open("/usr/lib64/sa/sa1", O_RDONLY) = 3
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib64/libc.so.6", O_RDONLY) = 3
open("/etc/localtime", O_RDONLY) = 3
open("/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
open("/proc/tty/driver/serial", O_RDONLY) = 3
open("/proc/interrupts", O_RDONLY) = 3
open("/proc/net/dev", O_RDONLY) = 3
open("/proc/diskstats", O_RDONLY) = 3
open("/var/log/sa/sa06", O_RDWR|O_APPEND) = 3
open("/proc/stat", O_RDONLY) = 4
open("/proc/meminfo", O_RDONLY) = 4
open("/proc/loadavg", O_RDONLY) = 4
open("/proc/vmstat", O_RDONLY) = 4
open("/proc/sys/fs/dentry-state", O_RDONLY) = 4
open("/proc/sys/fs/file-nr", O_RDONLY) = 4
open("/proc/sys/fs/inode-state", O_RDONLY) = 4
open("/proc/sys/fs/super-max", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/sys/fs/dquot-max", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/sys/kernel/rtsig-max", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/net/sockstat", O_RDONLY) = 4
open("/proc/net/rpc/nfs", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/net/rpc/nfsd", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/diskstats", O_RDONLY) = 4
open("/proc/tty/driver/serial", O_RDONLY) = 4
open("/proc/interrupts", O_RDONLY) = 4
open("/proc/net/dev", O_RDONLY) = 4
# strace top -n -b 1 &> results
# grep open results
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib64/libproc-3.2.7.so", O_RDONLY) = 3
open("/usr/lib64/libncurses.so.5", O_RDONLY) = 3
open("/lib64/libc.so.6", O_RDONLY) = 3
open("/lib64/libdl.so.2", O_RDONLY) = 3
open("/proc/stat", O_RDONLY) = 3
open("/proc/sys/kernel/pid_max", O_RDONLY) = 3
open("/etc/toprc", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/root/.toprc", O_RDONLY) = -1 ENOENT (No such file or directory)
If you don't already know about it, dstat is a good one too, and lightweight enough that it can be run on heavily loaded production systems w/out as much "top problem" going on.
This is the second or third article I've read explaining load average, and sad to say I still can't explain it.
All I know is that when it's inexplicably over 3-4, you can't determine why as (no processes are using high CPU), and 1 in ten database queries are taking to 5 minutes, under normal load, that day will not be the best of your life. Well, what I've determined is that a full storage device, or overloaded i/o system for your disk, can produce very high load along with accompanying performance of doom.
> All I know is that when it's inexplicably over 3-4
Depends on the box. Say you:
a) have an app that spawns worker threads on demand and doesn't have a limit
b) Have a modern, cheap Nehalem X5600 CPU.
2 sockets * 6 cores * 2 threads per socket mean you're only fully realizing the investment when you have a load average of 24 (assuming that each of these cores is 0% idle, which you can tell with eg top).
High IO would show up as high %system time vs %user.
That happens when you have a problem where I/O somewhere, so lots of processes wind up blocked on I/O and are counted to load even though they aren't counting for CPU. The fact that normal queries take 5 minutes is confirmation of this.
As you've figured out this can happen because of a full storage device (causing writes to fragment as it squeezes data into small empty spaces, which causes what could have been one write to be many, which adds lots of seek time) or overloaded I/O causing the pre-fetch/caching tricks to stop working, causing everything to slow down.
The best solution is to never get into this situation in the first place. The second best solution is to find a way to halt I/O to the afflicted device until the system gets sorted out, figure out what was wrong, then add load and get back to a running state. The worst solution is to let things limp along until diminished load hopefully lets the disk get through its backlog and go back to normal.
Are you talking about the measured load (as part of the load average) or a general state of the machine as 'heavily loaded'?
The short answer is: it should be smaller than the number of cpu cores you have, if the sort of tasks you're running are actually using the cpu (vs just dispatching I/O, for example). A 24-core machine with load averages in their teens is still pretty responsive. Then again, considering that the right definition should specifically say that it's the number of processes in the 'Runnable' state.
It's a measurement of the CPU as a performance bottleneck. Other bottlenecks (e.g. your disk(s)) aren't covered.
To some degree, it's voodoo. When you deal with machines of a particular OS every day, you get a feel for what the numbers mean.
you can't determine why as (no processes are using high CPU)
Roughly, the load avg is the length of the run queue (how many processes are waiting to be run or are being run, vs. how many processor cores are available) with some IO activity thrown in for good measure. It could be a lot of IO wait or a lot of small processes that are driving up the load avg -- not necessarily just one process trying to hog CPU time.
There are a lot of good suggestions here -- many sysadms are too reliant on load avg (myself somewhat included).
1. Load averages are not directly comparable across different versions of Unix's such as FreeBSD, Linux, Solaris, etc.
2. Load average is a good way to quickly check if there is anything else to look at.
However, a far better series of tools is sar plus iostat , vmstat etc. which should help you more quickly determine whether your problem is CPU, disk, or network IO .
So, what about threads? I've seen Linux versions where they all had the same load as the base process, resulting in a huge load in a multi-threaded server application.
In Linux, running threads that are shown in 'top' (with thread mode 'H' on) to be in state 'R' or 'D' are counted in the load average. You'll definitely see this in programs with a ton of threads, like MySQL or the Java JVM.
If you're not in thread mode in top, this is why you will sometimes see a process consuming > 100% CPU usage.
As far as I can tell, this article doesn't mention the most frequent problem people have comprehending high load averages on Linux: the inclusion of processes in uninterruptible sleep. This can result in a system with virtually no CPU load reporting a very high load average. (And, IMHO, makes the load average much less useful a metric than a pure running/run-queue-based method.)
How about an article about measuring memory usage?
From what I've read, measuring the actual total memory consumption of an individual process is nontrivial on both Mac OSX and Linux, because of the way the stats are generated for things like ps, top, etc. and the way both kernels share memory between processes whenever possible.
My rule of thumb is that the end users of a web application will perceive that the system is "slow" if the 1 minute load average is above 4 on the application or database server.
Huh? What if I have a 48 core AMD server? A fully CPU-bound load average of 48 would be great.
Load average can tell you if the system is under-used, but it can't tell you how the system is over-used.
I regularly have a few dual core systems spike to load averages of 50+ because of suboptimal NFS mounts. The NFS issue keeps processes waiting around with nothing to do for a while. Those processes are still counted towards the "load" even though they have no CPU activity (they are blocked on IO).
Agreed it is not accurate for huge servers and IO bound cases but I have found it to be a good rule of thumb for a LAMP web application stack on anything from single to 8 core machines.
It is a useful metric to start digging deeper into the machine.
The Linux kernel also checks to see if there are any tasks in a short-term sleep state called TASK_UNINTERRUPTIBLE. If there are, they are also included in the load average sample.
It also has a bunch of other options.
On my own system, I also generally run "ps auwx" every 15 minutes, together with a scan of what queries Postgres servers are doing and dump of web request activity (read last 10k lines from access logs, find out how long ago the first request was to determine rough hits per second and ares of application they hit). That way when someone says "hey, the system was slow around this time" I can go back and find out that some cron job had a dozen processes taking up tons of memory or blocking on IO.
Some of those statistics also go into some RRD-based system which makes it easier to follow e.g. number of users logged in or number of Apache children based on weekday/time of day.