Understanding top command

Example Output

load averages:  0.16,  0.11,  0.09                puffy.echinopsys.de 01:55:44
55 processes: 54 idle, 1 on processor                       up 722 days, 22:38
CPU0 states:  0.0% user,  0.0% nice,  1.0% system,  1.0% interrupt, 98.0% idle
CPU1 states:  0.2% user,  0.0% nice,  2.4% system,  0.0% interrupt, 97.4% idle
Memory: Real: 167M/861M act/tot Free: 2104M Cache: 593M Swap: 0K/3312M

  PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
 7055 www        2    0 6020K 7220K onproc/1  kqread    1:27 97.24% nginx
88770 _mysql     2    0  318M   62M idle      poll     25:41  0.00% mysqld
96529 _ntp       2  -20  888K 2480K sleep/1   poll      8:31  0.00% ntpd
99818 root       2    0  105M 3612K sleep/0   kqread    4:14  0.00% php-fpm-7.0
21945 root       2    0 9392K 3200K sleep/1   kqread    4:12  0.00% php-fpm-5.6
68396 _pflogd    4    0  716K  552K sleep/1   bpf       3:30  0.00% pflogd

So let's look into these values:

CPU load

The three numbers represent averages over progressively longer periods of time (one, five and fifteen minute averages).

These three numbers are shown with the uptime(1) and top(1) commands.

TL;DR

lower numbers are better
higher numbers represent a problem or an overloaded machine

Long Story

Imagine a single-core CPU as a single lane highway. There can be 100 cars driving in a lane and then the highway is full.

Imagine that there are driving 100 cars down the highway. This means the highway is exactly at capacity. That's a load of 1 (=100%).

Imagine that another car wants to enter the highway. That means 101 cars wants to drive on the highway that only can handle 100 cars. So 1 car has to wait in a queue. Now the highway is overloaded by 1 car (=1%). This is a load of 1.01 (=101%).

If the highway shows a load of 0.00 then this means there's no traffic on the highway at all. In fact, between 0.00 and 1.00 means there's no queue, and an arriving car will just go right on.

So we can say the "load of this highway" is

the number of cars driving on the highway plus
the number of cars waiting to enter the higwhay.

In analogy to that the "CPU load" is

the sum of the number of processes that are currently running plus
the number of processes that are waiting (queued) to run.

As we have seen in the example 1.00 means the highway is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.

So, your CPU load should ideally stay below 1.00. You are still okay if you get some temporary spikes above 1.00 ... but when you're consistently above 1.00, you need to worry.

But this means the ideal load is not 1.00. The problem with a load of 1.00 is that you have no certainty that your system runs well in the near future.

These are best practice when checking CPU load:

CPU Load	Rule of Thumb
0.70	Need to Look into it
> 0.70	It's time to investigate before things get worse
5.0	You could be in serious trouble

But wait, why saying could at the load of 5.00?

Because On multi-processor system, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core system, 2.00 on a dual-core, 4.00 on a quad-core, etc.

Thinking of the highway example above the load of 1.00 means that the highway is at 100% capacity. But if the highway had two lanes then 1.00 would mean 50% capacity.

Same with CPUs: a load of 1.00 is 100% CPU utilization on single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.

CPU states

The CPU(s) row shows CPU state percentages based on the interval since the last refresh. Each value has a label.

The order of the labels can differ and also the title (a word or two letters) can be different in your version of top(1).

us, user : time running un-niced user processes (CPU time spent in userspace)
ni, nice : time running niced user processes (CPU time spent on low priority processes)
sy, system : time running kernel processes (CPU time spent in kernelspace)
id,idle: CPU time spent idle (CPU has nothing to do)
wa, IO-wait : time waiting for I/O completion (CPU time spent in wait (e.g. disk))
hi : time spent servicing/handling hardware interrupts
si : time spent servicing/handling software interrupts
st : time stolen from this vm by the hypervisor

What is the Meaning of the 3 CPU states?

24.8% user: This tells us that the processor is spending 24.8% of its time running user space processes. A user space program is any process that doesn't belong to the kernel. Shells, compilers, databases, web servers, and the programs associated with the desktop are all user space processes.
73.6% idle: this tell us that the processor was idle just over 73% of the time during the last sampling period.
0.5% system: This is the amount of time that the CPU spent running the kernel. When a user space process needs something from the system, for example when it needs to allocate memory, perform some I/O, or it needs to create a child process, then the kernel is running.

How to interpret the Values?

If the processor isn't idle, it is quite normal that the majority of the CPU time should be spent running user space processes.
The total of the user space percentage, the niced percentage and the idle percentage should be close to 100%. If the CPU is spending a more time in the other states then something is probably wrong
The amount of time spent in the kernel (n% system) should be as low as possible. This number can peak higher, especially when there is a lot of I/O happening.
n% nice shows how much time the CPU spent running user space processes that have been niced. On a system where no processes have been niced then the number will be 0

Use these Values to examine System Health

On a busy server you can expect the amount of time the CPU spends in idle to be small. However, if a system rarely has any idle time then it is either overloaded, or something is wrong.

If a system suddenly jumps from having spare CPU cycles to running flat out, then the first thing to check is the amount of time the CPU spends running user space processes. If this is high then it probably means that a process has gone crazy and is eating up all the CPU time. Like nginx(8) in the example output on the top of this page.

Sometimes a high kernel usage is acceptable. For example a program that does lots of console I/O can cause the kernel usage to spike. However if it remains higher for long periods of time then it could be an indication that something is wrong. A possible cause of such spikes could be a problem with a driver/kernel module.

High waiting on I/O means that there are some intensive I/O tasks running on the system that don't use up much CPU time. If this number is high for anything other than short bursts then it means that

the I/O performed by the task is very inefficient, or
the data is being transferred to a very slow device, or
there is a potential problem with a hard disk

Another indication of a broken peripheral could be high interrupt processing. Some hardware device is causing lots of hardware interrupts or a process is issuing lots of software interrupts.

On virtual machines a value for stolen time shows how long the virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. A large stolen time means that the host system running the hypervisor is too busy.

Memory Usage

This row includes information about physical and virtual memory allocation. It shows physical memory, classified as: total, used, free, buffers

Physical memory is your RAM, physical pieces of hardware that provide Random Access Memory.

Swap is virtual memory which can be a file or a partition on your hard drive that is essentially used as extra RAM. It is not a separate RAM chip though, it resides on your hard drive.

Process List

The last section provides information about the currently running processes. It consists of the following columns:

Column	Content
PID	Process Id : This is a unique number used to identify the process
USERNAME	The username of whoever launched the process
PRI	Priority : The priority of the process. Processes with higher priority will be favored by the kernel
NICE	Nice value : value of setting the process' priority: -20..+20
SIZE	The total amount of memory size of the process incl. text, data, and stack segments
RES	Resident Memory Size: The non-swapped physical memory a task has used
STATE	The current state of the process and the CPU number (only on multiprocessor)
WAIT	If the process is asleep it displays the title of the wait channel
TIME	The CPU time the process spends in system and user
CPU	The CPU usage (default sort field)
COMMAND	The name of the process (In angle brackets if the process is swapped)

What's next?

For further information see the excellent top(1) manpage