For any performance tuning issue, our strategy in diagnosing the problem is to take a “top-down” approach and follow the trail of highest CPU consumption down to the offending processes, and eventually the offending sessions and query.

Verify CPU Consumption on the server is the first step any DBA should do for a top-down approach to resolve any performance issues.

  • Verify the observation of high CPU consumption on the affected server and collect top and vmstat data for a period of time (from 30 minutes up to a few hours).
  • Find which processes are using the most CPU.
  • If the top CPU consumers are Oracle processes, then find out which database sessions correspond to these processes.

Before we can start looking at why this problem is occurring we must first understand the problem and verify what the customer is reporting. This can be done by collecting OS statistics during a recent spike in activity due to a mass mailing.

Here is the vmstat output during one of the performance issue…

Each line represents a sample collected at 1-minute intervals.

Notice  how the run queues [The run queue is a queue of processes that are ready to run but must wait for their turn on a CPU; a run queue of 20 means that 20 processes are currently waiting to execute]  rapidly increase from 3 to 25 as the CPU utilization increases from about 50 percent to 100 percent. This shows the importance of knowing the length of the run queue as well as the CPU utilization. When the CPU is pegged at 100% utilization, the severity of the CPU starvation won’t be reflected in the percentage of CPU utilization (its pegged at 100%), but the run queue will clearly show the impact. Similarly, knowing only the run queue will not provide you with knowledge on the exact CPU usage, nor the spread of time usage across system and user modes.

CPU utilization is classified into three types: system (sy), user (us), and idle (id). System mode CPU utilization occurs whenever a process requires resources from the system; e.g.,

I/O or memory allocation calls. In typical OLTP systems, the percentage of system mode CPU utilization is often less than 10 percent; in data warehouses more I/O calls are performed and a higher percentage of system mode CPU utilization is common. User mode CPU utilization accounts for the rest of the time the CPU is busy and not running in system mode. Idle CPU is essentially the time that the CPU is not busy and waiting for work to do.

In our case, we can quickly see that the percentage of CPU utilization in system mode is usually less than 10 percent. This indicates that we are probably not suffering from memory shortages or doing excessive I/O. We can further dismiss memory shortages by looking at the paging and swapping statistics in vmstat (only swap-ins (si) and swap-outs (so) are shown here). If no pages are being swapped out during peak time, then there can’t possibly be a memory shortage.

If we were seeing system-mode CPU utilization higher than 15 percent and we weren’t seeing any memory problems, we might begin to suspect excessive I/O calls, very frequent database connections/disconnections, or some anomaly (we would keep this in mind as we collect more data for individual processes; e.g., using the truss command or while looking at RDBMS metrics like logons cumulative per second).

The fact that most CPU utilization is occurring in user mode tells us that whatever is occurring, it’s happening within the application that the process is running, not as part of a system call or action. The question is now, which processes are using the CPU?

To answer this question, we can look at the output of the top command. This command is useful because it shows at a glance the overall CPU statistics and the processes consuming the most CPU (ranked in descending order of CPU usage). The following shows a typical output of the top command during the performance problem:

Notice the load average for the last one-minute has been around 22 (load averages are for the last 1, 5, and 15 minutes). The load averages are an average of the run queue and basically means that around 22 processes were ready to run; however, the machine has only four virtual CPUs (two real Intel Xeon Hyperthreaded CPUs). The lines beginning with “CPUx states” show that Linux believes there are 4 CPUs. This means that on average, each CPU has one process running and over 5 waiting to run (22 / 4).

It’s clear from the output of the top command that Oracle database processes are using nearly all of the CPU. We can further clarify the names of these processes by looking at the output of the ps command (or use top’s “c” command in top’s interactive mode):

The ps command shows the accumulated CPU time for each process as well as the full name for the process associated with the process ID (PID). The process IDs are important for moving along to the next stage of the diagnostic process where we examine and trace the database sessions that are associated with these PIDs. The full name tells us what kind of process it is. In this case they were all Oracle database processes.

At this point we have verified the following:

  • The system is under extreme CPU starvation and very high run queues have formed
  • Oracle database processes are responsible for the high CPU consumption

If we had found non-Oracle processes using most of the CPU, then we would need to drill-down into those processes and investigate the reason for such high CPU consumption. Since Oracle processes were clearly the culprits here.

Now we have to  drill down into what Oracle is doing. I will post Database Drill-down to diagnosis this issue in my next post as now I am heavily loaded from my present project. .. watch this place ..

Do you want to become an Oracle tuning expert….. Subscribe by Email to get each posts in your inbox.

Enter your email address:

Delivered by FeedBurner

Readers who are interested on doing hands on practice, please feel free to contact us

PrintFriendly

Viewers of this post also viewed...