VCAP5-DCA – Objective 3.4 – Utilize Advanced vSphere Performance Monitoring Tools

For this objective I used the following documents:

  • Documents listed in the Tools section

Objective 3.4 – Utilize Advanced vSphere Performance Monitoring Tools

Knowledge

**ITEMS IN BOLD ARE TOPICS PULLED FROM THE BLUEPRINT**

  • Identify hot key fields used with resxtop / esxtop
    • There are some different hot keys that will get you to different views within resxtop / esxtop and some other hot keys that will perform different functions from within those screens. Let’s start with the different screens (the sort by commands are listed underneath each display, and are case sensitive):
    • resxtop / esxtopDisplays
      • c:cpu– used to display CPU statistics
        • U – will sort by the %USED metric
        • R will sort by the %RDY metric
        • N will sort by the GID (group ID)
      • i:interrupt used to display the interrupt handler statistics (I really don’t know what that even means)
      • m:memory– used to display memory statistics
        • M– will sort by the MEMSZ metric
        • B– will sort by the MCTLSZ metric
        • N – will sort by the GID (group ID)
      • n:network– used to display network statistics
        • T– will sort by MbTX/s
        • t– will sort by PKTTX/s
        • N – will restore the default sort order (which I believe is the PORT-IDfield)
        • R– will sort by MbRX/s
        • r – will sort by PKTRX/s
      • d:disk adapter used to display disk adapter statstics
        • r– will sort by READS/s
        • R– will sort by MBREADS/s
        • N – will restore the default sort order (which I believe is the ADAPTR field)
        • w– will sort by WRITES/s
        • T – will sort by MBWRTN/s
      • u:disk device– used to display disk device statistics
        • r– will sort by READS/s
        • R– will sort by MBREADS/s
        • N – will restore the default sort order (which I believe is the DEVICE field)
        • w– will sort by WRITES/s
        • T – will sort by MBWRTN/s
      • v:disk VM– used to display VM disk statistics
        • r– will sort by READS/s
        • R– will sort by MBREADS/s
        • N – will restore the default sort order (which I believe is the GIDfield)
        • w– will sort by WRITES/s
        • T – will sort by MBWRTN/s
      • p:power mgmt – used to display power statistics
    • Some other hot keys that might be useful:
      • h from any screen will display the help menu. The help menu will display hot keys for other screens, and other options
      • from any screen will bring up a list of available fields for that particular statistic. To turn on/off a metric press the letter corresponding to the field name
      • s from any screen brings up a prompt that allows you to enter in the number of seconds you want the screen to refresh (the lowest is 2)
      • q  from any screen will quit the resxtop / esxtop  utility

 

  • Identify fields used with vscsiStats
    • Fields are a bit difficult to identify with vscsiStats because the utility generates data that is best used when put into a histogram. Some of the different metrics it pulls:
      • I/O Command Length– overall commands, read commands and write commands
      • Distance between successive commands (in LBNs)– overall distance, read distance and write distance
      • Distance between each command from the 16 closest previous commands– overall, read and write commands
      • Latency (in microseconds) – overall, read and write latency
      • Number of outstanding I/Os– when a new I/O is issued, new read I/O and new write I/O
      • I/O Interarrival Time – overall interarrival time, I/O read interarrival time and I/O write interarrival time
    • I’ll go over in a later section of how to run vscsiStats

Skills and Abilities

  • Configure esxtop / resxtop custom profiles
    • Creating a custom profile in esxtop / resxtop is pretty simple. This procedure is the same with both esxtop and resxtop. Just remember with resxtop that you need to either connect to a server first, or specify the server when running the utility
    • Configure esxtop / resxtopCustom Profiles
      • SSH to an ESXi host or the vMA (vSphere Management Assistant)
      • Type esxtop (use resxtopif you are connected to a vMA)
      • Go through each display and customize them to your liking. Examples of this would be which fields to display, field order, refresh interval, etc.…
      • Once you have made all of your customizations, press W
      • The default location is <current working directory>/.esxtop50rc, You can use this or specify your own path and filename
      • When finished, press enter to save the file (for this example I’ve use /tmp/.vcap5esxtopconf

image

      • You have saved the configuration successfully
    • Now you can load esxtop / resxtop using the –c parameter
      • esxtop –c /tmp/.vcap5esxtopconf
      • Press enter
      • Now all of the customizations you made and saved previously should be set
  • Determine use cases for and apply esxtop / resxtop Interactive, Batch and Replay modes
    • Interactive Mode
      • esxtop / resxtopinteractive mode is for real-time analysis/troubleshooting of a particular host. For example, if you are trying to nail down a certain performance issue (Compute, Network or Storage) then interactive mode is for you
      • Using Interactive mode is as simple as typing esxtop from the command line, either from the console of a host or SSH’d to the host. Use resxtop if you are connected to the vMA
    • Batch Mode
      • Batch mode can be useful if you want to track certain metrics over a period of time. Now you can do some of the same thing with history charts from vCenter, but with vCenter you are limited to >20 second intervals, esxtop / resxtopcan go as low as 2 second intervals
      • To use Batch mode use the following commands; applies to esxtop and resxtop
[sourcecode language=”bash”] # -b stands for batch mode
# -d stands for delay (in seconds), which i’ve set to 2
# -n is the number of iterations that will be complete, which i’ve set to 400
# setting the iterations to 400 means that it will record all metrics over an 800 second period
# the > means export and i’m exporting to a compressed csv file
# named vcap5esxtopbatch.csv.gz

esxtop -b -d 2 -n 400 > vcap5esxtopbatch.csv.gz
[/sourcecode]

      • Once batch mode is complete you can copy the CSV file over to another system and decompress it
      • You can then load it into a utility called esxplot, which is awesome BTW. esxplot is a VMware Labs Fling and can be found here
      • You can also load the results into the Windows perfmon utility and analyze the capture
    • Replay Mode
      • Replay mode is a pretty cool feature of esxtop / resxtop. Replay mode allows you to use a vm-support generated bundle and run esxtop / resxtopin Replay mode against it, thus allowing you to look at snapshots of an environment
      • A big use case for this is when you need someone else who does not have access to your host(s) analyze these metrics. Using Reply mode you can generate a support bundle and send it to whomever you need. They can then extract that bundle and use esxtop / resxtopReplay mode against it to see what’s been going on
      • To generate a support bundle with performance snapshots run the following command directly from the host or using the vMA
[sourcecode language=”bash”] # the -p parameter specifies you want to collect performance snapshots
# the -i parameter specifies the interval (in seconds) between collecting
# performance snapshots
# the -d parameter specifies the duration of which the performance snapshots
# should be taken

vm-support -p -i 10 -d 60
[/sourcecode]

      • Check out VMware KB1967for additional information
      • Once complete the location of the support bundle will be displayed on the screen

image

      • Before you can use this newly generated bundle with Replay mode, you must first decompress it. Change directory to /var/tmp
[sourcecode language=”bash”] # the -x parameter means you want to extract the files
# the -z parameter filters it through gzip
# the -f parameter specifies the name of the TAR file

tar -xzf esx-vlabs-vmhost03.prod01.local-2012-09-05–02.03.tgz

# before continuing you may need to reconstruct files that were fragmented
# by running the following script from the support directory

#change to support directory
cd /var/tmp/esx-vlabs-vmhost03.prod01.local-2012-09-05–02.03

#run reconstruct script
./reconstruct.sh
[/sourcecode]

      • Now enter in the following command to run Replay mode against the extracted bundle
[sourcecode language=”bash”] # -R specifies Replay mode
# the path is the location of your extracted support bundle

esxtop -R /var/tmp/esx-vlabs-vmhost03.prod01.local-2012-09-05–02.03
[/sourcecode]

  • Use vscsiStats to gather storage performance data
    • To run vscsiStats you’ll need to know the worldGroupID of the VM you want to gather performance data on. To get the worldGroupID of the VMs on a specific host run the following command
[sourcecode language=”bash” padlinenumbers=”true”] vscsiStats –list
[/sourcecode]
    • You’ll be presented with a list of VMs for that host. Find the VM you want to gather data on and find the worldGroupID. In this example I’m getting the worldGroupID for the NAP02 VM, which is 811625

worldGroupID

    • Now that we have the worldGroupID we need to start the collection. If the VM you are gathering data for has multiple disks, you can specify a particular disk with the –i parameter. If you don’t specify a handle ID then the collection will be for all disks attached to the VM
[sourcecode language=”bash”] # run vscsiStats on all disks for a VM with the worldGroupID of 811625
# the -w parameter specifies the worldGroupID
# the -s parameter tells vscsiStats to start the collection

vscsiStats -w 811625 -s

# run vscsiStats on a specific disk on the VM. The worldGroupID for the VM is
# 811625 and the handleID for the specific disk is 8422
# the -i paramter is used to specify a specific disk (handleID)

vscsiStats -w 811625 -i 8422 -s
[/sourcecode]

    • In this example I’m starting a collection against the VM with a worldGroupID of 811625 and a handleID of 8422

vscsiStats_collection_start

    • Once you’ve started the collection you can look at the data it has collected via histograms. The –poption is used to specify a histogram. The following histogram types can be specified:
      • all
      • ioLength
      • seekDistance
      • oustandingIOs
      • latency
      • interarrival
    • By default this will be displayed on the screen, but what you really want is to be able to import it into excel so you can analyze the data. To comma delimit the file use the –c option. Here’s an example of exporting a histogram with a type of all using comma delimitation exported to a file named vcap5vscsiStats.csv for a VM with a worldGroupID of 811625 and a handleID of 8422
[sourcecode language=”bash” padlinenumbers=”true”] # the -w parameter specifies the worldGroupID
# the -i parameter specifies the handleID
# the -p parameter specifies the type of histogram you want
# the -c parameter specifies the output be comma delimited

vscsiStats -w 611825 -i 8422 -p all -c > /tmp/vcap5vscsiStats.csv
[/sourcecode]

    • To stop the vscsiStats collection execute the following command
[sourcecode language=”bash”] # the -w parameter specifies the groupWorldID
# the -x parameter tells vscsiStats to stop collecting
# if you are collecting on multiple disks you can use the -i parameter
# to stop colleciton on only a specific disk

vscsiStats -w 625811 -x
[/sourcecode]

    • To view data collected in really cool 3-D surface charts check out this site. It requires you to type up a small script (example provided) to get 21 thirty second samples. You can then take the output of that and import it into a template that will build all of the surface charts for you. Very cool stuff.
  • Use esxtop / resxtop to collect performance data
    • There are a few ways to view performance data within esxtop / resxtop; interactive mode, batch mode and replay mode. I covered these modes earlier so I won’t go into them again here. To collectdata I would assume this means over a period of time. To do that you have to use batch mode
    • Batch mode allows you to collect performance data with esxtop / resxtop over a period of time. You can specify a custom configuration file that contains only views and fields that are pertinent and you specify the delay between captures and number of iterations you want to capture
      • For example, you want to collect data using esxtop / resxtop every 5 seconds for 10 minutes. To do this you will specify a delay of 5 seconds with the number of iterations to 120 ((minutes x 60) / delay). This example would be ((10 x 60) / 5) = 120
    • Here is the command you need to execute
[sourcecode language=”bash”] # the -b parameter tells esxtop /resxtop to run in batch mode
# the -d parameter specfies the delay between captures
# the -n parameter specifies the number of iterations to perform
# i’m outputing this to a CSV file for import into another tool

esxtop -b -d 5 -n 120 > /tmp/vcap5esxtop.csv

[/sourcecode]
    • Once this completes copy the CSV file to a system where you have esxplot. Open esxplot and import the CSV file. Now you can analyze the performance data you just collected

 

  • Given esxtop / resxtop output, identify relative performance data for capacity planning purposes
    • When planning for future capacity, you need to see where you stand now. Are you oversubscribed? Do you currently have enough CPU, Memory, Disk? If so, what levels are you at? If not, how do you tell what is oversubscribed? Well, I am not going to go over every metric that exists within esxtop / resxtop, but I will go into a few metrics that can easily let you know if you have a problem
    • CPU
      • The CPU load average at the top of the screen can be a quick way to determine if your physical CPUs are being hammered on that particular host. The load average is represented in 1, 5 and 15 minutes from left to right based on 6 second samples. The CPU load takes into account the ready time and run time for all groups on the host
      • Based on the below screenshot you’ll see that the CPU load average for 1 minute is 0.23, for 5 minutes is 0.22 and 15 minutes is 0.23

cpu_load_average

      • The PCPU UTIL(%) statistic can also let you know if you are in an overcommited state. If the PCPU UTIL(%) is high across all PCPUs then there is a good change you are overcommitting your CPU resources on that host
      • You can see here each that the AVG across all is only 9.9%, so all is well

pcpu_util

    • Memory
      • The state metric is an easy one to look at and understand. The statemetric has the following possible values
        • High – will be High if free memory is greater >6%
        • Soft – will be Soft if free memory is 4%-6%
        • Hard – will be Hard if free memory is 2% – 4%
        • Low – will be Low if free memory is <2%
      • If your host is in a High State then there isn’t memory pressure. If your host is in any other State then you need to start monitoring closely and think about adding more capacity if it’s in a Hard or Low state
      • As you can see this particular host is in a High State

high_mem_state

      • Another statistic to look at is SWAP /MB. This will tell you if there is memory swapping currently happening for the host, and what rate memory is swapping in from/out to disk. If the r/s or w/s is high then you have a problem. Most likely if these two are high your memory state is either Hard or Low
      • As you can see from the screenshot below, the r/s and w/s are at 0.00, which is good

swap_mb

    • Disk
      • Using esxtop / resxtopyou can’t really determine if you have enough disk in terms of capactiy (GB), but you may be able to determine if you have enough capacity in terms of number of spindles as it relates to IOPs (I/Os per second). IOPs is an important metric for storage performance, and as a result, application performance for your VMs.
      • These statistics are per-VM instead of per-host (as we’ve been focused on for CPU and memory), but you might need more capacity in terms of IOPs for only one VM, and it can be deciphered based on those per-VM statistics. If you have an application/workload that requires a certain amount of IOPs you can use esxtop / resxtop to see what IOPs you are currently getting to make sure you are where you need to be, or identify a deficiency. Here are the counters you can look at
        • READS/s– shows the number of reads per second
        • WRITES/s – shows the number of writes per seconds
      • You can use the metrics identified above in concert with esxtop / resxtop Batch mode and see if you are getting the required amount of sustained IOPs over a certain period of time

Tools

Leave a Reply

Your email address will not be published. Required fields are marked *

*