Dec 262012
 

For this objective I used the following documents:

  • Documents listed in the Tools section

Objective 6.4 – Troubleshoot Storage Performance and Connectivity

Knowledge

**ITEMS IN BOLD ARE TOPICS PULLED FROM THE BLUEPRINT**

  • Identify logs used to troubleshoot storage issues
    • Log files that can be used to troubleshoot storage issues on an ESXi host all reside in the /var/log folder on the host:
      • vmkernel.log — you could see the host disconnecting/reconnecting to devices
      • storagerm.log — storage I/O control information
      • vobd.log — observations made by the vmkernel
  • Describe the attributes of the VMFS-5 file system
    • Can now be up to 64TB (VMFS-3 was 2TB)
    • Single 1MB block size (VMFS-3 had 4 different sizes)
    • Maximum file count is >100K
    • VMDK files up to 2TB (this has not changed from VMFS-3)
    • VMFS-5 uses GPT as its partition table (VMFS-3 used a MBR partition table)
    • RDMs can be up to 64TB (pass-through RDMs only)
    • The Sub-block size is 8KB (VMFS-3 used 64KB sub-block size)

Skills and Abilities

  • Use esxcli to troubleshoot multipathing and PSA-related issues
    • There are a few different contexts within esxcli that can be used for multipathing and PSA-related issues. I will list those contexts below and denote where they apply
      • esxcli storage core path this context allows you to look at pathing that exists for devices connected to the host. Use esxcli storage core path listd option along with a given device ID (e.g. naa.60060160891227…) to list all paths to a particular device
      • esxcli storage nmp psp — this context allows you to look at the various PSPs on the system and modify their properties
      • esxcli storage nmp satp — this context allows you to look at the various SATPs installed on the system and modify their properties
  • Use esxcli to troubleshoot VMkernel storage module configurations
    • esxcli storage core plugin registration list — this command lists out all registered modules for the host, to include SATPs and PSPs
    • esxcli storage core plugin <add><remove> — allows you to add or remove modules as needed
  • Use esxcli to troubleshoot iSCSI related issues
    • The esxcli iscsi context allows you to perform most configuration aspects related to iSCSI, and lets you list all configurations, which can be useful when troubleshooting
    • Here are a few commands to get you started:
      • esxcli iscsi networkportal list — this will list the vmkernel ports you have configured
      • esxcli iscsi logicalportal list — this command gives you a summary of all vmkernel ports to include the MAC address
      • esxcli iscsi session list — this command will list all active iSCSI sessions
      • esxcli iscsi adapter target portal list — this command will list all iSCSI targets the host is connected to
      • esxcli iscsi adapter capabilities get –A <vmhba##>this command will list the capabilities of the give iSCSI HBA
  • use esxtop / resxtop and vscsiStats to identify storage performance issues
    • You can see a brief introduction to using vscsiStats in a previous VCAP5 blog post; Objective 1.1 – Implement and Manage Complex Storage Solutions – look about a quarter of the way down the page
    • Here is a list of storage metrics you can look at in esxtop / resxtop to troubleshoot storage performance issues
    • For storage monitoring there are three panels within esxtop that you will want to be intimately familiar with (the letters at the end correspond the the esxtop hotkey for those panels)
      • Storage Adapter Panel (d)
      • Storage Device Panel (u)
      • Virtual Machine Storage Panel (v)
      • Some key metrics you want to look at for the panels above
        • MBREAD/s — megabytes read per second
        • MBWRTN/s — megabytes written per second
        • KAVG — latency generated by the ESXi kernel
        • DAVG — latency generated by the device driver
        • QAVG — latency generated from the queue
        • GAVG — latency as it appears to the guest VM (KAVG + DAVG)
        • AQLEN – storage adapter queue length (amount of I/Os the storage adapter can queue)
        • LQLEN – LUN queue depth (amount of I/Os the LUN can queue)
        • %USD – percentage of the queue depth being actively used by the ESXi kernel (ACTV / QLEN * 100%)
  • Configure and troubleshoot VMFS datastores using vmkfstools
    • Here is an example of how to create a VMFS-5 datastore using vmkfstools
[sourcecode language="powershell" padlinenumbers="true"]
# you’ll need the device ID (esxcli storage core device list)

# this command will get the current parition information, you need to see the last usable sector
partedUtil get /vmfs/devices/disks/naa.5000144f60f4627a

# sample results "1305 255 63 20971520"

# in this case 20971520 is the last usable sector. To create the parition we’ll use 20971500 as the last sector
# this command creates parition number 1, starting at 128, ending at 20971500 with a type of 251
partedUtil set /vmfs/devices/disks/naa.5000144f60f4627a "1 128 20971500 251 0"

# this command creates the VMFS 5 volume with a label of "vmkfstools_vcap5_volume"
vmkfstools -C vmfs5 -S vmkfstools_vcap5_volume /vmfs/devices/disks/naa.5000144f60f4627a:1

# if you want to remove this volume via the command line you can delete the underlying partition
partedUtil delete /vmfs/devices/disks/naa.5000144f60f4627a 1

# perform a rescan of the adapter and the volume will no longer be present
esxcli storage core adapter rescan -A vmhba35
[/sourcecode]

  • Analyze log files to identify storage and multipathing problems
    • The log files you’ll look at are the same log files I listed above. Here they are again
      • vmkernel.log — you could see the host disconnecting/reconnecting to devices
      • storagerm.log — storage I/O control information
      • vobd.log — observations made by the vmkernel
    • You are going to want to look for any errors in each of these logs, and you’ll want to try and do event correlation by looking at the timestamps contained within the log(s). Doing this should give you a better picture at what exactly was going on at a certain time and, hopefully, allow you to determine the root cause

Tools

 Leave a Reply

(required)

(required)


*