Jumbo Frames and Multi-NIC vMotion Performance over 10Gbe

Chris Wahl recently wrote a blog post titled Do Jumbo Frames Improve vMotion Performance? Chris was doing these tests using a 1Gbe network and the results indicated that you get slightly better performance by NOT using jumbo frames. There were a few comments and questions regarding what this would look like if you had a 10Gbe infrastructure. I also had the same question so decided to run my own tests

The Equipment

  • Two Dell R820s with 2, dual port Broadcom 57810MF CNAs
  • Two Cisco Nexus 2248 Fabric Extenders
  • Two Cisco Nexus 5596 UP switches
  • Cisco Nexus 1000v switch

The Setup

Setup is pretty standard, two CNA ports, one from each CNA going to FEXA, which connects to Nexus5KA and the other two CNA ports going to FEXB, which connected to Nexus5KB. The Cisco Nexus 1000v is deployed as an HA pair. To see how the 1000v is configured for multi-NIC vMotion check out my recent blog post, Configuring Multi-NIC vMotion with Cisco 1000v.

  • The 1000v is configured with two uplink port profiles, one with the MTU set to 1500 and another with the MTU set to 9000
  • The Nexus5K ports aren’t configured explicitly for a 9000 MTU, but dynamically adjust as jumbo frames are sent/received via policy map/QOS
  • There are 4 vmkernel interfaces on each host for vMotion with their MTUs set to either 1500 or 9000 depending on the test being performed
  • Both R820s were configured identically
  • Each VM used was configured for 2 vCPUs and 24GB of RAM, running Windows Server 2008 R2 SP1
  • All VMs were ALWAYS on the same host at the start of each test.

The Tests

For testing, I used the same thing as Chris, prime95. The following 3 tests were performed utilizing a power CLI script (more on the script below):

  • Test1
    • vMotion of 1 powered-on VM loaded with prime95 and running a 20GB memory workload
  • Test 2
    • vMotion of 4 powered-on VMs loaded with prime95 and each running a 20GB memory workload
  • Test 3
    • vMotion of 8 powered-on VMs loaded with prime95 and each running a 10GB memory workload

Each test was performed with 3 separate configurations with regards to the MTU:

  • Configuration 1:
    • vmk interfaces:                 1500
    • 1000v uplink port profile    1500
  • Configuration 2:
    • vmk interfaces:                 1500
    • 1000v uplink port profile    9000
  • Configuration 3:
    • vmk interfaces:                 9000
    • 1000v uplink port profile    9000

The Script

Below is a copy of the script I used. For each test, lines were removed or added depending on the number of VMs I was testing for that particular configuration. The script iterates through the vMotion process 10 times. During tests that included multiple VMs, all vMotions were run asynchronously EXCEPT for the last one. Once the last vMotion was finished there is a wait period of 45 seconds to ensure all vMotions were complete. Then the time is measured between all vMotions for that specific iteration and the one that took the longest is the the recorded on screen and appended to a text file. Not very scientific (even though using async, all vMotions aren’t started at EXACTLY the same time), but since all tests are being performed in this manner, I figured it would do. At the end of each iteration the script pauses for a 30 second period. The script below is what i used for the test3 (the 8VM test).

The Results

Once all tests were complete I removed the highest and lowest times from each, and then came up with the average. The results are interesting to say the least. All results are in seconds:

Test 1 — 1 VM, 20GB mem workload

  • Configuration 1 (1500/1500): 30.24
  • Configuration 2 (1500/9000): 28.81
  • Configuration 3 (9000/9000): 26.31

image

In test 1 jumbo frames had better performance; 12.99% faster

 

Test 2 — 4 VMs, 20GB mem workload

  • Configuration 1 (1500/1500): 89.69
  • Configuration 2 (1500/9000): 92.54
  • Configuration 3 (9000/9000): 75.71

image

In test 2 the results are much more concise, with jumbo frames blowing the competition our of water; 15.58% faster

 

Test 3 — 8 VMs, 10GB mem workload

  • Configuration 1 (1500/1500): 203.82
  • Configuration 2 (1500/9000): 204.71
  • Configuration 3 (9000/9000): 252.14

image

Test 3, overwhelmingly has jumbo frames as the loser; 23.70% slower

 

Conclusions

First I want to start by saying, what in the world happened in test 3? After looking at the results from the first two tests I was shocked to see the results that test 3 produced. Back to that in a moment…

  • The difference of the MTU setting on the Cisco Nexus 1000v uplink port profile is negligible and should be set to 9000
  • Configuring jumbo frames for use with vMotion in a 10Gbe multi-NIC vMotion environment is workload dependent (whether it’s used or not)
  • Jumbo frames are not the clear winner by any stretch of imagination.
  • More testing needs to be performed to determine the pros and cons of utilizing jumbo frames for vMotion in a multi-NIC vMotion environment

The second conclusion in the list is loose and only based on multi-NIC vMotion with 4 NICs.

I’m really confused on why moving 4 VMs with a memory workload aggregate of 80GB is so much different than moving 8 VMs with the same memory workload aggregate. Possibly the way the algorithm balances packets across the physical NICs? Because moving 8 VMs generate more jumbo frames (which means more overhead) and there’s a tipping point on # of jumbo frames from a cost/benefit scale?

More testing and a better understanding of the algorithm may offer better conclusions then what I’ve come up with. Please feel free to share your thoughts in the comments section.

Comments 8

  1. Give the script a go with 7 VMs at 10GB.

    A 10GbE will not launch more than 8 vMotion at the same time, maybe it starts limiting the vMotion at 8 parallel VMs,

    Or your running put of CPU juice at the 8 VM thresshold.

    Just suppositions as I don’t have a 10GbE lab to test it with.

    Thanks for taking up the test.
    Erik

  2. Good testing, I would be interested to see the maximum concurrent vMotion’s limited to 2 then 4 then 6 then 8 and perform vMotion’s of 8 VMs with the same mem workloads. I suspect either 2 or 4 concurrent will be fastest, but I haven’t had a chance to test it for myself yet. If you have time, id love to hear the results.

  3. Post
    Author

    Erik,

    I don’t think I’m running out of CPU, each host is a quad socket, 8 core box. 8 VMs are only using 16 vCPUs total.

    Josh,

    Originally I was only doing 1 and 4 configurations, but later decided to max out concurrent vMotions, however each host only has 128GB of RAM so I couldn’t allocate 20GBs each. I definitely like your idea of a 2, 4, 6 and 8 test all with same workloads. I will also do a another 1 VM test with same workload. I should be able to get this done over the next few weeks and I’ll post the results.

  4. Pingback: Technology Short Take #31 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers

  5. Pingback: Jumbo Frames and Multi-NIC vMotion Performance over 10Gbe — Part 2 » ValCo Labs

  6. Guys, jumbo frames definitely are of benefit – they were not created “out of nothing”. The main things you need to understand here are:

    1) VM migration isn’t the best for this test, you clearly see it on 4VM*20GB versus 8VM*10GB workloads. Their results are insanely different. 10Gbit throughput has to be measured in a very, very different way.

    2) I have some doubts on that dynamic Nexus “adaptation” to jumbo frames. I never liked anything “too dynamic”, I rather do things manually and have precise and detailed control over what I do. What if Cisco discovers bug in that mechanism in two months ?

    3) Broadcom and several other manufacturers have shown many performance issues recently so I wouldn’t wonder if they have something “misaligned” on 10Gbit. You think I’m off ? Well, one very hot and very recent announce last week, google “HP ProLiant DL980 G7/DL580 G7 Server – NC375i and NC375T Performance Will Decrease Significantly when Using NIC Driver Version 4.7.17.926”.

    4) yes, Intel has been reliable and predictable, at least in my books. http://download.intel.com/support/network/sb/fedexcasestudyfinal.pdf Surprising results, at least for someone, showing that not all “protocols” or “ways to transfer data” are equal. I guess this is the case, VM migration isn’t simply utterly optimized / suitable for 10Gbit networks and it’s not the best case you can throw at 10Gbit+ infrastructure. For sure you have better performance than 1Gbit will achieve, but how much better ?

    5) I don’t remember how’s that feature called, but there is new generation of CNA chips & drivers enabling adapter to “put data directly to virtual machine memory address space”, obeying the classic bus->cpu->cache->memory mechanism. Performance gains have been amazing, this new feature alone is pumping approx. 4Gbit/s MORE than ancestor solutions. Before this feature, one had problems to cross 6Gbit/s throughput in virtual environments, whereas with this feat, close to 9.5Gbit/s are easily achieved on the same platform and test methodology. I can’t google it quickly now, seen it somewhere at Myricom adapters [?].

    Enjoy !

    1. Post
      Author

      Thanks for the comment Lubomir, and I’m definitely not implying that jumbo frames don’t provide gains. This post wasn’t about jumbo frames in general, it was geared towards the specific use case of enabling jumbo frames on interfaces used only for vMotion.

  7. oh sorry, seems like the Broadcom-HP material is internal only and not released to public yet 🙂

    as said, extremely fresh from oven…

Leave a Reply to Lubomir Cancel reply

Your email address will not be published. Required fields are marked *

*