This document summarizes Keith Brennan's implementation of a VMware View virtual desktop infrastructure (VDI) solution at Delano Regional Medical Center using Nexenta storage. Key points include:
1) The initial VDI deployment using an IBM N3600 storage array struggled with performance issues due to simultaneous updates and reboots overwhelming the storage.
2) Brennan migrated the VDI workload to a Nexenta storage solution using SSDs, achieving a 20x improvement in IOPS and reducing latency from 18ms to 2ms.
3) The Nexenta implementation provided reliable performance for 175 active VMs using caching, compression, and avoiding deduplication due to use of a golden image.
1 of 38
More Related Content
OSS Presentation DRMC by Keith Brennan
1. More IOPS Please
DRMCs VMware View Implementation Using
Keith Brennan
October 2011
2. Delano Regional Medical
S 156 bed community hospital in
central California.
S Four satellite clinics.
S Only hospital in a 30 mile
S Serves approximately 60,000
people spread over several
S 80%+ of our patients are Medi-
Cal or Medicare.
S Government doesnt pay
4. The Great Directive of 2009
S Need to deploy 150 new desktops in support
of a Clinical Documentation implementation.
S Do it as cheaply as possible.
S Oh, by the way, youre losing an FTE due to
budget cuts.
5. Never let a good crisis go to
waste. Rahm Emmanuel
S Used this Opportunity to justify moving to VDI.
S Users resistant to using something other than a traditional
S Perceived lack of freedom.
S Perceived increase in Big Brother.
S Why I wanted the transition to VDI
S Ease of management.
S We had a set, well defined, integrated, desktop experience.
S Wanted a way to deliver the same experience in a controlled
manner to a myriad of devices. IOS, Android, etc.
6. I Need Storage!
S My Existing EMC CX500 was barely cutting it for 3 ESX
hosts w/ a combined 32 VMs.
S Lots of people on the Virtualization forums liked NetApp.
S NetApp had just published a white paper on a 750 View
virtual desktop deployment on a FAS 2050a.
S Near normal desktop load times.
S Seamless user experience.
7. Well Thats Timely!
S The next week another vendor calls letting me know that
IBM is running a huge storage sale.
S It includes their N series of network attached storage.
S Rebadged NetApps.
S Three weeks later a N3600, a rebadged NetApp 2050a,
S It is setup identically to the VDI whitepapers setup.
8. Implementation Guidelines
S Linked clones are to be used whenever possible.
S Ease of maintenance
S Ease of provisioning
S No user data to be stored on the VMs.
S Significant patching shall be done through the Golden Image
and VMs will be re-provisioned with using the updated image.
S AV will run on the VMs but only in real-time scan mode. No
scheduled system scans.
9. Initial Testing
S Two Hosts with 25 VMs each.
S One connected to the N3600 via ISCSI
S The other via NFS.
S Test lab of 25 thin clients.
S Good performance.
S Equivalent to a desktop of the previous generation.
S Quick user logins due to the VMs being always on and waiting.
S The N3600 is maintaining low utilization.
S NFS and ISCI exhibit similar speed.
10. Go Live!
S Five additional ESX Hosts are deployed.
S Each hosts ~25 VMs
S Current setup gives me N+2 host redundancy.
S For the first week everything looks good.
S User complaints are primarily with the clinical application.
S N3600 is handling it well. Running at about 35%
S ~1.5k IOPs/Sec of regular background chatter.
S VMs report average latency of 12ms.
11. Disaster!
For me they happen seem to happen in threes.
S First AV engine update happens 1 week after go live.
S AV server pushes it to all clients at once.
S The simultaneous update of all the View VMs forces the
SAN to a crawl for 3 hours.
S Users complain that the Virtual Desktops are unusable.
S Temporarily corrected the problem by only allowing the AV
to update 3 machines at once.
S This worked like a champ until a dot version update on the
AV server a month later broke that setting.
S Another 3 hour downtime.
12. Disaster (cont)
S Three days later a helpdesk tech forces the simultaneous
reprovisioning of 60 of the View VMs at once.
S Was applying an application patch.
S Was trained not to restart more than 5 VMs at once.
S That obviously didnt stick!
S That was another hour of the SAN crawling.
S Once again, users complain that the system was unusable
during this time.
13. Disaster! (yet again)
S .net 3.5 service pack is approved for deployment.
S SP is large. >100mb.
S Set to deploy starting at 2am and only on restart.
S At 04:15 four VMs restart within one minute of each other.
S N3600 starts to lag.
S Users seeing their system running slow decide to restart.
S At 5am I get the call regarding the issue.
S I immediately disabled the SP deployment.
S Still took an hour for the N3600 to catch up.
15. Whats Going On???
S Oh $41+
S General use chatter is
eating my bandwidth.
S N3600 CPU utilization is
regularly now above
S Disk utilization rarely
drops below 40%.
S Average disk latency
16. I Have a Problem
S Im maxing performance with just day to day operations.
S IBM has verified that the appliance is functioning
S In other words, this is all Im going to get out of it.
S Adding disks might help some, but too costly!
S Additional Tray would be $15k!
S SAS drives to populate it are almost $1k each!
S Still have CPU limitations.
S NIC Limitations (2 1gbe links per head)
S Did I mention that I have no money left in the budget?
17. Nexenta to the Rescue
S Had just installed Nexenta Core for my home file server.
S Time to find some hardware:
S Pulled a box out of the View cluster.
S Installed six Intel SSDs.
S Installed Nexenta Core. (yeah, I know.. EULA..)
S Created the volume and shared via NFS.
S The next day my poor brain figured out that I could have just
done a Nexenta VM. Doh!
S Over the next week I migrated half the virtual desktops over.
18. Its like Night and Day
S Average latency drops
from 18ms to 2ms.
S Write throughput
S Read throughput
S 20x improvement on 4k
20. Time For a Full Nexenta
S I was able to secure $45k capital for the next year.
S Normally this would just draw laughter when talking about
S I also intend on replacing the existing EMC.
S Annual maintenance too costly.
S I despise the fact that I have to call them out every time I want
to connect a new piece of hardware to it.
S Still some questioning from higher-ups on this whole open-
storage thing.
21. Final Solution Hardware
S 2x Supermicro dual Xeon servers with 96gb ram.
S 1x DataOn 1600 JBOD
S Houses twenty one 1tb nearline SAS drives.
S 1x DataOn 1620 JBOD
S Houses seventeen 300gb 10k rpm SAS drives
S 2x Stec ZeusRam
S 8x 160gb Intel 320 SSDs
23. Why DataOn?
S Disk Shelf Manager
S One thing Nexenta lacked
was a way to monitor the
S How could one of my techs
know how which drive to
S Intuitive slot lighting.
S Theyre responsive even
after the sale is made!
24. Why Nexenta?
S Its good to have on demand support.
S I am the only member of our technical staff that has a basic
understanding of storage architectures.
S I like to have the ability to go on vacation from time to time!
S Its good to have experts for unique problems.
S Regular tested bug-fixes.
S Its always nice to have someones neck to wring!
25. The End Result
S 2ms latency.
S 500 mb/s reads
S 200 mb/s writes
S Happy Users!
S Note: Benchmark was
done on production
system with 175 active
26. To Dedup or Not to Dedup
S Dedup can give you huge storage savings.
S I had 14x Dedup ratio on my VDI volume.
S Inline dedup saves on disk write IO.
S Itll still hit the ZIL, but wont be written to disk if it is
determined to be duplicated data.
S Instead of a 4+kb write you get a sub 256 byte metadata write.
27. To Dedup or Not to Dedup
S Ram Hog!
S For good performance you need enough ram to store the
dedup table.
S Uses ARC for this, which means you will have less room for
cached data.
S Potential for hash collision.
S Odds are astronimcal, but still a chance for data corruption.
S Dedup performance penalty.
S Small IOPS suffer.
29. Is Dedup Worth it?
S If youre using a Golden Image - No.
S VMDC Plugin provides great efficiency by only storing one
copy of the Golden Image vs one for each pool of VMs.
S Compression is virtually free and will do a good job of
making up the difference in the new blocks.
S Disk is cheap.
S If youre doing a bunch of P2V desktop migrations -
S If the desktops are poorly configured, or have other aspects
that can cause excessive I/O than no.
S If the desktops are similar and large, then sure.
30. Compression
S Use it. Unless youre using a 5 year old processor, there
will be no noticeable performance hit.
S On by default in Nexenta 3.1
S Compresses before write. Saves disk bandwidth!
31. Cache is Key!
S Between the the 70gb of arc and 640gb of l2arc the read cache
is hit almost 98% of the time!
S This equates to sub 2ms average disk latency to the end user.
S Beats the crud out of the >15ms average latency of the N3600!
S Know your working set. You could get away with a lot
smaller or need a lot larger cache.
34. Gig-E vs TenGig-E
S Obvious differences in maximum throughput.
S Small IOP differences are mainly attributable to network
latency differences.
S If youre stuck with Gig-E go use 802.3ad trunk groups.
S Still stuck with 100 mb/s throughput but no one ESX host
will saturate the link for the rest.
35. Gig-E vs TenGig-E - User
S Average time from the Power On VM command being
issued to the user is able to login:
S 10gbe: 23 seconds
S 1gbe: 32 seconds
S Time from when user presses login button until the
desktop is ready to use:
S 10gbe: 5 seconds
S 1gbe: 9 seconds
*Windows 7, 2 procs, 2gb ram, DRMCs Standard Clinical Image
36. Final Thought All SSD
S For deployments of Linked Clones or VMs off of a Golden
S Allows you to get rid of the L2ARC.
S Use a good ZIL Device (STEC ZeusRam, DDRDrive)
S Allows for sequential writes to the SSDs in the pool.
S Saves on write wear which is a SSD killer.
S My first test box with the x25m SSDs started suffering after
about 3 months.
S If you want HA you have to use SAS drives.