It’s the last day of VMworld and boy does my head hurt. Way too much information out there. Only 5 more sessions to present at today and then I’m done. I did get a small break to check email and such today and ran across a rather interesting blog posting from Marathon. In this posting they try to tell customers why VMware FT (fault tolerance) is so horrible. I’m fine with people talking bad about VMware as long as it’s accurate. In this case it’s nowhere close and obviously was written by someone that just doesn’t understand VMware or virtualization. I thought I’d take a second to make some corrections here. Go read the source article first. Most of it is quoted here for reference.
1. No component-level fault tolerance. The most common failures that result in unplanned downtime are component failures such as storage, NIC or controller failures. Yet VMware Fault Tolerance doesn’t do anything to protect against I/O, storage or network failures. By not addressing these primary sources of failures, VMware appears to be saying that you/the customer are on your own do figure out how to protect your storage and network connections. This may be okay for the very largest IT staffs in the world, but for the other 98%; it will not be sufficient.
VMware already has features to protect again component failure. If your NIC fails you’ve got NIC teaming built into the system. To set it up simply plug in both NICs to the server, go into the network panel and attach both of them to the same virtual switch. Done. 4 clicks. Same thing for storage with the built-in SAN multipathing drivers. I absolutely agree with the author that component failures are the cause of most crashes and that’s why VMware added these features in 2002. VMware FT is not designed for component failure because there’s no sense in moving the VM to another host if you’ve simply lost a NIC uplink. NIC teaming will take care of that with ease and is a LOT cheaper than using CPU and memory resources on another host to overcome the failure.
2. Complexity on top of complexity. In order to use VMware Fault Tolerance, you’ll first have to install both VMware HA and DRS. No small feat in and of themselves. Then, because VMware FT requires NIC teaming, you’ll also have to manually install paired NICs. Then you’ll need to manually setup dual storage controllers (with the software to manage them) because it requires multi-pathing. And to top it all off, you’re required to use an expensive, and often complicated, SAN.
This is where it’s pretty obvious the author has never configured HA or DRS. Let me show you a picture of how hard this is.
See those two check boxes? Click them and you’ve just enabled HA and DRS. If that’s too hard then please comment and let me know how it could possibly be easier. Even my dog has figure out how to do this now. Granted, it’s a pretty smart dog.
As for setting up the dual NICs and dual HBAs, well yes, you have to actually plug the physical devices in. After you’ve done that the **built-in** NIC teaming and HBA drivers will take over and configure most everything for you. The NIC teaming does require 4 extra clicks. The HBA drivers actually figure out the failover paths, match them up, and setup the appropriate form of failover all auto-magically. They’ve been doing this since ESX 1.5 (6 years ago).
Lastly, yes this requires shared storage. Pretty sure that most environments that want FT (no downtime what-so-ever because out business could lose millions) already have a SAN to take advantage of other things virtualization related such as DRS and VMotion.
UPDATE: VMware FT does not require dual NICs or dual HBAs. This is something you **should** have in every virtualization setup that’s running VMs you care anything about but it’s not a requirement to get VMware FT running.
3. Limited CPU fault tolerance. With VMware FT, you’ll need to setup what VMware refers to as a “record/replay” capability on both a primary and secondary server. If something happens to the primary server, the record is stored on the SAN and then restarted on the secondary server. Two things to point out here. First, the whole thing depends on the quality of the SAN. Second, in the words of the VMware engineer who presented at VMworld, “this can take a couple of seconds.” So what happens to your application state in those couple of seconds?
So we’re back to the SAN argument. If you’re the type of company that requires absolutely no downtime for an app – if the app is just that critical – then I’m pretty sure you’re going to have a decent SAN. What’s a decent SAN? From many performance tests I’ve run it’s a broad category depending on the app but it ranges from small NAS appliances to high-end F/C. Yes, iSCSI and NFS work great for most applications – even I/O intensive apps. But we’re back to the apps in question that are requiring FT. Those are the apps that usually are sitting on high-end storage. If you’re having so many problems with your SAN that you don’t trust it for FT then you have much bigger issues at hand that VMware or Marathon or any of the other virtualization related vendors aren’t going to help you with. It’s time for new arguments beside “be afraid of the SAN”. Even my father’s insurance business with 3 servers and 15 employees has shared storage in play.
UPDATE: There is some confusion here on what is stored on the SAN. VMware FT requires shared storage (NAS, iSCSI, or FC) to store the virtual disk for the VM. There is no actual “snapshot” for VMware FT. CPU instructions and memory are constantly streamed to the secondary server where they are consumed in real time. This is why the VMs stay in lock step with each other. No CPU or memory instructions are written to a SAN and resumed or anything like that. The virtual disk is stored on shared storage for a few different reasons. First, it’s already there if you’re using VMotion or DRS or VMware HA. Second, it’s a huge waste of disk space to replicate the actual disk file. Third, it takes a long time and a lot of bandwidth to constantly keep disk files in sync. Really the shared storage is the better architecture in this case.
4. For VMware virtual environments only. VMware FT will only work in VMware environments. It won’t work with other hypervisors, and most importantly, you can’t use for business critical and mission critical applications that you want to keep on physical server platforms (i.e., non-virtualized environments which still represent the vast majority of customer use cases). Oh well, only the vast majority of critical applications run in physical environments anyway.
This is a funny argument. You’re complaining that a VMware feature works only with VMware environments. I guess I could see that as a valid argument if you’re Marathon and want to play with everyone. Only problem is the Marathon stuff doesn’t work with VMware (Citrix only) so the same argument could be reversed in this case.
The bottom line with all of this is try to make some valid, accurate statements when you’re talking about competitors. At least then people might believe the rest of what you say. Hopefully the author here will take some time to play with the VMware setups to see what he’s really competing against. There are FREE evals here just in case.
UPDATE: This wasn’t talked about but Marathon’s virtualization FT only works with Windows 2003 Standard or Enterprise SP1 today. VMware FT works with any of the over 70 certified guest operating systems that run on Virtual Infrastructure. The Marathon solution also sits deeply embedded within the OS. From their FAQ:
What is involved in migrating our existing applications to a Marathon environment?
The servers to be used for the Marathon environment will need to be configured with just a base Windows OS installed. The Marathon software is then installed on top of these environments to create the virtual Windows environment, on which applications can then be installed. For existing servers, Marathon and its partners can work with you to develop a migration plan that assures minimal impact to users.
This also impacts your ability to patch systems using Marathon products since some patches could impact these deep integration points. Again from the FAQ:
How does Marathon qualify Windows security patches?
Because of their critical nature, we screen and test Microsoft Security Updates that apply to Windows 2000 or Windows 2003 and are posted on Microsoft’s automatic update website area. In the majority of cases, Windows security updates are fully compatible with Marathon products. In the rare cases where an issue is found, we post an advisory on our support website knowledgebase and provide an update to resolve the issue.
With VMware’s solution on the other hand the operating system is untouched and can be installed, patched, and operated normally. Obviously it’s time for a much deeper dive into the real differences between these two solutions.
**CAUTION**: The commenter with the name TopGun below actually works for Stratus – a competitor (sort of) to VMware FT. You can pretty much ignore his rants as a “customer” since that’s obviously a lie. Read the whole story here.



September 18th, 2008 at 4:35 pm
great reply!
September 18th, 2008 at 9:21 pm
*sarcasm on* Last time I checked I couldn’t carry a Marathon server on my USB keychain either. Is portability in their next generation, er, hardware release? *sarcasm off*
Excellent reply!
September 19th, 2008 at 12:54 am
On a side note (and yes, I do work for Stratus) you can achieve true fault-tolerance with Stratus fault-tolerant ftServers running VMware without any compromise.
September 19th, 2008 at 2:45 am
Marathon, who ?? Thanks for the detailed view.
September 19th, 2008 at 2:12 pm
Stratus ftServers?
But is there any sense?
I think ftServers are better used in situation where for some reason application MUST be installed on physical server and hypervisor CANNOT be used.
I don’t see any advantage in running ESX on Stratus(after all VMware FT protects from HW crashes only anyway. It either does that or it does not. Stratus does same. In more expensive way.
Also, Stratus will not work in situations like server just experience hard shutdown without any warning at all for any reason and. In VMware FT’s case if 2nd server and SAN array are in other physical location all is good. Stratus cannot provide that.
p.s.I hope FT will not be priced too much as ‘feature’, it’s too important even for small companies
September 19th, 2008 at 8:38 pm
I sat in on the session on VMware’s FT product which was held right after Mr. Maritz’s key note. The engineer who gave the presentation (to a packed room) was very open about what this product can do. In fact there seems to be more limitations then their marketing would lead you to believe. First, it’s not available yet, second they gave no pricing. But the real limitations come from the fact that you can only scale to a single core. Not a single socket, but single core! Add to that, VMware strongly recommends a single FT VM per physical server. To make matters even worse, the engineer said the overhead is about 20%. I don’t know of many applications that you would need fault tolerance for that can run on a single core with 20% of that core/CPU being dedicated to the FT software. And I agree with the previous poster, my site was one of those that were done do to the license “foul up”. Not sure I’d want to trust my most critical apps to this software, especially version 1.0.
CAUTION: This poster works for Stratus. Take what he’s saying with a grain of salt. He is not a customer that was impacted by the license issues. Read more: http://mikedatl.typepad.com/mikedvirtualization/2008/09/time-for-some-r.html.
September 20th, 2008 at 10:39 am
TopGun:
- VMWare FT works only for 1-VCPU guests.
- But you can have many such guests running on your physical host. I’m pretty sure Dan did _not_ say in his presentation that “you can have only one FT VM per physical server”, don’t know where you got that from.
- Yes, there is overhead, which varies depends on workload. It’s not a constant 20% for all workloads. Note, though: with VMware FT, you can use the latest Intel/AMD processors on the day Intel/AMD ship the processor, unlike with stratusFT.
- Sorry to hear that you were affected by the license foul-up.
September 21st, 2008 at 10:24 am
Timo,
Absolutely you could use Stratus. I wasn’t leaving you all out on purpose – this was more a response to Marathon mis-information. I know plenty of customers using VMware on Stratus servers today and are very happy with that setup. Different strokes for different folks I guess so that’s why both VMware FT and stratus will continue to have their markets and purposes.
September 22nd, 2008 at 10:42 am
I think the VMware FT and Stratus argument is an “apple to oranges” discussion. One is hardware, one is software. One has been doing FT for 30 years, one is new to FT. There are still a lot of unanswered questions regarding VMware FT, like availability and pricing? Also, I’m always a bit leery of rev. 1.0 of any software product. That being said, I think running VMware HA on a Stratus box would fit pretty much any need I would ever have when it comes to availability.
CAUTION: This poster works for Stratus. Take what he’s saying with a grain of salt. Read more: http://mikedatl.typepad.com/mikedvirtualization/2008/09/time-for-some-r.html.
September 22nd, 2008 at 10:55 am
TopGun,
You bring up valid points. Stratus has been doing FT longer than just about anyone. That solution might fit a lot of use cases. However, VMware FT will work for any OS on any hardware and it will even open the door to smaller customers that don’t typically acuire Stratus servers. It’s more of a case-by-case basis to determine if Stratus or VMware FT meet the needs. Both are good solutions. Even after VMware FT is released there will be a market for Stratus and they will continue to be a great partner for VMware.
VMware FT is something new to Virtual Infrastructure although the mechanics behind it (record and replay) have been in VMware Workstation for over a year. VMware Workstation is usually where new features like this show up first since it’s a good proving ground seeing as how workstation is run by millions of people around the globe. Usually if things work out there then they get rolled up to the other products as a direct copy like thin provisioning of disk or as some basis for another feature such as VMware FT. While VMware FT may be a 1.0 feature the technology under it has been in the hands of users for a while. Of course we could just go the Citrix route and skip 1.0 version numbers and start with 3.0. Maybe that would make people feel better.
As for pricing and availability, we can’t really talk about that yet. We did show previews at VMworld in presentations, in demos, in the booth, and in the hands-on lab. That should tell you it’s close. The SEC regulations prohibit VMware (or any public company) from publishing dates or version numbers for future releases. Actually, you can publish them but then you can’t recognize any revenue from impacted products until you deliver that future. This is why you don’t get a lot of “it will ship with version X.Y on ZZ date”. Pricing will be determined when the product is ready to ship. I could say more under NDA but not in a public forum so it’s always best to ping your VMware rep or partner for more information.
As always, the comments are interesting. Keep them coming!
September 26th, 2008 at 4:22 pm
Well Q is if you want to sell you product sell it but please do not do some cheap marketing. Where VMwaer is right now MS or any other will take ages to reach
October 19th, 2008 at 7:23 am
The biggest advantage I think for VMware FT against the EverRun FT is that the VMware FT is configured for ESX. When EverRun is defined it relies on only two servers and if both of them fail the protection is lost. In VMware FT we can have for example 3 ESXes in the cluster. Both provide continues availability for a VM for example and resources are consumed only from these two. The third ESX could be watch as extra bacup because if one of the first two servers fails back up machine will be created on the third host as well. What do you think isn’t that a lot better then EverRun?
November 3rd, 2008 at 12:41 pm
Firstly marathon has a significant play in the physical world where they can provide FT services for Windows servers, and HA services that are not cluster depended. The marathon HA offering provides much better availability that traditional clusters as a component level failure is simply re-disrespected over a high speed interconnect to a secondary server. The OS and Application will not fail over to the secondary server unless there is a complete server failure. It is pretty elegant software, born of the DEC VAX technologies, so give credit where credit is due. In fact many airports run their security systems even the pentagon has deployments where Five nines availability is critical.
So in reference your arguments, yes I agree that a ‘well architected’ VMware environment can be very available. Here are the issues and some arguments for component level availability.
Many deployments are on Blade servers or Dell 2950 class servers where slot real-estate is an issue. So NIC teaming will not buy you anything if you can’t team across multiple interface cards due to slot imitations. HP BL465 and Dell 2950 do not have enough slots to properly architect availability, yet they are very common for VMware deployments. Aks teh Dell reps… Some customers have created NIC teams across all ports in a single 4 port NIC Card for SC, VMotion traffic and general purpose traffic… glaring SPOF. Component level redirection would support this configuration and provide availability.
VMware HA is easy to configure… a few mouse clicks away, however a component level failure can cause a service level outage of a VM(s) without taking an entire server down and initiation an HA failover event. Also, not all VMware administrators are created equally. Many are doing heads down P2V conversions and not carefully architecting HA to support newly added VM’s. Poorly configured reservations and limits can cause VM(s) to not have sufficient resources to restart after an HA event.
On the storage side… no argument that enterprise SAN architectures support redundant switches and SP’s and are highly available. That buys nothing when a Blade connects to redundant switches from a single 2 or 4 port HBA… Many customers are just not that smart.
If the infrastructure has SPOF’s VMware FT will too.
Lastly, FT services will be available on XenServer as it is co developed by Citirx and Marathon. As much as we all would like to think that the hypervisor is not a commodity, many customers are doing their due diligence in evaluation XenServer and Hyper-V. While VMware is far superior today, the gap will close. And yes the free distribution of 3i installable, validates the commodity statement. many customers who would have purchased ESX for smaller environments, ROBO or test and dev are deploying 3i in production as they do not need DRS, HA and VMotion.
November 3rd, 2008 at 1:12 pm
Thanks for the comments, MTC (is that Marathon Technologies Center or something). You place a lot of your arguments on the component level failover. I rarely see people implementing virtualization without taking into account basic IT fundamentals of not architecting a single point of failure in the physical world. That means NIC teaming, HBA path failover, etc. To suggest that the IT world is full of a bunch of idiots that don’t know how to do the basic tasks and so they should buy your (I mean Marathon’s) technologies is insulting. Even those deploying blades realize the enclosure is a single point of failure and architect accordingly with extra capacity in a different enclosure.
I guess what you haven’t mentioned is the fact that with the Marathon component level solution you actually need to setup another VM on another host which is using resources and decreasing your consolidation ration and thus extending the ROI – all to get around buying an extra NIC. Sounds like a lot of over-engineering and cost for nothing.
Good luck pitching this to people. I really hope you can find a bunch of ignorant IT people to buy off on the value of component level failover.
November 3rd, 2008 at 2:06 pm
No insults intended… Sorry you took it that way.
If you read my post, nowhere I did slam VMware. Nor did I slam IT people.
So I guess the inverse is true. if IT people choose Marathon technology based on its merits, they are idiots?
I think VMware is an awesome product and said nothing to the contrary. I also think that Marathon is an awesome product and only gives XenServer competitive functionality… Competitive choices are not a bad thing even for stupid IT people.
I agree that a blade chassis is a SPOF… That does not stop people from deploying blades. The Blade I mentioned is common place as is the Dell 2950. So 2 on-board NIC,s and 3 slots is limited real-estate if you want to configure 6 NIC’s (best practice) and 2 HBA’s
So purchasing an extra chassis is not extending the ROI? They are not cheap.
And yes Marathon does dynamically create a small appliance (5 seconds) on the secondary server which is 128MB and does nothing more then intercept and re-redirect I/O in the event of a component failure.
And VMware FT is only a single VM on a single host??? How does that work?
My point is, that SMBs may not have teams of architects, and even enterprise customers are learning best practices for virtualization. That comment does not say that they are idiots, but there are a lot of VMware consultants making a good living helping IT to do things right… and some customers simply do not, based on budgets and such… I think that is a safe assessment and not insulting anyone.
February 8th, 2009 at 3:05 pm
[...] really got me thinking about this again is the Dilbert cartoon below. Several months ago I had touched a button with some of the guys over at Marathon after calling them out on a few things. Stratus came into [...]