By Ron Breault
Having been exposed to OpenStack for more than five years now, I think it’s fair to say that I’ve grown blasé with the Live Migration feature; it just does what it’s supposed to do and it does it well. But when you stop and think about it, OpenStack Live Migration is really an extraordinary feature. That’s especially true now with the all recent improvements that have been made to it. Read on to learn more.
Why do I call Live Migration extraordinary? Because of everything that happens “under the covers” to make it work, and because of what Live Migration enables. With just a few clicks of the mouse in Horizon, a VM running on one physical server can be automatically moved to another physical server. “Automatically” makes it sound simple, but there’s a whole lot of work going on to pull it off: replicating all the VM’s static and dynamic memory – while the VM is running; copying and establishing the VM’s complete network infrastructure on the target node; copying local block storage (if used) to the target node; and briefly pausing and then resuming the VM to complete the process. Depending on the size of the VM, the overall migration interval can be measured in seconds to minutes.
What does Live Migration enable? All sorts of things important to the operation of an always on, production cloud! Live Migration enables physical servers to be gracefully powered off and upgraded without taking hosted virtual servers offline. In a similar way, important host security updates or bug fixes can be delivered and deployed across servers without stopping any of the hosted VMs. For example, when using the in-service upgrade feature in Wind River’s Titanium Cloud virtualization software products, a complete cloud infrastructure can be upgraded from one release to the next, live and in production, thanks in part to the capabilities of Live Migration.
A report was recently issued by the OpenStack Innovation Center titled “High Availability of Live Migration.” It makes for good reading, and details a thorough study and testing they performed on OpenStack’s Live Migration capability. While I don’t want to dissuade you from reading the full report, the key line from the summary was, for me, this statement: “In conclusion, we were able to prove that Live Migration works.” If they had asked me first, I could have saved them a lot of work: Titanium Cloud has been successfully leveraging Live Migration for many years. While the vanilla OpenStack distribution still has a few kinks to work out, Titanium Cloud has Live Migration down to an art form, and we’ve been pushing changes upstream to make it even better. As validated though independent third party testing, Titanium Cloud can execute a Live Migration with under 150ms of VM downtime – I haven’t heard of another commercial implementation that comes anywhere close to that figure.
Now with the most recent release of Titanium Cloud, which includes both upstream OpenStack work and Wind River updates, Live Migration is getting even better than ever! Here are just two improvements that I think warrant particular attention:
Performance Increase. Under our latest Titanium Cloud release, our testing shows that Live Migration throughput has been significantly increased. In our labs, we’ve seen throughput improved by as much as five times over prior releases! That kind of change can make a big difference with large VMs, resulting in a substantially reduced Live Migration interval. Faster migrations can mean reduced timing for planned maintenance activities – the operator simply spends less time waiting for Live Migrations to complete.
Auto-Convergence. The new Auto-Convergence feature is an especially cool innovation. Some VMs can take a long time to migrate due to heavy memory write activities – as fast as OpenStack is able to copy the ‘dirty’ memory contents of the VM from the source to the target, the VM is able to ‘dirty’ its memory again. This means OpenStack might barely keep up, or in some cases, might never catch up – the VM is simply just too busy writing to memory. The new Auto-Converge feature changes that by intelligently slowing down the virtual CPU on the VM so that it can’t dirty its pages as quickly. Since its memory writes are slower, Live Migration proceeds without stalling and is able to stay ahead of the VM. Very smart. This feature is optional, so if you don’t want to use it with certain VMs, the feature can be turned off; flexibility is key.
There are other interesting changes as well: the ability to dynamically update the maximum Live Migration interval (some VMs always take longer to migrate than others – this helps to avoid timeouts); periodic logging of Live Migration throughput and estimated downtime; reduced maximum default for timeouts from 800 seconds to 180 seconds to name a few.
With all these changes taken all together, Live Migration under the latest release of Titanium Cloud is the best we’ve delivered to date. If you manage critical infrastructure using the cloud, Live Migration is an indispensable feature. If you’re not using Titanium Cloud, you’re not getting the best performance possible out of Live Migration. Contact your local Wind River sales manager or visit us to learn more about Titanium Cloud and our Live Migration performance.