As one might expect, I spend quite a bit of time talking to prospective customers about the joys of ownership that will accrue from the decision to install an on-premises cloud. To those who are charged with the operation and management responsibilities for a data center, the self-service notion is appealing, but for many of them, it is not new. What is new is the speed with which such requests can be filled by an on-premises cloud automatically, without system administrator intervention. We at Eucalyptus sometimes take for granted that the enterprise cloud must be fast. I'm occasionally surprised, however, by how important speed is to IT administrators and cloud users.
100 Nodes in 90 Seconds
By way of illustration, at a recent customer visit, one of the data center architects asked me
"If one of my users wanted 100 nodes, each running a VM, how long would that take?"
It turns out that we incorporate speed tests into the Eucalyptus release schedule and when 3.1 shipped last July, one of the tests (coincidentally) included a timed 100-node instance start.
In this graph, the x-axis shows the number of nodes on which Eucalyptus started an instance. The y-axis shows the average number of seconds for all instances (one per node) over multiple runs to start -- the total time for all instances to start for a simulated user. The test runs 50 times for each node size so that it is possible to compute confidence bounds on each average start time (95% confidence bounds shown as error bars in the figure).
Thus, the average time to start 100 VMs on 100 nodes with one VM per node (rightmost artifact on the plot) is 84 seconds, plus or minus 3 seconds or so.
The test configuration is designed to emulate a typical customer deployment. We use Dell R210 CPUs with 8GB of RAM in each, 1GB ethernet connected via a Netgear managed switch, and a 1GB Ubuntu test image. Each node has a 200GB local disk that the Eucalyptus node controller uses for image caching. Before the timed runs begin, the test loads the local image caches (images are stored initially in Walrus, the Eucalyptus object store). The client script (which is run on a workstation located on the company production network) issues run commands and then calls euca-describe-instances in a polling loop to determine instance status. Thus the purpose of the test is to determine the end-to-end performance as experienced by an enterprise user accessing fairly standard data center class hardware.
Behold! The Hyper Center
Returning to the customer, the discussion was focused on the difference between a virtualized data center and a data center managed as an enterprise cloud. Even with virtualization management tools (this customer uses several virtualization technologies) the provisioning time that users experience is still several days because the process is not automated and self-service. That Eucalyptus could turn that time into about a 90 second delay for 100 nodes may have come as a shock, I think, because I was met with a bit of a silent stare. I said, "No. Really. It's like hyperdrive for your data center. It makes it a Hyper Center." Or something.
Yeah. Well. Or something, but they seemed to get the point.
Reliable Performance as a Diagnostic
Part of the reason this interaction caught me off guard is that the purpose of the test is not actually to measure the speed of Eucalyptus prior to a release, but to look for signs that there might be intermittent or transient problems. The test runs a sequence of 1, 3, 6, 12, 50, and 100 VM instance launches (one per node). Each count is timed, and the instances are terminated between counts. Fifty iterations of this test is 8600 VM starts and stops. If any of them fail, the test terminates with an error condition. Further, because the groups of instances are from single invocations of euca-run-instance the system experiences highly correlated "bursty" and periodic workload making it a reasonable part of churn testing. The other important part of the test results are the error bars. Eucalyptus attempts to keep itself running "at all costs" which means simple functionality testing can mask intermittent internal problems. By observing the confidence bounds on the average times, it is possible to spot high-variance runs that may signal the presence of a transient bug even when the user requests are correctly satisfied.
In short, we use performance testing to bolster our confidence in the software releases. The test numbers we were discussing are not even for an optimized installation (we were testing the "generic" install). It is good to see, however, the extent to which an on-premises enterprise cloud can streamline IT by turning a virtualized data center into a "Hyper Center."