Eucalyptus engineering has gone through a number of changes, as Tim Cramer describes in his excellent blog posting about which I am excited and of which I'm quite proud. The other day, I was speaking with Vic Iglesias (@vicnastea), the person who heads up our QA team, about the upcoming 3.2 release and it occurred to me the release intersects a couple of hackneyed ideas in a surprising way -- surprising, at least to me.
To begin with, Eucalyptus has moved to an Agile style of software engineering practice. I say "style" in this previous sentence because it is an adaptation of the Agile software engineering to coordinated development (using multiple technologies) of both code and configuration specifications. Agile has been wildly successful for development of applications where the principal application technology both supports test-driven development and also prescribes a specific architecture. For example, the Agile methodology works brilliantly with Ruby-on-Rails for web development where the architecture is "model-view-controller" and the language system supports test-driven development.
In contrast, Eucalyptus depends on a large collection of software dependencies that conforms to several different architectural models. While it does have an overarching architecture, that architecture is not embodied in any single language system, execution framework, or IDE. Thus one of the cool things, I think, about Eucalyptus engineering is the way in which the engineers have tried to adapt Agile principles like continuous integration and date-driven release cycles to problem of co-developing code and dependency configuration.
Looks great, but does it work?
Another cool thing about Eucalyptus engineering that is beginning to bear fruit is the emphasis the team has put on QA. One of my many personal character flaws is that I really hate code that doesn't work. I'd rather have a few features implemented well then a large pile of partial functionality. Really. Every bug report bugs me.
I like to think the engineering team takes QA so seriously not because I do, but, in reality, they do because we as a company made a commitment to high-quality code when we commercialized and that sentiment is now indelibly etched into the company culture. QA is critical and everyone participates, contributes, and takes it to be one of the most important requirements in any release.
These two predilections intersected the other day in the following way. The next release of Eucalyptus includes a new reporting and charge-back facility that stores usage and accounting data, temporarily, in the internal cloud state management database. Periodically (under the control of an administrator-settable parameter) that data is aggregated in an external data warehouse for report generation and querying and subsequently purged from the internal database.
This architecture permits the accounting data to be used both for quota control via EUARE (the Eucalyptus implementation of identity management) as well as for charge-back. At the same time, it reduces the performance impact of report generation on the overall responsiveness of the system by offloading complex queries to the data warehouse.
However, there is also a potential performance impact in an High-availability (HA) configuration. The new reporting system includes usage data that updates frequently (e.g. network usage information). While the internal database can easily handle the update load, after a failure of the Cloud Controller (CLC) and a successful fail over, the length of the recovery period for restoring the system to an HA configuration is determined by the size of the database footprint at the time of the recovery.
Essentially, there is no performance issue when the system is running. There is no performance issue when the system is running as an HA system. There is no performance issue when the system fails over due to a failure of the CLC. There is, however, a possible performance problem when the system attempts to automatically restore itself after the failure has been cleared.
In an HA configuration, Eucalyptus maintains redundant databases. When one of them comes back on line, it must be synchronized with the currently active database. There is a period of time during this synchronization when the system must pause to allow the active database to become quiescent, and the length of this pause is, in part, determined by the size of the data.
I was concerned that the new reporting system might introduce a performance problem in HA recovery because of the additional accounting information. I was also worried that we would not detect this problem until one of our production customers, after a long period of stability, had a failure and then an unduly lengthy pause after the failure cleared. This kind of latent "corner case" is the kind of problem that can be easily overlooked when new features are introduced rapidly.
However, the QA team incorporates its test development into the Agile software development pipeline so that the test plan and the development plan proceed in parallel. This activity isn't test-driven development. Rather, the test plan is an independent product that "ships" the day the team begins final release QA, in a sense, becoming an upstream software dependency on the final release.
Or at least, that's the idea.
Turns out to be a good idea, in this case. When Vic and I chatted about the potential issue, not only did he have a test plan ready for HA recovery with heavy reporting activity (we are about to go into final QA as I write this) but his team also plans to try and determine guidelines for administrators (as a by-product of QA) for setting the purge frequency.
It sounds like a small thing, perhaps, but as the software has grown in complexity and users have come to depend on Eucalyptus as a piece of critical infrastructure, it is exciting to see how the effort we have been putting into building dependability into the software and the software process is beginning to pay off. That, for me, is a big thing. I like software that works.