By Rich Wolski | March 11, 2010
Recently, Thorsten Von Eicken, founder of RightScale and Adrian Cole, founder of jclouds, both offered interesting insights regarding cloud APIs -- in effect, defining some of the "dos and don'ts" of API design. Since we (at Eucalyptus) struggle with the implementation of different APIs somewhat regularly, I felt like I could shed some light on the perspective with which we work during these struggles.
From our perspective, understanding the distinction and separation between the design of the API from the abstractions to which it is an interface is important. That is, when we look at API we see two aspects of it:
-- the operations on abstractions referred to by the API calls
-- the syntax and operational behaviors of the API calls themselves
For example, an EC2 Elastic IP address is an abstraction. The user of an Elastic IP does not know (or, hopefully care) how that abstraction is implemented. Rather, he or she cares about the operations that can be performed to create, manipulate, and destroy an Elastic IP as well as what an Elastic IP address "does" in between these operations. The user also cares about how long each operation or command may take to complete and the response of the system if and when an operation fails. This part is the abstraction part.
The API and its specification also defines the language that the user is required to use to perform the various operations. It specifies not only the syntax (what characters or bytes occur in what order for each command) but also the information content carried in each API call or response, and their respective timings. This part is the syntax part.
Thus an API consists of things (objects, logical constructs, etc.) that can be talked about and a language (syntax, grammar, etc.) for talking about them.
Decomposed this way, it seems possible to consider the "sins" of API design (in the spirit of Thorsten's biblical allegory) as being either venial or mortal. Venial sins, which I want to emphasize are still indeed sins, are ones that make the API difficult to use, but do not limit its scalability or harm the robustness of applications using it. For example, in Thorsten's post, polling for system state is, in my opinion, a venial sin.
Here's why. He's right -- having the client poll for state changes is a real pain in the biblical reference. On the other hand, either the client needs to poll the cloud, or alternatively the cloud needs to remember that the client is waiting for a "signal" or "call back" and must send the necessary signal when the API call has completed. Notice that from the cloud platform designer's perspective, this state management is potentially a nightmare to implement. What if the client is a web browser and the user has closed it? How long should the platform wait before it gives up? How often should it retry? How does the platform "know" that the browser has received the call back? Now multiply that problem by thousands or hundreds of thousands of users and connect it to the management of hundreds of thousands of resources. It can be done, but at the cost of complicating the platform's state management problem substantially.
Moreover, the client side needs to be made "asynchronous." That is, a user must correctly set up a call back or a signal handler for the calls he or she makes. If the "user" in this case is a tool, this interface design can work, but regular, garden-variety users, and many programmers struggle with the notion of asynchrony making the user experience potentially underwhelming.
Alternatively, if the client polls, the state management burden is distributed among the clients. Moreover, the system is one-way transactional in that the polling requests come through the same API processing "engine" as imperative commands do making it easier for the platform to optimize for request throughput. Thus, polling is a pain, but one that makes sense in terms of complexity management and scale, at least from the platform perspective.
One possible way to address these less evils is to consider having two different syntaxes for a single set of abstractions: one for end-users/polling clients and one for more sophisticated asynchronous clients. The difficulty with this approach is in verifying that both sets of API calls are equivalent for each abstraction.
Resource identification, however, is another matter. It is a serious deficiency, for example, not to return a resource identifier on create, even if the system needs to poll to determine when the created resource is operational. The potential mortality lies in the inability of the client to match up create calls with the resources they create, especially in the face of failures. Not only is the application logic complex, but it may even be impossible to disambiguate the behavior of different application components leaving only a full restart as a fault remediation strategy.
On the abstraction side, the severity of shortcomings is harder to identify. One rule of thumb we tend to use considers the degree of aggregation that each abstraction requires is inversely proportional to the scalability and predictability with which it can be implemented. For example, a VM abstraction that includes block-level network-attached storage and persistent IPs in its API call for "create" is harder to implement at scale and more error prone than three separate abstractions that must be composed: one for VM create, one for volume attach, and one for persistent IP attach. Notice that the burden of composition is again shifted to the client, but the error semantics are easier to parse. In the aggregate case, internally there are 8 possible ways for the aggregate command to fail: VM yes/no, volume yes/no, and IP yes/no. From an API perspective, then, there are seven possible error conditions for each command and one success condition. It is possible to handle some of the cases together (e.g. all cases with VM=no are covered by a single code path) but logically all seven cases come back though the API. Alternatively, as three separate abstractions, the sequencing of requests makes the error logic easier. A failed VM create obviates the need for an attach call at all, for example.
Disclaimer: these musings are guidelines at best and delusions at worst. We certainly haven't yet seen a definitive set of cloud abstractions nor are the hard and fast rules for cloud API design even visible on the horizon. Like many working in this space, we have developed an internal "sense" of some of these issues and this post is intended to try and codify that sense in some way.