One of the much-touted benefits of cloud computing is elasticity: computing resources can be quickly scaled up or down on demand, with nearly unlimited resources available for rent. Cloud computing makes it (almost) as easy to use 1,000 machines to compute something in 1 hour as it is to use 1 computer to compute something in 1,000 hours, or any point in between. Node configuration can also be changed easily: for example, Amazon EC2 instance types range from “small” (1 compute unit, 1.7GB RAM) to “extra large” (8-20 compute units, 15GB RAM).
Since cloud computing allows such a flexible cluster configuration, what is the right choice for a given problem? Intuitively, the more amenable the task is to parallelization, the more effective it will be to employ more small nodes; the more centralized or communication-intense the task, the better it will be to use relatively few large nodes. Furthermore, if I need the answer immediately, it might make sense to continue adding more nodes until the query runtime stops improving, regardless of cost — but if the query is a low-priority one, I might want to run it slowly and cheaply.
As a database guy, this immediately raises the question: why do I need to make these decisions manually? It should be possible to build a query optimizer that understands the pricing model of my cloud computing provider and the nature of my query workload, and automatically provisions the resources I need to compute my query for the cheapest total cost. In the past, this kind of provisioning decision had to be made once, based on guesses about the cluster’s expected workload. With cloud computing, we can now do provisioning on the fly, choosing the best cluster configuration for the task at hand. That’s a powerful capability that hasn’t yet been widely exploited.
Update: I think this idea becomes even more compelling when you add the ability to schedule periodic analysis tasks using EC2 Spot Instances.