Wednesday, December 20, 2006

The Power of Generators, Part One

In this inaugural edition of informixdb.connect, I'd like to show you the powerful concept of generators. This is part one, in which I'll give you a hopefully gentle introduction of the general idea and explain why generators are particularly useful in the context of database applications.

Generators, sometimes called iterators, are essentially functions that return multiple results, one result at a time. To illustrate how useful and powerful this concept is, suppose a client asked us to write a function that calculates and prints the first n elements of the well-known Fibonacci sequence. In Python, that would look something like this:
def print_fibonacci(n):
(a,b) = (0,1)
for i in xrange(n):
(a,b) = (b,a+b)
print a

So far so good. Now the client changes his mind, as clients often do, and instead of having the numbers printed to the console, he wants the numbers inserted into a database table, or written to a file, or added together, or involved in any arbitrary operation that the caller specifies.

The first couple of requests could be handled by making multiple almost identical copies of this function, but this is the worst possible method of code reuse, and it is completely unsuited for the last request where the required operation is not known at design time. We need to physically
separate the code that generates the sequence from the code that uses the elements of the sequence.

One possible solution is the functional approach: Have the caller specify a callback function that will be invoked for each element, along the following lines:
def process_fibonacci(n, callback):
(a,b) = (0,1)
for i in xrange(n):
(a,b) = (b,a+b)
callback(a)

The caller code would then look like this:
def my_callback(elem):
# process elem...
print elem

process_fibonacci(10, my_callback)

This gets the job done, but the caller code hides the fact that it processes a sequence. The code would be clearer if the caller could write a for loop whose body performs the processing of each element. This can be achieved by building and returning a list:
def list_fibonacci(n):
result = []
(a,b) = (0,1)
for i in xrange(n):
(a,b) = (b,a+b)
result.append(a)
return result

This would be called like this:
for elem in list_fibonacci(10):
# process elem...
print elem

Now, this looks nice, but there's a catch. The entire list has to be built in memory before the caller gets to process any elements. If the number of elements is large, this consumes a lot of memory. This also requires that the caller know in advance how many elements are needed. It would be difficult to produce the first Fibonacci numbers up to a million, for example. If we underestimate n, we don't get all the elements. If we overestimate n,
we waste time and memory calculating a bunch of elements we don't need.

What we really need is something that will calculate and return the elements one by one, and that is precisely what a generator does. The generator version of our Fibonacci algorithm looks like this:
def gen_fibonacci(n):
(a,b) = (0,1)
for i in xrange(n):
(a,b) = (b,a+b)
yield a

The yield keyword is what makes this function a generator. It makes the function return the value of a to the caller, then suspends the function until the next value is needed. Execution then resumes immediately after the yield until the next yield, and so on. If you've done SPL programming, this concept might sound familiar, because this is how the SPL statement RETURN WITH RESUME behaves.

The corresponding caller code is the spitting image of the list example from above:
for elem in gen_fibonacci(10):
# process elem...
print elem

The key difference that makes the generator approach much more powerful than the list approach is that the processing of elements is interleaved into the calculation of elements. This means that no storage is required to hold the entire list, and the calculation can be aborted between elements at any time. Thus, producing the first Fibonacci numbers up to a million is as easy as this:
more_than_we_will_ever_need = 1<<31
for elem in gen_fibonacci(more_than_we_will_ever_need):
if elem > 1000000:
break
print elem

Note that there is no penalty for grossly overestimating the number of elements we're requesting, since the elements are calculated one by one, and we'll break out of the loop when we've seen enough. We could even extend the generator to have no limit at all, thus producing a sequence that's infinitely long.

I hope I have made it clear that generators are an excellent tool for producing and processing arbitrarily large sequential sets of data. This makes generators well suited for database applications since the result of querying a database is exactly that, an arbitrarily large sequential set of data. Consequently, InformixDB has a number of features that leverage the power of generators. In part two, after you've had some time to digest this article, I will show you some examples of using generators with InformixDB.

To be continued...