Friday, April 27, 2007

The Power of Generators, Part Two

In the previous installment of informixdb.connect, I have demonstrated that generators are a powerful tool for dealing with arbitrarily large sets of sequential data. In this installment, I will show some concrete examples of leveraging this power in Informix applications with the InformixDB module.

By the way, this blog does not pretend to be an introduction to Python. Python is a very readable and mostly straightforward language, and the code examples in this blog should be easy enough to follow if you don't know Python. If you are interested in learning Python, one good place to start is the Python Tutorial. For more information on developing Informix applications in Python, the InformixDB documentation is a useful resource.

The first example shows how InformixDB uses generators to offer a natural way for dealing with query results. This makes sense because generators allow you to handle arbitrarily large sets of sequential data, and that's precisely what you get from a database query. Consider the following code for connecting to a database and executing a select query:
import informixdb

# Establish a database connection
conn = informixdb.connect("stores_demo")

# Ask the connection for a cursor object
cur = conn.cursor()

# Execute a query
cur.execute("select fname, lname from customer")

At this point, we have executed the query, but we haven't fetched any of its result rows. One possibility is to fetch all result rows into a list and looping through that list:
all_rows = cur.fetchall()
for row in all_rows:
print row[0], row[1]

This works, but it has the same drawbacks as the list_fibonacci example from part one. If the result set is large, reading it all at once will use a lot of memory. Also, all rows have to be fetched before any of the rows can be processed.

The memory and time overhead can be avoided by explicity fetching and processing individual rows:
while True:
row = cur.fetchone()
if row == None: break
print row[0], row[1]

Using a result set generator would give us the best of both worlds: We get to use a more concise "for" loop without incurring the overhead of fetching all rows into a list. Fortunately, such a generator is trivial to obtain with InformixDB, because the cursor object itself acts as a generator that yields all of its result rows. Therefore, the following piece of code does exactly what we want:
for row in cur:
print row[0], row[1]

The above "for" statement will fetch the rows from "cur" one by one and execute the loop body for each fetched row. If you've programmed in SPL or Informix-4GL before, this concept should feel familiar and natural; it's Python's equivalent of a FOREACH statement.

That example showed the very common case of InformixDB generating a sequence that is consumed by our application code. In the next example, we will turn the tables and generate a sequence that InformixDB consumes. To see how this can be useful, we'll consider the following naive data loader:
# Get a database connection and a cursor
import informixdb
conn = informixdb.connect("stores_demo")
cur = conn.cursor()

# Open a file
f = open("names.unl")
# Loop through the lines of the file
for line in f:
# Split into columns
data = line.split("|")
# Insert the data row
cur.execute("insert into customer(fname,lname) values (?,?)",
data)

This inserts a row of data into the customer table for each row in the file. If there are many rows in the file, the same insert statement will be submitted to the database many times over, just with different parameters, and that's not very efficient.

To execute the same query many times with different parameters, the DB-API provides the executemany() method. executemany() is given two arguments: the query to execute, and a sequence of parameter sets. That sequence could be a list, but as we know by now, that would use a lot of memory, so we will provide a generator instead. We'll replace the for loop with a generator definition like this:
def gen_data():
f = open("names.unl")
for line in f:
data = line.split("|")
yield data

This could be optimized by combining the last two lines into yield line.split("|"), but the more verbose form makes the relationship to the naive approach clearer. The code has the same structure, but instead of passing the data to the execute call, we "yield" the data to the consumer. The consumer is the following executemany() call:
cur.executemany(
"insert into customer(fname,lname) values (?,?)",
gen_data() )

This approach has several speed advantages to the naive approach. For one, the loop under the hood in executemany is coded in C, which eliminates the overhead of interpreter-level looping and function calling. More importantly, executemany employs an insert cursor when it executes insert statements, which boosts performance even more.

In addition to providing a non-trivial example of using executemany, this example also illustrates a useful code pattern that I like to call generator transformation. gen_data consumes the lines that are generated by the file object f. It processes the lines by splitting them into columns, and then yields the results as a new generator, thus transforming a generator into another generator.

The concept of generator transformations is powerful because it improves your productivity by encouraging code reuse. Instead of having to write similar special-purpose generators over and over again, Python programmers often reuse existing general-purpose generators and transform them into special-purpose generators. In the last example we will see just how powerful this approach can be.

The last example demonstrates how to create a simple grouped report. Making a grouped report is easy in a specialized language such as Informix 4GL, but in a 3GL language this can be tedious: You need code to keep track of the group key and execute group header and footer code when the group key changes from one data row to the next. Python is no exception in the sense that the language itself does not have any special syntax for producing grouped reports. However, the necessary code for grouping data has already been written, and it is part of Python's standard library that can easily be reused.

To show how easy it is to create a grouped report, we will start with a flat non-grouped report which will be transformed into a grouped report later.
import informixdb

# Connect to the database
conn = informixdb.connect("stores_demo")

# Obtain a cursor that will return result rows as objects that
# allow "dot" notation for accessing columns by name.
cur = conn.cursor(rowformat=informixdb.ROW_AS_OBJECT)

# Execute the query for obtaining the report data.
cur.execute("""
select c.customer_num, o.order_date, o.order_num
from customer as c, orders as o
order by c.customer_num, o.order_date
""")

# Print column headers
print "Cust. No Order Date Order No"
print "---------- ---------- ----------"

# Format and print each result row
for row in cur:
row.format_date = row.order_date.strftime("%d/%m/%Y")
print "%(customer_num)10d %(format_date)10s %(order_num)10d" \
% row

This code snippet is a simple, yet complete Python program for producing a basic order history report. Note that we're using the cursor as a data generator with the for row in cur idiom that was introduced in the first example. We will now turn this flat report into a grouped report by transforming cur into a generator that generates groups of data.

This transformation will be performed by the groupby function in the itertools module that is part of Python's standard library. We simply give groupby the generator to transform and a callback function that determines the key by which the elements should be grouped. In return, groupby gives us the transformed generator that yields the desired groups. All the tedious work of keeping track of the group key and detecting group breaks is done under the hood in groupby.

Modifying the code example to make use of this transformation produces the following end result:
# Import the modules we need.
import informixdb, itertools

# Connect to the database, get a cursor, and execute the query
# as before
conn = informixdb.connect("stores_demo")
cur = conn.cursor(rowformat=informixdb.ROW_AS_OBJECT)

cur.execute("""
select c.customer_num, o.order_date, o.order_num
from customer as c, orders as o
order by c.customer_num, o.order_date
""")

# Define the group key "callback" function
def get_customer_num(row):
return row.customer_num

# Transform into grouping by that key
cust_grouping = itertools.groupby(cur, get_customer_num)

# Loop over the groups
for (group_key, group_contents) in cust_grouping:
# Print the group header
print "Customer Number:", group_key
print
print "Order Date Order No"
print "---------- ----------"
# Format and print each row in the group
for row in group_contents:
row.format_date = row.order_date.strftime("%d/%m/%Y")
print "%(format_date)10s %(order_num)10d" % row
# Print the group footer
print

Note that instead of iterating over cur directly, we now pass it into a call to itertools.groupby. The result, cust_grouping is a new generator that consumes cur's data and yields pairs of group keys and group contents. The single for loop that iterated through the flat data has become a nested loop. The outer loop iterates through the groups, and the inner loop iterates through the data rows in each group.

If this is not mind-bending enough, consider this: group_contents, which is generated by cust_grouping, is also a generator. Therefore, we could apply any kind of generator transformation to it, such as calling groupby on it, which allows us to group the report into sub-groups, sub-sub-groups, and so on, to any desired nesting level.

That example concludes this edition of informixdb.connect. It turned out a bit lengthy, but I think this was necessary to show how useful iterators in general and generators in particular can be. I promise the next edition will be shorter and less mentally taxing.