Sunday, July 22, 2007

Filling In The Blanks

When you develop a database application, sooner or later you will have to write queries in which parts depend on external inputs. These external inputs can come from a number of sources, including user input, results from other queries, or data from a configuration file. What they all have in common is that you can't predict at design time what this data is going to be. Unfortunately, many beginners don't know how to handle this problem properly, and they write code that performs poorly and can crash or allow unauthorized system access if the user supplies mischievous input.

Suppose for example you are developing an application that has access control based on a username and password. When the user supplies the username and password, you have to check this combination against a database table that stores the usernames and passwords of all authorized users. If you didn't know any better, you might use string formatting to build a query that contains the user-supplied data, along these lines:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = '%s'
and password = '%s'
""" % (username, password) )
row = cur.fetchone()
return (row[0]>1)

(The % operator is Python's equivalent of the sprintf() function in C.)

On the surface, this works, but this approach is bad for a number of reasons. For one, if the username or password contains an apostrophe, substituting them into the query will produce a syntax error. Worse, if a black-hat hacker enters ' or 1=1-- into the password field, he circumvents the verification logic and he can access any user's account without knowing the password. This kind of attack is widely known as SQL Injection Attack, and any application that fills user input into a query unchecked and unmodified is vulnerable to such an attack.

You could take measures against an SQL injection attack by quoting apostrophes that appear in the user's input, or by rejecting suspicious inputs, but you have to remember to do this for every single query, which can become tedious, and there may be ways to exploit your query that you haven't thought of. Schneier's Law applies here: "Any person can invent a security system so clever that she or he can't think of how to break it."

Even if you could develop an ironclad protection against SQL injection attacks, another problem remains: Every time the query is executed with different inputs, the database sees a different query. This means that the database has to go through the hard work of parsing the query and devising a query plan every time the query is executed. If the query is executed often enough, this will have a measurable negative impact on your application's performance.

An Informix-specific drawback of the string formatting approach is that you need to plug in a literal value for the supplied data. Certain data types, notably simple large objects, can not be represented by literal values, so you have no way of passing a simple large object values to a query using the string formatting approach.

The authors of the SQL standard realized that application developers need a reliable way to execute queries with variable inputs, and they agreed on a special syntax for designating so-called parameters in SQL statements. This syntax uses question marks as placeholders for parameters, and the actual values for parameters are supplied separately at execution time.

In Python with the InformixDB module, that looks like this:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = ?
and password = ?
""", (username, password) )
row = cur.fetchone()
return (row[0]>1)

Note that there are no quotation marks around the question marks. If there were, we'd be comparing the username and password to the character string containing one question mark, which is not what we want. Instead, the question marks signal to the database that they represent variable input parameters that are supplied separately.

There are two significant differences in the code between this approach and the first approach. The first difference is the use of question marks instead of '%s'. The second difference is a bit harder to spot. In the string-formatting approach, we passed one argument to the execute() method, namely, the result of substituting the tuple of username and password into the "command template" with the % operator. In the question mark approach, we're passing two arguments to the execute() method. The first argument is the query string with parameter markers. The second argument is the tuple of username and password, which is supplied to the database as the actual parameter values for the query.

Using parameter placeholders eliminates both the performance problems and the stability and security problems of the string-formatting approach. The danger of SQL injections is rendered moot, because the parameters are never actually plugged into the query. This doesn't necessarily mean that your application is 100% secure, because the underlying database might have vulnerabilities such as buffer overflow exploits, but as long as your database engine is kept up to date on security patches, an application with parameter placeholder queries is much harder to break than an application that uses string formatting.

Using parameter placeholders improves performance since the parameter contents are supplied separately, so the database always sees the same query every time this query is executed. This means that the database can skip the costly step of parsing and planning the query for repeated executions of the same query. To illustrate the amount of improvement that can be achieved, consider the following timing experiment:

# querytest.py
class Tester(object):
def __init__(self):
import informixdb
conn = informixdb.connect("ifxtest")
self.cur = conn.cursor()
self.cur.execute("create temp table t1(a int, b int)")
self.counter = 0
def with_params(self):
self.counter += 1
self.cur.execute("insert into t1 values(?,?)",
(self.counter,self.counter*2) )
def without_params(self):
self.counter += 1
self.cur.execute("insert into t1 values(%s,%s)" %
(self.counter,self.counter*2) )

This sets up a Tester class that creates a temporary table and supplies methods for inserting rows into this table, one using string formatting and one using parameter placeholders.

We then invoke the timeit module to measure how long repeated calls to those methods take:

$ python -mtimeit -s "from querytest import Tester; t=Tester()" \
't.without_params()'
1000 loops, best of 3: 415 usec per loop
$ python -mtimeit -s "from querytest import Tester; t=Tester()" \
't.with_params()'
10000 loops, best of 3: 150 usec per loop

This shows that, on my computer, filling in the blanks using parameter placeholders speeds up this test by a factor of almost 3. Add the speed improvements to the security improvements and the choice between string formatting and parameter placeholders becomes a no-brainer.

In addition to the standard question mark placeholders, InformixDB allows two other styles of placeholders. The following examples show how alternative placeholder styles can be used to improve the readability of your application.

The first alternative style is referred to as numeric. With numeric style, parameter placeholders are a colon followed by a number:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :1
and password = :2
""", (username, password) )
row = cur.fetchone()
return (row[0]>1)

The advantage of this style is that you can rearrange the query without having to rearrange the parameters. Also, if the same parameter value appears multiple times in the query, you can repeat the same placeholder instead of having to supply the same parameter value twice:

# question marks:
cur.execute("""
select * from orders
where order_date between ? and ? + 1 units month
""", (some_date, some_date) )

# numeric:
cur.execute("""
select * from orders
where order_date between :1 and :1 + 1 units month
""", (some_date,) )

While this is nice, the code would be even more readable if you could give the parameters meaningful names. That's possible with the parameter style called "named":

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :username
and password = :password
""", {'username':username, 'password':password} )
row = cur.fetchone()
return (row[0]>1)

Note that this requires that the second argument be a dictionary that maps the parameter names to their corresponding values. Writing out this dictionary is tedious if you have a lot of parameters, which is why I use a neat trick instead. InformixDB allows the parameter dictionary to contain more keys than it needs, as long as it contains at least the keys that correspond to the parameter names in the query. If your parameter placeholders have the same names as the local variables that supply the corresponding values, which is a good idea anyway in the interest of writing self-documenting code, you can use the locals() function to build the parameter dictionary:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :username
and password = :password
""", locals() )
row = cur.fetchone()
return (row[0]>1)

Any similarities of this last example to Informix 4GL host variables are entirely intentional, since simulating host variables was my main motivation when I added the ability to handle named parameters to InformixDB.

In case you're wondering, you don't have to specify which parameter style you're using. The execute() method detects automatically which style you're using, and you can switch between styles as long as you don't mix different styles in the same query.

As a final note I'd like to mention that InformixDB needs to parse the query to translate numeric and named placeholders to the native question marks. This takes time, so you will incur a small performance hit. I haven't measured this performance hit, but it should be much smaller than the performance gain from using parameter placeholders in the first place.

Also note that the placeholder parser is not a full SQL parser, so it can be fooled into seeing parameter placeholders where there aren't any, as in the following example by the evil genius of Jonathan Leffler: select * from somedb :sometable will cause the parser to think that :sometable is a parameter placeholder. To prevent this kind of false positive, you need to be careful when you write queries involving remote tables. Either put spaces on both sides of the colon or use no spaces at all, and the parser will understand that the colon is not part of a parameter placeholder.

I hope that you have enjoyed this edition of informixdb.connect and that you will be back for the next installment.

Friday, April 27, 2007

The Power of Generators, Part Two

In the previous installment of informixdb.connect, I have demonstrated that generators are a powerful tool for dealing with arbitrarily large sets of sequential data. In this installment, I will show some concrete examples of leveraging this power in Informix applications with the InformixDB module.

By the way, this blog does not pretend to be an introduction to Python. Python is a very readable and mostly straightforward language, and the code examples in this blog should be easy enough to follow if you don't know Python. If you are interested in learning Python, one good place to start is the Python Tutorial. For more information on developing Informix applications in Python, the InformixDB documentation is a useful resource.

The first example shows how InformixDB uses generators to offer a natural way for dealing with query results. This makes sense because generators allow you to handle arbitrarily large sets of sequential data, and that's precisely what you get from a database query. Consider the following code for connecting to a database and executing a select query:
import informixdb

# Establish a database connection
conn = informixdb.connect("stores_demo")

# Ask the connection for a cursor object
cur = conn.cursor()

# Execute a query
cur.execute("select fname, lname from customer")

At this point, we have executed the query, but we haven't fetched any of its result rows. One possibility is to fetch all result rows into a list and looping through that list:
all_rows = cur.fetchall()
for row in all_rows:
print row[0], row[1]

This works, but it has the same drawbacks as the list_fibonacci example from part one. If the result set is large, reading it all at once will use a lot of memory. Also, all rows have to be fetched before any of the rows can be processed.

The memory and time overhead can be avoided by explicity fetching and processing individual rows:
while True:
row = cur.fetchone()
if row == None: break
print row[0], row[1]

Using a result set generator would give us the best of both worlds: We get to use a more concise "for" loop without incurring the overhead of fetching all rows into a list. Fortunately, such a generator is trivial to obtain with InformixDB, because the cursor object itself acts as a generator that yields all of its result rows. Therefore, the following piece of code does exactly what we want:
for row in cur:
print row[0], row[1]

The above "for" statement will fetch the rows from "cur" one by one and execute the loop body for each fetched row. If you've programmed in SPL or Informix-4GL before, this concept should feel familiar and natural; it's Python's equivalent of a FOREACH statement.

That example showed the very common case of InformixDB generating a sequence that is consumed by our application code. In the next example, we will turn the tables and generate a sequence that InformixDB consumes. To see how this can be useful, we'll consider the following naive data loader:
# Get a database connection and a cursor
import informixdb
conn = informixdb.connect("stores_demo")
cur = conn.cursor()

# Open a file
f = open("names.unl")
# Loop through the lines of the file
for line in f:
# Split into columns
data = line.split("|")
# Insert the data row
cur.execute("insert into customer(fname,lname) values (?,?)",
data)

This inserts a row of data into the customer table for each row in the file. If there are many rows in the file, the same insert statement will be submitted to the database many times over, just with different parameters, and that's not very efficient.

To execute the same query many times with different parameters, the DB-API provides the executemany() method. executemany() is given two arguments: the query to execute, and a sequence of parameter sets. That sequence could be a list, but as we know by now, that would use a lot of memory, so we will provide a generator instead. We'll replace the for loop with a generator definition like this:
def gen_data():
f = open("names.unl")
for line in f:
data = line.split("|")
yield data

This could be optimized by combining the last two lines into yield line.split("|"), but the more verbose form makes the relationship to the naive approach clearer. The code has the same structure, but instead of passing the data to the execute call, we "yield" the data to the consumer. The consumer is the following executemany() call:
cur.executemany(
"insert into customer(fname,lname) values (?,?)",
gen_data() )

This approach has several speed advantages to the naive approach. For one, the loop under the hood in executemany is coded in C, which eliminates the overhead of interpreter-level looping and function calling. More importantly, executemany employs an insert cursor when it executes insert statements, which boosts performance even more.

In addition to providing a non-trivial example of using executemany, this example also illustrates a useful code pattern that I like to call generator transformation. gen_data consumes the lines that are generated by the file object f. It processes the lines by splitting them into columns, and then yields the results as a new generator, thus transforming a generator into another generator.

The concept of generator transformations is powerful because it improves your productivity by encouraging code reuse. Instead of having to write similar special-purpose generators over and over again, Python programmers often reuse existing general-purpose generators and transform them into special-purpose generators. In the last example we will see just how powerful this approach can be.

The last example demonstrates how to create a simple grouped report. Making a grouped report is easy in a specialized language such as Informix 4GL, but in a 3GL language this can be tedious: You need code to keep track of the group key and execute group header and footer code when the group key changes from one data row to the next. Python is no exception in the sense that the language itself does not have any special syntax for producing grouped reports. However, the necessary code for grouping data has already been written, and it is part of Python's standard library that can easily be reused.

To show how easy it is to create a grouped report, we will start with a flat non-grouped report which will be transformed into a grouped report later.
import informixdb

# Connect to the database
conn = informixdb.connect("stores_demo")

# Obtain a cursor that will return result rows as objects that
# allow "dot" notation for accessing columns by name.
cur = conn.cursor(rowformat=informixdb.ROW_AS_OBJECT)

# Execute the query for obtaining the report data.
cur.execute("""
select c.customer_num, o.order_date, o.order_num
from customer as c, orders as o
order by c.customer_num, o.order_date
""")

# Print column headers
print "Cust. No Order Date Order No"
print "---------- ---------- ----------"

# Format and print each result row
for row in cur:
row.format_date = row.order_date.strftime("%d/%m/%Y")
print "%(customer_num)10d %(format_date)10s %(order_num)10d" \
% row

This code snippet is a simple, yet complete Python program for producing a basic order history report. Note that we're using the cursor as a data generator with the for row in cur idiom that was introduced in the first example. We will now turn this flat report into a grouped report by transforming cur into a generator that generates groups of data.

This transformation will be performed by the groupby function in the itertools module that is part of Python's standard library. We simply give groupby the generator to transform and a callback function that determines the key by which the elements should be grouped. In return, groupby gives us the transformed generator that yields the desired groups. All the tedious work of keeping track of the group key and detecting group breaks is done under the hood in groupby.

Modifying the code example to make use of this transformation produces the following end result:
# Import the modules we need.
import informixdb, itertools

# Connect to the database, get a cursor, and execute the query
# as before
conn = informixdb.connect("stores_demo")
cur = conn.cursor(rowformat=informixdb.ROW_AS_OBJECT)

cur.execute("""
select c.customer_num, o.order_date, o.order_num
from customer as c, orders as o
order by c.customer_num, o.order_date
""")

# Define the group key "callback" function
def get_customer_num(row):
return row.customer_num

# Transform into grouping by that key
cust_grouping = itertools.groupby(cur, get_customer_num)

# Loop over the groups
for (group_key, group_contents) in cust_grouping:
# Print the group header
print "Customer Number:", group_key
print
print "Order Date Order No"
print "---------- ----------"
# Format and print each row in the group
for row in group_contents:
row.format_date = row.order_date.strftime("%d/%m/%Y")
print "%(format_date)10s %(order_num)10d" % row
# Print the group footer
print

Note that instead of iterating over cur directly, we now pass it into a call to itertools.groupby. The result, cust_grouping is a new generator that consumes cur's data and yields pairs of group keys and group contents. The single for loop that iterated through the flat data has become a nested loop. The outer loop iterates through the groups, and the inner loop iterates through the data rows in each group.

If this is not mind-bending enough, consider this: group_contents, which is generated by cust_grouping, is also a generator. Therefore, we could apply any kind of generator transformation to it, such as calling groupby on it, which allows us to group the report into sub-groups, sub-sub-groups, and so on, to any desired nesting level.

That example concludes this edition of informixdb.connect. It turned out a bit lengthy, but I think this was necessary to show how useful iterators in general and generators in particular can be. I promise the next edition will be shorter and less mentally taxing.