Sunday, July 22, 2007

Filling In The Blanks

When you develop a database application, sooner or later you will have to write queries in which parts depend on external inputs. These external inputs can come from a number of sources, including user input, results from other queries, or data from a configuration file. What they all have in common is that you can't predict at design time what this data is going to be. Unfortunately, many beginners don't know how to handle this problem properly, and they write code that performs poorly and can crash or allow unauthorized system access if the user supplies mischievous input.

Suppose for example you are developing an application that has access control based on a username and password. When the user supplies the username and password, you have to check this combination against a database table that stores the usernames and passwords of all authorized users. If you didn't know any better, you might use string formatting to build a query that contains the user-supplied data, along these lines:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = '%s'
and password = '%s'
""" % (username, password) )
row = cur.fetchone()
return (row[0]>1)

(The % operator is Python's equivalent of the sprintf() function in C.)

On the surface, this works, but this approach is bad for a number of reasons. For one, if the username or password contains an apostrophe, substituting them into the query will produce a syntax error. Worse, if a black-hat hacker enters ' or 1=1-- into the password field, he circumvents the verification logic and he can access any user's account without knowing the password. This kind of attack is widely known as SQL Injection Attack, and any application that fills user input into a query unchecked and unmodified is vulnerable to such an attack.

You could take measures against an SQL injection attack by quoting apostrophes that appear in the user's input, or by rejecting suspicious inputs, but you have to remember to do this for every single query, which can become tedious, and there may be ways to exploit your query that you haven't thought of. Schneier's Law applies here: "Any person can invent a security system so clever that she or he can't think of how to break it."

Even if you could develop an ironclad protection against SQL injection attacks, another problem remains: Every time the query is executed with different inputs, the database sees a different query. This means that the database has to go through the hard work of parsing the query and devising a query plan every time the query is executed. If the query is executed often enough, this will have a measurable negative impact on your application's performance.

An Informix-specific drawback of the string formatting approach is that you need to plug in a literal value for the supplied data. Certain data types, notably simple large objects, can not be represented by literal values, so you have no way of passing a simple large object values to a query using the string formatting approach.

The authors of the SQL standard realized that application developers need a reliable way to execute queries with variable inputs, and they agreed on a special syntax for designating so-called parameters in SQL statements. This syntax uses question marks as placeholders for parameters, and the actual values for parameters are supplied separately at execution time.

In Python with the InformixDB module, that looks like this:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = ?
and password = ?
""", (username, password) )
row = cur.fetchone()
return (row[0]>1)

Note that there are no quotation marks around the question marks. If there were, we'd be comparing the username and password to the character string containing one question mark, which is not what we want. Instead, the question marks signal to the database that they represent variable input parameters that are supplied separately.

There are two significant differences in the code between this approach and the first approach. The first difference is the use of question marks instead of '%s'. The second difference is a bit harder to spot. In the string-formatting approach, we passed one argument to the execute() method, namely, the result of substituting the tuple of username and password into the "command template" with the % operator. In the question mark approach, we're passing two arguments to the execute() method. The first argument is the query string with parameter markers. The second argument is the tuple of username and password, which is supplied to the database as the actual parameter values for the query.

Using parameter placeholders eliminates both the performance problems and the stability and security problems of the string-formatting approach. The danger of SQL injections is rendered moot, because the parameters are never actually plugged into the query. This doesn't necessarily mean that your application is 100% secure, because the underlying database might have vulnerabilities such as buffer overflow exploits, but as long as your database engine is kept up to date on security patches, an application with parameter placeholder queries is much harder to break than an application that uses string formatting.

Using parameter placeholders improves performance since the parameter contents are supplied separately, so the database always sees the same query every time this query is executed. This means that the database can skip the costly step of parsing and planning the query for repeated executions of the same query. To illustrate the amount of improvement that can be achieved, consider the following timing experiment:

# querytest.py
class Tester(object):
def __init__(self):
import informixdb
conn = informixdb.connect("ifxtest")
self.cur = conn.cursor()
self.cur.execute("create temp table t1(a int, b int)")
self.counter = 0
def with_params(self):
self.counter += 1
self.cur.execute("insert into t1 values(?,?)",
(self.counter,self.counter*2) )
def without_params(self):
self.counter += 1
self.cur.execute("insert into t1 values(%s,%s)" %
(self.counter,self.counter*2) )

This sets up a Tester class that creates a temporary table and supplies methods for inserting rows into this table, one using string formatting and one using parameter placeholders.

We then invoke the timeit module to measure how long repeated calls to those methods take:

$ python -mtimeit -s "from querytest import Tester; t=Tester()" \
't.without_params()'
1000 loops, best of 3: 415 usec per loop
$ python -mtimeit -s "from querytest import Tester; t=Tester()" \
't.with_params()'
10000 loops, best of 3: 150 usec per loop

This shows that, on my computer, filling in the blanks using parameter placeholders speeds up this test by a factor of almost 3. Add the speed improvements to the security improvements and the choice between string formatting and parameter placeholders becomes a no-brainer.

In addition to the standard question mark placeholders, InformixDB allows two other styles of placeholders. The following examples show how alternative placeholder styles can be used to improve the readability of your application.

The first alternative style is referred to as numeric. With numeric style, parameter placeholders are a colon followed by a number:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :1
and password = :2
""", (username, password) )
row = cur.fetchone()
return (row[0]>1)

The advantage of this style is that you can rearrange the query without having to rearrange the parameters. Also, if the same parameter value appears multiple times in the query, you can repeat the same placeholder instead of having to supply the same parameter value twice:

# question marks:
cur.execute("""
select * from orders
where order_date between ? and ? + 1 units month
""", (some_date, some_date) )

# numeric:
cur.execute("""
select * from orders
where order_date between :1 and :1 + 1 units month
""", (some_date,) )

While this is nice, the code would be even more readable if you could give the parameters meaningful names. That's possible with the parameter style called "named":

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :username
and password = :password
""", {'username':username, 'password':password} )
row = cur.fetchone()
return (row[0]>1)

Note that this requires that the second argument be a dictionary that maps the parameter names to their corresponding values. Writing out this dictionary is tedious if you have a lot of parameters, which is why I use a neat trick instead. InformixDB allows the parameter dictionary to contain more keys than it needs, as long as it contains at least the keys that correspond to the parameter names in the query. If your parameter placeholders have the same names as the local variables that supply the corresponding values, which is a good idea anyway in the interest of writing self-documenting code, you can use the locals() function to build the parameter dictionary:

def authenticate_user(conn, username, password):
cur = conn.cursor()
cur.execute("""
select count(*) from authorized_users
where username = :username
and password = :password
""", locals() )
row = cur.fetchone()
return (row[0]>1)

Any similarities of this last example to Informix 4GL host variables are entirely intentional, since simulating host variables was my main motivation when I added the ability to handle named parameters to InformixDB.

In case you're wondering, you don't have to specify which parameter style you're using. The execute() method detects automatically which style you're using, and you can switch between styles as long as you don't mix different styles in the same query.

As a final note I'd like to mention that InformixDB needs to parse the query to translate numeric and named placeholders to the native question marks. This takes time, so you will incur a small performance hit. I haven't measured this performance hit, but it should be much smaller than the performance gain from using parameter placeholders in the first place.

Also note that the placeholder parser is not a full SQL parser, so it can be fooled into seeing parameter placeholders where there aren't any, as in the following example by the evil genius of Jonathan Leffler: select * from somedb :sometable will cause the parser to think that :sometable is a parameter placeholder. To prevent this kind of false positive, you need to be careful when you write queries involving remote tables. Either put spaces on both sides of the colon or use no spaces at all, and the parser will understand that the colon is not part of a parameter placeholder.

I hope that you have enjoyed this edition of informixdb.connect and that you will be back for the next installment.

15 comments:

grendal said...

Hey!
The named parameters is probably the best form since it appears to be a "standard" across multiple databases.

Love that use of locals(). My solution was to pass a dictionary in as one of my parameters and use the values as needed.

Carsten Haese said...

"Standard" is to be taken with a grain of salt. sqlite supports named, as does cx_Oracle, although I heard cx_Oracle doesn't allow the dictionary to be over-specified. If that's true, the locals() trick goes out the window.

The DB-SIG mailing list has recently reached a consensus for making named parameter support mandatory in a future version of the API spec. Until that happens, you should always check with the documentation of the API in question which parameter style(s) it supports.

Tomade said...

Great article, Carsten. Thanks.

Pajton said...

Regarding reading sblob object, here is the code example:

import informixdb
import datetime
def main(argv=None):
if argv is None:
argv = sys.argv

aud = audit()
aud.init_connection()
aud.analiza()
aud.close_connection()
class audit():
def init_connection(self):
print 'init connection'
audit.conn = informixdb.connect('database_name@instance'
,user='a'
,password='a')
audit.cursor = audit.conn.cursor()
return 1
def analiza(self):
print 'analiza'
audit.cursor.execute("""select message from poll_manager
where appid='user_name'
and to_char (time, '%Y%m%d')>='20080112'
and to_char (time, '%Y%m%d')<'20080113'""")
hl7 = audit.cursor.fetchone()[0]
hl7.open()
f_hl7 = hl7.read(10000)
print f_hl7
def connection(self):
print 'close connection'
self.cursor.close()
self.conn.close()
if __name__ == '__main__':
main()

OctopusGrabbus said...

Is there an example of passing variables, instead of constants, into the connect statement?

Carsten Haese said...

@OctopusGrabbus: I'm not sure I understand your question. The "connect statement" is a regular Python function call. The function just needs string objects for the database name, username, and password. Whether those string objects are literal strings or names for string objects you've read from some configuration file doesn't matter.

Here's a concrete example of reading those values from an INI file:

#####################################
import informixdb
from ConfigParser import ConfigParser

cp = ConfigParser()
cp.read("database.ini")
dbname = cp.get("Database", "dbname")
user = cp.get("Database", "username")
passwd = cp.get("Database", "password")
conn = informixdb.connect(dbname, user, passwd)
#######################################

HTH,

Carsten.

OctopusGrabbus said...

Will Python 3 work with informixdb? If so, what environment variables do I need to set to use informixdb with Python 3?

tnx

OctopusGrabbus said...

Carstem: I believe my problems with passing variables has to do with using Python 3 and my probably not having set up something correctly.

OctopusGrabbus said...

Sorry for misspelling Carsten; key slipped.

Carsten Haese said...

No, informixdb does not support Python 3 yet. I started a branch of the code for Python 3 back when Python 3 was in alpha, but it'll probably need some work to play nice with the final release of Python 3, and I haven't had the necessary time to make that happen.

OctopusGrabbus said...

Thanks for the update. Again, the code is excellent, and I can live for now with python 3 to achieve what I need.

OctopusGrabbus said...

I have a fully formed "row" ready to insert into a 25-column table. Do I have to list out each column, or is there a broader form of the insert?

Carsten Haese said...

@OctopusGrabbus: If the row object is a dictionary (or dictionary-like object), you can use named parameters in your query and simply pass the row object as the second argument to the execute() function. Send me an email if you want to see an example.

Cristian Quagliozzi said...

Hi, sorry by my poor english. I've a serious problem installing informixDB in Debian Squeeze. I guess that I need ESQL/C installed too, isn't? When I want to run python setup.py build_ext it throw an error saying to me that "don't find esql". Can you help me please?

Carsten Haese said...

esql/c is part of the Informix client SDK (CSDK). You need to have the CSDK installed in order to compile InformixDB.