Home
entries friends calendar user info Scribbles in the Dark
Handwriting on the Sky
stranger than the night where black stars rise
glyf
Add to Memories
Tell a Friend
I didn't put the full title of this article into the "subject" because it was super long.

It Is Pretty Much a Bad Idea to Expose Raw SQL Through Your Database Access Layer

or

Fun Things I Found Out About Your Company With Administrator Access to your Database

I was originally inspired to write something about this when I read Jonathan Ellis remarking that ORMs should include direct SQL access because Django recently added a different, but still dodgy, 'OR'-operator support to a syntactically disappointing ORM query syntax. However, it really came to a boil for me when I found that Divmod had been running some third-party software with an SQL injection vulnerability. (Yes, we have since patched it, no harm done.)

Security experts have long known that code-injection attacks are pretty easy on many popular programming platforms, and you should take steps to prevent them. It's easy to find commentary on this. Stephen Thorne has had many amusing and insightful things to say about PHP's vulnerability to SQL injection attacks, as well as the occasional dig about just injecting PHP code itself. If you're looking for something more serious, Steve Friedl has written a fairly comprehensive guide to understanding, executing, and preventing against SQL injection.

An ORM's job is to provide an alternative interface to a database. Interfaces should be complete things, not broken fragments of utility which require manual crank-turning to function. If you have to use a different mechanism to access the database within your application, the ORM is incomplete and should be fixed. Sure, many programmers who use ORMs also know SQL, and that is a useful skill, because today these are in closely related problem domains, but they should not have to use SQL within the same context that is using the ORM.

The python "os" module provides (among other things) an alternative interface to a large portion of the POSIX C API. As I said, interfaces should be complete things. If you use a different mechanism to access the POSIX C API, the "os" module is incomplete and should be fixed. Again, sure: many programmers who use Python's "os" module also know the POSIX C API, and that is a useful skill, because these are related problem domains, but they should not have to use the POSIX C API within the same context that is using the "os" module.

In both cases, you can generate the underlying code yourself, and in both cases, people sometimes really need to, so the fact that you can is important. However, Few C programmers ever want to drop back down to C when they're using Python, and will rightly avoid it (as a complexity cost) when they can; yet many Python programmers who use ORMs frequently and loudly declare that they want to use SQL all the time.

Not to pick on Mr. Ellis. The syntax he's reacting to really is abhorrent (although that's no comment on Django as a whole), and the tremendous Ruby on Rails movement seems to largely agree with his point. There are good reasons that people want SQL access from within ORMs. It's simple: most ORMs are really, really awful. They are heavy on the "object" and not so much on the "database". There are a lot of features that SQL provides which they don't expose.

If you use an open-source ORM and find yourself wanting to use SQL to get at one of those features, consider not clamoring for (or not using, if one exists) an SQL execution back-door included in the library. Instead, consider ways to integrate that SQL feature with the existing structure of the ORM, or an extension which wraps that SQL feature on top of the ORM and only generates the SQL you need in one small place. Obviously this goes double if you are a user of Axiom - I do my best to accept any patches that expose new database features that were previously obscured.

Don't just accept the status quo and generate SQL strings from within your application. Originally this post was going to be longer and talk about API structure and communication between programmers and preconditions and postconditions and all kinds of fancy computer-science garbage, but I think it would be better to leave you with just this one thought - the security implications alone are more than enough reason to be extremely sparing, and careful, with the places that your code generates SQL. Isolate it, test it, audit it, and don't make it a habit.

Update: This article is confusingly titled. In fact, Axiom does provide an API for getting at SQL. Store.executeSQL. The point I am stressing here is that axiom does not "expose" it in that it is not a supported, public API, and if you have to call it, Axiom is broken and you should let Divmod know what you needed it for. To attempt to totally deny access to that layer would be unwise; as I said earlier in the article, "you can generate the underlying code yourself, and in both cases, people sometimes really need to, so the fact that you can is important".

Tags:
Current Music: Every Planet We Reach Is Dead (by Gorillaz on "Demon Days")

glyf
Add to Memories
Tell a Friend
Mr. Bicking wants to change his default encoding. Since there is some buzz about this I figure it would be a good opportunity to answer something that has already emerged as a FAQ during Axiom's short life, about its treatment of strings.

Axiom does not have strings. It has 2 attribute types that look suspiciously like strings: text() and bytes().

However, 'text()' does not convert a Python str to text for you, and never, ever will. This is not an accident, and it is not because guessing at this sort of automatic conversion is hard. Lots of packages do it, including Python - str(unicode(x)) does do something, after all.

However, in my mind, that is an unfortunate coincidence, and I avoid using the default encoding anywhere I can. Let me respond directly to part of his post, point-by-point:
Are people claiming that there should be no default encoding?
That's what I would say, yes. The default encoding is a process-global variable that sets you up for a lot of confusion, since encoding is always context and data-type dependent. Occasionally I get lazy and use the default encoding, since I know that regardless of what it is it probably has ASCII as a subset (and I know that my data is something like an email address or a URL which functionally must be ASCII), but this is not generally good behavior.
As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.
I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.

The fact that English text, the sort that programmers commonly use to converse with, code with, identify network endpoints with and test program input with, looks very similar in its decoded and encoded forms, is an unfortunate and misleading phenomenon. It means that programs are often very confused about what kind of data they are processing but appear to work anyway, and make serious errors only when presented with input which differs in encoded and decoded form.

SQLite unfortunately succumbs to this malady as well, although at least they tried. Right now we are using its default COLLATE NOCASE for case-insensitive indexing and searches. This is defined according to the docs as "The same as binary, except the 26 upper case characters used by the English language are folded to their lower case equivalents before the comparison is performed." Needless to say, despite SQLite's pervasive use of Unicode throughout the database, that is not how you case-insensitively compare Unicode strings.

Using the default encoding and Unicode only worsens this. Now the program appears to work, and may in fact be correct in the face of non-English, or even non-human-language input, but breaks randomly and mangles data when moved to a different host environment with a different locally-specified default encoding. "Everybody use UTF-8" isn't a solution either; forgetting the huge accidental diversity in this detail of configuration, In Asian countries especially, the system's default encoding implies certain things to a lot of different software. It would be extremely unwise to force your encoding choice upon everyone else.

I don't think that Ian has an entirely unreasonable position; the only reason I know anything about Unicode at all was that I was exposed to a lot of internationalization projects during my brief stint in the game industry, and mostly on projects that had taken multilingual features into account from the start.

The situation that I describe, where text and bytes are clearly delineated and never the twain shall meet, is a fantasy-land sort of scenario. Real-world software still handles multilingual text very badly, and encoding and decoding properly within your software does no good and is a lot of extra work when you're interfacing with a system that only deals with code points 65-90. Forcing people to deal with this detail is often viewed as arrogance on the part of the system designer, and in many scenarios the effort is wasted because the systems you're interfacing with are already broken.

Still, I believe that forcing programmers to consider encoding issues whenever they have to store some text is a very useful exercise, since otherwise - this is important - foreign language users may be completely unable to use your application. What is to you simply a question-mark or box where you expected to see an "é" is, to billions of users the world over, a page full of binary puke where they expected to see a letter they just typed. Even pure English users can benefit: consider the difference between and . Finally, if you are integrating with a crappy, non-Unicode-aware system (or a system that handles Unicode but extremely poorly) you can explicitly note the nature of its disease and fail before passing it data outside the range (usually ASCII) that you know it can handle.

Consider the other things that data - regular python 'str' objects - might represent. Image data, for example. If there were a culture of programmers that expected image data to always be unpacked 32-bit RGBA byte sequences, it would be very difficult to get the Internet off the ground; image formats like PNG and JPEG have to be decoded before they are useful image data, and it is very difficult to set a 'system default image format' and have them all magically decoded and encoded properly. If we did have sys.defaultimageformat, or sys.defaultaudiocodec, we'd end up with an upsetting amount of multi-color snow and shrieking noise on our computers.

That is why Axiom does not, will not, and can not, automatically decode and encode your strings for you. Your string could be a chunk of oscilloscope data, and there is no Unicode encoding for that. If you need to store it, store it unencoded, as data, and load it and interpret it later. There are good reasons why people use different audio and image codecs; there are perhaps less good, but nevertheless valid reasons why people use different Unicode codecs.

To avoid a similar common kind of error, I don't think that Axiom is going to provide a 'float' type before we've implemented a 'money' type - more on why money needs to be encoded and decoded just like Unicode in my next installment :)

Tags: , ,
Current Mood: quixotic
Current Music: Tutelary Genius (by Universal Hall Pass on "Mercury")

glyf
Add to Memories
Tell a Friend
If you've ever tried to use __slots__, you might have noticed that it comes with a pleasant side-effect: it makes it impossible to assign extra, garbage attributes to your objects. While normally not a serious problem, objects with extra, unintentional crud stuck to them can make object databases hell to work with as you struggle to figure out where it came from or why it is in your database but not in memory. Therefore, this semantic feature is handy in addition to reducing the memory footprint of objects if you have large numbers of them lying around.

If you've noticed this, you've also probably noticed that it is damn near impossible to use __slots__ because you have no control over what slots are used from your base classes.

Here is a solution to this problem that I have been working on for a while: SlotMachine. While it breaks isinstance - you weren't using isinstance anyway, right? - it does do the sane thing that you would expect a cooperative __slots__ implementation to do; you can subclass from random other classes (provided that you properly specify all their attributes as slots on your object) and other SlotMachines, and even other objects that define __slots__, without giving up the explicit specification of attributes or the ability to inherit from things.

I'll add some more comments to that file later tomorrow, I think.

Tags: , , ,
Current Mood: weird
Current Music: Only (by Nine Inch Nails on "With Teeth")

profile
Glyph Lefkowitz
User: [info]glyf
Name: Glyph Lefkowitz
calendar
Back April 2008
12345
6789101112
13141516171819
20212223242526
27282930
page summary
tags