You are viewing glyf

entries friends calendar profile Scribbles in the Dark Previous Previous Next Next
Please Visit http://glyph.twistedmatrix.com/ - This Blog Is Closed. - Encoding.
Sorry about the ads, I can't turn them off.
glyf
glyf
Share
Encoding.
Mr. Bicking wants to change his default encoding. Since there is some buzz about this I figure it would be a good opportunity to answer something that has already emerged as a FAQ during Axiom's short life, about its treatment of strings.

Axiom does not have strings. It has 2 attribute types that look suspiciously like strings: text() and bytes().

However, 'text()' does not convert a Python str to text for you, and never, ever will. This is not an accident, and it is not because guessing at this sort of automatic conversion is hard. Lots of packages do it, including Python - str(unicode(x)) does do something, after all.

However, in my mind, that is an unfortunate coincidence, and I avoid using the default encoding anywhere I can. Let me respond directly to part of his post, point-by-point:
Are people claiming that there should be no default encoding?
That's what I would say, yes. The default encoding is a process-global variable that sets you up for a lot of confusion, since encoding is always context and data-type dependent. Occasionally I get lazy and use the default encoding, since I know that regardless of what it is it probably has ASCII as a subset (and I know that my data is something like an email address or a URL which functionally must be ASCII), but this is not generally good behavior.
As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.
I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.

The fact that English text, the sort that programmers commonly use to converse with, code with, identify network endpoints with and test program input with, looks very similar in its decoded and encoded forms, is an unfortunate and misleading phenomenon. It means that programs are often very confused about what kind of data they are processing but appear to work anyway, and make serious errors only when presented with input which differs in encoded and decoded form.

SQLite unfortunately succumbs to this malady as well, although at least they tried. Right now we are using its default COLLATE NOCASE for case-insensitive indexing and searches. This is defined according to the docs as "The same as binary, except the 26 upper case characters used by the English language are folded to their lower case equivalents before the comparison is performed." Needless to say, despite SQLite's pervasive use of Unicode throughout the database, that is not how you case-insensitively compare Unicode strings.

Using the default encoding and Unicode only worsens this. Now the program appears to work, and may in fact be correct in the face of non-English, or even non-human-language input, but breaks randomly and mangles data when moved to a different host environment with a different locally-specified default encoding. "Everybody use UTF-8" isn't a solution either; forgetting the huge accidental diversity in this detail of configuration, In Asian countries especially, the system's default encoding implies certain things to a lot of different software. It would be extremely unwise to force your encoding choice upon everyone else.

I don't think that Ian has an entirely unreasonable position; the only reason I know anything about Unicode at all was that I was exposed to a lot of internationalization projects during my brief stint in the game industry, and mostly on projects that had taken multilingual features into account from the start.

The situation that I describe, where text and bytes are clearly delineated and never the twain shall meet, is a fantasy-land sort of scenario. Real-world software still handles multilingual text very badly, and encoding and decoding properly within your software does no good and is a lot of extra work when you're interfacing with a system that only deals with code points 65-90. Forcing people to deal with this detail is often viewed as arrogance on the part of the system designer, and in many scenarios the effort is wasted because the systems you're interfacing with are already broken.

Still, I believe that forcing programmers to consider encoding issues whenever they have to store some text is a very useful exercise, since otherwise - this is important - foreign language users may be completely unable to use your application. What is to you simply a question-mark or box where you expected to see an "é" is, to billions of users the world over, a page full of binary puke where they expected to see a letter they just typed. Even pure English users can benefit: consider the difference between and . Finally, if you are integrating with a crappy, non-Unicode-aware system (or a system that handles Unicode but extremely poorly) you can explicitly note the nature of its disease and fail before passing it data outside the range (usually ASCII) that you know it can handle.

Consider the other things that data - regular python 'str' objects - might represent. Image data, for example. If there were a culture of programmers that expected image data to always be unpacked 32-bit RGBA byte sequences, it would be very difficult to get the Internet off the ground; image formats like PNG and JPEG have to be decoded before they are useful image data, and it is very difficult to set a 'system default image format' and have them all magically decoded and encoded properly. If we did have sys.defaultimageformat, or sys.defaultaudiocodec, we'd end up with an upsetting amount of multi-color snow and shrieking noise on our computers.

That is why Axiom does not, will not, and can not, automatically decode and encode your strings for you. Your string could be a chunk of oscilloscope data, and there is no Unicode encoding for that. If you need to store it, store it unencoded, as data, and load it and interpret it later. There are good reasons why people use different audio and image codecs; there are perhaps less good, but nevertheless valid reasons why people use different Unicode codecs.

To avoid a similar common kind of error, I don't think that Axiom is going to provide a 'float' type before we've implemented a 'money' type - more on why money needs to be encoded and decoded just like Unicode in my next installment :)

Tags: , ,
Current Mood: quixotic quixotic
Current Music: Tutelary Genius (by Universal Hall Pass on "Mercury")

Comments
From: ianbicking Date: August 4th, 2005 03:54 pm (UTC) (Link)

I Wish It Were So...

I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.
This would be nice, except it isn't true. In Python, both the core language and a bulk of library code, str-as-text is common. For instance, I consider class names and attribute names to be text. The result of repr() is text. 99% of string literals are text, and yet most string literals aren't unicode.

I'd be okay if we dropped strings. And really there's nothing stopping Python from adding the bytestring-not-like-str object right now, since I think there's consensus that we need a more distinct byte object.

As I think about it, that might help the process of unicodification even before str is eliminated (which is a ways off). At least a person could make a hard distinction (duck-type-capable distinction) in their libraries, without just sprinkling everything with asserts or setting the encoding to undefined.

"Everybody use UTF-8" isn't a solution either; forgetting the huge accidental diversity in this detail of configuration, In Asian countries especially, the system's default encoding implies certain things to a lot of different software. It would be extremely unwise to force your encoding choice upon everyone else.
I'm most interested in server processes, where any local default encoding is itself a bug, because there's no "local". The configuration issue is important, but it's not a blocker -- a couple assertions, a couple little hacks, and you can at least confirm the encoding is set correctly. This is something best done as part of a larger programming environment -- for documentation reasons if nothing else -- as it does create a somewhat incompatible environment.

But really my assertion is that, for most practical cases, a UTF-8 default encoding is a much more compatible environment than the one we have. I agree that all text in Unicode is the right solution ultimately, but it just isn't an option now. There are some signficant issues with that default encoding; the most concerning one to me is that equal str/unicode strings don't hash equally even if they compare equally, except in ASCII. I dunno... I'm open to other ideas, but the only ideas out there have involved walled cities of code, and that's not that appealing to me. When the majority of programmers get something wrong -- and at least for US programmers I'm pretty sure that's true -- then we have to look at what's wrong with the environment that causes that.

glyf From: glyf Date: August 4th, 2005 04:34 pm (UTC) (Link)

Re: I Wish It Were So...

The idea of "non-Unicode text" is just a programming error waiting to happen.
This would be nice, except it isn't true.

I think that it is true. I don't think we're really disagreeing - I'm saying "it's an error", you're saying "it's common". Later on you say "the majority of programmers get something wrong" and thereby seem to be agreeing with my central premise - it's a common error ;-).
there's nothing stopping Python from adding the bytestring-not-like-str object right now

In fact, it has one. Several, even. read-write buffers, which is how SQLite deals with data. character arrays, which are in every way I can see completely identical to read-write buffers. read-only buffers. In cases where the bytes/text distinction has to be particularly stark, I do use these. Unfortunately str's API is still usually the most convenient out of all of these.

a couple assertions, a couple little hacks, and you can at least confirm the encoding is set correctly


Yes, but that means that your program is now dependent upon site configuration. Even if we discount the fact that you may want to re-use logic on a client that you wrote for a server, there is another purely server-side environment that makes this mistake - gpc_magic_quotes anyone? ;-) Just read that post imagining that it said instead "I know the encoding has to be utf-8 for webware to work. But our local content management system needs it to be Shift-JIS to work; what do I do!?"

a UTF-8 default encoding is a much more compatible environment than the one we have

Only compatible with other UTF-8 environments ;-)

strings don't hash equally even if they compare equally, except in ASCII

What?
>>> u'\u1234'.encode('utf-8') == u'1234'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 0: ordinal not in range(128)


That seems like a pretty straightforward error - what do you mean "except in ASCII"?

the only ideas out there have involved walled cities of code

I'm not sure what you mean by a 'walled city'. At some point, code is entering your system from some data source as bytes. At that point you should be told, or guess, what the encoding of those bytes are. I suppose that you could consider those data-verification chokepoints "walls" around your code, but I feel safer with those kinds of walls around me :).

From: ianbicking Date: August 4th, 2005 04:58 pm (UTC) (Link)

Re: I Wish It Were So...

I think that it is true. I don't think we're really disagreeing - I'm saying "it's an error", you're saying "it's common".
No, we aren't really disagreeing; but I'm pointing out that Python has this error built into the language.
In fact, it has one. Several, even. read-write buffers, which is how SQLite deals with data. character arrays, which are in every way I can see completely identical to read-write buffers. read-only buffers. In cases where the bytes/text distinction has to be particularly stark, I do use these. Unfortunately str's API is still usually the most convenient out of all of these.
Right, there should be one best structure for byte strings in the standard library. In builtins even -- it should be more convenient than str.
Only compatible with other UTF-8 environments ;-)
Yes, everyone should use UTF-8. I know there's some controversy to Unicode in general, but frankly I think claims of Western racism are just covers for a controversy motivated by the much more significant racism between the people of those countries. But now I'm getting way off topic, and it doesn't even matter because Python doesn't support translating character sets, only encoding and decoding, so if you don't like Unicode you're stuck.

But I digress... UTF-8 is in some ways a side issue... but in some ways I guess I'm hoping that a walled city that accepts UTF-8 as okay will be a much larger, more inclusive, easier-to-work-with walled city than the one that uses only Unicode. I want UTF-8 as a default encoding because things like unicode(str(v)) == unicode(v) work, not so I can save a little effort decoding external data sources. Actually, right now in some cases I'm wary of decoding external sources, because by doing so I introduce unicode into the system and a huge number of errors will start popping up.

That seems like a pretty straightforward error - what do you mean "except in ASCII"?
The funniness starts when you change the default encoding. I'll probably post the details in another blog post.
At some point, code is entering your system from some data source as bytes.
If that was the only place where a wall had to be put up, that would be fine. But right now you need a wall between every library you use that isn't carefully vetted as being Unicode safe.
From: mesozoic Date: August 4th, 2005 10:53 pm (UTC) (Link)

Re: I Wish It Were So...

glyf: At some point, code is entering your system from some data source as bytes.
ianbicking: If that was the only place where a wall had to be put up, that would be fine. But right now you need a wall between every library you use that isn't carefully vetted as being Unicode safe.
If you exchange text with six different libraries which aren't Unicode-safe, aren't you going to have to treat their text delicately anyways, to make sure they're handing you text that is stored in an encoding you are able to use?
4 comments or Leave a comment
profile
Glyph Lefkowitz
User: glyf
Name: Glyph Lefkowitz
calendar
Back December 2010
1234
567891011
12131415161718
19202122232425
262728293031
page summary
tags