At work, I've been thinking about the problem of versioning data quite a bit. It's a nasty problem, but I think I've gotten it down to a nice, simple paradox:
"The easier you make version migration, the harder version migration will be."
This is a very weird, and particularly intense microcosm of "worse is better". In this case, the difficulty involved in using a tool actually makes the task that the tool has to perform easier. In other words, subjective quality improves as objective quality degrades. Or something like that. Here's a case in point:
In Python, you can easily serialize any object regardless of its layout, as long as it doesn't contain something completely nonsensical like an open file or a graphical button. It takes no extra code.
In Twisted, and by extension in Quotient, any object that is even theoretically persistent in this manner can be upgraded by inheriting a single superclass (twisted.persisted.styles.Versioned) and writing a single method (upgradeToVersion1... or upgradeToVersion7. write as many as you like and they'll be run in order) which shuffles things around inside the object until they're consistent with the current world-view. This is about as easy as upgrading from one version to another can get - the upgrade function is almost always completely self-explanitory:
class Meeting(Versioned): persistenceVersion = 1 def upgradeToVersion1(self): # stored IDs instead of references before, oops! self.people = [reference(self.database, x) for x in self._peopleIDs] del self._peopleIDs
This is the MIT style of persistence design. (Oddly enough, written while I'm living at MIT.) It is complete, it takes every case into account (there is even code for upgrade ordering dependencies, if you care about such things) and it values simplicity for the user (the "business object" author) rather than the implementor (the maintainer of the framework).
Now, for an example of the New Jersey approach, I will refer to some code that I actually wrote in New Jersey. (This is code that I have done my best to erase from the internet's collective memory. If any of you offer up the ridicule that it so richly deserves, so help me I will erase you in the same fashion.)
Unfortunately I haven't had much experience with this style of persistence, although I am aware that many popular systems use it, including software that costs thousands of dollars per copy and does Very Important Things Indeed for fortune 500 companies.
The style which I am speaking of is explicit persistence, in other words, you have to write a new method for every new object you want to persist, even if it's something dirt simple like two ints and a string. Then, whenever you want to change anything about an object, you have to modify some code that saved and loaded that object. In the code in question - this was the original Twisted Reality codebase - there were 2 methods of note for saving and loading objects. One was public String content() in Thing. The other was [package] boolean handleIt(String) in ThingFactory. These methods were 110 and 283 lines long, respectively. If you wanted to add a new attribute to anything, you had to add some code that looked like this:
else if (tok == "component")
{
thi.setComponent(true);
Age.log("Component bit set",Age.EXTREMELY_VERBOSE);
return true;
}
and like this:
if(isComponent())
{
r.append("\n\tcomponent");
}
to each of those methods. If you actually wanted to change something about the structure of an object, you had to create magic, temporary declarations in the persistence and then interpret them later. Certain kinds of changes weren't even possible.Now, when a designer who was thus far ignorant of the subtle mysteries of persistence comes across these views of the world, the choice here is obvious. The former is so clearly superior that the latter seems like a cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds unrelated aspects of the code together inextricably! It creates arbitrary, artificial, and unnecessary limitations!
Not all these problems are endemic to explicit persistence, but I wanted to present an obvious straw man here so that it would be really surprising when I couldn't knock it down.
Don't get me wrong - the latter strategy is certainly nicer to work with when one is writing programs. Given the explicit task of upgrading a few simple objects (of the style present in the code from which the New Jersey example is taken) to a new and better representation, the Right Thing wins hands down. This scales up, too - there were no complicated objects in the NJ code specifically because the persistence code pretty much fell over whenever you tried to do anything complicated, so writing and upgrading complicated objects MIT-style is certainly easier than impossible.
But, there is a curious phenomenon that takes place looking at the larger issues pertaining to codebases using these two approaches. In about 2 years of maintaining TwistedReality/Java, when there were bugs in the persistence, they were obvious and immediate - you could dump and object and identify the problem with its representation in a few moments. More importantly, pretty much every version was backwards-compatible to old persistent data. You never had to "blow away" your map files, because they would just load without the spiffy features available in the new map format. And finally, no contributor to the project ever checked in code which broke these constraints - every data-layout change included appropriate persistence changes.
In only about 6 months of maintaining a Right Thing codebase (Quotient) this is certainly not the case. Close to shipping now, we are still wrestling with a system which requires the database be destroyed on every other run. Nobody wants to write persistence code, and hey, the system works if you don't, so why bother? We don't have a policy in place that mandates that everyone must write an upgrader every time they change anything, and again, nobody wants to write persistence code, so since they don't have to they won't. This includes me, so I understand the impulse quite well.
This isn't entirely a fair comparison, of course. TR/J included a lot less persistence-sensitive data than Q does. It had a far simpler charter. It didn't use a database layout, just pure object persistence. However, from experience with the Twisted 'tap' format, those issues are peripheral - Twisted devs. generally don't like taking the time to write persistence upgrade functions either. There are periodically upgrade snafus. What really matters, of course, is that nobody trusts taps to stay stable worth a damn, even though we try really hard to make sure they will be.
Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some characteristics of the Right Thing model (in particular, dynamic properties of arbitrary type), which in turn began creating the same syndrome of problems. In that case, it manifested itself as certain features breaking on particular objects from version to version and requiring operator interventions to fix the data rather than whole-system upgrade explosions, but nevertheless, we couldn't quite shoehorn all the features we needed into a static, single-object model of persistence.
Python has tempted us, we have taken the bite from the Pickle, and we can't ever go back again. A persistence strategy as clearly brain-dead as the one featured in TR/J just isn't going to cut it with the feature-set that we need to support in Quotient. However, we desperately need to encourage the developmental behaviors which that system encouraged, especially keeping a running system going with the same data for an indefinite period of time.
What did the Jersey style do right, then?
- Forced Consideration of Impact
Every time that a programmer made a change to an object that might affect that object's persistence, they had to make a change to the persistence code as well, or they effectively couldn't use their new feature. The data just wouldn't load. This meant that, when faced with a potentially complicated new data structure, they would always ask themselves, "Do I really need to add that?" This might seem like an artificial burden, but in reality it more closely reflected the real cost of change while keeping the actual requirements satisfied, rather than making the cost of change seem artificially low while constantly violating the requirements in the name of expediency. - Immediate Feedback and Testability
The persistence format was also the introspection format, because it was so simple. Whenever a developer made a change to the persistence code, they could immediately see that change in a very direct way, making it easy to see if they made a mistake. If that code had had any tests (NOT A WORD, I SAID!) then writing them would have been relatively easy too. With an implicit persistence mechanism, the only way to write such a test is to keep an exact, unreadable copy of an old object's data (and of course, all the context that object kept around). - Programmer-Valuable Data Associated with the System
This is more specific to TR - as we were working on the code, we were also developing a companion dataset, stored in TR's own format, which was equally important to the project as the code itself. It was absolutely 100% imperative to every developer to keep that data working in every minor iteration of the code, because otherwise we couldn't test. I think that ultimately every data-storing project needs something like this to make the developers truly care about versioning. - Separation of Run-Time and Saved State
All that grotty string-mashing code in TR actually served a purpose - it stripped implementation-dependent run-time data out of the saved file. This meant that we were free to change the implementation of lots of structures without updating the data files (for example, switching a list of strings to a dictionary of string:int, or vice versa) without persistence changes, as long as it could be represented in the same way. In an automatic system, these implementation details are indistinguishable from the core abstract state of an object.
Oddly enough, although it is brought about through forced duplication of effort (manual specification of the persistence format), it reduced the amount of upgrade code necessary. Because the persistence format was very abstract, you never had to write an upgrader to go from one implementation of behavior for the same persistent data to another. While changing persistence can be frequent, changing implementation is almost by definition more frequent.
I think that's a relatively complete summary of the advantages of manual persistence work, although I'd love to hear comments upon it.
How can we replicate these advantages?
I think that an important first step is to find some simple, lightweight way to completely express the necessary information for the persistence of an object. Even if we still use Pickle to store this data, an explicit specification of what it should look like would be a good mental exercise for us. It would also provide a means to test upgrading and to represent the format of old versions without having to copy their entire implementation. In short, the "schema" that Twisted World provided and New PB is about to provide again. The outstanding work in my sandbox on indexed and referenced properties in Atop is an important first step here.
We also need some critical data to be stored in the database that can't be exported, re-loaded, or otherwise passed through some external crutch versioning mechanism. We need to care about our core data's dependability as we move forward.
We also need to decide what kinds of data we really care about. There are certain aspects of the Quotient application which are developing so fast that it's impossible to effectively represent them persistence-wise, and it would really be a waste of time. Such objects should probably never be persisted in the first place - just provide an 'experimental' flag or somesuch, indicating that the object should never touch an on-disk database. When this becomes burdensome, the programmer can un-set it and manually start performing updates.
There's more to say, but this has already been quite a ramble! I hope that you've enjoyed reading it. Please keep in mind that I would like feedback and more ideas about how to perform the transitions I've suggested over a relatively large existing system. (Quickly, of course. And cheaply. ^_^)
Until next time.
December 14 2003, 07:24:02 UTC 8 years ago
Clarity is good
This is why I have always been enamored by the idea of repr/eval persistence. Introspection is important for understanding of the program's operation, in both the debugger/interpreter and in the code. I have been writing some C and there's *no way* for the debugger to let you reconcile the value of a variable set to an enum and the symbolic enum -- the only way to do it is val == kOSASelectStore. If you write a repr for everything, introspection becomes 100 times easier. Unless you have exceptions in your reprs, and exceptions in your reprs also help you create more consistent objects -- what do you mean that object sometimes doesn't have a "foo" attribute?Using repr/eval helps you care less about object identity. It makes you think about what other objects need to be included in the representation in order for the evalled object to be considered "equivalent" to the original. It helps you write objects which are less interdependent -- if you add another object to the repr (%r yo) and recursively pull in half of your program, you have a problem.
Also, I think nevow might be a source of inspiration here. In fact, I called the main interface ISerializable (although will probably rename it IRenderable). It has a few key points which make it very easy to express complex nested hierarchies as a flat string:
- It uses adapters for serialization of types.
- It attempts to further adapt the return value from serializers until it is presented with a single contiguous string.
- It allows serializers to be expressed recursively by letting those serializers yield serializable adapters around child attributes.
While the implementation moves slightly into the realm of the complex (flatten being the most complex example) most serialization code is incredibly simple, clear, and reusable.
dp
December 16 2003, 07:54:39 UTC 8 years ago
Re: Clarity is good
It's unfortunate that Pickle being in C means that this approach is so much less efficient than what we've already got. In a language with better low-level facilities we could perform a LOT of optimizations by building things this way.December 15 2003, 02:31:09 UTC 8 years ago
So, if I understand it correctly....
the major problem you are dealing with in persistence is dealing with differences in objects as a system is upgraded. Someone changes something in a new release of Quotient and the object is now different than it was before and the persistence becomes broken. Is that what you are talking about?So the real problem then is providing means for handling these changes that might be automatic, to make the developer's job easier, because developers are lazy? :)
December 15 2003, 06:18:08 UTC 8 years ago
Straw Men
Having maintained several NJ-style persistence systems myself, I can related to this worse-is-better perspective. Without exception, they were terrible systems that never had problems with changes to our code, to our internal object representation, or to new features we added.They did present some inhibition to our actually adding features that posed new or different persistence requirements though, which I have mixed feelings about. As you said, it can be a boon to the system in general if developers are required to keep things simple. Mainly though, I don't want my persistence layer to restrict the features I consider adding to a project. With Python (the systems in question were in C), and with just plain better designed code (I inherited the systems in question, and while they worked, they seemed to try to be as horrible as possible and still do so) this may be less of a problem.
More transparency in the persistence format is a good thing, but going for complete transparency is unreasonable. Sticking with pickle is probably a necessity, but avoiding using its arbitrary graph traversal features might be a good step in the right direction. I can't think of a way to have it magically omit "experimental" objects, but that's probably more of an implementation detail than a conceptual problem, and it certainly sounds like a good idea (assuming the system can handle certain objects going missing for no particular reason; since most of these experimental objects will probably inhabit the system as elements in a homogenous list, it probably can).
Another key feature of the NJ-style persistence system you have described, as well as the ones I have worked on, is the ability to trivially *manually* remove an object from the persistence system (at least, while the system is offline). With opaque data sitting in a database, this is not trivial, and manually impossible. Plaintext data files are great -- we have a plethora of tools available for working with them. If pickle is the way forward, as it appears to be, tools for manipulating pickles are probably a necessity.
July 4 2004, 03:57:55 UTC 7 years ago
I ranted about it on my blog, here: I HATE PERSISTENCE.
-- radix
October 13 2004, 20:06:12 UTC 7 years ago
Twisted folks and E folks agree
Hm. I don't have a strong opinion about this myself, but apparently the E folks came to similar conclusions against the make-it-easy-on-the-programmer orthogonal approach.This is fascinating, as both Twisted and E folks probably started with the assumption that one should always make it easier on the programmer.
Some refs that I haven't really read yet:
http://www.erights.org/elib/persist
http://erights.org/data/serial/jhu-pape