Saturday, September 01, 2007

Query Errors and Identity Map

I’ve encountered a “gotcha” in O/R Mapping. I’ll describe it here, and outline some of the possible solutions. To make matters worse, the default solution in nHibernate has side effects which can cause problems of their own. I’ll cover those side effects too.

The Problem

Imagine doing this in one Unit of Work (aka Session in Hibernate-speak):

1. Load customer “Bob Smith”
2. Change his surname to “Jones” (but don’t save yet)
3. Now, ask the ORM to load all customers named “Smith”.

Depending on how your ORM works, the list of Smiths may include one guy named Jones!

Why? Because he is still “Bob Smith” in the database. The ORM grabs the database row for Bob Smith, sees that the corresponding object has been loaded already (via the Identity Map pattern), and so returns the already-loaded object – even though that object is now named Jones.

There are several possible solutions to this problem, as follows:

1. Automatic “eager updates”. This is the default solution used in Hibernate and nHibernate. When FlushMode is set to its default value (Auto), nHibernate will send updates to the database early (i.e. before you ask it to) just to make sure that your in-memory changes also exist in the database, and therefore are reflected in the results of your queries. In the example above, nHiberate would automatically save Jones before querying for the list of Smith’s. This solution guarantees accurate results, but it has a problem (which I’ll describe below).

2. Cross check database results with in-memory state. This solution would involve scanning all objects already in the identity map, in addition to running each database query. Due to the obvious performance implications, particularly if there are a lot of objects in the identity map, I won’t discuss this option further. [See comments below for discussion of performance implications]

3. “Don’t worry about it”. Simply don’t solve the problem at all. Go ahead and return Jones in the list of Smiths. Rely on the application programmer to know when this might be dangerous. In cases where the programmer thinks this will cause a problem, the programmer can either:

3.a. Explicitly save changed objects before running the problematic query – this is a manual version of option 1 above, but it only kicks in when the programmer thinks they need it. I believe this is the approach taken by LINQ-to-SQL.(*)
3.b. Explicitly run two queries: the first runs against the database and loads the objects into the unit of work (aka session). The second runs in memory, scanning over all the objects in the unit of work, returning those that match. Only results from the memory query are actually used. In our example above, Jones would be processed by the first (database) query, but not by the second (memory) one. I’m not aware of any ORM that implements this approach(**)
4. Solve the worst half of the problem... There are actually several types of error that may occur. Arguably, the worst type of error is the one I showed above. Say you query for all "Smith"s, and the proceed to perform some operation on each object returned, an operation which is only valid for people named Smith. It's not valid for people named Jones, so you may execute it in error when the ORM returns a Jones mixed in with the Smiths. This problem can be solved by a variant of option 2: query the database, let the identity map do its thing, and then execute the criteria again in memory on the final result set. Strip out all objects that fail the in-memory test. This option performs better than 2 because it doesn't scan all objects in memory, just those that were returned by the database query and were already in the identity map. (i.e. just Jones, in the example above). This approach guarantees to exclude all bad data, but it does not guarantee to include all good data. (E.g. it won't include new Smiths which have not yet been saved.) I implemented this option in ActiveSharp. [See comments below for discussion of why this is worth doing and why, unlike option 2, it can be implemented efficiently]

The Problem with the nHibernate Approach

Let’s introduce the problem with a quote from the nHibernate documentation:

Many business processes require a whole series of interactions with the user interleaved with database accesses…[This is called] a long running application transaction. A single application transaction usually spans several database transactions. It will be atomic if only one of these database transactions (the last one) stores the updated data, all others simply read data.
But, when automatic flushing is enabled, any transaction may update data! Even if you roll back the last one (the one which is supposed to do all the updates) Hibernate may have already committed some “eager" updates in earlier transactions. Therefore, automatic flushing is incompatible with long application transactions.

In summary, you have two choices in nHibernate:

(a) Automatic flushes, usually in the session-per-request pattern.
Pro: accurate query results;
Con: must use one database transaction for entire session (to prevent unwanted commits of automatic flushes)

(b) Manual flushes, usually in the session-per-conversation pattern (aka session-per-application-transaction).
Pro: can use multiple database transactions per session, flushing only in the last one;
Con: query accuracy may suffer as described above.

I am disappointed to see nHiberate documentation which says that “the ISession.Find(..) methods will never return stale data; nor will they return the wrong data”, without specifically spelling out that this is only true with automatic flushes. Likewise I am disappointed to see the session-per-conversation pattern described, with a perfectly valid recommendation to turn off automatic flushes, but with no mention of the adverse effect on query accuracy.

[Update 26 Aug 08: Looks like nHibernate 2 has tightened things up somewhat, since it has "No AutoFlush outside a transaction - Database transactions are never optional, all communication with a database has to occur inside a transaction, no matter if you read or write data." So at least if it is going to send your updates to the database, it will be in a transaction that you have created yourself. Still no guarantee (as far as I can tell) that the transaction used for the updates will be the "last" one - i.e. the one where you intend all updates to happen]


With manual flushing, accuracy of query results is a potential problem. Admittedly, it will be a rare problem in practice, but sometimes the rare problems are the most dangerous -- because they’re the hardest to find in testing.

I can’t find much about the problem on Google. Perhaps I’m not searching well enough… or perhaps the information is out there but I can’t find it because Martin Fowler hasn’t given it a catchy name yet ;-) …

Certainly, I agree with the usual recommendation to make sessions as short as practically possible. That may minimize the problem – but it won’t 100% eliminate it, particularly for application transactions that span multiple database transactions.

I think it’s important to understand when an ORM may give “incorrect” results. It’s also important to understand the side effects of Hibernate’s eager-update solution.

Finally, I’m relatively new to nHibernate. If I’ve missed something important, please let me know.

Hibernate transaction patterns:

Stale data in LINQ-to-SQL:
(at the time of writing this post, my own question in the thread has not been answered)

Another post on LINQ issues, which mentions the stale data issue:

Uniquing (aka Identity Map):

(*)The programmer will generally need to make sure that the explict "early" save is in the same database transaction as the rest of the intended updates. This is for the same reasons outlined above with regard to automatically-flushed updates in Hibernate.

(**)This option is achievable, because with LINQ, it is relatively easy to run the same query once against the database and once against the objects in memory. But, it is basically just a manual version of option 2 above, but it only kicks in when the programmer thinks they need it.
Importantly, to be truly accurate, options 2 and 3b must include new objects that have been added to the unit of work but not yet saved to the database. This is not possible in most (all?) true POCO ORMs, since their persistence-by-reachability code only kicks in when objects are saved. That’s too late for options 2 and 3b, which require the ORM to know about objects as soon as they become reachable from any other object in the Unit of Work, even if they have not been saved yet.


Blogger Trant said...

Very interesting, I hadn't gone much beyond considering what LINQ does to what it should do.

The problem with post-processing the identity map to reject non-matching records is that you'd also have to scan them to include now-matching ones too. There would no doubt be performance implications here.

Every new technology to solve problems brings a few new ones to the party.


Mon Oct 08, 07:22:00 AM PDT  
Blogger John Rusk said...

True. That's what I was trying to get at with "option 2" above. I do agree that scanning to include all now-matching ones (ones which did not actually come back from the DB) would indeed have performance issues.

That's why I added "option 4", which is deliberately only half the solution. It is only about rejecting "bad" records, weeding out the ones which did come back from the DB but which (in memory) no longer match the query that retrieved them.

That can be done efficiently, as follows:

foreach row in result set
if row already in identity map
check if the in-memory object still matches the query
// it was not already in memory, so no need to do an in-memory test

In other words, the in-memory test only runs for objects which were in fact in memory before the query runs. In many situations, that will only be a small minority of the rows.

The value of this approach is that you can ask for a list of "Smiths" and then process it safe in the knowledge that they are indeed "Smiths". They might not be all the Smiths that exist (since others may also exist in memory, as you said); but they will all be Smiths. Not Joneses.

To me, there is a big difference between accidently running Smith logic on a Jones; versus missing out some Smiths when I run my Smith logic. The former is plain erroneous; while the latter is akin to someone else saving a new Smith while you are working - it's not very different from the "normal" concurrency issues in multi-user systems.

Tue Oct 09, 12:25:00 AM PDT  

Links to this post:

Create a Link

<< Home