An Agreeable Procrastination – and the blog of Niels Kühnel

Code, the universe and everything

Posts Tagged ‘Database stuff

The Eager Fork – Taming Cartesian explosions in databases

with one comment

UPDATE: For almost any practical purpose batch loading is faster. The exotic cases are not worth the effort. This post is only relevant if you are obsessed with ORM performance tuning and want inspiration from some crazy ideas.

After reading this excellent post by Ayende it made be think of a fun solution I once came up with to the problem that arises when you try to extract a big object graph from a relational database with one query; a Cartesian explosion. That’s what this blog post is about.

Basically a “Cartesian exploision” happens when a database query results in a lot of Cartesian products. It means that when you write a query that joins in multiple, unrelated tables the database server will give you A LOT of rows and far more than you need. A very common situation where this occurs is when you’re using an ORM tool like NHibernate, and you’re loading a list of entities and their related entities eagerly (i.e. in one query). You may not know it, but in the background a lot of network bandwidth and CPU cycles are wasted to absolutely no avail.

To keep things simple, let’s consider an example with the three tables A, B and C where A has one-to-many relations to B and C. Specifically we have 3 As, 4 Bs and 3 Cs in total where

  • A1 has 3 Bs and 3 Cs
  • A2 doesn’t have any Bs or Cs
  • A3 has 1 B

If you wanted to extract all As with all its Bs and Cs in one query you would do something like: (for simplicity only the ID’s are included in the query)

SELECT A.ID AS A, B.ID AS B, C.ID AS C FROM A
  LEFT JOIN B ON B.A = A.ID
  LEFT JOIN C ON C.A = A.ID;

With the example data this results in

A B C
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 NULL NULL
3 4 NULL

Because B and C are unrelated each C is repeated for each B (the infamous Cartesian product) yielding a result set that may be several orders of magnitude greater than the one needed. In this limited example we get 11 rows for 9 entities but if each A had 30 Bs and 30 Cs you would get 2,700 rows (3*30*30) for 183 entities (3 As, 90 Bs and 90 Cs).

The Eager Fork

Now, it’s actually possible to reduce the number of rows dramatically by forcing some additional structure into the query that separates B and C. The very same Cartesian product that’s causing the trouble can be used to add “forks” (or axes, if you like) to each A. These can then be used when joining Bs and Cs.

Consider the query:

SELECT A.ID AS A, Forks.Fork FROM A
INNER JOIN (SELECT 1 AS Fork UNION ALL SELECT 2) AS Forks ON 1=1

This will explicitly make a Cartesian product between A and the subquery “Fork” (by joining on 1=1). “Fork” is a very simple table containing two rows with the numbers “1” and “2”, hence the result is each A repeated twice with the numbers 1 and 2 like this:

A Fork
1 1
1 2
2 1
2 2
3 1
3 2

This structure can be used to separate the Bs and Cs, as we can now join the Bs on the “1” forks and Cs on the “2” forks. In this way the initial query becomes

SELECT A.ID AS A, Forks.Fork, B.ID AS B, C.ID AS C FROM A
	INNER JOIN (SELECT 1 AS Fork UNION ALL SELECT 2) AS Forks ON 1=1
	LEFT JOIN B ON B.A = A.ID AND Forks.Fork = 1
	LEFT JOIN C ON C.A = A.ID AND Forks.Fork = 2;

This will result in

A Fork B C
1 1 1 NULL
1 1 2 NULL
1 1 3 NULL
1 2 NULL 1
1 2 NULL 2
1 2 NULL 3
2 1 NULL NULL
2 2 NULL NULL
3 1 4 NULL
3 2 NULL NULL

As it can be seen, the multiplication between the number of Bs and Cs is now a simple addition instead. In this limited example it only reduces the number of rows from 11 to 10, but if each A had 30 Bs and 30 Cs (as before) the 2,700 rows would be reduced to 180(!) which is actually less than the 183 entities needed to be extracted. More specifically, the upper bound for the row count in the result set is the number of As times the number of forks plus the number of each A’s Bs and Cs minus one for each (the first B or C are included in a row that would otherwise have NULLs in its place).

The concept can easily be extended to related entities’ related entities as fork tables can be added on nested joins as well. Assume that each B further had relations to D and E. The query needed to extract the full object graph would then be:

SELECT A.ID AS A, Forks.Fork, B.ID AS B, C.ID, D.ID, E.ID FROM A
	INNER JOIN (SELECT 1 AS Fork UNION ALL SELECT 2) AS Forks ON 1=1
	LEFT JOIN B ON B.A = A.ID AND Forks.Fork = 1
	--Nesting
		LEFT JOIN (SELECT 1 AS Fork UNION ALL SELECT 2) AS NestedForks ON B.ID IS NOT NULL AND 1=1
		LEFT JOIN D ON D.B = B.ID AND NestedForks.Fork = 1
		LEFT JOIN E ON E.B = B.ID AND NestedForks.Fork = 2
	LEFT JOIN C ON C.A = A.ID AND Forks.Fork = 2;

Obviously, the number of forks must equal the number of related tables, i.e. if A had Bs, Cs, Ds and Es four forks would be needed instead of two (i.e., “… UNION ALL SELECT 3 UNION ALL SELECT 4”).

Implementing this in NHibernate (or similar)

So far “Eager forks” are only an SQL centric concept, but it shouldn’t be that hard to implement them in an ORM (especially one like NHibernate where the source is available). For that I think the best solution is to transform the SQL just before it’s executed by the database engine by means of some kind of filter on the final NHibernate generated SQL. Assuming that the original query only contains left joins it should be pretty easy to extend it with the required fork tables and fork criteria on the joins. Also considering inner joins, right joins and nested joins is a bit more intricate but definitely also doable. The rationale behind this approach is that NH uses some kind of hash table when hydrating the entities and only does something the first time it sees an entity in the result set. Thus, in theory, it shouldn’t be bothered at all if the same entities are included in a smaller, less redundant result set. By adding a layer that transforms SQL between NH and the database engine no changes need to be made to NH, and it can carry on doing what it does the way it’s used to. The developer, on the other hand, can start adding as many joins as he/she likes to a query without worrying about that “it’s probably bad” and later be bitten in the ass by a massive Cartesian explosion when the amount of data increases.

In the examples the “Fork” column is included for illustrative purposes but of course that shouldn’t be added to the NH queries.

(A small note on further reducing the row count)

I’ve tried with another approach where the Bs and Cs are joined to the fork table first as in this (less readable) query:

SELECT A.ID AS A, Forks.Fork, B.ID AS B, C.ID AS C FROM A
	LEFT JOIN ((SELECT 1 AS Fork UNION ALL SELECT 2) AS Forks
		LEFT JOIN B ON Forks.Fork = 1
		LEFT JOIN C ON Forks.Fork = 2
	) ON A.ID = B.A OR A.ID = C.A;

This further reduces the row count, but it seems to be very hard on the query optimizer so I think it’s better to go with a little extra rows.

Written by niels.kuhnel

October 15, 2010 at 8:47 am

Posted in ASP.NET

Tagged with , ,