One thing to keep in mind: CTEs are defined as being optimisation boundaries and databases are not allowed to optimise across CTEs. So CTEs have to always be resolved first.
If you later join that CTE against some other table, eliminating most of the rows fetched by the CTE, you will have wasted all the time for retrieving these rows.
This means that you might get a worse plan by using a CTE instead of a direct join.
It also means that you might convince the system to give you a much better plan by using a CTE (if you can limit the total amount of rows by the rows in the CTE instead of the tables you join it against).
Just keep that in mind when you're debugging query speed issues, but in general, CTEs are good for very readable queries that are much easier to understand and maintain than if you join inline.
The additional readability is for sure worth it to first try to solve the issue with a CTE and only if you run into above problem and if the performance is inacceptable to then convert the query to a more traditional (in a sense of "what the various open source RDBMs provided until two years ago when Postgres introduced CTEs") join.
CTEs are defined as being optimisation boundaries and databases are not allowed to optimise across CTEs.
Wait, no. I know for a fact that SQL Server will happily optimize across CTEs. They are effectively syntactic sugar, and handled much like a view or derived table.
So, either the SQL standard defines this optimization boundary and SQL Server ignores it, or this is an idiosyncrasy of the Postgres implementation, or this is an idiosyncrasy of some other implementation that you have assumed applies to all implementations.
I honestly don't know. I wouldn't mind a syntactic construct that did define optimization boundaries, but to my mind CTEs are not that construct.
Unfortunately I haven't read the SQL standard(s) and I also seem to have trouble finding them in full on the internet.
However I was told on IRC in #postgres that not optimising across CTEs was something mandated by the standard.
Of course I might have been told something that's not quite correct or I might have misunderstood.
However in the context of postgres (which is what the article is talking about), I know for a fact that CTEs are first fully resolved. This is what I'm seeing in my query plans and it's what the manual says: http://www.postgresql.org/docs/9.3/static/queries-with.html (at the end of 7.8.1)
If I'm wrong what the standard is concerned, then I apologise.
Yes, as far as the standard is concerned you are wrong. See http://dba.stackexchange.com/questions/27425/is-the-optimisa.... But it is a very common misconception you can hardly be faulted for because it is the way postgres does it and many people claims it is due to the sql standard.
Of the only annoying features in postgres, imho. You have a large query and you try to break it up into smaller self-contained cte parts and voila, performance goes crap.
As per discussons on the email lists, PostgreSQL does it this way because this is the safest way to adhere to the standard. Presumably over time cross-CTE optimizations will be added. The problem though is guaranteeing the stability the standard requires without doing so. That's a technical issue, not a standards issue though and you are quite right for pointing that out.
"Of course I might have been told something that's not quite correct or I might have misunderstood."
I think it was a partial truth. In many cases, you can't optimize because you need to produce the same results as if you had evaluated the CTE in full exactly once.
CTEs are also a convenient place to have an optimization fence, which are sometimes useful. Yes, this conflates the logical and the physical, but that's the way it is in postgres.
Yes, it's quite apparent in practice that WHERE constraints are not being propagated back into CTEs in cases where they clearly could be, so that's something to look out for.
"CTEs are defined as being optimisation boundaries and databases are not allowed to optimise across CTEs."
No. You can optimize them however you want and still comply with the SQL standard; you just can't produce different results.
Different results are really only a problem when the CTE is non-deterministic or has side effects. If you know that the CTE is deterministic and has no side effects, you can just treat it like a macro.
> If you know that the CTE is deterministic and has no side effects, you can just treat it like a macro.
With PostgreSQL you can't really assume this. Consider this:
CREATE FUNCTION stable_random() returns double language sql as
$$ select random() $$ stable;
Now what I have done is wrapped a volatile function in a stable function. Now the planner thinks it is deterministic and will fold it into the plan in a different way. I could further mark it as immutable and allow the planner to run it once per query even if it occurs multiple times. Now: select my_random() will be treated as fully deterministic, so select my_random() a from my_table will return a list of the same singular random number (all values of a will be the same).
So how is the planner to know this? Obviously we could just let the planner decide based on function markers but then you'd run into the case where a stable function above would produce different results, while a volatile or an immutable one would produce the same result. That seems to make very little sense to me.....
I may have gotten folding wrong. The question is whether the subquery gets run once at execution time and folded in, or whether it gets run on every row. I will have to double check the specific behavior but it's a query folding issue not a plan vs execution time issue.
As I think about it I may have gotten it wrong. It might need to be in a subquery to fold in that way.
It's also worth noting that in systems that don't optimise across CTE boundaries (I believe Postgres is one of these), they usually do still optimise across view boundaries. CREATE TEMPORARY VIEW is your friend
... that seems thoroughly backwards to me, at least in terms of expectations. I mean, I know that views that were optimization boundaries wouldn't be useful, but just from an intuitive perspective I'd expect that the stand-alone database objects would be optimization boundaries while the query-supporting statements would be optimized as part of the query.
CTEs are defined as being optimisation boundaries and databases are not allowed to optimise across CTEs.
What sort of brain-dead language design would conflate a performance construct with a maintainability construct? Even C isn't that bad (though it sure comes close).
I, on the other hand, am perfectly capable of believing that SQL is a pretty terrible language that merely persists because every attempt to replace it has been even worse because they threw out the baby with the bath-water.
A data-storage language based on relational theory is a great idea. We should make a new one at some point.
Date and Darwen's rants were exactly what I had in mind when I posted that. Although I didn't know an actual implementation of D existed in the form of D4, which is interesting.
I wonder whether it would be possible to write a D interpreter as a front-end for an existing database? If everyone with a PostgreSQL installation suddenly found that Tutorial D was only a 'yum install' away, it might catch on.
I mean, strictly speaking, that would make PostgreSQL a noSQL database ...
Given that the language is different, you'd probably need a custom background listener and its own parser (and youd probably want to listen on a different port), but yes.
Someone already wrote a custom background listener that pretends to be mongodb, so I guess PostgreSQL is web-scale now ;-)
I love using small CTE's to help with the hassle of the alternative - creating temp tables and then cleaning them up. They can be particularly useful for data modification, as you can't ordinarily do a recursive delete or insert all that easily.
PS: I always cringe when I see this sort of syntax though:
"FROM
users,
tasks,
project"
Much clearer using explicit JOINS imo, especially if your goal is readability.
Not just cleaner (which might be debatable), but having a separate ON clause allows you put criteria for selecting rows after the join has occurred in the WHERE clause, which are different than criteria for making the join in the first place. Or so I prefer, but whatever.
Completely disagree. All the database developer "experts" I know of (including myself) hate the explicit join syntax.
I prefer to list out your tables and then do your joins in the "where" clause. For 3+ tables and mixing inner & outer joins it's especially better. I think maybe this only looks better once you become really really good at reading and writing SQL though. Maybe it's also because I come from the Oracle world where it is the standard.
I've worked in a mixed Oracle / PostgreSQL environment, and I prefer the explicit join syntax.
1. I hate needing to parse/use the '(+)' to denote which way an outer join goes. It's much easier to read "left outer join".
2. Separation of concerns. This way all of the conditions that are in the WHERE clause act as filters on the data, and all of the JOIN conditions sit right next the the relevant tables in the FROM clause.
3. I don't even know what the WHERE clause syntax for a CROSS JOIN would be.
4. When a single JOIN has multiple conditions, it's much nicer to have them all together in the FROM clause than to have them scattered (or even grouped) in the WHERE clause.
5. You can use USING. When you're joining two tables on a column with the same name (e.g. product_id), I always find it annoying to be required to specify which table's column to look at since they are guaranteed to be equal.
I don't consider myself a 'database expert', but I spend several years in an environment where 100+ line queries with 5-10 joined tables was not uncommon (so I'm not coming from a place of only having worked with simplistic queries).
I'm the opposite. I hate joins in the where clause because it is semantically wrong and confuses the building the set part with the filtering the result part. I want to see how a table is joined into the query at the place it is joined and additional filtering in the where. To me it makes it much easier to follow and not miss a join item, especially if the join happens on multiple columns.
I really hope you are not suggesting going back to things like <star>= and =<star>
I hate joins in the where clause because it is semantically wrong and confuses the building the set part with the filtering the result part.
It is not "semantically wrong": conceptually, you can compute a join by taking the Cartesian product and then applying the join predicates to the result, which is actually what the implicit join syntax suggests. Semantically, there is no difference between applying an (inner) join predicate and applying any other predicate. There is also no actual division between "building the set" and "filtering the result": different predicates might be applied in different orders, before or after different relations in the query have been accessed.
I started out writing joins in the where clause; i'm not sure why, probably because that's how it was done in the first book i read. I then worked with some old Oracle hands who also did it that way, probably because of tradition. A year or two ago, i rediscovered the explicit join notation, and switched over to it completely in a matter of days.
As you say, explicit joins are right because they separate which tables/which relationships and which rows. Those are not particularly distinct things in the pure relational algebra model, but they are fully distinct in practical use.
> Completely disagree. All the database developer "experts" I know of (including myself) hate the explicit join syntax.
[...]
> Maybe it's also because I come from the Oracle world where it is the standard.
IME, lots of Oracle experts -- who are likely to be longtime heavy users that started when Oracle didn't support explicit joins -- often prefer implicit joins because its what they are used to.
But its much clearer when you need to reuse and modify a query -- particularly someone else's query -- to have the join conditions clearly identified and distinct from filtering conditions.
I went from from a company that generally adopted the implicit syntax to one that uses explicit, and after some adjustment, I've come to slightly prefer the latter. One thing that helps is to indent the join condition, formatting the SQL as:
select a.foo, b.bar, c.blah
from blahblah a
join whatever b on a.ID = b.A_ID
join another_table c on b.ID = c.B_ID
where 1=1
and a.foo = "baz"
This becomes less clear with certain types of joins, but generally, I've come to prefer it.
if you have a very simple join it really makes no difference. On the other hand, when you have 10 tables involved in a 200 line query, debugging and maintaining the query is directly proportional to how well you can structure it. Breaking the query down into functional, related pieces which can be reviewed and managed separately is the key to maintaining large queries.
The way I describe it is this: In C or Perl, the complexity of debugging is linearly proportional to the length of the function. State accumulates during the function and consequently you have to read each line in context with those which came before and after.
With well-structured SQL, OTOH, you have something different. You have a set of sub-functions which are each somewhat independent and state does not (usually, outside of specified window functions) accumulate. Because it is declarative, a well structured query can be debugged not as a linked list of statements but as a tree. This is far more efficient. Personally I find a well-structured 200 line SQL query to be far more maintainable than a 50 line Perl function. And if it is really well structured, you could probably push this further out.
I 100% disagree with you. How is having even more conditions in the where clause helpful? It makes it very easy to accidentally cross join (Oops, I forgot to actually join that table because I have 40 lines in my where clause).
Also, I believe that in SQL Server, the old (ANSI 89) style of joining (joining in the where clause) was deprecated and isn't optimized anymore so you'll get crap performance because all of the tables are actually cross joined before the filtering happens. The explicit join forces relationships to be accounted for first which can help keep the amount of memory needed down.
One last thing, how long does it take to adapt a new standard? The explicit join has been the standard since 1992.
I can't claim to have worked with a representative sample of SQL developers, but I've worked with a lot and written huge quantities myself, and I'd like to think some of it was near the 'expert' level.
And I'm 100% with the OP, not you, sorry. Explicit joins are, IMHO, enormously clearer and easier to work with, particularly with outer joins, cross joins (the rare times they're intentional), CROSS APPLY....
Each to their own though. You write the code that's easier for you and your team to read; I just think it's unlikely we'll be happy working on a team together ;-)
I think you're either very selective with your sources or very selective about who you give the "expert" title to. I don't know that I've seen implicit joins used by a speaker at a SQL event in the past couple years, unless it's for cartesian joins.
My experience is the exact opposite. Only absolutely beginner php/mysql guys do that. And they do so because they don't understand joins. You can't even do outer joins implicitly without vendor specific extensions (which are being deprecated), so the idea that it is better when mixing join types is particularly odd. It is also far from "the standard" in the oracle world. Oracle land has more people who think using weird oraclisms are good, but there's still an awful lot of people who prefer ANSI syntax there. Just look at the answers to questions about it on any forum.
> For some reason most people throw out principles we follow in other languages such as... composability just for SQL
This is because the most common API for accessing databases is wrong and doesn't support composition.
Imagine designing a general-purpose, rich API that has only one method. The method takes one giant string argument. Insane? Absolutely.
SQL is actually awesome and amazing, but many programmers hate it because their API for accessing it is horribly broken. The whole point of relational algebra is that it's... an algebra. None of which is usefully exposed if you're just mashing strings of SQL together.
I'll one up you on that. . . the common approach of treating the database as if it were an API that you shove string arguments into is wrong.
Imagine if you developed your website without ever editing a *.js file directly, and instead just programmatically mashed together strings of JavaScript code and emitted them inline in the HTML wherever necessary. You'd probably end up generating much crummier JavaScript code, too. But should you blame JavaScript or your Web application framework for the fact that your client-side scripting is crummy? No, of course not. The real problem is that you're horribly misusing your tools.
Esqueleto is a pretty interesting take at putting together SQL without messing with strings. Doesn't quite expose the relational algebra in its full generality, but I've been finding it quite usable.
(read the PGObject::Simple docs too, and both are on github).
The idea is that you write your sql in .sql files as user defined functions, and then run them as parameterized queries in your code using a declarative interface. No SQL in your Perl, and you get the benefit of encapsulating your database in an API that is not so tightly tied to your application (you can add new optional parameters without throwing off your application).
mssql has a lot of "tricks" associated with CTE's. Using the OUTPUT clause on updates is one of them.. Here's some queue code using a cte to grab the top one by queued date.
DECLARE @Queue TABLE (QueueID INT)
WITH q AS (
SELECT TOP (1) QueueID, StatusID, ModifiedDate
FROM worker.Queue WITH (ROWLOCK, READPAST)
WHERE StatusID=(SELECT TOP 1 s.StatusID FROM worker.Status s (NOLOCK) WHERE s.[Status] IN ('Queued'))
ORDER BY QueueDate ASC
)
UPDATE q SET
StatusID=(SELECT s.StatusID FROM worker.Status s (NOLOCK) WHERE s.[Status]='Processing'),
StatusMessage='New Process',
ModifiedDate=GETDATE()
OUTPUT INSERTED.QueueID INTO @Queue
CTEs in general are fine, just beware of using recursive CTEs to model graphs - and do graph operations - inside Postgres. My experience has been that the performance is pretty abysmal in that case, for even fairly small graphs. IMO, if you find yourself doing a lot of "graph stuff" just go ahead and store the graph in Neo4J or Titan or something.
Note that I don't mean to argue against the point of this post... CTEs are very handy. It's just that one of the more commonly cited uses for them is to do graph stuff, and my experience trying to that, specifically, has not been positive.
Doing graph processing by explicitly populating a temp table is nearly as fast in many cases as neo4j with the added benefit of a vastly improved analytics experience if you plan to analyze the data or do range queries, etc... It is more work to do it smartly and efficiently, but it is possible. I'll try to find a python script I wrote as a PoC with SQLite.
CTEs and recursive queries are okay for many cases, but they can blow up your RDBMS temp space easily if you aren't careful. Especially useless is a non-materialized view based on them.
I'd be interested in seeing how you approach some of the things that graphs are better at. I have an understanding of creating efficient tree structures in a relational model - I haven't thought about the more general case of graphs though.
I was recently involved in a project that was using neo4j as a datastore. In the end we switched to postgres because the cognitive load was lower, the tooling was better and the relationship management was easier and more explicit. The trade off is going to be that later the relationship exploration will be a little harder. My plan was to add a graph datastore back in to mirror the relationships in the system so we could query it.
I think your big limiting factor in graph analysis with recursive cte's is that PostgreSQL is relational and therefore set-based. This effectively forces any search into a breadth-first search.
I don't see any reason why doing breadth-first searches would be any slower on PostgreSQL than Neo4j. But the question is, what happens when bredth-first is not what you want to do?
Now with some work, it should be quite possible to do depth-first or other sorts of searches, but you'd probably want to think very, very carefully about the implementation and use something besides a recursive cte.
So I would expect something like "show me everything within two hops of x" to perform pretty well if you designed your query right (and you'd want to put the starting point and the limit in your CTE so you aren't pulling the whole graph first to check). I would expect "show me the shortest route from x to y" to perform much better through a non-relational approach. However getting anything to work well would require carefully thinking through your query.
Recursive CTEs can absolutely be more performance intensive, and even non recursive CTEs can be an optimization bottleneck. However from what I've seen with non recursive CTEs is that you're looking at a 1.5-2x difference in performance (in the bad cases) versus orders of magnitude. For making SQL more readable and shareable this seems a small price to pay, its much the same that people use higher level languages rather than dropping down to C for everything they need to do.
As a counter-point here, I would point out that there are cases where we've found (with the LedgerSMB project) drastic increases in performance in switching to CTE's, especially with recursive ones.
The reason is that if you have a stable result set which is complex to calculate (particularly if it involves recursive processing), but needs to be checked against multiple conditions in different subqueries, it is far better to materialize it once and check a few times, than to materialize it in every subquery.
For example, our menu uses a simple referential tree. Generating this with connectby() when we had to support 8.3 primarily usually worked unless you wanted to exclude a large number of rows based on permissions. In that case, it would take better part of a minute to run. Not good for a web app.
Moving to recursive CTE's cut the execution time of our best performance in half, and increased our worst cases by orders of magnitude.
To be honest though, for shareable code, I think views are often better because they represent a stable interface that can be documented regarding acceptable uses (should you join this view? self-join it?). What CTE's often buy you though is an ability to break a query off into multiple, independently testable pieces (where a view might be overkill due to a lack of sharing), and that allows you to better debug when something goes wrong.
When speaking of performance you always need a dataset size qualifier (what does small mean?). I have used CTEs in MSSQL to manage graphs and it performed really well. None of the relationship tables had more than 1M rows in it though.
Yeah, it depends on what you're trying to do with the graph as well. I was using a mix of Postgres operations and Java code to do a Floyd-Warshall "all shortest paths" calculation. I forget how many vertices I had, but it wasn't many... on the order of hundreds or thousands, at most. And the Postgres part just totally bogged down when using CTEs. I finally switched to storing the underlying graph itself in my own hand-rolled format (and eventually ran into different bottlenecks, but that's another story).
Now it's possible that if I were more of a Postgres expert, I could have tuned or optimized it to behave better. But at the minimum, I'll say that a naive graph implementation using CTEs didn't prove to scale well for me, in this given use-case.
"as long as the 'RECURSIVE' part of CTE is
ignored."... To me, the recursive part of CTEs is half the point! All the other stuff is syntactic sugar that can be done elsewhere, but recursion is a function that I'm unable to find another method for elsewhere in the language.
Note: I have far more SQL Server experience than I do with any other DBMS.
"In most cases I have seen performance differences smaller than a 2X difference, this tradeoff for readability is a nobrainer as far as I’m concerned. "
I wouldn't call a 2X difference in performance a "no brainer". Far from it. It could be the difference between a visitor trying your product and him leaving b/c it takes too long to load.
A 100% increase in query time for a query that originally takes 50ms is neglectable, for a query that takes 500ms it will indeed make a vital difference.
And of course you can do it all inline. For transactional uses, that is probably best anyway. But there are clear use cases where inline queries are significantly less readable. Imagine subqueries nested 6 levels deep...CTEs allow you to flatten it such that you can linearly read the buildup of the query from top to bottom. It also allows the optimizer additional room for optimization, as it allows the optimizer to decide based on its analyze data whether the query is best run inline or by creating temp tables.
Recursive CTEs are super amazing. One of the great underused features of SQL. Give me tall PostgreSQL and a WITH RECURSIVE to steer by, and i will happily take on any graph database.
I don't know about postgres but joining CTEs in Sql Server is a recipe for blowing up your tempdb or at the very least waiting for ages for a trivial query to finish.
Postgres has no problems joining against any sized CTEs - at least that's what I'm seeing here with a considerably sized dataset (600 GB raw table size).
But: The optimizer will first fully resolve the CTE expression and only then join. If the join condition eliminates many of the rows selected in the CTE, this might indeed yield a much worse execution time than doing it inline.
If you later join that CTE against some other table, eliminating most of the rows fetched by the CTE, you will have wasted all the time for retrieving these rows.
This means that you might get a worse plan by using a CTE instead of a direct join.
It also means that you might convince the system to give you a much better plan by using a CTE (if you can limit the total amount of rows by the rows in the CTE instead of the tables you join it against).
Just keep that in mind when you're debugging query speed issues, but in general, CTEs are good for very readable queries that are much easier to understand and maintain than if you join inline.
The additional readability is for sure worth it to first try to solve the issue with a CTE and only if you run into above problem and if the performance is inacceptable to then convert the query to a more traditional (in a sense of "what the various open source RDBMs provided until two years ago when Postgres introduced CTEs") join.