Hacker Newsnew | past | comments | ask | show | jobs | submit | SigmundA's commentslogin

Saw posters on /r/Truckers complaining that truck stops were taking advantage of situation to gouge for DEF too on top of diesel before someone pointed out that Iran is the second largest producer of urea, the primary component of DEF. They are not happy right now with $5.00 / gal fuel and higher DEF as well.

Worried the administration will use it as an excuse to rollback NOx emissions regulations that mandated DEF usage in diesel engines. They are already not enforcing "deletes" of the emissions systems which is a federal crime.


Here's a trucker explaining other costs of the unnecessary Iran war.

https://youtu.be/BtAZ1jAV57s


The deletes are for recirculation systems not DEF. Recirculation was always a bad idea and anyone who wants diesel for some reason should use DEF.

There are three main emissions control systems in diesels, the Exhaust Gas Recirculation (EGR), Diesel Particulate Filter (DPF) and Selective Catalytic Reduction (SCR) which uses Diesel Exhaust Fluid (DEF).

Any or all can be "deleted" and is a crime to do so. All 3 systems add complexity and potentially reduce performance which is why those who don't care about emissions like to get rid of them.

Before DEF NOx regulations steadily went up engine manufacturers relied on increasing amounts of EGR to control NOx until it was not tenable, once DEF systems where implemented they could back off EGR increasing performance but not as much as ripping it all out and tuning for no care of emissions.

There are EGR free engines that rely entirely on DEF to control NOx but they are not for on-road use in the US thus far.


Of course, but mostly it's just recirculation that causes costly engine rebuilds. If it wasn't so expensive no one would delete.

Most regulations target emissions at point of sale but don't give a toss if such systems are practical or maintainable. It's sometimes better for the environment to have more emissions but not so much waste from having to buy a new thing more frequently.


I've got a 2011 BMW X5 diesel with 189k miles on it. The DPF went out at around 150k, which is roughly the expected lifetime from my understanding. The part alone is ~$4000. Regardless of living in CA, I'd have replaced it anyway because I think it's the right thing to do. (I didn't want a non-DPF diesel that was emitting lots of PM2.5s when I made my purchase).

That said, many people would prefer the money in their bank account. And if I had decided to sell it rather than fix, no doubt it would have gone to someone looking to delete the DPF.


No the DEF systems tend to break down and its mandated the engine go into limp mode when they do, this is hated by those that do not care for emissions and one of the biggest reason to delete by truckers. Also costly to fix the systems out of warranty.

You can find instructions on how to make DEF simulators using raspi's to fool the ECU into think the DEF system is working, people who do care about emissions will still carry these for emergency so they continue on their trip and get to a shop later. The derate was over zealous for sure and was a bad policy.

Also DPF is a performance issue since it blocks the exhaust to some degree, same with catalytic converter with def nozzle so no its not just EGR at all. DPF also consumes more fuel for regens.

2027 diesel regulations was to mandate even more NOx control but also specified manufacturers were required to have 100,000 mile 10 year warranty on emissions systems, its 5 year, 50k now. I believe thats dead in the water now.

A diesel engine with a deleted SCR system puts out 40 times the NOx of a working one. Thats 40 trucks going down the road to 1 equivalent. NOx causes asthma and acid rain, its not for the environment as much as for you directly.


And what percent of diesel emissions are from diesel passenger vehicles?

Emissions reduction efforts would be better spent ensuring repairability. The sunk cost emissions of landfills filled with junk created by planned obsolescence is much worse.

Diesel pick ups are not practical vehicles. Let's be honest, it's a hobby. It's always going to be niche and cutting down the last 10% is always the hardest. General aviation still uses lead.


Not sure what your point is, I was talking primarily about Class 8 heavy duty commercial trucks (semi) and other medium duty commercial use.

All the stuff that makes up emissions gear is highly recyclable and in fact some of it very desirable which is why people are getting catalytic converters stolen. So I do not worry about it filling up a land fill.

I also don't worry about EV batteries filling landfills because again they are very high grade ore for new batteries, once we have enough in circulation we no longer need to mine much lithium or rare earth.

I agree it should be reliable and repairable and forcing the manufacturers to have very long warranties on it seems like a good way to do that, having followed the various generations of DEF systems for the last decade the manufacturers have been making big strides because it costs them otherwise and has.

I also think airplanes using lead is stupid, but that is a fraction of even private diesel pickup usage let alone commercial trucking. Diesel pickups are at least 10% of all pickup sales now days.


Wars open a lot of opportunities, especially for the worst people.

The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.

implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences

It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?

I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.

By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.

This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.


Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.

Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.


English has plenty of Unicode — claiming otherwise is such a cliché…

Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.


I am talking about coded values, like Status = 'A', 'B' or 'C'

Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.


It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.

How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.

Typical code tables with code, description and anything else needed for that value which the user can configure in the app.

Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.


Enums are for non user-configurable values.

For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.

Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.


>Enums are for non user-configurable values

In the systems I work with most coded values are user configurable.

>But those should have an auto-integer primary key and if you need the description, join for it.

Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'

select * from addresses where state = 11, no thanks.

Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.

>the length of the string is stored as an int

No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.

You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.


> They do not complicate storage access in any meaningful way.

Sure they do, because now your row / index is variable length rather than fixed length. Way more complicated. Even 3 bytes is way more complicated to deal with than 4 bytes.

> select * from addresses where state = 11, no thanks.

I will agree that isn't fun. Is it still the trade off I do make? Absolutely. And it's not really that big of a problem; I just do a join. It also helps prevent people from using codes instead of querying the database for the correct value -- what's the point of user-configuration of someone hard-codes 'NY' in a query or in the code.


>Sure they do, because now your row / index is variable length rather than fixed length. Way more complicated.

Come on its literally a 2 byte per column header in the row so it just sums the column lengths to get the offset, it does the same thing for fixed length except it gets the col length from the schema.

It's not much more complicated than a fixed length column only the column length is stored in row vs schema. I am not sure where you are getting this idea it way more complicated, nor the 3 vs 4 byte thing, the whole row is a variable length structure and designed as such, null values change the row length fixed or variable data type and have to be accounted for since a null takes up no space in the column data its only in the null bitmap.

> what's the point of user-configuration of someone hard-codes 'NY' in a query or in the code

Because it doesn't matter, 'NY' isn't changing just like 11 the int wouldn't change, but 'NY' is way easier to understand and catch mistakes with and search for code without hitting a bunch of nonsense and distinguish when 10 columns are all coded next to each other in a result set.

I prefer my rows to be a little more readable than 1234, 1, 11, 2, 15, 1 ,3 and the users do too.

I have had my fill of transposition bugs where someone accidentally uses the wrong int on a pk id from a different table and still gets a valid but random result that passes a foreign key check almost enough for me to want to use guid's for pk's almost. At least with the coded values it is easier to spot because even with single character code people tend to pick things that make sense for the column values you know 'P' for pending, 'C' for complete etc, vs 1 2 3 4 used over and over across every different column with an auto increment.


> Come on its literally...

You're the one saying a 2 character string is somehow a space savings. If we're going to split hairs that finely then you have to know that any row with a variable length string makes the entire row/index variable length and that is a net storage and performance loss. It's worse in every way than a simple integer. I will admit that it ultimately doesn't matter. But I'd also argue using an nvarchar in place of varchar for this also doesn't matter. It's not just premature optimization it's practically useless optimization.

> Because it doesn't matter, 'NY' isn't changing just like 11 the int wouldn't change, but 'NY

That's not what happens but what happens is that somebody renames New York to New Eburacum and now your code doesn't match the value and it just adds more confusion.

But I'll grant you that it's totally fine. It's even more fine if you don't use varchar and instead use char(x).


>You're the one saying a 2 character string is somehow a space savings. If we're going to split hairs that finely then you have to know that any row with a variable length string makes the entire row/index variable length and that is a net storage and performance loss.

The row is always variable lengths as a structure it has flags noting how many columns there are with values and if there is a variable length section or not, only rows with no variable length fields at all has no variable length section and that is a bit flag check in the header.

You are making a non argument, variable length fields can be a space savings over an int with single char codes which is very common, and do not impact performance in any meaningful way. Besides that one could use fixed length chars and still get the other benefits I mentioned while having the same exact space usage and processing as a fixed length ints.

>That's not what happens but what happens is that somebody renames New York to New Eburacum

Changing the descriptive meaning of an entry causes all sorts of problems and even more so if it is a int because it's completely opaque its much harder to see an issue in the system because everything is a bunch of ints that do not correlate in any way to their meaning.

Changing the description to something that has the same meaning worded differently is usually not an issue and still gives good debug visibility to the value. If you and your users consider New Eburacum synonymous with New York, then having the code stay 'NY' should not be an issue and still be obvious when querying the data.

Unless someone is using the short code in a user visible way and it has to be updated. State is a common one that does this and nobody is changing state names or codes because it is a common settled natural key.

In the rare situation this actually needed to be done then one can update existing data, this is a not an issue in practice. You have the be extremely cautious updating the description of a code because much data was entered under the previous description and the meaning that it carries, having the code have some human meaning makes it more obvious to maintainers this should be done with care, many times it would involve deprecating the old one and making a new one with a different code because they have different meanings, having a table instead of a enum allows other columns to have this metadata.

This is not the same issue as say using a SSN for a person ID.


Please take literally one course.

Do NOT use mnemonics as primary keys. It WILL bite you.


https://en.wikipedia.org/wiki/Natural_key you should have learned learn this in your courses.

Clam down, I am not suggesting using this for actual domain entity keys, these are used in place of enums and have their advantages. I have doing this a long time and it has not bit me, I have also seen many other system designed this way as well working just fine.

Using an incrementing surrogate key for say postal state code serves no purpose other than making things harder to use and debug. Most systems have many code values such as this and using surrogate key would lead to a bunch of overlapping hard to distinguish int data that leads to all sorts of issues.


The way to do enums in SQL (generally, not just MSSQL) is another table. It's better that they don't offer several ways to do the same thing.

Mostly agree separate tables can have multiple attributes besides a text description and can be exposed for modification to the application easily so users or administrators can add and modify codes.

A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.

Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.

Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.


While I generally would prefer lookup tables, it's much easier to sell dev teams on "it looks and acts like a string - you don't have to change anything."

Those are all single byte characters in UTF-8.

We are talking nvarchar here, yes UTF-8 solves this issue completely and MSSQL supports it now days with varchar.

But nvarchar is UTF-16

No. Look closer.

Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.

Not that I disagree — Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.


My comment contains two glyphs that are not in CP1252.

Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji

Specifically not talking about a name field

> Unicode is a requirement everywhere human language is used

Strange then how it was not a requirement for many, many years.


Oh, it was. It was fun being unable to type a euro sign or the name Seán without it being garbled. Neither were matched quotation marks, and arguably computer limitations killed off naïve and café too.

Don’t confuse people groaning and putting up with limitations as justifying those limitations.


In Portugal it always was, that is why we got to use eh for é, ah for á, he for è, c, for ç and many other tricks.

Shared by other European languages, like ou for ö in German, kalimera for καλημέρα, and so on all around the world in non-English speaking countries during the early days of computing.


Or rather, computers had inadequate support.

It was a mess back then though. Unicode fixed that.

I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.

The options were never ASCII or unicode though. Before unicode we had ASCII + lots of different incompatible encodings that relied on metadata to be properly rendered. That's what unicode fixed

Besides I like being able to put things like →, €, ∞ or ∆ into text. With ascii a lot of things that are nowadays trivial would need markup languages


For whom? Certainly not any of the humans trying to use the computer.

Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…

Yes I have run into this regardless of client language and I consider it a defect in the optimizer.

I wouldn't consider it a defect in the optimizer; it's doing exactly what it's told to do. It cannot convert an nvarchar to varchar -- that's a narrowing conversion. All it can do is convert the other way and lose the ability to use the index. If you think that there is no danger converting an nvarchar that contains only ASCII to varchar then I have about 70+ different collations that say otherwise.

Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?

If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.


For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.

Making a plan that works for the general case, but is also efficient, is rather trivial. Here's pseudocode from spending two minutes on the problem:

    # INPUT: lookfor: unicode
    var lower, upper: ascii
    lower = ascii_lower_bound(lookfor)
    upper = ascii_upper_bound(lookfor)
    for candidate:ascii in index_lookup(lower .. upper):
        if expensive_correct_compare_equal(candidate.field, lookfor):
            yield candidate
The magic is to have functions ascii_lower_bound and ascii_upper_bound, that compute an ASCII string such that all ASCII strings that compare smaller (greater) cannot be equal to the input. Those functions are not hard to write. Although you might have to implement versions for each supported locale-dependent text comparison algorithm, but still, not a big deal.

Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.


optimizer can't inspect the value? pretty dumb optimizer, then.

It's not "the value", it's "the values".

Running the optimizer for every execution of the same query is... not very optimal.

It can run it for a range of values: https://learn.microsoft.com/en-us/sql/relational-databases/p...

Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.

The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...


I'm not sure it makes sense to add more checks and another operation to every single query just for the case where the user explicitly mislabels the types. You're going to slow down everything everywhere (slightly) for a pretty obscure case. I suspect, in the long term, this would be a bad choice.

The check is added if it sees a varchar column and nvarchar parameter predicate on it.

It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.

There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.


To complicate matters SQL Server can do Nvarchar compression, but they should have just done UTF-8 long ago:

https://learn.microsoft.com/en-us/sql/relational-databases/d...

Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?


UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

I have avoided it and have not followed if the issues are fully resolved, I would hope they are.


> UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.


I would be more critical of Microsoft choosing to support UCS-2/UTF-16 if Microsoft hadn't completed their implementation of Unicode support in the 90s and then been pretty consistent with it.

Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.


It's not really a Linux vs MS thing though. When Unicode first came out, it was 16-bit, so all the early adopters went with that. That includes Java, Windows, JavaScript, the ICU lintaries, LibreOffice and its predecessors, .NET, the C language (remember wchar_t?), and probably a few more.

Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.

Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.


Database systems are inherently conservative -- once you add something you have to support it forever. Microsoft went hog wild on XML in the database and I haven't seen it used in over a decade now.

In 92 it was a conference talk. In 98 it was adopted by the IETF. Point probably stands though.

the data types were introduced with SQL Server 7 (1998) so i’m not sure it’s accurate to state that it’s considered as the new thing.


thanks. now i see the point that the poster was making.

The parameter type is part of the cache identity, nvarchar and varchar would have two cache entries with possibly different plans.

Nice to have some higher yield options.

There are banks out there that will do business savings accounts not much below this (2.85%) while keeping things safe (FDIC insured) and liquid.

https://www.liveoak.bank/business-savings/


Good find- 2.85% is great for a business savings account.

All that is to say: businesses shouldn't treat all their cash the same way, especially when they have significant runway. The exact breakdown depends on the business, but typically you can think of it as three different buckets:

1) You have short-term cash, which you need immediately. This is where you'd use a checking account. This pays very close to 0% but you have immediate access. Most businesses might keep a few weeks' cash here.

2) You have short-term reserves, which is what you'd use in the next couple of months. This is where most companies might use a savings account (or even put it in a money market fund), where you know you can get the cash into your checking account in ~1 day. This pays between 2.5% and up to maybe 3.75%. Each business will structure their cash differently, and some might not even bother having this bucket.

3) Long term reserves, which you won't touch for months. This is where companies try to optimize yield, and where Palus is valuable. Even here, your money is safe, and in Palus's case can usually be in your checking account within a couple of days, but getting extra yield on long-term reserves can be super valuable.


Looks like they just recreated a tuple layout in rust with null bit map and everything, next up would be storing them in pages and memmap the pages.

https://www.postgresql.org/docs/current/storage-page-layout....


Absolutely, it's a very common technique :)

I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.


I think its a good article and I enjoyed learning a little more about rust, but would have been nice to point out this is a common technique used for tuple storage in databases for those not familiar.

It comes off as being a novel solution rather than connecting it to a long tradition of DB design. I believe PG for instance has used a null bitmap since the beginning 40 years ago.


That would be surprising to me if anyone would think this is novel.

Using bitmaps to indicate whether things are in-use or not is very common in systems programming. Like you said PG does it, but most other systems do this too. It's not specific to databases: in an operating system, one of the first thing it needs is an allocator, the allocator most likely will use some bitmap trick somewhere to indicate what's free vs. what's available etc.


Interesting so you don't find it odd that an article about the storage engine of a SQL database system explains the solution to problem without once mentioning that it is the way most other sql database engines solve it? It is mentioned several times sql table are typically sparse, but not that this is how its typically solved, only this is how you solved it...

>The fix is simple: we stop storing Option<T> and instead we store a bitmap that records which fields are None.

Right here would have been the opportunity to let those not familiar with database internals know. Something like "This technique has been widely used in many RDBMS systems over the years, you can see the PG version of this in their documentation for page layout".

Instead you go into detail on what a null bitmap is and how to implement it, calling it a "trick". Which is strange if you think your target audience is assumed to already know this common technique.

I mean not one mention of the origin of the trick or even calling it a common solution to this problem...


> not one mention of the origin of the trick

And who or what would you say is the origin? The "trick" is so old I'm afraid it is lost to time to say who invented the bitmap. It was used in MULTICS or THE long before PostgreSQL was invented.


There is a lot of space between no acknowledgment at all like this article and tracing down the exact origin of who and when invented to null bit map in what system.

Your version comes off as self promotion, going into detail how to implement a the technique with out once calling it a common technique or even mentioning its widely used in the industry to solve the problem you are solving.

It would be like saying finding rows in storage is slow so we implemented a b-tree in a separate structure for certain columns and go into detail how to create a b-tree, calling it a "trick" while never once acknowledging the is the most common form of indexing in RDBMS's, would you understand how this would seem strange?


Postgresql uses a process per connection model and it has no way to serialize a query plan to some form that can be shared between processes, so the time it takes to make the plan including JIT is very important.

Most other DB's cache query plans including jitted code so they are basically precompiled from one request to the next with the same statement.


What do you mean ? Cause the obvious thing is a shared cache and if there is one thing the writers of a db know it is locking

Sharing executable code between processes it not as easy as sharing data. AFAIK unless somethings changed recently PG shares nothing about plans between process and can't even share a cached plan between session/connections.

Executable code is literally just data that you mark as executable. It did the JIT code, and the idea that it can't then share it between processes is incomprehensible.

I was actually confused by this submission as it puts so much of an emphasis on initial compilation time, when every DB (apparently except for pgsql) caches that result and shares it/reuses it until invalidation. Invalidation can occur for a wide variety of reasons (data composition changing, age, etc), but still the idea of redoing it on every query, where most DBs see the same queries endlessly, is insane.


No a lot of jitted code has pointers to addresses specific to that process which makes no sense in another process.

To make code shareable between processes takes effort and will have tradeoff in performance since it is not specialized to the process.

If the query plan where at least serializable which is more like a AST then at least that part could be reused and then maybe have jitted code in each processes cached in memory that the plan can reference by some key.

DB's like MSSQL avoid the problem because they run a single OS process with multiple threads instead. This is also why it can handle more connections easily since each connection is not a whole process.


What does specialized to the process mean? Lots of JIT tooling these days readily supports caching and precompilation. Invalidation is hard but things like reloading global references are hardly intractable problems especially for an org as large as pgsql.

Pointers to process specific memory addresses to functions other data structures that only exist in that process. I didn't say it was intractable only that it takes more effort, other databases do it.

The current PG query plan and jit is designed around just being in memory in a single process, this would need to be extracted into something not process specific and shared between all processes. The plan itself is just a bunch of C structs I believe.


The emphasis on compilation time there is because the JIT provider that comes with Postgres (LLVM-based) is broken in that particular area. But you're right, JITed code can be cached, if some conditions are met (it's position independent, for one). Not all JIT providers do that, but many do. Caching is on the table, but if your JIT-compilation takes microseconds, caching could be rather a burden in many cases. Still for some cases useful.

Write the binary to a file, call it `libquery-id1234.so`, and link that to whichever processes that need it?

Might want to take a look at some research like this [1] that goes over the issues:

"This obvious drawback of the current software architecture motivates our work: sharing JIT code caches across applications. During the exploration of this idea, we have encountered several challenges. First of all, most JIT compilers leverage both runtime context and profile information to generate optimized code. The compiled code may be embedded with runtime-specific pointers, simplified through unique class-hierarchy analysis, or inlined recursively. Each of these "improve- ments" can decrease the shareability of JIT compiled code."

Anythings doable here with enough dev time. Would be nice if PG could just serialize the query plan itself maybe just as an SO along with non-process specific executable code that then has to be dynamically linked again in other processes.

1. https://dl.acm.org/doi/10.1145/3276494


Won't work well if it executes 20k+ queries per second. Filesystem will be a bottleneck among other things.

You can put more than one function in one file.

Sure, but not more than one query per file

Hm, what is preventing from putting more than one query into the same file?

The fact that you plan and execute query by query?

> and it has no way to serialize a query plan to some form that can be shared between processes

https://www.postgresql.org/docs/current/parallel-query.html

"PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster."


Nothing to do with plan caching, thats just talking about plan execution of parallel operations which is that thread or process based in PG?

If process based then they can send small parts of plan across processes.


Ah, didn't see the caching part.

Plans for prepared statements are cached though.


Yes if the client manually prepares the statement it will be cached for just that connection because in PG a connection is a process, but it won't survive from one connection to the next even in same process.

Other databases like MSSQL have prepared statements but they are rarely used now days since plan caching based on query text was introduced decades ago.


Only on a per-connection basis

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: