Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?

If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.



For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.


Making a plan that works for the general case, but is also efficient, is rather trivial. Here's pseudocode from spending two minutes on the problem:

    # INPUT: lookfor: unicode
    var lower, upper: ascii
    lower = ascii_lower_bound(lookfor)
    upper = ascii_upper_bound(lookfor)
    for candidate:ascii in index_lookup(lower .. upper):
        if expensive_correct_compare_equal(candidate.field, lookfor):
            yield candidate
The magic is to have functions ascii_lower_bound and ascii_upper_bound, that compute an ASCII string such that all ASCII strings that compare smaller (greater) cannot be equal to the input. Those functions are not hard to write. Although you might have to implement versions for each supported locale-dependent text comparison algorithm, but still, not a big deal.

Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.


optimizer can't inspect the value? pretty dumb optimizer, then.


It's not "the value", it's "the values".


Running the optimizer for every execution of the same query is... not very optimal.


It can run it for a range of values: https://learn.microsoft.com/en-us/sql/relational-databases/p...

Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.

The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...


I'm not sure it makes sense to add more checks and another operation to every single query just for the case where the user explicitly mislabels the types. You're going to slow down everything everywhere (slightly) for a pretty obscure case. I suspect, in the long term, this would be a bad choice.


The check is added if it sees a varchar column and nvarchar parameter predicate on it.

It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.

There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: