I'm curious about what makes this implementation faster than the alternatives.
Also, if you care about function call performance, I guess you'd use PyPy. Have you tried to run the benchmarks (with appropriate warmup) on PyPy to see if the results carry over?
Mainly codegen. Most other libraries do something like `return dispatch_dictionary[tuple(map(type, args))](*args)`. Whereas ovld generates a specialized function to the set of actual signatures of the overloads (still with a dictionary, but you remove a surprising amount of overhead just from removing varargs). The code is registered in the linecache, so you can step into it with pdb if you want to look at it. After that I became a bit obsessive about pushing it further and further (because it's fun). So... when dependent types are involved, it will actually generate custom dispatch code, e.g. if you define an @ovld with x:Literal[0], for example, it will generate an `if x == 0` in the int dispatch path (and you can define custom codegens for new types like Regexp).
Regarding PyPy, I did run some benchmarks recently (I use pytest-benchmark). The advantage over other libraries remains and the magnitude is similar, but when I compare with custom if/isinstance code, that code is optimized a lot more aggressively and it gains a significant edge (let's say ~5x). Now, since I'm into codegen, part of me feels like meeting the challenge and figuring out a way to generate optimal code from a set of signatures, but I think I've spent enough time as it is haha.
Also, if you care about function call performance, I guess you'd use PyPy. Have you tried to run the benchmarks (with appropriate warmup) on PyPy to see if the results carry over?