If you are seriously worried about the performance implications of a binary tree...

chc · on Nov 10, 2010

I really hate arguments along these lines. Python programs might not be be as fast in the average case as, say, a C program, but that doesn't mean you don't care about performance at all. Yes, I will take a several-microsecond slowdown for the convenience of using Python over C. That's a fair tradeoff, as it really isn't noticeable from my perspective — it's the difference between, like, 10% CPU utilization and 1% utilization, and still done in the blink of an eye. But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

jerf · on Nov 10, 2010

I agree. But a binary tree vs. a hash table isn't going to be that difference. That's my point. Python won't excuse an O(n^3) when you should have been using an O(log n) algorithm, but the constant factors of Python are so huge they dominate a lot of the choices you'd make between two different O(log n) algorithms. You'd never notice the difference between the existing built-in Python dict and a hypothetical built-in binary tree, it would just be buried by some of the fundamentally-cache-hostile stuff Python does as it is bouncing all around memory with all the indirection it has on every line that has nothing to do with either structure.

jemfinch · on Nov 10, 2010

> But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

This must be a purely theoretical argument for you, because you wouldn't be arguing this point if you were consistently analyzing data in Python and C/C++.

I routinely analyze gigabytes of aggregated data in both Python and C++, and have occasionally rewritten the Python in C++ to be faster (as an aside, for basic processing it's surprising how simple the translation is). I routinely get between 10x and 100x improvement in computation speed, and often see similar reductions in memory used.

Python is not slower than C/C++ by a few microseconds; it's slower for most operations by two or more orders of magnitude. Whether you have a dictionary implemented by a hash table or by a balanced binary tree is not going to have any significant impact on your analysis speed.

chc · on Nov 10, 2010

It's not purely theoretical — it's just not universally applicable. Even Ruby is fast enough to do most things faster than I can perceive, but I know from practical experience that there are fast and slow ways to do those things even in a "slow" language. I've tweaked Ruby algorithms before to make programs go from "OMG Ruby is so slow!" to perceptually the same as a C program (though a few tenths of a second slower in absolute terms).

What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy. There's a wide range of performance options between "just forget about performance altogether" and "rewrite the entire application in C++."

jemfinch · on Nov 10, 2010

> What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy.

No, it's a very real dichotomy. If your complexities are right (e.g., you're not using a quadratic algorithm where you could be using a linearithmic algorithm) and Python still isn't fast enough for your task, switching from a hash table to a binary tree is not going to make it fast enough.