Except for one. Quick: What's the syntax for a binary tree in Python? OK, I'll g...

scott_s · on Nov 9, 2010

Balanced binary tree versus hash table is an implementation choice for the associative array abstract data type. Similarly, growable array versus linked list are implementation choices for the list abstract data type.

I don't buy that balanced binary trees are "too computer sciencey for most people's taste" as an argument against not have them as implementations of native types in Python. Designing a good hash table is equally as subtle in terms of both the theory and the implementation itself. But those details are not exposed through the interface.

shadowmatter · on Nov 9, 2010

"Balanced binary tree versus hash table is an implementation choice for the associative array abstract data type."

You can treat it as an implementation detail, but it's a good idea to let the client know that the underlying key-value collection is actually sorted by key. This allows the client to iterate over the key-value pairs in sorted order without dumping the keys to an array and then sorting it, retrieving the k smallest keys through forward iteration, retrieving the k largest keys through reverse iteration, etc. I think Java solves this nicely by creating the SortedMap subinterface of Map; if you have a SortedMap, you're assured these properties hold. (The typical implementation of SortedMap is a balanced tree, while the typical implementation of Map is a hash table.)

caf · on Nov 9, 2010

A binary tree also allows efficient iteration in key order ("next" and "previous" operations).

eru · on Nov 10, 2010

Yes. As the other poster argued, a tree needs some kind of ordering (like Hashtables need the keys to be hashable). But trees also give you a persistent data structure. Most hashtables including Python's dict are ephemeral, i.e. not copy-on-write.

derefr · on Nov 10, 2010

Python:

    conn = sqlite3.connect('foo')
    curs = conn.cursor()
    curs.execute('create table things (id INTEGER PRIMARY KEY, stuff TEXT)')
    curs.execute('CREATE UNIQUE INDEX tree ON things (id ASC)')

Now "tree" is your tree. ;)

Quite seriously, though, people (that aren't systems programmers) don't think "this is a load of data—I need a binary tree!" People instead think "this is a load of data—I need a database!" and then, when the database isn't actually help their queries go faster, they slap their foreheads and exclaim "oh, wait, I need an indexed database!" Of course, the index just happens to be stored as a b-tree—but they don't know that, and they don't care.

So, to reinterpret, this is basically the data-structure-usage flowchart for the average app programmer:

1. Need quick random access? Use an array.

2. Need quick insertion and removal, or lazy reification (to support, i.e., infinite sequences)? Linked lists, then.

3. Need quick search by key? Hash tables have you covered. (Well, these days, any "hash table" ADT also carries around its keys, so you also get O(N) enumeration of pairs.)

4. And, if your data set makes any of the above solutions choke? Just throw a database at it, and hope it helps.

chc · on Nov 10, 2010

I apologize for being a little bit pedantic here, but a B-tree is not the same thing as a balanced binary tree, and a SQL database is something different still. Even if the SQL database happens to use binary trees somewhere in its implementation, using that database is not at all the same as using a binary tree for your data structures.

derefr · on Nov 10, 2010

I didn't mean to imply that they were the same—just that, in practice, where a systems programmer thinks "use an in-memory binary tree data structure", an application programmer thinks "use a database", and a B-tree is what they end up using under the layers of abstraction and encapsulation. Yes, a binary tree and a B-tree are bad substitutes for one another, but you'll be much, much better off substituting a B-tree for a binary tree, or vice-versa, than substituting in, say, a linked list.

A cooking analogy: you wouldn't drink cream instead of milk—but if a recipe calls for milk and you don't have any, cream is a much better call than orange soda.

For some things (implementing a virtual filesystem, say), the app programmer will make the call to use a database, and they'll be fine, because a B-tree is just what a virtual filesystem needs. For others—storing and evaluating a Lisp AST, for example—they'll have to make O(N) roundtrips through the database and create a bunch of useless indexing cruft, but if the AST is complex enough, and if they're beta-reducing and caching their results in each parent node as they go, then this is still likely to outperform an alternative implementation that naively uses any of the other three data structures.

derefr · on Nov 10, 2010

Am I -1 points of wrong here? I made two assertions:

1. People writing code in HLLs don't really use binary trees directly;

2. When people writing code in HLLs have a problem that would be best solved with a binary tree, and they can't find a wrapped native library that internally uses binary trees, then the generic solution that 90% of developers rely on (whether it helps or not) is to use SQLite, BDB, or another library which basically acts as a complex surrogate for a tree data structure.

archangel_one · on Nov 9, 2010

I think it's simpler than that: when you have arrays, linked lists and hash tables there aren't _that_ many cases when you need a binary tree, so I guess the designers felt they weren't worth baking into the language at a syntax level. I agree it would have been nice to have in the standard library though.

chc · on Nov 9, 2010

This is sort of a "when all you have is a hammer…" situation. You don't need binary trees, but only in the same sense most of these data structures can stand in for one or more of the others. Binary trees can actually be better than hash tables in many cases where hash tables are used in mainstream scripting languages simply due to language support. An informative post on the subject: http://enfranchisedmind.com/blog/posts/problems-with-hash-ta...

jerf · on Nov 10, 2010

If you are seriously worried about the performance implications of a binary tree vs. the Python hashes, you are working in the wrong language. Whatever performance implications you are worried about are utterly dwarfed by the Python penalty.

I am not being sarcastic or cynical. There really isn't much point in really nailing some of these things to the wall in Python or Perl or Ruby or the other interpreted languages hanging out at the bottom of the shootout[1] because you've already thrown performance out the window as a consideration just by cracking open your editor and starting a new file. This is an argument to take up with languages much higher up the shootout.

[1]: http://shootout.alioth.debian.org/ , if you don't know what I mean by that. The benchmarks there may not see all and know all, but they aren't meaningless, either.

chc · on Nov 10, 2010

I really hate arguments along these lines. Python programs might not be be as fast in the average case as, say, a C program, but that doesn't mean you don't care about performance at all. Yes, I will take a several-microsecond slowdown for the convenience of using Python over C. That's a fair tradeoff, as it really isn't noticeable from my perspective — it's the difference between, like, 10% CPU utilization and 1% utilization, and still done in the blink of an eye. But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

jerf · on Nov 10, 2010

I agree. But a binary tree vs. a hash table isn't going to be that difference. That's my point. Python won't excuse an O(n^3) when you should have been using an O(log n) algorithm, but the constant factors of Python are so huge they dominate a lot of the choices you'd make between two different O(log n) algorithms. You'd never notice the difference between the existing built-in Python dict and a hypothetical built-in binary tree, it would just be buried by some of the fundamentally-cache-hostile stuff Python does as it is bouncing all around memory with all the indirection it has on every line that has nothing to do with either structure.

jemfinch · on Nov 10, 2010

> But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

This must be a purely theoretical argument for you, because you wouldn't be arguing this point if you were consistently analyzing data in Python and C/C++.

I routinely analyze gigabytes of aggregated data in both Python and C++, and have occasionally rewritten the Python in C++ to be faster (as an aside, for basic processing it's surprising how simple the translation is). I routinely get between 10x and 100x improvement in computation speed, and often see similar reductions in memory used.

Python is not slower than C/C++ by a few microseconds; it's slower for most operations by two or more orders of magnitude. Whether you have a dictionary implemented by a hash table or by a balanced binary tree is not going to have any significant impact on your analysis speed.

chc · on Nov 10, 2010

It's not purely theoretical — it's just not universally applicable. Even Ruby is fast enough to do most things faster than I can perceive, but I know from practical experience that there are fast and slow ways to do those things even in a "slow" language. I've tweaked Ruby algorithms before to make programs go from "OMG Ruby is so slow!" to perceptually the same as a C program (though a few tenths of a second slower in absolute terms).

What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy. There's a wide range of performance options between "just forget about performance altogether" and "rewrite the entire application in C++."

jemfinch · on Nov 10, 2010

> What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy.

No, it's a very real dichotomy. If your complexities are right (e.g., you're not using a quadratic algorithm where you could be using a linearithmic algorithm) and Python still isn't fast enough for your task, switching from a hash table to a binary tree is not going to make it fast enough.

jemfinch · on Nov 10, 2010

That's a great article, and I agree with most of it (e.g., that a balanced binary tree should be the default set/map data structure in programming languages; unfortunately it seems only C++ gets that right), but its reference to FNV as a good hash function is anachronistic.

For a more modern and much faster hash, see http://sites.google.com/site/murmurhash/ . Murmurhash3 is an order of magnitude faster than FNV: http://code.google.com/p/smhasher/wiki/MurmurHash3 .

chc · on Nov 10, 2010

The article actually just says FNV is popular, not good. The point (as I understand it) was how common bad hash functions are.

eru · on Nov 10, 2010

Haskell also gets it right. Look at Data.Map and Data.Set.

jemfinch · on Nov 10, 2010

Haskell "gets it right" because you can't write an applicative hash table with the same computational complexity as a mutable one. Haskell's option is either "ordinary balanced binary trees" or "hash table in the IO monad".

rg3 · on Nov 9, 2010

Your point about the need of certain data structures reminds me of the famous humorous text "Real programmers don't use Pascal", which said:

"As all Real Programmers know, the only useful data structure is the Array. Strings, Lists, Structures, Sets-- these are all special cases of arrays and can be treated that way just as easily without messing up your programming language with all sorts of complications."

The full text is available here and is a recommended read if you have some spare minutes: http://www.pbm.com/~lindahl/real.programmers.html

jemfinch · on Nov 9, 2010

Except for two, actually: there aren't any linked lists in Python, either.

Why? Because there's only one thing that a linked list is better at than a growable array: O(1) splice. Everything else is better with a growable array.

jedbrown · on Nov 9, 2010

jemfinch · on Nov 10, 2010

Yes, and?

Growable arrays use the same or less memory and offer random access to elements. They're not applicative, but then, if you insist on applicative data structures you're obviously not very concerned with performance anyway so there's no reason to bother with any particular data structure.

loup-vaillant · on Nov 10, 2010

The burden of proof should be on the mutable end, not of the applicative one. Purely functional data-structures are far less error prone than mutable ones. They have simpler APIs, which lead to simpler and clearer code. In other words, mutable state is the biggest source of errors. It should be avoided when possible, and insulated otherwise. http://www.loup-vaillant.fr/articles/assignment

Therefore, mutable data-structures are almost always an optimization. You shouldn't use them unless your program is demonstrably too slow for your needs. (Incidentally, the same argument applies more strongly to C itself.)

Finally, immutable data structures can sometimes be faster, for instance when you do backtracking, or keep track of undo information. With mutable data structure, you have to copy the whole thing.

jemfinch · on Nov 10, 2010

> The burden of proof should be on the mutable end, not of the applicative one.

On the contrary: the burden of proof is on the uncommon software development methodology, on the uncommon data structure, on the uncommon programming languages. Billions of lines of effective code have been written in imperative languages using mutable data structures. Such programs are what make entire industries possible today. Your love of abstract applicative elegance does not make up for the fact that in the world we live, imperative languages with mutable data structures rule. The burden of proof is not on the status quo, but on the challengers to the status quo.

As of yet, applicative programming has not borne that burden.

loup-vaillant · on Nov 11, 2010

Mutable state and manual memory management are responsible for almost all bugs I have seen. There are other classes of bugs of course, but these two far out-cost them.

Mutable state and manual memory management are mostly unnecessary. That has been proven years ago. See GHC, Xmonad, and Darcs for substantial examples. Of course, many important pieces of software have to use imperative programming, but this is the minority.

Conclusion: we should all learn functional programming techniques. And we should all use them where appropriate (meaning, most of the time). All of us. No exception. And I mean it literally.

Yet almost everyone never use functional programming. That's the reality, OK. That doesn't mean however, that what is so, should be. For the status quo to be a good thing, it must have good reasons behind it.

And I don't know of any. I know of several bad reasons though: anthropomorphism, analogies, lack of teaching, refusal to learn or to change one's mind, and the passive acceptance of the resulting network effect. Dijkstra warned us. We didn't listen. Mistakes ensued. Money was wasted. Computers were breached. People died.

I'll grant you that the lack of competence may mean that functional programming isn't the best idea for the next project. But that's no excuse for not even trying. That's no excuse for not teaching it.

Still not convinced? Then learn FP and read' Okasaki's thesis, if you haven't already. If that's not enough, then we may be in an actual, interesting disagreement. In this case, I beg you to give me more details. Debunking my essay on the assignment statement would be great. And I now operate under the Croker's rule[1].

[1]: http://sl4.org/crocker.html

jemfinch · on Nov 15, 2010

> Mutable state and manual memory management are responsible for almost all bugs I have seen.

I don't think this proves what you seem to think it proves. Mutable state is pervasive in most software written today. Almost all bugs can be classified as bugs of mutable state in one way or another if you so insist. In other words, your statement is akin to saying, "Computers are responsible for almost all bugs I have seen." Yes, they are. No, it's not useful to say they are.

> Mutable state and manual memory management are mostly unnecessary.

This depends on how you define "unnecessary". There is still no applicative, constant-time finite map.

> That has been proven years ago. See GHC, Xmonad, and Darcs for substantial examples.

Programs have been written in every kind of programming language with every kind of programming methodology. The existence of a few examples of such programs is not a compelling demonstration, let alone proof, of the usefulness of a paradigm. The plural of anecdote is not data.

> Of course, many important pieces of software have to use imperative programming

All software has to use imperative programming at some level. Purely applicative programs can do nothing. The question is not whether the state of the world is modified, but how it is modified and if those modifications should be somehow segregated from other parts of the program.

> Conclusion: we should all learn functional programming techniques. And we should all use them where appropriate (meaning, most of the time). All of us. No exception.

These sound like the ravings of an ideologue, not someone who's thought through the issues.

> For the status quo to be a good thing, it must have good reasons behind it.

There are several fundamentally good reasons why imperative programming is the status quo and applicative programming is not:

1. Imperative programming maps with significantly less impedance to the actual machines the code runs on, leading to a simpler and more comprehensible performance model for imperative programs.

2. Imperative programming does not require of programmers the same mathematical expertise and intuition that applicative programming requires. Whatever your blue sky hopes, the mass of programmers writing useful code today do not have the intellect or education to be as effective with applicative languages as they are with imperative languages. Programmers intuitively understand the concept of telling a computer what to do; far fewer can construct bodies side-effect-free functions expressing their instructions to an abstract mathematical reduction machine.

3. The problems we wish to solve in the Real World are largely imperative problems. Imperative programming maps more easily to these problems of command and control than applicative programming.

> Dijkstra warned us. We didn't listen.

Dijkstra also would have been the first to oppose to your applicative ideology. He considered applicative programming to be abstraction inversion, inefficient, and unnecessary. He would argue (with Hoare) that reasoning about programs is as easy with an appropriate imperative development methodology as with applicative methods.

> Still not convinced? Then learn FP

I've already learned FP. I've written tens of thousands of lines of code in OCaml. I've implemented Lisp interpreters. I understand monads and call/cc. The fact remains, when I need to get work done, imperative programming is faster and easier than applicative programming. The language I'm designing for my own edification is not an applicative language because I don't believe that side effects are the fundamental constraint on programming in the large (in fact, I would argue that side effects are required for programming in the large), and I recognize that real programming requires side effects to accurately model the real problems that real programs solve.

As for your essay on assignment, you're not entirely wrong, but your solution is misguided. Okasaki may have shown that in many cases you can implement applicatively a data structure similar to the common mutable one, and even (much of the time) get the same complexity bounds for that data structure. Both of you completely gloss over the constant factors, though, and those constant factors are massive, both in terms of the processing requirements for accurate garbage collection and in terms of the memory inefficiency induced by all the page faults applicative data structures' indirections require.

The real problem with programming today is aliasing.

loup-vaillant · on Nov 16, 2010

Thank you. I really appreciate that you took your time to answer.

> All software has to use imperative programming at some level. Purely applicative programs can do nothing. The question is not whether the state of the world is modified, but how it is modified and if those modifications should be somehow segregated from other parts of the program.

We probably misunderstood each other, because I completely agree with that point. Now my point is that effect aren't segregated well enough. And that anyone who knows FP will have a better grasp of state, and will be able to segregate it more properly. I'm not against mutability. I'm against pervasive, rampant, unsegregated mutability, which I see way too much at work.

About your good reasons… I'd say reason 1, while still important, is invoked too often. It looks to me like premature optimization justify the use of verbose, unsafe, but familiar, techniques and languages. Reason 2 is basically the lack of teaching I was talking about. I don't know how to fix it, I just believe it should be. I partly deny reason 3. Command and control is of course best expressed with imperative programming, but no problem I saw is limited to that. Most problems have huge "pure" chunks, which are easily expressed applicatively. And in the large… I think modules developed by different teams should have explicit and small interfaces (if only to reduce communication overhead). Making those interfaces functionally pure where possible looks like the easiest way to achieve that. (Note that imperative inner cores don't bother me much, as long as the outer shells are kept pure.)

Apparently, I haven't read Dijkstra enough. I'll check it out.

It looks like I could learn from your code (you're definitely more experienced than I am). Could you show some you think I should look at?

> The real problem with programming today is aliasing.

Wait a minute, this sounds extremely interesting. I could say that the substitutability provided by purity makes that point moot, but I may have overlooked other problems (or solutions). Do you have a few pointers about that?

jemfinch · on Nov 18, 2010

I'll come back and reply more thoroughly later, but to give you some interesting reading in the meantime, I figured I'd drop you a link: http://www.cse.ohio-state.edu/rsrg/documents/2009/09PHABKW.p... . The citations in that paper will be somewhat interesting as well.

loup-vaillant · on Nov 18, 2010

Reading the paper. Really interesting, I'm learning something new there. You can assume I've read it thoroughly when you reply.

eru · on Nov 10, 2010

Indeed. The lack of persistent data structures in a Python is a much bigger barrier to a functional coding style than the lack of tail call optimization.

You can mostly get around the missing tail call optimization by using combinators and avoiding naked recursion---a good idea anyway. And then implement your combinators, say map or foldr, with a loop. It's ugly, but nobody has to see it.

There's no real way around the ephemeral nature of python lists and dicts---short of implementing your own versions. To say something positive: Strings, numbers and tuples are immutable in Python, though.

jedbrown · on Nov 10, 2010

The dynamic array is only amortized O(1) instead of guaranteed O(1) (do you have real-time constraints?), insertion mid-list/queue is always O(n) instead of O(1) with an iterator for the linked list variant. I was just disagreeing with "everything else is better with a growable array", the array implementation is indeed better for many common use cases, but splice is not the only exception.

jemfinch · on Nov 10, 2010

Your point about amortization is good.

Insertion mid-list is just splice with a singleton; I don't bother mentioning it separately.

Those really are the only two advantages of linked lists over growable arrays.

jedbrown · on Nov 11, 2010

Okay, how about removal mid-list with an iterator? Technically that could also be called "splice" after taking sublists, but that makes the term so broad as to be almost useless.