Hacker Newsnew | past | comments | ask | show | jobs | submit | mutex007's commentslogin

I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?


LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.


In fact, Lucene (the library behind Solr and Elasticsearch) is basically a special-purpose inverted-index-as-LSM-tree database engine.


https://github.com/kragen/500lines/tree/master/search-engine explains a simple version of this in Python.


Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.


elastic search will score no more than 7 points on our feature table. We are very aware of their product and currently running benchmarks on it.

If you are interested, i can personally email you the results


we picked the most popular out of the RDBMS, NOSQL and newSQL world. In the future we will compare against more systems. We are actually in the process of releasing benchmarks as well for each of the above systems. We will be sure to include postgres


This is like comparing apples and oranges. What technique do you use for benchmarking?


Something like TodoMVC for backends would be nice... in essence, you create a backend for a TodoMVC front end, each using the same web-server platform and language and TodoMVC front end. The difference being the back end SQL server, with as much processing on the server, if it supports procedures, as possible. Maybe extending the example for a location, and a local date/time.

Using Node.js, and Angular for the server/front end, it would be easy enough to swap out the "todo-mvc-server-data" module... as long as each supported the same interface(s), it could be a good test...

Setup the same hardware for each backend, and then run performance tests against a node cluster for the front end. It would by no means be comprehensive, but would be a nice comparison point (like TodoMVC itself).


> We will be sure to include postgres

Yeah, right. Come back when you do.


The price is unbelievably fair. Try deploying mysql in the cloud and while you are at it, you get Redis as well for cache and then you hit the wall you need search in your application, so you deploy elastic search as a service as well. Do the math and compare to ours.

Our pricing starts as low as $15 per month

The cheapest instance of amazon cloudsearch is around $79 monthly, You then have to deploy a transactional DBMS and then most likely S3 for storage as well.


It's 700/month (otherwise your product isn't needed because no-sharding? postgresql can do most of the stuff)

the hosted version is upto 8gb ram, don't you think that's low ?

sqlserver + oracle are also insane


$700 annually. Please read well before you misrepresent us.

By the way Partitioning by Hash and Range as we have stated means "Sharding"


the 700$month was mistyped

what i meant is that the correct price is 700 and not 350 since the 350 pricing doesn't include sharding/clustering


In the case of MongoDB, dump a blob called BSON which itself can be larger than the JSON itself. Paradoxically this is touted as a space efficient binary serialization you then read it back using an index or something.


By the way if you are looking for all the above functionality provided by all the DBMS you mentioned in a single DBMS instance, you can check out amisaserver.com. Polygot persistence is just another fad.


Doesn't appear to have a community edition for local/private installation...


Sounds good to me. Might try it out.


Just curious, how many indexes were on the structure you were inserting? Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..


> Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..

You can't quite divide like that. In Rethink

  r.table('users').insert({ 'name': 'Bob' })
  r.table('users').insert({ 'name': 'Jim' })
is much slower than the equivalent batched insert

  r.table('users').insert([{ 'name': 'Bob' },
                           { 'name': 'Jim' }])
The latter isn't subject to network round trips and can batch disk writes, drastically increasing performance relative to multiple individual writes. You can see a similar effect in most other database systems.

There other nuances to this -- the complexity of measuring things properly is why we haven't published benchmarks to date. Check out this doc on insert performance for more details: http://rethinkdb.com/docs/troubleshooting/#my-insert-queries...


Assuming a tcp connection can be established in 15ms worst case and another 10ms for data transmission, 65ms spent by the dbms for a single insert is incredibly slow.

I am very aware of the increase in performance of batch inserts and batched disk writes and so on.. as an individual who has worked on the development of a major DBMS.


No, 90ms including in transit, so 20ms in svr processing time.


Assuming your 70ms in transit is accurate, although you have not provided details on how you measured that. 20ms svr processing time is only 50 inserts per second which is simply untenable in respect to a high performing rethink..

In effect if i wanted to do 10000 single writes per second typical of most interactive systems, i will need 200 nodes to pull that Off.


By default Rethink requires an fsync from disk to guarantee data safety. So a typical default query goes like this: client request -> network trip -> wait on disk fsync -> network trip -> ack. That's going to be really slow in any system, and makes 10k writes/sec pretty much impossible on any rotational disk.

In Rethink you can turn on soft durability which will allow buffering writes in memory. That would drastically speed up write performance. Another option is to use concurrent clients. Rethink is optimized for throughput and concurrency, so if you use multiple concurrent clients the latency won't scale linearly.


Shameful self promotion. However i have to say the SQL and your type system makes me gravitate towards your platform. Can Kill mongoCache for sure


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: