mutex007's comments

mutex007 · on July 26, 2016

I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?

btrask · on July 26, 2016

LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.

teraflop · on July 26, 2016

In fact, Lucene (the library behind Solr and Elasticsearch) is basically a special-purpose inverted-index-as-LSM-tree database engine.

abecedarius · on July 26, 2016

https://github.com/kragen/500lines/tree/master/search-engine explains a simple version of this in Python.

Arkanosis · on July 26, 2016

Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.

mutex007 · on April 15, 2014

elastic search will score no more than 7 points on our feature table. We are very aware of their product and currently running benchmarks on it.

If you are interested, i can personally email you the results

mutex007 · on April 15, 2014

we picked the most popular out of the RDBMS, NOSQL and newSQL world. In the future we will compare against more systems. We are actually in the process of releasing benchmarks as well for each of the above systems. We will be sure to include postgres

techprotocol · on April 15, 2014

This is like comparing apples and oranges. What technique do you use for benchmarking?

tracker1 · on April 15, 2014

Something like TodoMVC for backends would be nice... in essence, you create a backend for a TodoMVC front end, each using the same web-server platform and language and TodoMVC front end. The difference being the back end SQL server, with as much processing on the server, if it supports procedures, as possible. Maybe extending the example for a location, and a local date/time.

Using Node.js, and Angular for the server/front end, it would be easy enough to swap out the "todo-mvc-server-data" module... as long as each supported the same interface(s), it could be a good test...

Setup the same hardware for each backend, and then run performance tests against a node cluster for the front end. It would by no means be comprehensive, but would be a nice comparison point (like TodoMVC itself).

unmole · on April 15, 2014

> We will be sure to include postgres

Yeah, right. Come back when you do.

mutex007 · on April 8, 2014

The price is unbelievably fair. Try deploying mysql in the cloud and while you are at it, you get Redis as well for cache and then you hit the wall you need search in your application, so you deploy elastic search as a service as well. Do the math and compare to ours.

Our pricing starts as low as $15 per month

The cheapest instance of amazon cloudsearch is around $79 monthly, You then have to deploy a transactional DBMS and then most likely S3 for storage as well.

ddorian43 · on April 8, 2014

It's 700/month (otherwise your product isn't needed because no-sharding? postgresql can do most of the stuff)

the hosted version is upto 8gb ram, don't you think that's low ?

sqlserver + oracle are also insane

mutex007 · on April 8, 2014

$700 annually. Please read well before you misrepresent us.

By the way Partitioning by Hash and Range as we have stated means "Sharding"

ddorian43 · on April 9, 2014

the 700$month was mistyped

what i meant is that the correct price is 700 and not 350 since the 350 pricing doesn't include sharding/clustering

mutex007 · on April 8, 2014

In the case of MongoDB, dump a blob called BSON which itself can be larger than the JSON itself. Paradoxically this is touted as a space efficient binary serialization you then read it back using an index or something.

mutex007 · on April 8, 2014

By the way if you are looking for all the above functionality provided by all the DBMS you mentioned in a single DBMS instance, you can check out amisaserver.com. Polygot persistence is just another fad.

wilsonfiifi · on April 8, 2014

Doesn't appear to have a community edition for local/private installation...

mutex007 · on April 8, 2014

Sounds good to me. Might try it out.

mutex007 · on Dec 16, 2013

Just curious, how many indexes were on the structure you were inserting? Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..

coffeemug · on Dec 16, 2013

> Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..

You can't quite divide like that. In Rethink

  r.table('users').insert({ 'name': 'Bob' })
  r.table('users').insert({ 'name': 'Jim' })

is much slower than the equivalent batched insert

  r.table('users').insert([{ 'name': 'Bob' },
                           { 'name': 'Jim' }])

The latter isn't subject to network round trips and can batch disk writes, drastically increasing performance relative to multiple individual writes. You can see a similar effect in most other database systems.

There other nuances to this -- the complexity of measuring things properly is why we haven't published benchmarks to date. Check out this doc on insert performance for more details: http://rethinkdb.com/docs/troubleshooting/#my-insert-queries...

mutex007 · on Dec 16, 2013

Assuming a tcp connection can be established in 15ms worst case and another 10ms for data transmission, 65ms spent by the dbms for a single insert is incredibly slow.

I am very aware of the increase in performance of batch inserts and batched disk writes and so on.. as an individual who has worked on the development of a major DBMS.

dpweb · on Dec 16, 2013

No, 90ms including in transit, so 20ms in svr processing time.

mutex007 · on Dec 16, 2013

Assuming your 70ms in transit is accurate, although you have not provided details on how you measured that. 20ms svr processing time is only 50 inserts per second which is simply untenable in respect to a high performing rethink..

In effect if i wanted to do 10000 single writes per second typical of most interactive systems, i will need 200 nodes to pull that Off.

coffeemug · on Dec 16, 2013

By default Rethink requires an fsync from disk to guarantee data safety. So a typical default query goes like this: client request -> network trip -> wait on disk fsync -> network trip -> ack. That's going to be really slow in any system, and makes 10k writes/sec pretty much impossible on any rotational disk.

In Rethink you can turn on soft durability which will allow buffering writes in memory. That would drastically speed up write performance. Another option is to use concurrent clients. Rethink is optimized for throughput and concurrency, so if you use multiple concurrent clients the latency won't scale linearly.

mutex007 · on Nov 26, 2013

Shameful self promotion. However i have to say the SQL and your type system makes me gravitate towards your platform. Can Kill mongoCache for sure