I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?
LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.
Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.
we picked the most popular out of the RDBMS, NOSQL and newSQL world. In the future we will compare against more systems. We are actually in the process of releasing benchmarks as well for each of the above systems. We will be sure to include postgres
Something like TodoMVC for backends would be nice... in essence, you create a backend for a TodoMVC front end, each using the same web-server platform and language and TodoMVC front end. The difference being the back end SQL server, with as much processing on the server, if it supports procedures, as possible. Maybe extending the example for a location, and a local date/time.
Using Node.js, and Angular for the server/front end, it would be easy enough to swap out the "todo-mvc-server-data" module... as long as each supported the same interface(s), it could be a good test...
Setup the same hardware for each backend, and then run performance tests against a node cluster for the front end. It would by no means be comprehensive, but would be a nice comparison point (like TodoMVC itself).
The price is unbelievably fair. Try deploying mysql in the cloud and while you are at it, you get Redis as well for cache and then you hit the wall you need search in your application, so you deploy elastic search as a service as well. Do the math and compare to ours.
Our pricing starts as low as $15 per month
The cheapest instance of amazon cloudsearch is around $79 monthly, You then have to deploy a transactional DBMS and then most likely S3 for storage as well.
In the case of MongoDB, dump a blob called BSON which itself can be larger than the JSON itself. Paradoxically this is touted as a space efficient binary serialization you then read it back using an index or something.
By the way if you are looking for all the above functionality provided by all the DBMS you mentioned in a single DBMS instance, you can check out amisaserver.com. Polygot persistence is just another fad.
Just curious, how many indexes were on the structure you were inserting? Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..
The latter isn't subject to network round trips and can batch disk writes, drastically increasing performance relative to multiple individual writes. You can see a similar effect in most other database systems.
Assuming a tcp connection can be established in 15ms worst case and another 10ms for data transmission, 65ms spent by the dbms for a single insert is incredibly slow.
I am very aware of the increase in performance of batch inserts and batched disk writes and so on.. as an individual who has worked on the development of a major DBMS.
Assuming your 70ms in transit is accurate, although you have not provided details on how you measured that. 20ms svr processing time is only 50 inserts per second which is simply untenable in respect to a high performing rethink..
In effect if i wanted to do 10000 single writes per second typical of most interactive systems, i will need 200 nodes to pull that Off.
By default Rethink requires an fsync from disk to guarantee data safety. So a typical default query goes like this: client request -> network trip -> wait on disk fsync -> network trip -> ack. That's going to be really slow in any system, and makes 10k writes/sec pretty much impossible on any rotational disk.
In Rethink you can turn on soft durability which will allow buffering writes in memory. That would drastically speed up write performance. Another option is to use concurrent clients. Rethink is optimized for throughput and concurrency, so if you use multiple concurrent clients the latency won't scale linearly.