A while ago (4ish years) Jake Luciani implemented Lucandra (Cassandra as the store for Lucene indices) and Solandra (extension of Lucandra - solr on top of Cassandra). I used both of those at one point.
It was really interesting work.
OP - was elassandra influenced by either of those in anyway?
It was 7 years ago. Jake did Lucene on Cassandra, but I actually did the Solr part, initially at least[1]. I never did much with it though, and Jake deserves the credit for making it a real thing people might use. I'm pleased with the name though.
In solr vs elasticsearch, my understanding is that solr is more correct and has faster & better algorithms for certain edge cases. Though I'd be interested in hearing the details of any differences of opinion.
Before ES 1.7.x, ES would lose data all the time. However they have made a serious effort to make it much more reliable. There are still some problems, but I haven't lost a significant amount of data in a long time (currently on 2.3.3). They are fixing it.
> This is probably not true: Crate 0.54.9 uses Elasticsearch 1.7, which definitely loses updates. If Crate loses updates, it’s unlikely that you can guarantee reading the latest write, let alone reading that write ever again. Even the unreleased Elasticsearch 5.0.0 still fails its Jepsen tests according to Elastic, so claiming linearizable reads on single keys might be a bit of a stretch.
I've used Solr in production for about five years. All of the pain with Solr is up-front. Once you get it running, set up your schemas and you have your indexes built, it just hums along, doing its job and being obscenely fast.
Trying out Elasticsearch, my experience was that it really wants to be run in a cluster, but it also loses data pretty easily. I had more issues with it crashing and it's generally a lot hungrier for memory.
Both have non-obvious shortcomings. Solr's schema will make you believe that it likes deeply nested JSON documents. False! It actually wants pretty flat "documents" without nesting (you /can/ nest, but it usually doesn't do what you want without some extra legwork). ES will have you believe that it supports lots of query types and they'll all perform great on semi-structured data. My experience was that it was difficult to predict performance, but that generally the fancier the query the worse it would be.
Solr's querying functionality is not extremely powerful (though they "helpfully" made it offensively complex with different query parsers and stuff) but performance has always been excellent for me.
IMO, if you don't need clustering, Solr is definitely better. A cranky but robust piece of engineering from before scaling was everything. ES has better documentation, a better "getting started" story, and is generally a lot more user-friendly. Aphyr's posts about it have made me wary of using it without a re-indexing story.
I haven't tried Solr's scaling stuff because I haven't needed it, but I would expect it to be in pretty rough shape compared to ES because it's not a primary use case for Solr and it is for ES.
We use SolrCloud cluster where I work. The initial setup is rather daunting and involves reading up on Zookeeper, and Solr terminology on Collections, Cores, Shards and Nodes. But once the reading is done, clustering is effortless to execute. Solr 5 and 6 have a robust REST API for managing Collections.
The new SQL / Parallel Streaming has also made querying multiple collections a cinch.
Most of the problems that I have had are with the ZK clients and not the server. As long as you follow the operational documentation (there are a few basic rules) it hums along nicely. We have a few clusters with Kafka and have a decent process in Chef:
Compared to ES documentation on clustering it's a lot of work. ES merely requires a seed host to connect to and will gossip the rest of the cluster. No external service needed.
On the other hand, it's had its share of improperly handled split brain scenarios. I still think it has problems with partial split brain (where A and B can't talk, but C can talk to both of them).
I may not have phrased it correctly. It was daunting to setup the first time as I had no idea (then) about running clusters. Most people graduate from running a single Solr Node to a (generally) 3 node cluster. Getting a running Zookeeper ensemble is just part of that process.
Thanks for sharing the cookbook :) . Do you use Zookeeper for anything other than Kafka?
That's interesting thanks for sharing. I've used Elasticsearch and found all the configuration tuning daunting, but I'm sure I would go through that with Solr as well. I never used ES as a primary data store and I had jobs syncing it up so data loss was never a real concern for me.
I just clustered ES with Docker somewhat recently. It was pretty good at discovery I must say. Well, I'm no great search engineer I just throw these things up when I need 'em so good to know all this. Thanks again.
I haven't seen it being broken, but it breaking things all the time definitely. There are more non-backwards compatible api changes between versions than I have seen with other infrastructure software.
When you spend lots of time fine-tuning the exact combination of filters, ranking algorithms, weighting of fields and so on, the api just completely switching certain methods kind of hurts.
My favorite experience was discovering that HTTP Pipelining was completely broken. Rather than respond in FIFO order, it would respond in the order that queries completed. Since most of our queries were homogenous, we wouldn't detect any schema issues and successfully return a completely wrong set of results to our users.
My understanding has always been that elastic search is tremendously more consistent with streaming data while Solr has long pauses for the indexes to keep up. That was based on a detailed post by Loggly some years ago though.
This looks very interesting, however I'm concerned about some of the implementation details...
1. It mentions using secondary indexes - its my understanding thats a huge no-no, as they have to hit the whole cluster
2. Uses "lightweight" transactions - also another perf hit, as lightweight transactions have (anecdotally) a 6x slowdown...
I like the idea but I'm curious if these are issues and whether these uses are something the author is looking to replace...
Very interesting idea though!
(CoAuthor of cassieq here so these were things we had to learn about.)
indexes are together with the data. the partition key becomes the _routing key, so you can always search 1, x, or all nodes depending on your _routing value
lightweight transactions are only used on schema-changes (which are/should-be rare)
Awesome, i watched the demo video... i will be trying this. Thanks for the info on lightweight transactions, sounds like the perfect use then!
As for the indexes - are they standard Cassandra secondary indexes? "Custom secondary indexes" - does that mean that it just looks like a secondary index, but is actually backed by Elastic search?
@vroyer, we are currently migrating the MSSQL + Elasticsearch backend of http://apply4u.co.uk as well as http://thulya.com to Elassandra. Our initial tests are very promising. I'll be happy to share more details soon about these production use cases.
I don't understand... elasticsearch is built on top of lucene indexes. Data indexed with a postings lists is designed around the search use case. Don't know too much about cassandra but it's not a search engine?
Would like to know more about how indexing is handled.
Closest thing Cassandra has to behaving like a search engine is the new SASI indexes. Good deep dive here: http://www.doanduyhai.com/blog/?p=2058 which describes how its different from elastic search in "SASI vs Search Engines" section.
- SASI requires 2 passes on disk to fetch data: 1 pass to read the index files and 1 pass for the normal
Cassandra read path whereas search engines retrieves the result in a single pass (DSE Search has a singlePass option too).
By laws of physics, SASI will always be slower, even if we improve the sequential read path in Cassandra
- Although SASI allows full text search with tokenization and CONTAINS mode, there is no scoring applied
to matched terms SASI returns result in token range order, which can be considered as random order from the
user point of view. It is not possible to ask for total ordering of the result, even when LIMIT clause is used.
Search engines don't have this limitation
- last but not least, it is not possible to perform aggregation (or faceting) with SASI.
The GROUP BY clause may be introduced into CQL in a near future but it is done on Cassandra side,
there is no pre-aggregation possible on SASI terms that can help speeding up aggregation queries
Indexing is still handled the same as normal elasticsearch. The "raw" document, meaning "_source" is stored in a cassandra row as separate fields. While lucene indexes, are stored just as they are. Cassandra implements the replication/sharding, meaning when a token migrates from 1 node to another, they get reindexed on the new node and deleted on the old node.
This exists to index your Cassandra data. From the readme:
* Cassandra update are automatically indexed in Elasticsearch.
* Full-Text and spatial search on your cassandra data.
* Real-time aggregation (does not require Spark or Hadoop to group by)
* Provide search on multiple keyspace and tables in one query.
* Provide automatic schema creation and support nested document using
User Defined Types.
* Provide a read/write JSON REST access to cassandra data (for indexed data)
seems interesting, but our current elasticsearch is as a third party service - so that means we need a third party cassandra service that would also suppot elassandra on top which seems unlikely - does anyone know of any I can look at?
It was really interesting work.
OP - was elassandra influenced by either of those in anyway?
[1] https://github.com/tjake/Lucandra [2] https://github.com/tjake/Solandra