Elassandra: Elasticsearch implemented on top of Cassandra

ceocoder · on July 8, 2016

A while ago (4ish years) Jake Luciani implemented Lucandra (Cassandra as the store for Lucene indices) and Solandra (extension of Lucandra - solr on top of Cassandra). I used both of those at one point.

It was really interesting work.

OP - was elassandra influenced by either of those in anyway?

[1] https://github.com/tjake/Lucandra [2] https://github.com/tjake/Solandra

isoos · on July 8, 2016

Also, Stratio's lucene index: https://github.com/Stratio/cassandra-lucene-index

nl · on July 9, 2016

It was 7 years ago. Jake did Lucene on Cassandra, but I actually did the Solr part, initially at least[1]. I never did much with it though, and Jake deserves the credit for making it a real thing people might use. I'm pleased with the name though.

[1] http://nicklothian.com/blog/2009/10/27/solr-cassandra-soland...

fjordan · on July 8, 2016

I believe he now works with DataStax on their DSE Solr product.

grizzles · on July 8, 2016

In solr vs elasticsearch, my understanding is that solr is more correct and has faster & better algorithms for certain edge cases. Though I'd be interested in hearing the details of any differences of opinion.

Therefore Stratio's Cassandra Lucene Index is worth an equal mention. https://github.com/Stratio/cassandra-lucene-index

phamilton · on July 8, 2016

Solr is operationally more complex to cluster (at least it was a few years ago) and the API is less intuitive in my opinion.

Elasticsearch is terribly broken in a lot of ways, but it's awfully easy to get up and running.

I'm not sure which I'm holding out for. That Solr will be easier to manage or that ElasticSearch will stop doing terrible things.

rpedela · on July 8, 2016

Before ES 1.7.x, ES would lose data all the time. However they have made a serious effort to make it much more reliable. There are still some problems, but I haven't lost a significant amount of data in a long time (currently on 2.3.3). They are fixing it.

cjbprime · on July 8, 2016

I think everyone agrees it's getting better, but:

> This is probably not true: Crate 0.54.9 uses Elasticsearch 1.7, which definitely loses updates. If Crate loses updates, it’s unlikely that you can guarantee reading the latest write, let alone reading that write ever again. Even the unreleased Elasticsearch 5.0.0 still fails its Jepsen tests according to Elastic, so claiming linearizable reads on single keys might be a bit of a stretch.

https://aphyr.com/posts/332-jepsen-crate-0-54-9-version-dive...

rpedela · on July 8, 2016

Yes there are still problems: https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...

joslin01 · on July 8, 2016

Would be curious to hear what Elasticsearch is doing wrong if you don't mind. I have no dog in the fight, I just always hear good things.

fusiongyro · on July 8, 2016

I've used Solr in production for about five years. All of the pain with Solr is up-front. Once you get it running, set up your schemas and you have your indexes built, it just hums along, doing its job and being obscenely fast.

Trying out Elasticsearch, my experience was that it really wants to be run in a cluster, but it also loses data pretty easily. I had more issues with it crashing and it's generally a lot hungrier for memory.

Both have non-obvious shortcomings. Solr's schema will make you believe that it likes deeply nested JSON documents. False! It actually wants pretty flat "documents" without nesting (you /can/ nest, but it usually doesn't do what you want without some extra legwork). ES will have you believe that it supports lots of query types and they'll all perform great on semi-structured data. My experience was that it was difficult to predict performance, but that generally the fancier the query the worse it would be.

Solr's querying functionality is not extremely powerful (though they "helpfully" made it offensively complex with different query parsers and stuff) but performance has always been excellent for me.

IMO, if you don't need clustering, Solr is definitely better. A cranky but robust piece of engineering from before scaling was everything. ES has better documentation, a better "getting started" story, and is generally a lot more user-friendly. Aphyr's posts about it have made me wary of using it without a re-indexing story.

I haven't tried Solr's scaling stuff because I haven't needed it, but I would expect it to be in pretty rough shape compared to ES because it's not a primary use case for Solr and it is for ES.

rrampage · on July 8, 2016

We use SolrCloud cluster where I work. The initial setup is rather daunting and involves reading up on Zookeeper, and Solr terminology on Collections, Cores, Shards and Nodes. But once the reading is done, clustering is effortless to execute. Solr 5 and 6 have a robust REST API for managing Collections.

The new SQL / Parallel Streaming has also made querying multiple collections a cinch.

johnbellone · on July 8, 2016

What's daunting about running Zookeeper?

Most of the problems that I have had are with the ZK clients and not the server. As long as you follow the operational documentation (there are a few basic rules) it hums along nicely. We have a few clusters with Kafka and have a decent process in Chef:

https://github.com/bloomberg/zookeeper-cookbook

phamilton · on July 9, 2016

Compared to ES documentation on clustering it's a lot of work. ES merely requires a seed host to connect to and will gossip the rest of the cluster. No external service needed.

On the other hand, it's had its share of improperly handled split brain scenarios. I still think it has problems with partial split brain (where A and B can't talk, but C can talk to both of them).

rrampage · on July 9, 2016

I may not have phrased it correctly. It was daunting to setup the first time as I had no idea (then) about running clusters. Most people graduate from running a single Solr Node to a (generally) 3 node cluster. Getting a running Zookeeper ensemble is just part of that process.

Thanks for sharing the cookbook :) . Do you use Zookeeper for anything other than Kafka?

techdragon · on July 9, 2016

Because I'd rather run Consul?

Zookeeper as a hard requirement feels like asking me to supply an entire container port to unload one container from a semi-trailer.

joslin01 · on July 10, 2016

That's interesting thanks for sharing. I've used Elasticsearch and found all the configuration tuning daunting, but I'm sure I would go through that with Solr as well. I never used ES as a primary data store and I had jobs syncing it up so data loss was never a real concern for me.

I just clustered ES with Docker somewhat recently. It was pretty good at discovery I must say. Well, I'm no great search engineer I just throw these things up when I need 'em so good to know all this. Thanks again.

elcapitan · on July 8, 2016

I haven't seen it being broken, but it breaking things all the time definitely. There are more non-backwards compatible api changes between versions than I have seen with other infrastructure software.

When you spend lots of time fine-tuning the exact combination of filters, ranking algorithms, weighting of fields and so on, the api just completely switching certain methods kind of hurts.

phamilton · on July 9, 2016

My favorite experience was discovering that HTTP Pipelining was completely broken. Rather than respond in FIFO order, it would respond in the order that queries completed. Since most of our queries were homogenous, we wouldn't detect any schema issues and successfully return a completely wrong set of results to our users.

rpedela · on July 8, 2016

I think this is a great comparison between Solr and Elasticsearch from a search relevance perspective: http://opensourceconnections.com/blog/2016/01/22/solr-vs-ela...

brightball · on July 8, 2016

My understanding has always been that elastic search is tremendously more consistent with streaming data while Solr has long pauses for the indexes to keep up. That was based on a detailed post by Loggly some years ago though.

cipherzero · on July 8, 2016

This looks very interesting, however I'm concerned about some of the implementation details...

1. It mentions using secondary indexes - its my understanding thats a huge no-no, as they have to hit the whole cluster 2. Uses "lightweight" transactions - also another perf hit, as lightweight transactions have (anecdotally) a 6x slowdown...

I like the idea but I'm curious if these are issues and whether these uses are something the author is looking to replace...

Very interesting idea though!

(CoAuthor of cassieq here so these were things we had to learn about.)

ddorian43 · on July 8, 2016

indexes are together with the data. the partition key becomes the _routing key, so you can always search 1, x, or all nodes depending on your _routing value

lightweight transactions are only used on schema-changes (which are/should-be rare)

cipherzero · on July 8, 2016

Awesome, i watched the demo video... i will be trying this. Thanks for the info on lightweight transactions, sounds like the perfect use then!

As for the indexes - are they standard Cassandra secondary indexes? "Custom secondary indexes" - does that mean that it just looks like a secondary index, but is actually backed by Elastic search?

ddorian43 · on July 8, 2016

Cassandra offers a way to create your own custom-secondary-index. In this case, the secondary-index is backed by elasticsearch/lucene.

Though you can't query it from cassandra yet. You have to use the elastic-search rest-api.

cipherzero · on July 9, 2016

Thats ridiculously beautiful! Thank you!

vroyer · on July 16, 2016

Here is a typical use case of elassandra + kibana with cross datacenter replication https://github.com/vroyer/elassandra/blob/master/cross-datac...

thulya · on July 17, 2016

@vroyer, we are currently migrating the MSSQL + Elasticsearch backend of http://apply4u.co.uk as well as http://thulya.com to Elassandra. Our initial tests are very promising. I'll be happy to share more details soon about these production use cases.

cphoover · on July 8, 2016

I don't understand... elasticsearch is built on top of lucene indexes. Data indexed with a postings lists is designed around the search use case. Don't know too much about cassandra but it's not a search engine?

Would like to know more about how indexing is handled.

cnlwsu · on July 8, 2016

Closest thing Cassandra has to behaving like a search engine is the new SASI indexes. Good deep dive here: http://www.doanduyhai.com/blog/?p=2058 which describes how its different from elastic search in "SASI vs Search Engines" section.

    - SASI requires 2 passes on disk to fetch data: 1 pass to read the index files and 1 pass for the normal
    Cassandra read path whereas search engines retrieves the result in a single pass (DSE Search has a singlePass option too).
    By laws of physics, SASI will always be slower, even if we improve the sequential read path in Cassandra

    - Although SASI allows full text search with tokenization and CONTAINS mode, there is no scoring applied
    to matched terms SASI returns result in token range order, which can be considered as random order from the
    user point of view. It is not possible to ask for total ordering of the result, even when LIMIT clause is used.
    Search engines don't have this limitation

    - last but not least, it is not possible to perform aggregation (or faceting) with SASI.
    The GROUP BY clause may be introduced into CQL in a near future but it is done on Cassandra side,
    there is no pre-aggregation possible on SASI terms that can help speeding up aggregation queries

ddorian43 · on July 8, 2016

Indexing is still handled the same as normal elasticsearch. The "raw" document, meaning "_source" is stored in a cassandra row as separate fields. While lucene indexes, are stored just as they are. Cassandra implements the replication/sharding, meaning when a token migrates from 1 node to another, they get reindexed on the new node and deleted on the old node.

atombender · on July 8, 2016

This exists to index your Cassandra data. From the readme:

    * Cassandra update are automatically indexed in Elasticsearch.

    * Full-Text and spatial search on your cassandra data.

    * Real-time aggregation (does not require Spark or Hadoop to group by)

    * Provide search on multiple keyspace and tables in one query.

    * Provide automatic schema creation and support nested document using
    User Defined Types.

    * Provide a read/write JSON REST access to cassandra data (for indexed data)

jason_heo · on July 8, 2016

Good naming ;)

bryanrasmussen · on July 8, 2016

seems interesting, but our current elasticsearch is as a third party service - so that means we need a third party cassandra service that would also suppot elassandra on top which seems unlikely - does anyone know of any I can look at?

ddorian43 · on July 8, 2016

I don't know if there's any. One of the good things is that you don't have to overprovision shards, since each node has 1 shard for each index.

tjake · on July 8, 2016

This is close to my heart, I love these integrations!