Back in 2006 I wanted to learn more about how search engines worked, so I started porting Lucene to Common Lisp. Actually, I wrote a Common Lisp port of Ferret. Ferret is a Ruby port of Lucene. Lucene is sort of Doug Cutting's Java version of Text Database (TDB), which he and Jan Pedersen developed at Xerox PARC, and which, to complete the circle, was written in Common Lisp.
I didn't know Ruby, and I didn't know search engines, but I did know Common Lisp. It took me 7 months to create a binary-compatible, pretty dang functional port of Lucene. I called it Montezuma[1], and it's still actually used by people.
About half the time was spent implementing the text analyzer, document store, and indices. The remaining 50% was spent implementing search (and parsing the query language). It was a very rewarding experience--It was one of the largest projects I'd worked on, mostly solo (I got some help near the end), in an area I knew nothing about, working in an extremely test-driven fashion (over 2000 unit tests when I was done).
I did not, however, learn as much about indexing and search as I expected. I learned a lot, yes, but quite a bit of the Ruby code was so easy to translate to Common Lisp that I didn't have to understand everything in order to make it work.
I still recommend writing a small search engine as an interesting exercise, though.
A footnote: For Java devs out there, the Lucene codebase is a joy to read [0]. Esp, the APIs. Highly recommend it just for the documentation which is top-notch. Michael McCandless, Lucene's committer-in-chief, sometimes blogs about its internals [1].
Or, instead of porting Lucene, just take its main concepts such as analysis, tokenization, an in-memory trie or binary search tree, query parser, term, query and collector and implement them the way _you_ would and using whatever bit maps and other file formats _you_ see fit when serializing to disk. If you iterate enough times you'll realize that as you have grown in your capacity of understanding these concepts your code base has turned into quite an approximation of Lucene, with the same flaws and the same strengths.
If you feel Lucene is close to a global, or at least a very high local optima, then by now you know search.
Now the real fun begins, because now you get to implement your own search model as either a Lucene model or one that is supported by your own code.
There are ten or twelve really interesting problems to solve before you can call it a day. Before you're done, you'll start to see everything and I do mean everything as a search problem. Use Wikipedia as your tutor but prepare yourself to have your current world view become completely transformed, because what's a search problem, really?
What's a word? What's a phrase? What's their meaning? What's this word's meaning in the context of these other words? What's a paragraph? What's the meaning of this paragraph, in the context of these other paragraphs? What are some of the patterns we can perceive in the binary representation of our data? What are some of the patterns we can perceive in the vector space representation of our data? The answers to these questions and more lies in the search model and implementing one is the most fascinating thing there is, because no one, not even Google's top engineer, knows what the correct set of questions are.
LMAO Ferret was sooo bad. It was written by a guy who effed off to Japan to study Jujitsu, and I remember my first startup job was porting all of its ints to size_t.
Shameless plug from someone who want this project to flourish.
Check https://vespa.ai as an alternative to Elasticsearch. Migrating from a ES to it, I got a faster search, never had to face a unhealthy node and native tensor support (And Native ANN is coming soon https://github.com/vespa-engine/vespa/issues/9747).
Vespa looks pretty good, at least in terms of performance and operation. I've been evaluating it myself. I'm less happy about everything else.
It's got a mishmash of odd APIs, lots of XML, several query languages, lots of weird little quirks. It doesn't feel modern. It's pretty clear that this is originally an in-house project, developed over many years by many people, where not as much effort has been spent on consistent/cohesive design or documentation.
One rough area is the approach to schemas and indexing. Rather than let you define a "clean" schema and put in your data and then have Vespa index it in all the ways it knows about, you're forced to essentially reshape your data into a format compatible with Vespa, which brings with it some severe restrictions. For example, Vespa will not index arbitrarily nested structured data. If you have something like {categories: [{id: 1}]}, Vespa will not index that. You have to flatten any array data to the top level. Nested maps and arrays are mostly not supported, although it's hard to tell from the documentation what is supported.
Vespa is also very obviously skewed toward ranking, not filtering. You can't search by exact string matching: You can't do something like "topic = 'news'". You only get case insensitive substring search. It's got lot of ranking functions but very little that's optimized for filtering.
Overall, I'm a bit surprised that Vespa's authors position it as an Elasticsearch competitor, because you certainly cannot just port an app that uses ES over to it.
To be sure, it's got lots of interesting features such as ML integration, and, again, performance and clustering design seems good. But it still feels very much like a niche product.
I migrated from ES and for me, I do not agree about the feeling that it doesn't feel modern. The Middleware logic container and Live reconfiguration it is mind blowing. About those two things:
By modern I mean the approach to configuring and running, and the myriad of languages used: Antiquated XML for some things, a homegrown DSL for others, JSON for query results, then multiple languages for expressing various parts of the query -- it's pretty chaotic.
Another thing that felt antiquated: The whole notion of uploading an "application". I can appreciate the benefits of controlling the lifecycle of the configuration and have Vespa distribute it to nodes. But when you start out, that "application" is just one or two files, and yet you have to create a whole directory structure for it, as opposed to just POSTing individual configs to REST endpoints like you can do with ES. The heavy-handedness of it feels very "Java".
The document you linked to is a different type of exact match. I've been through this, and even posted a Github issue. Mysteriously, a Vespa developer replied that nobody had ever needed exact string matching, so nobody had bothered to implement it.
Parent/child is not applicable to what I was talking about, I think. I'm not talking about hierarchical relationships.
For my part, most of my work is in structured data, not text or vector-based ranking, and Vespa really doesn't seem to be designed for that.
ES also has a very, very good aggregation API. Vespa's aggregation syntax is odd and seemingly much more limited.
I used to use Lucene back in the 1.x days when a fuzzy search was a complete table scan. It was quite a surprise to see how your single term fuzzy query was interpreted as one term query for each fuzzy hit OR-ed together. The Lucene team soon realized they needed to code a levenstein automaton but none of them had ever done that before. They pulled several all-nighters reading math papers and coding and when they succeeded they were so happy they told the world about it [0]. It's a great story.
There's also a YouTube recording of a talk with similar content by the same author from EuroPython 2014. Helped me out when I was adopting ES at scale in that time period. (And the principles are pretty timeless to modern ES, too.)
Does Elasticsearch need to be as complicated as it is?
I was surprised to find there wasn't an Elasticsearch + Kibana competitor that is "simpler".
I just want to be able to store JSON logs with a timestamp + a bunch of fields then search them in a nice little UI later. Apparently, that's pretty hard to do right.
My team is using Loki + Grafana and we're pretty happy with it. It's pretty basic but it does what you expect it to do just fine.
We ditched Elastic as it was a super massive PITA to operate (& a resource hog at that). I'll admit I'm not an expert at ELK at all, but tbh I was absolutely surprised just how bad Elastic + Kibana was for our basic log uses when they tout it as one of their mainstays. Or we were just exceptionally stupid, who knows. In any case, the experience we had with it didn't motivate us to become ELK experts at all.
Our pet peeves:
- The Kibana UI needlessly wastes tons screen for whitespace
- makes it hard to dig down into logs
- never seems to find exact string matches when we wanted it to and instead returns "helpful" fuzzy matches
- Kibana has no qualms sending requests to Elastic that will happily kill your node instead of applying sensible paging / query timeouts. I mean that's why I'm using Kibana and not writing my own elastic frontend...
But Elasticsearch has evolved into a whole bunch of things to meet everyone's needs. There's a way to do what you want and simply, but you have to find the simple path in the middle of the big product.
I mean, the same reason that SQL isn't any simpler? It takes arbitrary data, and you can filter and aggregate it in arbitrary ways? People aren't just doing log data... it's just been heavily adopted for that purpose
I think the "simpler" version of ES+Kibana is probably a spreadsheet.
ElasticSearch is really just clustered Lucene with some nice features wrapping it. You can probably get away with something that also wraps lucene. Though elasticsearch has a dominant position precisely because it is quite full featured and easy to run.
Im afraid its a bit over simplification . For example - Aggregations in elasticsearch are not using lucene facets, but it just leverages the basic lucene mechanisms ( collector) .
> JSON logs with a timestamp + a bunch of fields then search them
There is S3+Athena for this with AWS and Google can store/query JSON with BigQuery. The nice little UI doesn't come with it, but at least you don't have to spin up an Elastic cluster.
Do you have a S3 + Athena example because when I tried it didn't seem to query adhock json file but rather the S3
file needs contain a array of json documents.
One thing that really bothers me about ES is that compared to Solr, some terms that have a specific meaning in Lucene are either not used for the corresponding concept or even worse re-used for a different one.
It sometimes makes explaining the underlying implementation a bit harder to people who are Lucene-agnostic but are ES users, with no good reason apart from, I would guess, brand differentiation?
I didn't know Ruby, and I didn't know search engines, but I did know Common Lisp. It took me 7 months to create a binary-compatible, pretty dang functional port of Lucene. I called it Montezuma[1], and it's still actually used by people.
About half the time was spent implementing the text analyzer, document store, and indices. The remaining 50% was spent implementing search (and parsing the query language). It was a very rewarding experience--It was one of the largest projects I'd worked on, mostly solo (I got some help near the end), in an area I knew nothing about, working in an extremely test-driven fashion (over 2000 unit tests when I was done).
I did not, however, learn as much about indexing and search as I expected. I learned a lot, yes, but quite a bit of the Ruby code was so easy to translate to Common Lisp that I didn't have to understand everything in order to make it work.
I still recommend writing a small search engine as an interesting exercise, though.
1. https://github.com/sharplispers/montezuma