Hacker Newsnew | past | comments | ask | show | jobs | submit | wolfgarbe's commentslogin

Author of SymSpell here. Congrats on the launch of Lexiathan.

Unfortunately, the comparison of Lexiathan vs. Symspell on your website regarding accuracy is misleading.

1. SymSpell has two parameters to control the maximum edit distance. Once you set both to 3, then also terms with an edit distance of 3 are accurately corrected:

  pronnouncaition -> pronunciation

  inndappendent -> independent

  unegspeccted -> unexpected

  soggtwaee       -> software
2. SymSpell comes with dictionaries in several sizes. Once you load the 500_000 terms dictionary, then also the two remaining terms will be corrected:

  maggnificntally -> magnificently

  annnesteasialgist -> anesthesiologist
https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell.B...

SymSpell accurately corrects all of your examples if used properly with the correct parameters and dictionary.

Apart from that, your methodology of comparing correction accuracy by cherry-picking specific terms without statistical significance, where your product seemingly performs better, is questionable.

One would use large public corpora to measure the percentage of accurately corrected terms as well as the percentage of false positives.

Because SymSpell is Open-Source, everyone can integrate it into their applications for free, modify the code, use different dictionaries in various languages, or add terms to existing ones.

https://github.com/wolfgarbe/SymSpell

https://github.com/wolfgarbe/symspell_rs


Hi wolfgarbe,

I don't believe my benchmark of SymSpell is misleading. I used the webassembly repository that is listed on your github: https://github.com/justinwilaby/spellchecker-wasm

Here is the code I used for my benchmark: https://gist.github.com/Eratosthenes/bf8a6d1463d2dfb907fa13c...

I reported the results faithfully and I believe these results reflect the performance that users would typically see running SymSpell in the browser, using the default configuration. If I had increased the edit distance, then the latency performance gap between Lexiathan and SymSpell would have been even larger, and then arguably I would have been gaming my metrics by not using SymSpell as it is configured.

Regarding dictionary size: The dictionary size (as you can verify from the gist) was 82k words. I didn't specify the dictionary size I used for Lexiathan, but it was 106k words.

Lastly, three of the words in the benchmark have edit distances greater than three:

distance("pronnouncaition", "pronunciation") = 4

distance("maggnificntally", "magnificently") = 4

distance("annnesteasialgist", "anesthesiologist") = 6

So I do not believe SymSpell would correct these even with the edit distance increased to 3.


Peter Norvig shows that an edit distance = 2 will cover 98.9% spelling errors. https://impythonist.wordpress.com/2014/03/18/peter-norvigs-2...

That's the reason why the default maximum edit distance of SymSpell is 2.

Now, all your 6 out of 6 examples are chosen from that 1.1% margin that is not covered by edit distance 2, presenting a rather unlikely high amount of errors within a single word.

The third-party SymSpell port from Justin Willaby, which you were using for benchmarking, clearly states that you need to set both maxEditDistance and dictionaryEditDistance to a higher number if you want to correct higher edit distances. That you neither used nor mentioned. This has nothing to do with accuracy; it is a choice regarding a performance vs. maximum edit distance tradeoff one can make according to the use case at hand.

https://github.com/justinwilaby/spellchecker-wasm?tab=readme...

pronnouncaition IS within edit distance 3, according to the Damerau-Levenshtein edit distance used by SymSpell. The reason is that adjacent transpositions are counted as a single dit. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...


The examples that I chose for my benchmark demonstrate that Lexiathan maintains accuracy and performance even on severely degraded input. On less corrupted input, Lexiathan runs significantly faster and is even more accurate.

Lexiathan also doesn't have any edit distance parameters that need to be configured, so there is no "tuning" required. In particular, it's worth mentioning that using a very large dictionary, e.g. 500,000 words, often degrades accuracy rather than improves it, and likely increases memory usage and latency as well.

Regarding Norvig's 98.9% figure--this seems to be from Norvig's own made-up data. In the real world, users often generate misspellings that exceed 2 edit distances in many use cases (OCR, non-native speakers, medical/technical terminology, etc), and published text (often already spell-checked) doesn't reflect the same level of errors. And in any case, Norvig's spell-checker apparently only achieves an accuracy of 67% on its own chosen benchmarks, so clearly the 98.9% figure is not a realistic reflection of actual spell-checker performance, even for an edit distance of 2. Lexiathan is extremely accurate and retains high performance even on heavily degraded input, and the benchmark data (and demo) that I presented reflect that.


The stopword list in SeekStorm is purely optional, per default it is empty.

The query "to be or not to be" that you mentioned, consisting solely of stopwords, returns complete results and perform quite well in the benchmark: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#be...

Both Lucene and Elastic still offer stopword filters: https://lucene.apache.org/core/10_3_2/analysis/common/org/ap... https://www.elastic.co/docs/reference/text-analysis/analysis...


Thanks for correcting me and clarifying this.


Can the index size exceed the RAM size (e.g., via memory mapping), or are index size and document number limited by RAM size? It would be good to mention those limitations in the README.


Great work! Would be interesting to see how it compares to Lucene performance-wise, e.g. with a benchmark like https://github.com/quickwit-oss/search-benchmark-game


Thanks! Honestly, given it's hacked together in a weekend not sure it’d measure up to Lucene/Bleve in any serious way.

I intended this to be an easy on-ramp for folks who want to get a feel for how FTS engines work under the hood :)


Sure, but it says "High-performance" Full Text Search Engine. Shouldn't that claim be backed up by numbers, comparing it to the state of the art?


Not _that_ long ago Bleve was also hacked together over a few weekends.

I appreciate the technical depth of the readme, but I’m not sure it fits your easy on-ramp framing.

Keep going and keep sharing.


The most widely used DHT is Kademlia from Petar Maymounkov and David Mazières. It is used in Ethereum, IPFS, I2P, Gnutella DHT, and many other applications.

https://en.wikipedia.org/wiki/Kademlia

https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia...

https://web.archive.org/web/20120128120732/http://www.cs.ric...


SeekStorm comes with an http interface.

The SeekStorm server features an REST API via http: https://seekstorm.apidocumentation.com

It also comes with an embedded Web UI: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#bu...

Or did you mean a Web based interface to create and manage indices, define index schemas, add documents etc?


>> The documentation seems a bit sparse.

We just released a new OpenAPI based documentation for the SeekStorm server REST API: https://seekstorm.apidocumentation.com

For the library we have the standard rust doc: https://docs.rs/seekstorm/latest/seekstorm/


For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.

For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.

Systematic relevancy benchmarks like BeIR, MS MARCO are planned.


got it - i think the anecdotal evidence is what intrigued me a little bit looking forward to seeing the systematic relevancy benchmarks


The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.

You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.

As for the server, we may add binaries for the main OS in the future.

WASM and Python bindings are on our roadmap.


In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...


Does seekstorm use io_uring? Could io_uring lower load time here?

Or at least lazy loading of index in RAM (emulating what mmap would do anyway)


SeekStorm does currently not use io_uring, but it is on our roadmap. Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.

It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.

But I would expect that the mmap implementations do already use io_uring / IoRing.

Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: