Hacker Newsnew | past | comments | ask | show | jobs | submit | jdnier's commentslogin

Finally, a bracket I can enjoy (that doesn't involve basketball).

> I think Claude Shannon’s spirit is probably proud to know that his name is now being associated with such advances. Hats off to Claude!

I didn't realize Claude was named after Claude Shannon!

https://en.wikipedia.org/wiki/Claude_Shannon


Trivia: Claude Shannon proposed the idea of predicting the next token (letter) using statistics/probabilities in the training data corpus in 1950: "Prediction and Entropy of Printed English" https://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf


It goes back a bit further than that. His 1948 “Mathematical theory of communication” [1] already has (what we would now call) a Markov chain language model, page 7 onwards. AFAIK, this was based on his classified WWII work so it was probably a few years older than that

[1] https://people.math.harvard.edu/~ctm/home/text/others/shanno...


I was just reading Norbert Wiener's "The Human Use of Human Beings" (1950) and this quote gave me a good chuckle:

"One may get a remarkable semblance of a language like English by taking a sequence of words, or pairs of words, or triads of words, according to the statistical frequency with which they occur in the language, and the gibberish thus obtained will have a remarkably persuasive similarity to good English."


A letter is not a token, is it? Redundancy could hit 75% in long sentences, but Shannon was not predicting tokens or words, he was predicting letters (characters).


It's like the diesel engine, which is named after Rudolf Engine.


:|


Is this a joke I don't get? His name was Rudolf Diesel, right?


Yes, it is a fantastic joke and I laughed for ages, well played.


Here I was assuming it was named after https://en.wikipedia.org/wiki/Claude_(alligator)


And Claude had a collection of cycles, unicycles. Unfortunately the article is about something else altogether.


Last time I asked Claude itself also didn’t know.


Wait till you hear about nvidia and their GPU architecture naming scheme :)


I had not heard of Glyphs, the tool the author used. I used to use Fontographer long ago.

https://glyphsapp.com/learn/recommendation:get-started

It's a great article!


Also a Fontographer user here. That's how you know you did font design in the last 90s.


DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/


for me the issue is that DuckLake's feature of flushing inlined data to parquet is still in alpha. one of the main issues with parquet is when writing small batches you end up with a lot of parquet files that are inefficient to work with using duckdb. to solve this ducklake inlines these small writes to the dbms you choose (postgres) but for a while it couldn't write them back to parquet. last I had checked this feature didn't yet exist, and now it seems to be in alpha which is nice to see, but I'd like some better support before I consider switching some personal data projects over. https://ducklake.select/docs/stable/duckdb/advanced_features...


Data inlining is also currently limited to only the DuckDB catalog (ie it doesn't work with Postgres cataglogs)[0]. It's improving very quickly though and I'm sure this will be expanded soon.

[0] https://ducklake.select/docs/stable/duckdb/advanced_features...


DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.


DuckLake is more comparable to Iceberg and Delta than to raw parquet files. Iceberg requires a catalog layer too, a file system based one at its simplest. For DuckLake any RDBMS will do, including fs-based ones like DuckDB and SQLite. The difference is that DuckLake will use that database with all its ACID goodness for all metadata operations and there is no need to implement transactional semantics over a REST or object storage API.


It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.


What about file-based catalogs with Iceberg? Found one that puts it in a single json file: https://github.com/boringdata/boring-catalog


Then concurrency suffers since you have to have locks when you update files.

That's also why ducklake performs better than others.

For many use cases this trade-off is worth it.


Yesterday there was a somewhat similar DuckDB post, "Frozen DuckLakes for Multi-User, Serverless Data Access". https://news.ycombinator.com/item?id=45702831


I set up something similar at work. But it was before the DuckLake format was available, so it just uses manually generated Parquet files saved to a bucket and a light DuckDB catalog that uses views to expose the parquet files. This lets us update the Parquet files using our ETL process and just refresh the catalog when there is a schema change.

We didn't find the frozen DuckLake setup useful for our use case. Mostly because the frozen catalog kind of doesn't make sense with the DuckLake philosophy and the cost-benefit wasn't there over a regular duckdb catalog. It also made making updates cumbersome because you need to pull the DuckLake catalog, commit the changes, and re-upload the catalog (instead of just directly updating the Parquet files). I get that we are missing the time travel part of the DuckLake, but that's not critical for us and if it becomes important, we would just roll out a PostgreSQL database to manage the catalog.


This also reminded me of an approach using SQLite: https://news.ycombinator.com/item?id=45748186


Looking up the etymology of "sargeable", I found this StackOverflow answer: https://dba.stackexchange.com/a/217983

And Google explains "The term 'sargable' is a portmanteau of "Search ARGument ABLE," formed by combining the words from a SQL database context."


If you want to do this rigorously, I suggest you read Robert D. Cameron's excellent paper "REX: XML Shallow Parsing with Regular Expressions" (1998).

https://www2.cs.sfu.ca/~cameron/REX.html


That must be for "Dissolution and crystallization of cobalt, copper and sodium chlorides". It's quite something to watch!

https://www.nikonsmallworld.com/galleries/2025-small-world-i...


> should buy the books

Yes I totally will, err..., oh my, ebook for $91.99, paperback for $127.99. What's going on with these prices? These aren't college textbooks. I'm glad to hear about the 3rd edition but the cost gives me pause.


I recall them being less expensive when first released.

Either there has been a new printing which was done overseas and was affected by tariffs or more expensive for some other reason, or the copies in the warehouse were taxed as inventory (blame Congress for that, it was a major change in the tax law and it created the current mess of remaindered books and no back-list and ever spiraling book prices).


The paper they've submitted goes into a lot more detail. https://royalsocietypublishing.org/doi/10.1098/rspa.2025.029...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: