Hacker Newsnew | past | comments | ask | show | jobs | submit | lucb1e's commentslogin

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)

Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name


> meant to have good ratios and good (de)compression speeds as compared to other tools

That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides

          ___.--- higher throughput
        /     LZ4
       / zStd
      |
      ; LZMA
     |
     | ZPAQ
    lower size
Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.

The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).


This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip

Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.

At half the size of xz's output, there can be applications for this, but you need to:

- not care about compression time

- not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)

- not care about decompression time either

- and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)


Huh? Only if it gets decompressed few times I would say, because it's so extremely slow at it

Like a software installation that you do one time. I'd not even want it for updates if I need to update large data files. The only purpose I'd see is the first-time install where users are okay waiting a while, and small code patches that are distributed to a lot of people

(Or indeed if size is important, but then again bzip2 only shines if it's text-like. I don't really find this niche knowledge for a few % optimization worth teaching tbh. Better teach general principles so people can find a fitting solution if they ever find themselves under specific constraints like OP)


> bzip2 and xz are extremely slow to compress

This depends on the setting. At setting -19 (not even using --long or other tuning), Zstd is 10x slower to compress than bzip2, and 20x slower than xz, and it still gets a worse compression ratio for anything that vaguely looks like text!

But I agree if you look at the decompression side of things. Bzip2 and xz are just no competition for zstd or the gzip family (but then gzip and friends have worse ratios again, so we're left with zstd). Overall I agree with your point ("just use zstd") but not for the fast compression speed, if you care somewhat about ratios at least


It's good, but is it "the future" when it's extra work?

Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size

It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.


Was the name used with permission? Even if not trademarked (because open source freedom woohoo), it's a bit weird to release, say, Windows 12 without permission from the authors of Windows 11

I tried looking it up myself but it doesn't say in the readme or doc/ folder, there is no mention of any of the Bzip2 authors, and there is no website listed so I presume this Github page is canonical


Free software doesn't have a lot of trademarks, with the notable exception of Linux.

Also, the name is the algorithm. Bzip2 has versions and bzip3 is something else which has its own updated versions. Programs that implement a single algorithm often follow this pattern.


> Free software doesn't have a lot of trademarks

Hence my saying "not trademarked (because open source freedom woohoo)". I understand bzip3 is legal, but my question is whether it's a dick move

> the name is the algorithm. Bzip2 has versions and bzip3 is something else

I don't understand this. If bzip3 is "something else" compared to bzip2 then how can both be named after the algorithms they use? Nothing in bzip2's algorithm is related to the digit 2. If it used, say, powers of 2 centrally and bzip3 used pi for something, then the naming could make sense since it's indeed not a version number, but I don't remotely see this argument in these algorithms

Looking on Wikipedia and bzip2's website, there is no explanation of the name, and yesterday I read somewhere (either in the OP or in another comment) that it would stand for "better zip". It has nothing to do with PKZIP though. If they had both implemented pkzip's format, then a "spiritual successor" to bzip2 would be something like uzip, uzip2 (replacing "better" with "ultimate"), bbzip2, or any of a million other options, but that's not the pattern they chose to follow

How is this not just taking their established name for your own project with the version number +1?


I read a while back that bzip2 is named that way because the original bzip used arithmetic coding. The person who made bzip then made bzip2 to use Huffman coding due to patent problems with arithmetic coding.

This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time

OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast

This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)

Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder


I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.

Or perhaps this is already being done this way and it is a bottleneck?

This might not work for multi-gigabyte files, but most of text content is well under 1MiB.


This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit

Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm

Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)


Why? If the voltage spikes on the grid (that's what I understand a power surge to mean), wouldn't even more of it end up in your house (that is: the grid voltage spike even higher) if the neighbors have equipment that doesn't let their devices consume some of that energy?

Edit: wait, maybe I figured it out: those devices must be consuming the excess rather than blocking it. Is that it?


Yes, energy dissipates. Although one still needs to look out for the distance from the neighbours as your ground can be different from your neighbours ground potential.

How did you know it was a power surge? Not doubting your comment, just interested in knowing if this ever happened to me

The comment says "again". I'm not too surprised agreements like this are not the normal procedure and can get forgotten about (it shouldn't... but if this isn't automated, humans can make mistakes), but more than once?

Even if you only read the headline, you can work out the most likely story from logic. Someone else in this thread already said it: they're not gonna go to jail for you over a small subscription fee, of course they're complying with local laws and then the Swiss people handed it over to where they got the legal assistance request from. It's also not the first time Proton cooperated with a legal request from Swiss authorities, or Signal or other similar companies for that matter. The story tells itself no matter what part of it they decide to stick in the headline

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: