How many times have you wanted to dedup a (text) file, but definitely didn't have enough memory to perform the task? I found this one day when I had to dedup a set of .ndjson.gz files which totaled a cumulative 312 GBs. Utilizing the bloomfilter option, I was able to dedup the records without any large investment on my part.
Anyways, runiq[1], "[an] efficient way to filter duplicate lines from input, à la uniq".
It provides several ways to filter of which I almost always default to utilizing the bloomfilter implementation (`-f bloom`).
Interestingly I was also about to post that autojump was missing, checked the comments, saw yours, and rechecked - sure enough, autojump is in there! So, your comment was useful after all.
Ah, htmlq [1] is a missing one that's not on the list!
Straight from the repo: "Like jq, but for HTML."
I find it useful for quickly hacking scripts together and exploring data. Very useful for the iterative process of finding good CSS selectors with the data that I can get without javascript running.
Not sure if pup supports this but something I do use fairly often (and copied into my own internal tooling) is the ability to filter out results as a flag in the CLI.
For example, something I usually do is:
curl --include --location https://example.com | tee /tmp/example-com.html | htmlq --base https://example.com a --attribute href --remove-nodes 'a[href*="#"],a[href^="javascript"],a[href*="?"]'
This grabs the page, shunts a copy to /tmp for subsequent, iterative testing, then tries to grab all the links while filtering out any links that have a '#', '?', or start with the word 'javascript'. This is super helpful when I'm just exploring some HTML scrape and trying to build a graph of links without having to pop out a proper programming language just yet.
[1] https://github.com/dandavison/delta
(edit: nevermind, somehow I missed it)