Ah, delta is missing! Right from the docs, it's "[a] syntax-highlighting pager f...

zamubafoo · on April 13, 2022

A-ha! I knew I had one more.

How many times have you wanted to dedup a (text) file, but definitely didn't have enough memory to perform the task? I found this one day when I had to dedup a set of .ndjson.gz files which totaled a cumulative 312 GBs. Utilizing the bloomfilter option, I was able to dedup the records without any large investment on my part.

Anyways, runiq[1], "[an] efficient way to filter duplicate lines from input, à la uniq".

It provides several ways to filter of which I almost always default to utilizing the bloomfilter implementation (`-f bloom`).

---

[1] https://github.com/whitfin/runiq

[2] https://whitfin.io/filtering-unique-logs-using-rust/

javajosh · on April 13, 2022

Interestingly I was also about to post that autojump was missing, checked the comments, saw yours, and rechecked - sure enough, autojump is in there! So, your comment was useful after all.

zamubafoo · on April 13, 2022

Ah, htmlq [1] is a missing one that's not on the list!

Straight from the repo: "Like jq, but for HTML."

I find it useful for quickly hacking scripts together and exploring data. Very useful for the iterative process of finding good CSS selectors with the data that I can get without javascript running.

--- [1] https://github.com/mgdm/htmlq

ducktective · on April 13, 2022

I use pup for this. People who tried both, any difference?

https://github.com/EricChiang/pup

zamubafoo · on April 13, 2022

Not sure if pup supports this but something I do use fairly often (and copied into my own internal tooling) is the ability to filter out results as a flag in the CLI.

For example, something I usually do is:

  curl --include --location https://example.com | tee /tmp/example-com.html | htmlq --base https://example.com a --attribute href --remove-nodes 'a[href*="#"],a[href^="javascript"],a[href*="?"]'

This grabs the page, shunts a copy to /tmp for subsequent, iterative testing, then tries to grab all the links while filtering out any links that have a '#', '?', or start with the word 'javascript'. This is super helpful when I'm just exploring some HTML scrape and trying to build a graph of links without having to pop out a proper programming language just yet.

internetter · on April 13, 2022

It's on the list, do a ripgrep for `a pager for git` ;)