How feasible would it be to store all that data on a decentralized system like I...

duskwuff · on March 26, 2021

You may not grasp just how large the Common Crawl dataset is. It's been growing steadily at 200-300 TB per month for the last few years. I'm not certain how large the entire corpus is at this point, but it's almost certainly in the tens to low hundreds of petabytes. (This is significantly larger than the capacity of the entire Sia network, for example.)

Storing a dataset of this size and making it available online is not inexpensive. Amazon has generously donated their services to handle both of these tasks; it would be foolish to turn them down.

duskwuff · on March 26, 2021

(Update: the complete Common Crawl dataset is actually a little smaller than I thought, at 6.4 PB. That's still pretty big, though.)

indolering · on March 27, 2021

> Amazon has generously donated their services to handle both of these tasks; it would be foolish to turn them down.

Amazon makes plenty of money from people using AWS to processes the data.

fjeifisjf · on March 27, 2021

That's a win-win.

new_realist · on March 26, 2021

It's not clear, but it looks like the last crawl was 280 TiB (100 TiB compressed) and contains a snapshot of the web at that point; i.e. you don't need prior snapshots unless you're interested in historical content.

EDIT: the state of the crawls are summarized at https://commoncrawl.github.io/cc-crawl-statistics/.

duskwuff · on March 26, 2021

As best I can gather, the crawl is an ongoing process, not a series of independent "snapshots". There's almost no overlap in URLs between each crawl archive, although it looks as though there's some repetition on a larger scale (roughly every 2 months):

https://commoncrawl.github.io/cc-crawl-statistics/plots/craw...

gillesjacobs · on March 26, 2021

Blockchain storage is going to cost you a pretty penny if you were to store all of Common Crawls pentabytes, so not very feasible.

kevincox · on March 27, 2021

IPFS doesn't require any blockchain.

In fact someone could easily setup some IPFS nodes that fetch the data from the current host if requested over IPFS.[1] This way people could access it via IPFS and provide an alternate mirror of the data.

[1] https://github.com/ipfs/go-ipfs/blob/master/docs/experimenta...

The main benefits here would be

- Even if the source is unavailable there may be other copies on IPFS which would be transparently used.

- There may be some performance benefits in rare cases.

- If you are accessing this on a bunch of machines your IPFS gateways would handle downloading the source once, then automatically using the local copy from inside your network.

The maindownside is that if Amazon is donating their resources why bother with IPFS?

psKama · on March 26, 2021

That's not correct. When it comes to storage and transfer, blockchain alternatives are fraction of the cost of Amazon. For example Sia Skynet is offering $5/month/TB[1] storage. If you skip Skynet and run your own Sia node the price can even go lower to $2/month/TB basing on the market conditions.

[1] https://blog.sia.tech/announcing-skynet-premium-plans-faster...

coder543 · on March 26, 2021

Amazon is hosting the Common Crawl on S3 for free, so... yes, $2/month/TB is a lot more expensive.

gillesjacobs · on March 26, 2021

It seems that at least on Sia's plans, you can maximally host 20TB for 80$/month, not even a tenth of a monthly common crawl.

Of course Sia's Skynet are package deals right now and I guess they're currently bootstrapping the network with users. Filecoin has no operational storage yet. Storj quotes 10$/Terabyte/month [1] so that will come out expensive.

1. https://www.storj.io/blog/2019/11/announcing-pioneer-2-and-t...

gloriousternary · on March 26, 2021

Moreso than Amazon? From my (limited) experience blockchain storage solutions are often less expensive, although I've never worked with petabytes of data so maybe it's different on that scale.