Cloudflare, Google, Bing Destroying the Infrastructure of the Free Web

calmworm · on Aug 26, 2020

The article has no substance at all. It’s like 150 words and just describes Cloudflare a bit and explains how captcha works.

Not what I’m used to seeing on HN frontpage.

EE84M3i · on Aug 26, 2020

Cloudflare is not unique in this regard. Other CDNs and just normal libraries offer anti-scraping capabilities.

johnklos · on Aug 27, 2020

Yet you respond in the fashion of a typical fan without any substance at all.

It's a legitimate complaint about a company that doesn't care about the negative impacts of their decisions. Do you disagree with the facts of the complaint?

calmworm · on Aug 27, 2020

The facts of the complaint are that Google, Bing, and Baidu are secretly running or maybe influencing Cloudflare and they purposely implemented a non-accessible captcha to inhibit the smaller crawlers ability to build their index ... no, I don’t agree. Not based on a blurb that contains zero evidence or sources.

polyomino · on Aug 26, 2020

Very click-baity title IMO.

I think they're right about Cloudflare making legitimate scraping more difficult. There's little reason to believe that it's to protect the big players in search though.

MauranKilom · on Aug 26, 2020

> Now, if you don't retain that cookie for each subsequent access to that same website from that same IP address the delay is repeatedly increases until like around 30 seconds at which point you will be presented with a Turing test.

So retain the cookie they give you? Is that a problem?

(Honest question, I have not dabbled in writing crawlers...)

MattGaiser · on Aug 26, 2020

One possible problem with this is if one crawler gets a link to the website and then a parallel crawler gets a link to the website as well.

Enough parallel crawlers and you hit the captcha.

So the crawlers would need to be maintaining a record of sites and associated cookies. I don’t see why it couldn’t be done, but definitely adds to complexity.

core-questions · on Aug 26, 2020

> So retain the cookie they give you?

Ahh, but the cookie is created client-side using some Javascript which does some computationally intensive stuff for a second. Doesn't bother you as an end-user, but if you're writing a crawler and you're not driving a headless browser (expensive) then you probably don't trivially have the ability to run arbitrary Javascript code (or else, you have the work of integrating Deno or something to do that part for you).

Either way it means you can't just curl the webpage and get it. That's obviously the point when defeating DDOS attacks is the use case, but it doesn't work for crawlers, many of which are legitimate "users" like in the article.

These services should offer some other easier proof-of-work mechanism.

cnst · on Aug 27, 2020

> computationally intensive stuff for a second. Doesn't bother you as an end-user

Actually, it does. They waste the CPU cycles of my devices for absolutely no good reason. It's very environmentally unfriendly as well. It shifts the cost onto the end user, not a very nice way to go about it. They probably don't describe the drawbacks to their own customers, either, so, everyone simply opts-in thinking there's no drawbacks; but the end result in a diminished user experience and tonnes of extra CO2 emissions throughout the world.

core-questions · on Aug 27, 2020

I agree 100% and wish there was a better mechanism to prove you're not an attacker, but it's hard to think of one that isn't annoying like a traditional CAPTCHA is.

cnst · on Aug 28, 2020

The bigger question that noone's asking is the cost to generate the page:

Does it take them 1 second of CPU time to generate the page?

* If not, isn't that a disproportionate amount of time for the client to do some silly throw-away work?!

* If yes, why don't they improve their infrastructure such that static pages could be properly cached as they should be, and a slightly stale versions could be served to everyone at a lower total cost than if you require even a few select users with "abnormal" parameters to solve the captchas?

At the end of the day, all these DDoS protections are placed in front of pages that by all accounts should be cacheable static pages, which should take less time to produce and consume than the repeated 5-second JavaScript captchas that they replace these static pages with.

The underlying issue is that one solution could be sold as a standalone one-size-fits-all product, but the other one can not, so, that's why we have to face daily disappointment if our browsing setup is "abnormal" in any way.

chc · on Aug 26, 2020

I have written crawlers with this functionality (not for Cloudflare, but for sites that implemented the same technique themselves), and it didn't seem like a big problem.

zerotolerance · on Aug 26, 2020

Being a well-behaved crawler and using cookies the way a browser would is simple. Especially if they're crawling from a single IP address (as described).

joeraut · on Aug 26, 2020

> You have to be a non-blind person to pass the Turing test, as Cloudflare does not offer a handicap option.

(Edit: not sure if the above is true.)

I was quite surprised to see this. Much effort has been put into making the web more accessible; it’s a shame if an otherwise accessible site is blocked behind a non-accessible captcha wall.

MattGaiser · on Aug 26, 2020

Captchas have long been hated in the accessibility community though, so even if inaccurate, it seems to be a common sentiment.

abiogenesis · on Aug 27, 2020

https://blog.cloudflare.com/moving-from-recaptcha-to-hcaptch...

> [...] it has a robust solution for visually impaired and other users with accessibility challenges [...]

anjel · on Aug 27, 2020

The way that works is the now 30 year old Americans with Disability Act understandably never contemplated captchas. So legal enforcement can be pursued in the courts only employing the weaker "violates the spirit of the law" legal argument..[1] And so it goes.[2]

[1] https://captcha.com/accessibility/ada-captcha.html

[2] Linda Ellerbee

calmworm · on Aug 26, 2020

I don’t even think this part of the article/blurb is true.

Edit: I stand corrected, looks like Cloudflare recently moved to hCaptcha, which does not offer an “a11y” option.

h3h3 · on Aug 26, 2020

"How it works: first, an accessibility user signs up at this URL, which is linked in the hCaptcha widget info page. They are given an encrypted cookie that can be used several times per day, but must be refreshed every 24 hours via login."

https://www.hcaptcha.com/accessibility

kiwijamo · on Aug 27, 2020

How on earth is this considered a reasonable accommodation for people with access needs? Stinks of something created with no consultation whatsoever with the accessibility community.

gsich · on Aug 27, 2020

This is bad on so many levels, for starters that you need yet another account.

judge2020 · on Aug 26, 2020

hCaptcha (the service CF switched to after Google decided to start charging for reCAPTCHA for large-volume customers[0]) has an accessibility option that bypasses their captchas, and it's available at: https://www.hcaptcha.com/accessibility

0: https://blog.cloudflare.com/moving-from-recaptcha-to-hcaptch... ( https://news.ycombinator.com/item?id=22812509 )

dudus · on Aug 27, 2020

What large players did Google charge for recaptcha other than cloudflare?

judge2020 · on Aug 27, 2020

I don't know of any specifics, however the pricing change is public:

https://cloud.google.com/recaptcha-enterprise

https://www.google.com/recaptcha/about/

> Free up to 1 million Assessments / Month

mmaunder · on Aug 26, 2020

Sure the author may not win a Pulitzer, but the point re accessibility, and the big SEs being stakeholders, creating a moat for any new crawlers, is valid.

johnklos · on Aug 27, 2020

As others constantly show us, illegal is only illegal when someone actually enforces the law. Even if it weren't illegal, if it only impedes a small subset of people, or if it impedes people without money / resources, then Cloudflare won't give the tiniest of a damn.

Their record is clear on this. They only care about those who give them money. Any attempts to give you things for free is to bait you in to becoming dependent on their platform.

vikramkr · on Aug 26, 2020

This is a weird article - it looks like its pushing a conspiracy that cloud flare is secretly an agent for Google bing and Baidu simultaneously?

jyrkesh · on Aug 26, 2020

As others have said, this article is mostly nasty rhetoric with very little substance. I'm surprised that the CAPTCHA isn't accessible, but that's about it.

Websites opt-in to CloudFlare DDoS protection. If you want to be crawled, you don't have to use it. But it's very difficult to expose yourself to the open internet nowadays unless you're hosted in a cloud or have something like CloudFlare.

I have stuff I don't want crawled at all, I use CloudFlare, and it's an awesome (free!) service that helps me maintain HTTPS certs and keep Chinese and Russian IPs from hammering my server.

Eduard · on Aug 27, 2020

> But it's very difficult to expose yourself to the open internet nowadays unless you're hosted in a cloud or have something like CloudFlare.

Yet, millions of websites are working fine without cloud and without CloudFlare.

johnklos · on Aug 27, 2020

Nasty? Methinks your reaction shows you to be a fanboi who doesn't like someone speaking truths about the object of your fandom.

mcdoogal · on Aug 27, 2020

Not a very effective retort when GP lays out their thoughts and justifies their position...

kgraves · on Aug 26, 2020

What is it with this pseudo-doomsday clickbait "the free and open web is dead" type posts I keep seeing?

Is the web really dying or being destroyed? I don't get how it is and this article doesn't explain this either.

brlewis · on Aug 26, 2020

Is there no link to the post apart from the top-level link to the blog?

shakna · on Aug 26, 2020

The post has a name attribute (cloudflaredestroy), so you can generate a link (a href='cloudflaredestroy'), but you can't have a direct URL to it, as it doesn't have an id.

validuser · on Aug 26, 2020

https://gigablast.com/blog.html#cloudflaredestroy

edoceo · on Aug 26, 2020

That's how this one is, the post in question is just the first post on that page, appears to be the full content.

greatjack613 · on Aug 26, 2020

The solution is simple. Can some cloudflare engineers here grant this search engine the same access it gives the chinese ones?

zerotolerance · on Aug 26, 2020

They failed to present data that exceptions are made for other search engines.

greatjack613 · on Aug 26, 2020

Valid point, but I do not think they are lying, I am sure they can present it to cloudflare if required.

joshuamorton · on Aug 27, 2020

It's quite likely that Baidu is doing the reasonable thing and using a more sophisticated crawling mechanism to cache the cookies. You have to do this anyway to be able to crawl much of the web which is highly js-dependent.

cnst · on Aug 27, 2020

So, the solution is to complain on social media to right the wrongdoing? Shouldn't there be a better way for the shareholders at stake to have a voice?!

monkin · on Aug 27, 2020

So some bizarre and absolutely insignificant search engine is ranting about being irrelevant? Good for them.

bxwalters · on Aug 26, 2020

Google seems to be destroying itself, fortunately. The search results got so bad after the recent changes that for the first time duckduckgo is superior, even for technical searches.

I'm slowly making the switch now.