Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Cloudflare, Google, Bing Destroying the Infrastructure of the Free Web (gigablast.com)
102 points by dcassett on Aug 26, 2020 | hide | past | favorite | 43 comments


The article has no substance at all. It’s like 150 words and just describes Cloudflare a bit and explains how captcha works.

Not what I’m used to seeing on HN frontpage.


Cloudflare is not unique in this regard. Other CDNs and just normal libraries offer anti-scraping capabilities.


Yet you respond in the fashion of a typical fan without any substance at all.

It's a legitimate complaint about a company that doesn't care about the negative impacts of their decisions. Do you disagree with the facts of the complaint?


The facts of the complaint are that Google, Bing, and Baidu are secretly running or maybe influencing Cloudflare and they purposely implemented a non-accessible captcha to inhibit the smaller crawlers ability to build their index ... no, I don’t agree. Not based on a blurb that contains zero evidence or sources.


Very click-baity title IMO.

I think they're right about Cloudflare making legitimate scraping more difficult. There's little reason to believe that it's to protect the big players in search though.


> Now, if you don't retain that cookie for each subsequent access to that same website from that same IP address the delay is repeatedly increases until like around 30 seconds at which point you will be presented with a Turing test.

So retain the cookie they give you? Is that a problem?

(Honest question, I have not dabbled in writing crawlers...)


One possible problem with this is if one crawler gets a link to the website and then a parallel crawler gets a link to the website as well.

Enough parallel crawlers and you hit the captcha.

So the crawlers would need to be maintaining a record of sites and associated cookies. I don’t see why it couldn’t be done, but definitely adds to complexity.


> So retain the cookie they give you?

Ahh, but the cookie is created client-side using some Javascript which does some computationally intensive stuff for a second. Doesn't bother you as an end-user, but if you're writing a crawler and you're not driving a headless browser (expensive) then you probably don't trivially have the ability to run arbitrary Javascript code (or else, you have the work of integrating Deno or something to do that part for you).

Either way it means you can't just curl the webpage and get it. That's obviously the point when defeating DDOS attacks is the use case, but it doesn't work for crawlers, many of which are legitimate "users" like in the article.

These services should offer some other easier proof-of-work mechanism.


> computationally intensive stuff for a second. Doesn't bother you as an end-user

Actually, it does. They waste the CPU cycles of my devices for absolutely no good reason. It's very environmentally unfriendly as well. It shifts the cost onto the end user, not a very nice way to go about it. They probably don't describe the drawbacks to their own customers, either, so, everyone simply opts-in thinking there's no drawbacks; but the end result in a diminished user experience and tonnes of extra CO2 emissions throughout the world.


I agree 100% and wish there was a better mechanism to prove you're not an attacker, but it's hard to think of one that isn't annoying like a traditional CAPTCHA is.


The bigger question that noone's asking is the cost to generate the page:

Does it take them 1 second of CPU time to generate the page?

* If not, isn't that a disproportionate amount of time for the client to do some silly throw-away work?!

* If yes, why don't they improve their infrastructure such that static pages could be properly cached as they should be, and a slightly stale versions could be served to everyone at a lower total cost than if you require even a few select users with "abnormal" parameters to solve the captchas?

At the end of the day, all these DDoS protections are placed in front of pages that by all accounts should be cacheable static pages, which should take less time to produce and consume than the repeated 5-second JavaScript captchas that they replace these static pages with.

The underlying issue is that one solution could be sold as a standalone one-size-fits-all product, but the other one can not, so, that's why we have to face daily disappointment if our browsing setup is "abnormal" in any way.


I have written crawlers with this functionality (not for Cloudflare, but for sites that implemented the same technique themselves), and it didn't seem like a big problem.


Being a well-behaved crawler and using cookies the way a browser would is simple. Especially if they're crawling from a single IP address (as described).


> You have to be a non-blind person to pass the Turing test, as Cloudflare does not offer a handicap option.

(Edit: not sure if the above is true.)

I was quite surprised to see this. Much effort has been put into making the web more accessible; it’s a shame if an otherwise accessible site is blocked behind a non-accessible captcha wall.


Captchas have long been hated in the accessibility community though, so even if inaccurate, it seems to be a common sentiment.


https://blog.cloudflare.com/moving-from-recaptcha-to-hcaptch...

> [...] it has a robust solution for visually impaired and other users with accessibility challenges [...]


The way that works is the now 30 year old Americans with Disability Act understandably never contemplated captchas. So legal enforcement can be pursued in the courts only employing the weaker "violates the spirit of the law" legal argument..[1] And so it goes.[2]

[1] https://captcha.com/accessibility/ada-captcha.html

[2] Linda Ellerbee


I don’t even think this part of the article/blurb is true.

Edit: I stand corrected, looks like Cloudflare recently moved to hCaptcha, which does not offer an “a11y” option.


"How it works: first, an accessibility user signs up at this URL, which is linked in the hCaptcha widget info page. They are given an encrypted cookie that can be used several times per day, but must be refreshed every 24 hours via login."

https://www.hcaptcha.com/accessibility


How on earth is this considered a reasonable accommodation for people with access needs? Stinks of something created with no consultation whatsoever with the accessibility community.


This is bad on so many levels, for starters that you need yet another account.


hCaptcha (the service CF switched to after Google decided to start charging for reCAPTCHA for large-volume customers[0]) has an accessibility option that bypasses their captchas, and it's available at: https://www.hcaptcha.com/accessibility

0: https://blog.cloudflare.com/moving-from-recaptcha-to-hcaptch... ( https://news.ycombinator.com/item?id=22812509 )


What large players did Google charge for recaptcha other than cloudflare?


I don't know of any specifics, however the pricing change is public:

https://cloud.google.com/recaptcha-enterprise

https://www.google.com/recaptcha/about/

> Free up to 1 million Assessments / Month


Sure the author may not win a Pulitzer, but the point re accessibility, and the big SEs being stakeholders, creating a moat for any new crawlers, is valid.


As others constantly show us, illegal is only illegal when someone actually enforces the law. Even if it weren't illegal, if it only impedes a small subset of people, or if it impedes people without money / resources, then Cloudflare won't give the tiniest of a damn.

Their record is clear on this. They only care about those who give them money. Any attempts to give you things for free is to bait you in to becoming dependent on their platform.


This is a weird article - it looks like its pushing a conspiracy that cloud flare is secretly an agent for Google bing and Baidu simultaneously?


As others have said, this article is mostly nasty rhetoric with very little substance. I'm surprised that the CAPTCHA isn't accessible, but that's about it.

Websites opt-in to CloudFlare DDoS protection. If you want to be crawled, you don't have to use it. But it's very difficult to expose yourself to the open internet nowadays unless you're hosted in a cloud or have something like CloudFlare.

I have stuff I don't want crawled at all, I use CloudFlare, and it's an awesome (free!) service that helps me maintain HTTPS certs and keep Chinese and Russian IPs from hammering my server.


> But it's very difficult to expose yourself to the open internet nowadays unless you're hosted in a cloud or have something like CloudFlare.

Yet, millions of websites are working fine without cloud and without CloudFlare.


Nasty? Methinks your reaction shows you to be a fanboi who doesn't like someone speaking truths about the object of your fandom.


Not a very effective retort when GP lays out their thoughts and justifies their position...


What is it with this pseudo-doomsday clickbait "the free and open web is dead" type posts I keep seeing?

Is the web really dying or being destroyed? I don't get how it is and this article doesn't explain this either.


Is there no link to the post apart from the top-level link to the blog?


The post has a name attribute (cloudflaredestroy), so you can generate a link (a href='cloudflaredestroy'), but you can't have a direct URL to it, as it doesn't have an id.



That's how this one is, the post in question is just the first post on that page, appears to be the full content.


The solution is simple. Can some cloudflare engineers here grant this search engine the same access it gives the chinese ones?


They failed to present data that exceptions are made for other search engines.


Valid point, but I do not think they are lying, I am sure they can present it to cloudflare if required.


It's quite likely that Baidu is doing the reasonable thing and using a more sophisticated crawling mechanism to cache the cookies. You have to do this anyway to be able to crawl much of the web which is highly js-dependent.


So, the solution is to complain on social media to right the wrongdoing? Shouldn't there be a better way for the shareholders at stake to have a voice?!


So some bizarre and absolutely insignificant search engine is ranting about being irrelevant? Good for them.


Google seems to be destroying itself, fortunately. The search results got so bad after the recent changes that for the first time duckduckgo is superior, even for technical searches.

I'm slowly making the switch now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: