Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks, that's our experience exactly and that's why we built the library this way. It's not uncommon to switch from HTTP to headless back to HTTP in the lifecycle of a project as the website evolves or as you find better ways to scrape.


I‘m very new to web scraping, can you explain what the use cases are for each and how you can switch between them? As far as I understood you can use HTTP scraping for static websites and need some kind of browser/headless browser to scrape dynamically rendered websites. How would you do that with plain HTTP? By figuring out the ajax network requests and then sending those directly?


Exactly. The dynamic websites need to pull the data from somewhere as well. There's no magic behind it. Either all the data is in the initial payload in some form (not necessarily HTML), or it's downloaded later, again, over HTTP.

Headless browsers are useful when the servers are protected by anti-scraping software and you can't reverse engineer it, when the data you need is generated dynamically - not downloaded, but computed, or simply when you don't have the time to bother with understanding the website on a deeper level.

Usually it's a tradeoff between development costs and runtime costs. In our case, we always try plain HTTP first. If we can't find an obvious way to do it, we go with browsers and then get back to optimizing the scraper later, using plain HTTP or a combination of plain HTTP and browsers for some requests like logins, tokens or cookies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: