Web scraping is the process of extracting useful data from a website. Google built their company on this. Many websites take measures to prevent scraping for various reasons, while still allowing big search engines to crawl their content. I will share my opinion about this subject and how some websites detect web scraping.
Scraping enables automation, giving more freedom to power users. Some platforms do not like this because we won’t see their ads, however most power users block ads anyway as they usually cripple web apps. Other reasons to scrape may not be very ethical. For example, there are many bots scraping the web in search for emails to send spam. Any website is free to block scraping in any way they want, after all they are the ones that choose how to reply to requests. Linkedin sued a company for scraping their public content, but the court did not give them reason. A website term of services page hidden somewhere with a lot of text that no one reads is not really legally enforceable.
If websites provide a reasonable API, web scraping can be reduced. This way they can enforce limits on API clients, like the ammount of requests allowed per day, track how it is used and even make money by charging for additional API calls. Most web applications are already using internal APIs, they could be made public. Frequently, scrapers actually consume these internal APIs instead of parsing the whole website, which is more efficient for both sides.
It is ironic that Google created their empire on web scraping and are the ones to create solutions like CAPTCHA to prevent others from crawling. Some websites even include metadata to help Google scrap their content more accurately and easily, using JSON-LD for SEO purposes. There are many other crawlers that behave nicely and have good intentions.
Web scraping detection will always be a cat and mouse game and some ways of detecting this were presented. However, in the end, if information is publicly available, even a real human can act as a web scraper with or without web extensions to help automate the process.