From the course: Web Scraping with Python

Unlock the full course today

Join today to access over 22,500 courses taught by industry experts or purchase this course individually.

Sitemaps and robots.txt

Sitemaps and robots.txt - Python Tutorial

From the course: Web Scraping with Python

Start my 1-month free trial

Sitemaps and robots.txt

- You may have heard of robots.txt before. It's essentially a file at the root of most domains like Wikipedia/robots.txt. this gives instructions to any passing bots about what they should and shouldn't scrape. And the syntax of this file is determined by something called the robots exclusion standard or the robots exclusion protocol. The cool thing is, you don't need to really worry about any of that in order for your scrapers to follow robots.txt. Check out settings.py in the news article scraper I used as a solution for the challenge in chapter two. You have the obey robots.txt rules, and it's set to true here. So Scrapy will automatically go to the robots.txt for any domains you give it, check out what it can and can't scrape and then follow the rules there. And listen, obviously, if a site's robots.txt is a big problem for your scraper, you can always set robots.txt obey to, you know, false, I guess, but…

Contents