From the course: Technical WordPress SEO (2019)

Robots.txt

- [Instructor] A properly configured website is going to have a robots.txt file. And this file is really important, because it creates the set of instructions that crawlers use when they arrive on your website. And these instructions indicate what the crawler should crawl and what they should not crawl. Essentially, it disallows or allows certain behavior. To show you the robots.txt file for Salt and Simple, I'll simply go to saltandsimple.com/robots.txt. All robots.txt files live at the same destination, and they're all case sensitive, so they will always be lowercase r in the robots. Here's the robots.txt file for Salt and Simple, and this is the default set of directives that WordPress provides. Here it's saying the user agent, that's the piece of software that's crawling the website, is an asterisk which means all user agents must follow the directive below. And the directive says you're not allowed to crawl the folder wp-admin. And that's the folder that we use to log in and administer everything we're doing in WordPress, so it makes sense that we don't want the crawlers going through all of those URLs. It says you're allowed, however, to visit one particular URL within that folder, admin-ajax. Now, it's important to know that disallowing pages or subdirectories isn't a security feature, this isn't going to prevent people from accessing this content. This just tells the robots to not waste their time crawling it. And this is really important, because a particular crawler, say Google, has a quota, an allotment of time that it's going to dedicate to crawling your website. And once that time has elapsed, it's done and it leaves. So, if the crawler wastes time visiting content that is never going to be relevant for what people are searching for, it's content that you're never going to send traffic to, there's no point in having Google, or any other crawler, visit that content. And that is why we use a robots.txt file. We use it to provide directives, the instructions that we want the crawler to follow. Now, you'll also notice that within the robots.txt file we provide the sitemap, and this sitemap will provide a list of all the pages that the website wants crawled. Let's take a closer look at some sample robots.txt files and go through them together. So, I'm here in a text editor, Sublime, and I've created a few examples of robots.txt directives. So, for starters, you'll always have a User-agent: and then the user agent that you're directing. You can identify a specific user agent, such as Googlebot, msnbot, and so on, and if you simply do a Google search for user agents, you'll find a wide variety that are available. Now, when you're specifically speaking to a particular user agent, you'll likely already know why and what you're doing there. For many large sites, they will provide different directives to crawlers that are running crawls for serving ads versus, say Google, and this is where you often would want to differentiate between two user agents. You want Google to access all of the content that's relevant to them, but your advertisements might serve on every piece of content and you want that crawler to have unlimited access to particular areas of the website. So, then we have Disallow and then the string that you will not want to be crawled. By default, anything that's not in the disallow list is allowed, so we don't have to explicitly call out allow. So, here's an example, lines five through seven, this is what you would see if you were blocking all crawlers from all content, and this problem comes up quite a bit. If you identify that your site is not being indexed by Google, you'll want to evaluate whether your robots.txt file has this directive. This is saying you don't want anything crawled. You likely see this on development websites, so if you have staging.yoursite.com, you would disallow all crawling. And if that robots.txt file accidentally gets replicated to your live production site, well you'll have a problem. Another very common way that robots.txt is used out of the box is to simply provide every user agent the ability to crawl the entire site, and disallow is left blank. You'll provide the location to your sitemap with Sitemap: and then the URL to that sitemap, typically, /sitemap.xml. Now, it is important that you maintain case sensitivity and the space after the colon. Now, one of the most common mistakes that I encounter when reviewing robots.txt is providing multiple directives to a user agent at the same time. So, let's say that we had directives for msnbot and Googlebot and they were identical directives, so we simply stack the user agents, User Agent A, User Agent B, Disallow path, Disallow path2. This may actually create some scenarios by which the crawler gets confused. Perhaps we wanted User Agent B to only disallow path2, and User Agent A to only disallow path. In this case, it could be User Agent A only disallows both or User Agent B disallows both or a number of other ways that it can be incorrectly interpreted. You see, crawlers aren't always the smartest, they follow a very rudimentary set of rules. So, a better way to manage this is to always talk to one user agent at a time. So, if we wanted User Agent A to disallow both of these paths, we'd set it up as such, or we would simply add in the other user agent and disallow that path explicitly underneath that user agent. So, when you set up your robots.txt file, it should always go user agent, path, and you can have as many disallows as you need. So, we could disallow path3, path4, and so on. Your robots.txt file is incredibly important for your SEO efforts, so take time to make sure that you're using a robots.txt file, and evaluate that you're using it in a way that is most effective for your SEO goal.

Contents