Join Joe Marini for an in-depth discussion in this video Understanding the Sitemap and Sitemap index formats, part of Real-World XML.
Before we jump in and start designing our own XML format, I thought it would be instructive to take a look at some of the real-world XML formats that are in use today. We'll start out by looking at the Sitemap and the Sitemap Index formats. These formats provide a way for web masters to inform search engines about the contents of their sites that are available for searching or crawling by the search engines. The Sitemap and Sitemap index currently enjoy pretty wide support. They are supported by Google and Yahoo and Microsoft search engines, which pretty much constitute the bulk of the search engine traffic that's out there today.
Now I want to point out that Sitemap and Sitemap Index don't affect the way that your sites appear or are ranked in the search engines. The whole point of these file formats is to tell the search engines how they can crawl your site more intelligently. This is not about Search Engine Optimization or anything like that. Each Sitemap is an XML file and that XML file lists information about each URL that is available on your site. It lists information like when it was last updated, and how often it changes, and so on and so forth. Now as I said, this does not guarantee that pages are going to be included in search results or that it's in any way going to affect how your page gets ranked. The whole idea here is that this is a way for your site to inform the search engines about the structure of your site, how they should search the site, that kind of thing.
You can find out more information about the Sitemap and the Sitemap Index formats at the URL that you see here, www.sitemaps.org. Okay, so each Sitemap file contains a collection of tags that define the URLs that the search engines should care most about. Now Sitemap files are limited to 10 Megabytes in size. So if you have to use more than one Sitemap file, then Sitemap index files are used to group multiple Sitemap files together. You can imagine for websites that have a lot of URLs, such as say a large catalog shopping site, they want to index all of the URLs that are available. That can easily exceed 10 Megabytes in size pretty quickly.
So the Sitemap index file is how you group multiple Sitemaps together. Ideally, you place these files at the root of your website and you then either include them in a robots.txt file or you submit the site directly to the search engines in order to let them know that these files exists and the sitemap.org URL that I listed earlier has more detailed information on how to do this. These are all only just hints. The search engines don't use this to affect your site's search rankings.
Let's take a look at the tags available in the Sitemap file. Each Sitemap file has a set of tags, some of them are required and some are not. This table lists all of the tags that are in the Sitemap file format. So you can see there are six tags. So it's a pretty compact, pretty focused file format that does one job and does it well. The urlset tag, the one at the top here, it's required. It encapsulates the file and it references the current protocol standard. So this basically serves as the root tag in any of the Sitemap files. Urlset tags contain one or more URL tags. This is the parent tag for each URL entry. All the other tags in this list are child tags of this url tag. As you can see, it's also required.
Inside each url tag, there's one required tag and that's the loc tag right here. The loc tag stands for location and it lists the URL of the page. The URL has to begin with the protocol like HTTP. If your web server requires it, then it has to end with the trailing slash on the URL. Some web servers require it and some don't. The whole idea though is that these URLs are going to be used by the search engines to crawl your site. So if your web server requires it, then you have to include them in these tags as well.
The rest of the tags are optional. The lastmod tag indicates using a date format when this URL was last modified. Now this date should be in the W3C Datetime format which you can look up on the W3.org website. If you want, you can just omit the time portion and use the format of a four-character year, followed by a two-digit month and a two-digit day. The next tag, changefreq, indicates the frequency that the page changes. It provides basically general information to search engines. Now this may or may not co-relate exactly to how often they crawl over the page. Remember, this file's purpose in life is to provide hints to the search engines, they don't necessarily denote solid rules that the engines have to follow.
So you can put in values for this tag, either always or hourly, daily, weekly, monthly, yearly and never. So if you place always in this tag, it means that the page is always changing, it dynamic and it needs to be searched each and every time as if it were a new page. The never value, you should only use that in cases of pages that have been archived and don't need to be searched anymore. Ironically enough, that may or may not mean that search engines honor that value. They may choose to search pages listed as never anyway just in case there are unexpected changes to those pages. Again, these are hints.
Then finally, there's the priority and that's also optional. This indicates the priority of this particular URL relative to the other URLs on your site. You can place values from 0, meaning least important, up to 1.0, which means most important. The default priority, if you don't specify this, is going to be 0.5. Meaning it's kind of a middle priority. Now this priority again does not affect how your page gets listed in search engine rankings. It just indicates how important the file is relative to the rest of the ones in your site.
So this is what a sample Sitemap looks like. You can see at the top, there's the XML declaration. In XML version 1. 0, this is optional but it's always a good idea to declare it anyway. In 1.1, this became mandatory but in XML 1.0 the XML version is not needed but I always like to put it in because it's proper XML. You can see here, here is the urlset at the top of the page. It references its namespace in case we wanted to include this in another file, we wouldn't have name collisions. Then inside the urlset, you have a collection of URL tags.
You can see that each one of these guys has a location tag but not all of them have, for example, priority or last modification. It turns out that each one of them has a change frequency but again, those are optional as well. So this is a finished and complements sample Sitemap. You can see it's focused on one job. Its whole job in life is to tell search engines how often and which URLs they should crawl on your site. Okay, so moving along looking at the Sitemap index tags. Now Sitemap index files are even more compact. That's because their only purpose in life is to group together multiple Sitemap files, in the case that you build Sitemap files that are larger than 10 Megabytes, you have to break them down into smaller parts and then group them together using a Sitemap index.
So all but one of these tags are required. The sitemapindex tag is required and it's the root tag of the document. The sitemap tag is also required. These go inside the sitemapindex root tag and there can be one or more of these. Each sitemap tag essentially encloses the location and lastmod tags about each Sitemap file. The location or loc tag indicates the URL of the Sitemap that it points to and lastmod is the time that the corresponding Sitemap file was last modified.
It does not correspond to the time that any of the pages in that Sitemap were changed. It's the file itself. Again, this should be kept in W3C style Datetime format. Here we have a sample Sitemap index. So you can see that in this case we have a sitemapindex. This is the root and here is its namespace declaration. This sitemap index file points to two different sitemaps. This one here has an example URL. This one has another one. We indicate when they were last modified. This is what the W3C Datetime format looks like. If you want to omit the time portion, which starts from the T and goes to the end, you just can use a four-character date followed by a two- character month and two-character day.
That's essentially sitemaps and sitemaps index files in a nutshell. What we are going to do now is jump over to the code really quick, so we can look at in the other. Okay, so here we are in the code and if you have access to the sample files, then you have these files. I have included the example XML files from both the sitemap and the sitemap index files, along with the Schema files for each of these, in case you have a tool that can use Schema files in your XML design.
So here you have the sample sitemap XML file that we looked at in the slides. You can see here the various tags. This is the corresponding schema that goes along with it. The schema file basically lays out the rules that an XML file has to follow. So you can see that this is defining what elements are allowed and where they can go inside the sitemap file. Same over here for the site index. This is the sample file and here is the schema that goes along with the site index file.
So that's a pretty simple example to get our feet wet with a custom real world XML format. Let's take a look now at a more complex example and that's the RSS file format.
XML Essential Training is a prerequisite for getting the most out of this course.
- Understanding the Sitemap index format
- Integrating XML and design
- Using XML effectively in Firefox and Internet Explorer
- Avoiding common design mistakes
- Understanding and implementing DOM algorithms
- Building an XML tag set
- Using XML with RSS and Atom
- Processing XML data with XSLT