Google Sitemaps Deconstructed

Tuesday, 7 June 2005

Google Sitemaps seemed like a moderately good idea at first. A standard format for information is always good for interoperability. But it’s hard to see Google Sitemaps ever really being useful to web search engines. Some features seem to add little to what Google already does, and others seem entirely useless.

Google Sitemaps is a standard XML format for website maps. Google (and anybody else) can use such a sitemap to get information about site structure without having to crawl the entire site. Of course, the information in the sitemap is only a hint — Google still has to crawl your site in order to index it and to verify the map. So whether the map is useful or not depeneds on the kinds of information that it can contain.

For each URL on your site, the Sitemap Protocol allows you to specify when and how often the page is updated. Google can use this (in theory) to crawl your site more efficiently. You can also specify the relative importance of each URL. This seems straightforward, but I think it’s actually slightly mysterious. You can see this if you delve a little more into the structure and usage of the sitemap protocol.

The root element of the sitemap is urlset. This contains one url element for each URL in the sitemap. The url element must contain a loc element, giving the actual URL address; it can also contain metadata.

loc — the URL location
This is somewhat useful because it allows you to tell Google about URLs that aren’t linked, for example search results. However, most pages that are important enough to be in the sitemap are probably important enough to have a link already.

changefreq — how frequently the content at the URL is likely to change
This can be set to various time periods from hourly to yearly, or “always” or “never”. In most cases, Google will be able to tell how often a page changes just as accurately as website operators. And I’m sure very many operators will set their change frequency to “hourly” or “always” just to try to get Google to index their site more often.

lastmod — the time the content at the URL was last modified
This information is normally available in the headers for a web page anyway. And if the headers are not accurate, then the sitemap certainly won’t be. Again, I can see unscrupulous operators using sitemaps with constantly updating lastmod dates, just to try to fool Google into spidering the site more often. I’m sure this ploy won’t work, but the only way to detect such ploys is to crawl the site, which defeats the purpose of the sitemap.

priority — the priority of the page relative to other pages on the same site
This one is interesting. Google says (in a faintly self-contradictory way) that this can be used to weight search rankings of individual pages in your site: “Search engines use this information when selecting between URLs on the same site”.

So can this be used to adjust your page’s search ranking? Google thoughtfully provides two answers. The first answer is no, as you might expect: “The priority you assign to a page has no influence on the position of your URLs in a search engine’s result pages”. The second answer is a surprising yes: “you can use this tag to increase the likelihood that your more important pages are present in a search index.” This seems quite strange. Google already has its own arcane methods of ranking search results. When I search using Google, I don’t want Google’s carefully-designed, finely tuned search results to be affected by some website opertor’s finagling! I tend to believe the “no” answer.

So if the priority tag does not affect search rankings, what good is it? I would love to know.

Tags:

Leave a comment