“We have to reinvent the wheel every once in a while, not because we need a lot of wheels; but because we need a lot of inventors.”
– Bruce Joyce
I wrote about my experience writing a site crawler in php in an earlier post, and I’m going to use some of the background there to make my point here. So it might help to go read it if you haven’t already.
[Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php]
From my casual observation of the way Googlebot crawls some of the sites I work on, I have reached the conclusion that it works in much the same way that a crawler I wrote a year ago worked.
Google bot goes page to page, gathering links from your page and tacking them onto the current url that it is at, right then. So why do query strings give it such a problem?
The answer is simple. Imagine this url for an item that doesn’t exist anymore.
www. example.com/store.php?buyid=29&catid=12
When a crawler encounters this url and tests it to see if it returns a 404 … it doesn’t.
Why?
Because www. example.com/store.php is usually still a valid page. It won’t give the crawler an error, unless you explicitly code it to.
So the crawler now tosses www. example.com/store.php?buyid=29&catid=12 onto its list of pages to be crawled. Can you see the disaster waiting to happen?
www. example.com/store.php?buyid=29&catid=12 and any other non-existent urls like it are basically just mapping to the still valid www. example.com/store.php but in the crawlers mind they are all different urls.
Now , if there are other urls on that page (store.php), like for related products for example. Google just takes the url and tacks it on to the url (it thinks) its at right now. So it winds up with
www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11
It does that for every invalid query string url that has store.php in its base. It then goes back and crawls them again and now it has.
www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11store.php?buyid=39&catid=11
The crawler is now in a tailspin … going around in circles trying to crawl your site. Chewing up your cpu cycles and generally being a nuisance.
I hope this helps you understand why Googlebot hates query strings so much.
I haven’t tried this yet, but I think it should be clear that making the base url of a query string resolve to a 404 error will help it out a lot.
So as an example
www. example.com/store.php?buyid=29&catid=12
should return a code 200/ok
and
www. example.com/store.php
should give a 404 error.
This is just my theory, I don’t know that It’d be practical.
PS: I hope this further helps you understand why search engine crawlers also hate PHP session ids on your content.