I spent a lot of time early last year, trying to write a crawler in php (I know, I know).
It was supposed to sit on the server and when so that when you went to the url, it’d generate a google sitemap for your entire site.
What I found out was that writing a good crawler is very hard work. Not because of the recursion involved, but because of the infinite ways link tags appear.
Now Google has validated my experience (more on this in a second).
Just a couple of things I had to consider with my crawler were
- I had to program it to look for a base tag so that I’d know if to treat the links as relative or absolute.
- I had to check each link to figure out if it was an internal or external link so I’d know whether to crawl it.
- Then I had to keep a running list of links crawled, so that I’d know if I had crawled a link before or not
- I had to let the program know that if there was a “/” in a relative link, to let it know to substitute the domain for it.
- knowing how to deal with “../” … this was a pain and a half
- I had to let it know how to deal with mailtos, javascript, and improperly written urls like <a href=”www.concept47.com”> (more on this later) … etc
My crawler worked by gathering a list of links that it continually added urls to be crawled onto. As it crawled the urls, it put them in another array that each new url had to be crosschecked with before being added to the list to be crawled.
The problem though, is the way people write html markup. As many of you know, there are some nastily written pages out there … so if someone wrote <a href=”www.concept47.com”> or <a href=”concept47.com”> or even <a href=”screwyougoogle”> my crawler had to know not to add it to the list to be crawled.
This is very difficult to do correctly and for all the time I spent on it, there is no real way to deal with it. You could write special cases for <a href=”www.concept47.com”> but what about <a href=”ww.concept47.com”> or
<a href=”w.concept47.com”> … see the problem?
Even though the urls give you a 404, they’ll make it onto your list to be crawled and waste the crawlers time. I felt like such a loser for not being able to figure out this issue, but it seems the Googlebot has the same problem.
Look at this. [click the image to make it bigger]
This is from my webmaster tools console.
The problem here was that I had a link tag on one of my blog posts that went like this
<a href=”www.unfuddle.com”>
As you can see, even the mighty Googlebot didn’t pick up on the error. It just tacked the url onto the current url, it was at and went on about its business.
Validation!
Read the next in this series [Why query strings in urls drive Googlebot and other search engine crawlers insane]