Alternative search engines have lousy crawlers en

By ACM on Thursday 1 December 2011 00:08 - Comments (7)
Category: Tweakers.net, Views: 10.801

There are a few alternative search engines arriving every year. All claim some distinctive features, most try to be some alternative to the big players either as a whole or for a specific niche. Yacy is one of the latest that saw a public announcement, although it was around for much longer.

As a publisher, we see one very common denominator; they're generally lousy crawlers. Especially compared to Googlebot and Bingbot. Tweakers.net offers over a million "articles" (forum topics, news articles, reviews, etc). Many of those pages have links to tweak some aspects of the current view (sorting, paging, filtering). On top of that, we offer links to specific finer grained information, like a direct link to a comment on a article or topic.

So while we offer over a million articles, we've in the order of a hundred million unique links on our site. We don't need, nor want, all those urls treated as important and/or distinct pages on our site. I.e. we don't want, nor need, to have those url's followed, they're there for our visitors but not very usefull for search engines. We generally don't even see those urls as a different page and the particular content will have been indexed via some other (more important) link anyway.
So to steer the crawlers, we mark them with rel='nofollow' and add a line to our robots.txt if possible. Apart from having popularised and generally obeying these markers, Googlebot and Bingbot offer additional tools to reduce this problem. For instance, Google allows you to specify how GET-parameters should be interpreted, so they know how to treat links with them.

Yes, we're one of the publishers that use the rel=nofollow to actually just mark url's to be ignored from the link-graph. We don't really care about all its added bonuses that used to be there, like the 'pagerank'-skulpting. We just really don't want or need those url's to be indexed, saving "them" and us processing time and leaving room for crawlers to actually spend time to index the parts we do want them to index.

The new crawlers generally lack (full) support for those aspects to steer crawlers, even the more basic aspects. Its not uncommon for them to:
  • Ignore nofollow on links or in metatags
  • Ignore noindex in metatags
  • Partially or fully ignore the robots.txt
  • Fail to fully understand the robots.txt
  • Fail to limit their amount of requests to some decent value
  • Fail to understand how relative links work in combination with a base href (untill a few years ago Google didn't understand that too well either)
  • Fail to understand that links are given in html, and should thus be correctly decoded (i.e. no & in urls)
  • Not offer any insight in why our pages are being crawled
With that in mind, we are fairly quick to decide to block entire search engines. Either by blocking the ip addresses or adding some lines to our robots.txt. Given its behavior, it wouldn't suprise me if we're going to block Yacy (for now) as well... I've already seen it do requests to url's blocked via our robots.txt and it doesn't seem to understand nofollow too well. We don't mind being indexed by robots, but we do if they do it incorrectly and/or inefficiently.