Alternative search engines have lousy crawlers en

By ACM on Thursday 1 December 2011 00:08 - Comments (7)
Category: Tweakers.net, Views: 10.846

There are a few alternative search engines arriving every year. All claim some distinctive features, most try to be some alternative to the big players either as a whole or for a specific niche. Yacy is one of the latest that saw a public announcement, although it was around for much longer.

As a publisher, we see one very common denominator; they're generally lousy crawlers. Especially compared to Googlebot and Bingbot. Tweakers.net offers over a million "articles" (forum topics, news articles, reviews, etc). Many of those pages have links to tweak some aspects of the current view (sorting, paging, filtering). On top of that, we offer links to specific finer grained information, like a direct link to a comment on a article or topic.

So while we offer over a million articles, we've in the order of a hundred million unique links on our site. We don't need, nor want, all those urls treated as important and/or distinct pages on our site. I.e. we don't want, nor need, to have those url's followed, they're there for our visitors but not very usefull for search engines. We generally don't even see those urls as a different page and the particular content will have been indexed via some other (more important) link anyway.
So to steer the crawlers, we mark them with rel='nofollow' and add a line to our robots.txt if possible. Apart from having popularised and generally obeying these markers, Googlebot and Bingbot offer additional tools to reduce this problem. For instance, Google allows you to specify how GET-parameters should be interpreted, so they know how to treat links with them.

Yes, we're one of the publishers that use the rel=nofollow to actually just mark url's to be ignored from the link-graph. We don't really care about all its added bonuses that used to be there, like the 'pagerank'-skulpting. We just really don't want or need those url's to be indexed, saving "them" and us processing time and leaving room for crawlers to actually spend time to index the parts we do want them to index.

The new crawlers generally lack (full) support for those aspects to steer crawlers, even the more basic aspects. Its not uncommon for them to:
  • Ignore nofollow on links or in metatags
  • Ignore noindex in metatags
  • Partially or fully ignore the robots.txt
  • Fail to fully understand the robots.txt
  • Fail to limit their amount of requests to some decent value
  • Fail to understand how relative links work in combination with a base href (untill a few years ago Google didn't understand that too well either)
  • Fail to understand that links are given in html, and should thus be correctly decoded (i.e. no & in urls)
  • Not offer any insight in why our pages are being crawled
With that in mind, we are fairly quick to decide to block entire search engines. Either by blocking the ip addresses or adding some lines to our robots.txt. Given its behavior, it wouldn't suprise me if we're going to block Yacy (for now) as well... I've already seen it do requests to url's blocked via our robots.txt and it doesn't seem to understand nofollow too well. We don't mind being indexed by robots, but we do if they do it incorrectly and/or inefficiently.

Volgende: Serverkeuzes bij Tweakers.net 03-'12 Serverkeuzes bij Tweakers.net
Volgende: Advertenties op Tweakers.net: een paar mythes en misvattingen 04-'11 Advertenties op Tweakers.net: een paar mythes en misvattingen

Comments


By Tweakers user Luuk1983, Thursday 1 December 2011 15:18

Wat een interessante blog, leuk om te lezen! :)

By Tweakers user zzattack, Thursday 1 December 2011 17:12

Are crawlers forced to comply with Disallows per User-Agent in robots.txt?

By Tweakers user Kees, Thursday 1 December 2011 18:00

zzattack wrote on Thursday 01 December 2011 @ 17:12:
Are crawlers forced to comply with Disallows per User-Agent in robots.txt?
Of course noone can force anyone to comply to these standards, but if we detect a crawler that isn't following our suggestions, we'll usually ban them from the site, or take it up with the crawler owner.

By Tweakers user Wortelsoep, Friday 2 December 2011 08:42

Interessant, maar waarom in het Engels? Het is duidelijk door een NL'er geschreven en daardoor lastig te lezen (bv de eerste zin).
Verder goed en logisch dat jullie zulke rotzooi bannen :)

By Tweakers user ACM, Friday 2 December 2011 10:46

Wilko wrote on Friday 02 December 2011 @ 08:42:
Interessant, maar waarom in het Engels?
Dan kunnen we het evt ook hergebruiken als verwijzing bij bugreports :)

By Tweakers user jamy015, Saturday 3 December 2011 18:25

Interessante blog, maar een paar spellingsfoutjes:

- they're generally lousy crawlers > they generally have lousy crawlers (de zoekmachine is geen crawler, maar gebruikt/heeft een crawler)
- especially compared to [...] > especially when compared to (denk ik, kan iemand me vertellen of dat goed of fout is?)
- usefull > useful
- skulpting > sculpting
- we've in the order of a hundred million [...] > we've got in the order of a hundred million

By Tweakers user Snikker, Tuesday 6 December 2011 23:17

De nieuwste die ik gister vond, alleen NL content: http://www.zoezoe.nl

Ik moet zeggen, nog veel werk te doen, maar alle nieuwe initiatieven lijken mij prima!

Comments are closed