Alternative search engines have lousy crawlers
As a publisher, we see one very common denominator; they're generally lousy crawlers. Especially compared to Googlebot and Bingbot. Tweakers.net offers over a million "articles" (forum topics, news articles, reviews, etc). Many of those pages have links to tweak some aspects of the current view (sorting, paging, filtering). On top of that, we offer links to specific finer grained information, like a direct link to a comment on a article or topic.
So while we offer over a million articles, we've in the order of a hundred million unique links on our site. We don't need, nor want, all those urls treated as important and/or distinct pages on our site. I.e. we don't want, nor need, to have those url's followed, they're there for our visitors but not very usefull for search engines. We generally don't even see those urls as a different page and the particular content will have been indexed via some other (more important) link anyway.
So to steer the crawlers, we mark them with rel='nofollow' and add a line to our robots.txt if possible. Apart from having popularised and generally obeying these markers, Googlebot and Bingbot offer additional tools to reduce this problem. For instance, Google allows you to specify how GET-parameters should be interpreted, so they know how to treat links with them.
Yes, we're one of the publishers that use the rel=nofollow to actually just mark url's to be ignored from the link-graph. We don't really care about all its added bonuses that used to be there, like the 'pagerank'-skulpting. We just really don't want or need those url's to be indexed, saving "them" and us processing time and leaving room for crawlers to actually spend time to index the parts we do want them to index.
The new crawlers generally lack (full) support for those aspects to steer crawlers, even the more basic aspects. Its not uncommon for them to:
- Ignore nofollow on links or in metatags
- Ignore noindex in metatags
- Partially or fully ignore the robots.txt
- Fail to fully understand the robots.txt
- Fail to limit their amount of requests to some decent value
- Fail to understand how relative links work in combination with a base href (untill a few years ago Google didn't understand that too well either)
- Fail to understand that links are given in html, and should thus be correctly decoded (i.e. no & in urls)
- Not offer any insight in why our pages are being crawled
Of course noone can force anyone to comply to these standards, but if we detect a crawler that isn't following our suggestions, we'll usually ban them from the site, or take it up with the crawler owner.zzattack wrote on Thursday 01 December 2011 @ 17:12:
Are crawlers forced to comply with Disallows per User-Agent in robots.txt?
Verder goed en logisch dat jullie zulke rotzooi bannen
Dan kunnen we het evt ook hergebruiken als verwijzing bij bugreportsWilko wrote on Friday 02 December 2011 @ 08:42:
Interessant, maar waarom in het Engels?
- they're generally lousy crawlers > they generally have lousy crawlers (de zoekmachine is geen crawler, maar gebruikt/heeft een crawler)
- especially compared to [...] > especially when compared to (denk ik, kan iemand me vertellen of dat goed of fout is?)
- usefull > useful
- skulpting > sculpting
- we've in the order of a hundred million [...] > we've got in the order of a hundred million
Ik moet zeggen, nog veel werk te doen, maar alle nieuwe initiatieven lijken mij prima!
Comments are closed