As a publisher, we see one very common denominator; they're generally lousy crawlers. Especially compared to Googlebot and Bingbot. Tweakers.net offers over a million "articles" (forum topics, news articles, reviews, etc). Many of those pages have links to tweak some aspects of the current view (sorting, paging, filtering). On top of that, we offer links to specific finer grained information, like a direct link to a comment on a article or topic.
So while we offer over a million articles, we've in the order of a hundred million unique links on our site. We don't need, nor want, all those urls treated as important and/or distinct pages on our site. I.e. we don't want, nor need, to have those url's followed, they're there for our visitors but not very usefull for search engines. We generally don't even see those urls as a different page and the particular content will have been indexed via some other (more important) link anyway.
So to steer the crawlers, we mark them with rel='nofollow' and add a line to our robots.txt if possible. Apart from having popularised and generally obeying these markers, Googlebot and Bingbot offer additional tools to reduce this problem. For instance, Google allows you to specify how GET-parameters should be interpreted, so they know how to treat links with them.
Yes, we're one of the publishers that use the rel=nofollow to actually just mark url's to be ignored from the link-graph. We don't really care about all its added bonuses that used to be there, like the 'pagerank'-skulpting. We just really don't want or need those url's to be indexed, saving "them" and us processing time and leaving room for crawlers to actually spend time to index the parts we do want them to index.
The new crawlers generally lack (full) support for those aspects to steer crawlers, even the more basic aspects. Its not uncommon for them to:
- Ignore nofollow on links or in metatags
- Ignore noindex in metatags
- Partially or fully ignore the robots.txt
- Fail to fully understand the robots.txt
- Fail to limit their amount of requests to some decent value
- Fail to understand how relative links work in combination with a base href (untill a few years ago Google didn't understand that too well either)
- Fail to understand that links are given in html, and should thus be correctly decoded (i.e. no & in urls)
- Not offer any insight in why our pages are being crawled