6/19/2007

Hot Spider Action

Filed under: — Mike @ 3:21 am

It seems that lately we’re paying a hosting bill for America’s Debate so that companies can spider our content. I’m really getting sick of it.

I took a look at our last 300 visitors to the site.

Of course, we have some spiders that I don’t mind: Google, who spidered 102 pages all from the same IP. MSNBot, who spidered 2 pages from the same IP.

But, there are some spiders that are just driving me nuts:

  • SBIder/SBIder-0.8dev (http://www.sitesell.com/sbider.html) - I have no idea who these people are. They seem shady if you ask me. I almost think that they are spidering so that attentive site administrators visit their seemingly lame site. They’re more of a nuisance, spidering 3 pages from the same IP.
  • Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/) - Not a big deal here. They seem like a new search engine. Two pages spidered from the same IP. No big deal, spider away.
  • ArabyBot (cble; Mozilla/5.0; GoogleBot; FAST Crawler 6.4; http://www.araby.com;) - An Arabic search engine. Not a big problem, spidering only 7 pages, all from the same IP. The part that I hate is that they seem to be quite unethical, listing GoogleBot and FAST Crawler in their user agent string, with which they are almost certainly not related.
  • ConveraCrawler/0.9e (+http://www.authoritativeweb.com/crawl) - These people seem shady. Their spider page is pretty vague, and they’re taking a lot of my pages– 190 pages from the same IP. Not a big deal, but still– give me a good explanation of what you’re doing with my pages.
  • Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) - Yahoo, the worst offender. Normally I don’t mind Yahoo spidering my site, but in this instance, I’m getting quite irritated. Yahoo has only taken 167 pages, which is not a lot if it means good inclusion in their engine. But the part that is driving me nuts is that they have used 146 unique IP addresses to get these pages. 146! That means that the guest count at the bottom of my forum is highly exaggerated, showing 146 more guests than it should. Shame on you, Yahoo! You need to use ONE IP for spidering, and only one IP. I’ll be emailing them.

Tips to spider owners: If you are spidering content, use only one IP. If you are spidering, use an honest user agent string. And lastly, if you are spidering, DO NOT request more than one page every 15-60 seconds. Why put unnecessary load on websites that are most likely running on shared hosting?

That is all.

Leave a Reply