Yahoo Spider Problems– Too many IPs!

So a suggestion I made over at the Yahoo suggestions board seems to be growing legs, so I figured I would post it here.

The long and short of it is that Yahoo has changed the way they crawl websites. Instead of a nice, slow meandering pace from limited originating IPs, Yahoo is now using a different IP for nearly every page request when crawling websites.

This is leading to increased and overly-inflated guest counts on nearly all scripts that utilize guest-visitor totals, or guest visitor tracking. It is also causing undue load upon many servers.

Here is my Yahoo suggestion: http://suggestions.yahoo.com/detail/?prop=SiteExplorer&fid=31431

Here is the Digg: http://www.digg.com/software/Yahoo_Spider_wrecking_havoc_on_website_guest_tracking

Digg Podcast Submission Process = Slow

I submitted the America’s Debate Radio Podcast to Digg Podcasts on June 21, 2007.

It is now June 26, 2007, and they are yet to approve our podcast.

What specifically takes the folks at Digg so long? It should be simple– click the link, look if the podcast is legit, and hit approve.

Sure, I could understand a delay like this if our podcast only had one or two episodes released at irregular intervals. But, we’ve been around for 72 weeks. Our new episodes come out every Wednesday evening / Thursday morning (depending on where you live). They are well-produced using better equipment than most podcasts out there.

I just don’t see any reason for this long of a delay.

Is there a trick to submitting a podcast to Digg to get it pushed through faster?

I’d really like to know.

Forced Spay/Neutering of Pets

So, there is an article linked off of Drudge about the proposed legislation requiring all pet cats and dogs to be spayed or neutered. Here is the article:

http://www.insidebayarea.com/ci_6218465?source=rss

My personal thoughts on the issue aside, I noticed a very flawed statistic offered by someone who is in favor of the bill:

“We kill a half-million animals every year and spend a quarter-billion dollars doing it,” said Sarah Eryavec, adoption supervisor at the Santa Cruz County SPCA shelter, referring to statewide estimates.

OK, so doing the math: $250,000,000 divided by 500,000 animals = $500 per animal.

Doesn’t that seem high? It sure does to me.

1. Euthanizing a cat costs about 85 cents (Source). Let’s say that a cat weighs about 10lbs. Let’s also say that, using proportional math, it would cost $8.50 to euthanize a 100lb dog. Hell, let’s round up– $10 per animal, regardless of weight or species.

2. Euthanizing a cat or a dog will consume about five minutes of a vet’s time or less, and five minutes of a veterinary assistant (see article linked above in point 1). Let’s say, just for easy math, that both earn a full veterinarian’s wage. The average veterinarian earns $39.18 per hour (Source). $39.18 per hour divided by 12 euthanizations per hour equals $3.27 cents. Multiply that by two for the veterinarian and the assistant, and we’re up to $6.54 per animal. Let’s round it up to $10 as well.

3. Cremating your pet will run about $143.75. I came to this figure by averaging the costs for all animals from kitten to 120+ lb dog as quoted on this page. Let’s round up to $150, even though that is likely for a private cremation, and non-owned pets would be just as cremated in a group cremation, which costs about half.

So, $10 plus $10 + $150 = $170.

How exactly is the State of California paying 194% more to euthanize and cremate a stray animal than than a super-high average cost to bulk euthanize and cremate a stray animal?

It just doesn’t add up.

That $250,000,000 figure seems more like it should be $85,000,000, and more likely around $45,000,000.

Yeah, morbid subject I know. But still, I hate bad math.

Hot Spider Action

It seems that lately we’re paying a hosting bill for America’s Debate so that companies can spider our content. I’m really getting sick of it.

I took a look at our last 300 visitors to the site.

Of course, we have some spiders that I don’t mind: Google, who spidered 102 pages all from the same IP. MSNBot, who spidered 2 pages from the same IP.

But, there are some spiders that are just driving me nuts:

  • SBIder/SBIder-0.8dev (http://www.sitesell.com/sbider.html) – I have no idea who these people are. They seem shady if you ask me. I almost think that they are spidering so that attentive site administrators visit their seemingly lame site. They’re more of a nuisance, spidering 3 pages from the same IP.
  • Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/) – Not a big deal here. They seem like a new search engine. Two pages spidered from the same IP. No big deal, spider away.
  • ArabyBot (cble; Mozilla/5.0; GoogleBot; FAST Crawler 6.4; http://www.araby.com;) – An Arabic search engine. Not a big problem, spidering only 7 pages, all from the same IP. The part that I hate is that they seem to be quite unethical, listing GoogleBot and FAST Crawler in their user agent string, with which they are almost certainly not related.
  • ConveraCrawler/0.9e (+http://www.authoritativeweb.com/crawl) – These people seem shady. Their spider page is pretty vague, and they’re taking a lot of my pages– 190 pages from the same IP. Not a big deal, but still– give me a good explanation of what you’re doing with my pages.
  • Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) – Yahoo, the worst offender. Normally I don’t mind Yahoo spidering my site, but in this instance, I’m getting quite irritated. Yahoo has only taken 167 pages, which is not a lot if it means good inclusion in their engine. But the part that is driving me nuts is that they have used 146 unique IP addresses to get these pages. 146! That means that the guest count at the bottom of my forum is highly exaggerated, showing 146 more guests than it should. Shame on you, Yahoo! You need to use ONE IP for spidering, and only one IP. I’ll be emailing them.

Tips to spider owners: If you are spidering content, use only one IP. If you are spidering, use an honest user agent string. And lastly, if you are spidering, DO NOT request more than one page every 15-60 seconds. Why put unnecessary load on websites that are most likely running on shared hosting?

That is all.

Vadixbot – Look Out!

Jaime noticed heavy traffic on America’s Debate tonight, so I did some digging. It turns out we were being spidered by a bot called “vladixbot.”

Who the hell are these Vadixbot people?

In just under 7 minutes, these jerks grabbed precisely 845 of our pages, averaging about two pages per second and wasting around 10 megabytes. As far as I can tell, they had been at it for several hours, if not more.

Here’s a sample of the latest visitor entry:

Host: 70.112.211.26

* /forums/index.php?s=9feb85cf271657f5d2d05b1d8f3f71bb&showuser=386
Http Code: 200 – Date: Jun 06 08:26:09 – Http Version: HTTP/1.1 – Size in Bytes: 12474
Referer: -
Agent: VadixBot

Here’s the WhoIs record on the IP:

Whois Record
IP Information 70.112.211.26
Record Type: IP Address
IP Location: United States United States – Texas – Austin – Road Runner Holdco Llc
Reverse DNS: cpe-70-112-211-26.austin.res.rr.com
Blacklist Status: Currently Listed (history)
Whois Record

OrgName: Road Runner HoldCo LLC
OrgID: RRSW
Address: 13241 Woodland Park Road
City: Herndon
StateProv: VA
PostalCode: 20171
Country: US

ReferralServer: rwhois://ipmt.rr.com:4321

NetRange: 70.112.0.0 – 70.127.255.255
CIDR: 70.112.0.0/12
NetName: RRSW
NetHandle: NET-70-112-0-0-1
Parent: NET-70-0-0-0-0
NetType: Direct Allocation
NameServer: DNS1.RR.COM
NameServer: DNS2.RR.COM
NameServer: DNS3.RR.COM
NameServer: DNS5.RR.COM
NameServer: DNS6.RR.COM
Comment:
RegDate: 2004-09-17
Updated: 2006-06-06

OrgAbuseHandle: ABUSE10-ARIN
OrgAbuseName: Abuse
OrgAbusePhone: +1-703-345-3416
OrgAbuseEmail: Whois Privacy and Spam Prevention by DomainTools.com

OrgTechHandle: IPTEC-ARIN
OrgTechName: IP Tech
OrgTechPhone: +1-703-345-3416
OrgTechEmail: Whois Privacy and Spam Prevention by DomainTools.com

Yeah, I know it says Virginia, but the IP is most likely out of Texas:

IP address: 70.112.211.26
Reverse DNS: cpe-70-112-211-26.austin.res.rr.com.
Reverse DNS authenticity: [Verified]
ASN: 11427
ASN Name: SCRR-11427
IP range connectivity: 1
Registrar (per ASN): ARIN
Country (per IP registrar): US [United States]
Country Currency: USD [United States Dollars]
Country IP Range: 70.96.0.0 to 70.127.255.255
Country fraud profile: Normal
City (per outside source): Austin, Texas
Country (per outside source): US [United States]
Private (internal) IP? No
IP address registrar: whois.arin.net
Known Proxy? No
Link for WHOIS: 70.112.211.26

My recommendation? Block them. These jerks didn’t read my robots.txt, and were hammering my site. They aren’t welcome back as a result. :)

Computer Upgrades

Well, it’s that time again. Time to upgrade my computer. This particular upgrade was spurred on by the need to upgrade Jaime’s computer, which has a crappy Soyo motherboard with crappy bad capacitors that shut the computer down at random intervals.

So here’s the rundown after the upgrades are complete:

  • My Computer
    • (New) AMD Athlon 64 X2 4600+(65W) Windsor 2.4GHz Socket AM2 Processor Model ADO4600CUBOX
    • (New) G.SKILL 2GB (2 x 1GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual Channel Kit Desktop Memory Model F2-6400CL5D-2GBNQ
    • (New) MSI K9N Neo-F Socket AM2 NVIDIA nForce 550 MCP ATX AMD Motherboard
    • (New) FREETECH PX6200TD-128M GeForce 6200 128MB DDR PCI Express x16 Video Card
    • (New) 2 x Seagate Barracuda 7200.10 ST3320620AS (Perpendicular) 320GB 7200 RPM 16MB Cache SATA 3.0Gb/s Hard Drive
    • M-Audio Delta 66 Professional 6-In/6-Out Audio Card w/Digital I/O
    • Logitech Cordless Desktop S510 Keyboard and Mouse Combo
    • Dell UltraSharp 1905FP 19-inch Flat Panel Monitor with Height Adjustable Stand
    • Toshiba SD-R5112 DVD-R/RW Drive
    • SkyHawk MSR-4610 Silver 1.2mm All Aluminum ATX Mini Server Case
    • Rosewill RE501-SLV ATX 12V v2.03 500W Aluminum Power Supply
  • Jaime’s Computer
    • IntelĀ® PentiumĀ® 4 Prescott Processor 2.80EGHz, 800MHz FSB, Socket 478, 1MB Cache
    • Patriot eXtreme Performance 1GB (2 x 512MB) 184-Pin DDR SDRAM DDR 400 (PC 3200) Dual Channel Kit Desktop Memory Model PDC1G3200LLK
    • MB MSI|PT880 NEO-LSR MS-7008 RET
    • ECS R9200-128DV Radeon 9200 128MB DDR AGP 4X/8X Video Card
    • 2 x Maxtor DiamondMax Plus 9 6Y080M0 80GB 7200 RPM Serial ATA150 Hard Drive
    • M-Audio Audiophile 2496 4-In/4-Out Audio Card w/MIDI & Digital I/O
    • KB LITE-ON|SLIM STANDARD SK-1789/BS
    • Logitech Cordless Optical Mouse
    • SAMSUNG 940BX Black 19″ 5ms DVI LCD Monitor with Height Adjustments
    • Benq 48x12x48 IDE CDRW Drive
    • Chieftec Mini Dragon Case – RED
    • Ultra 400 Watt ATX PS With 120mm Blue LED Fan
  • Utility Computer
    • AMD Athlon XP 2600+ Barton 1.917GHz Socket A Processor Model AXDA2600BBOX
    • mushkin 512MB 184-Pin DDR SDRAM DDR 400 (PC 3200) Desktop Memory Model 991093
    • PQI POWER 512MB DDR PC2700
    • MB SIS741GX ECS|741GX-M ATHLON ATX
    • 2 x Maxtor DiamondMax Plus 9 6Y160P0 IDE Ultra ATA133 160GB 7200 RPM 8MB Cache Hard Drive
    • Maxtor DiamondMax 10 6B080P0 IDE Ultra ATA133 80GB 7200 RPM 8MB Cache Hard Drive
    • Seagate 250 GB Internal Hard Drive 16 MB Cache (ST3250623A)
    • M-Audio Delta 44 Professional 4-In/4-Out Audio Card
    • RAIDMAX Astro ATX268WUP Blue 0.7mm Japanese SECC ATX Mid Tower Computer Case
    • Rosewill Value RV400S ATX 12V v1.3 400W Power Supply

After a huge hassle with PayPal and the crappy PayPal Debit Card (I’m canceling mine), it is all finally ordered and on the way. I’m really looking forward to getting it!