Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

Filthy Rotten Spammer! A Spammy Bots Rant


11 replies to this topic

#1 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 3991 posts

Posted 22 October 2012 - 03:31 PM

Michael Martinez lets rip a good one in The Difference Between a Good Crawler and a Spammy Rogue Crawler, SEO Theory, 20-October-2012.

If you run a Web crawler you are a filthy spammer UNLESS you do the following:

* Publish a user-agent on your Website.
* Identify the user-agent in every fetch request the crawler makes.
* Honor the Robots Exclusion Standard (Cf. http://www.robotstxt.org/).

Your classification as filthy bottom-feeding scum for running a rogue crawler is not negotiable.


!!!! :D

So the Rules for Crawlers Are Simple

Rule No. 1: Respect the Website This means your crawler should identify itself and it should honor the Robots Exclusion Standard.

Rule No. 2: Make No Exceptions If in your bid to become the next big tech billionaire you find that a million Websites are blocking your user-agent and IP address, STAY OFF THOSE SITES. We don’t owe you anything. You owe US.

Rule No. 3: Don’t Take it Personally The Internet was not created so that you can make billions of dollars by stealing other people’s content and using it for your own benefit. So when you choose to do this and treat us as nothing more than ants to be crushed, we’re going to consider you to be bottom-feeding filth because that is what you are.


While I loved his rant and generally shouted Yes!!! or Huzzah! with every sentence I do disagree with this: ...Google honors the Robots Exclusion Standard...
Not always. And neither do Bing and other major SEs. They do run 'cloaked' bots, headless browsers, etc. as well as fudge on the 'meaning' of bot, user-agent et al to excuse apparent rogue crawler and fetcher behaviours.

It still behooves webdevs to build in appropriate bot defences in depth - unfortunately still no simple plug and play version available I know to recommend - but the parameters are only a search or three away, some even with basic code examples (caveat emptor). A typical website analysis shows 15-20% of it's traffic is bot; some exceed 80-90%. That is a lot of scraped content, wasted bandwidth, diluted conversion stats, etc. I even block most Google and Bing bots and other critters because they do not offer value in return; your needs may differ.

Great read. Thanks, Michael.

#2 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13016 posts
  • Twitter:https://twitter.com/kim_cre8pc
  • Facebook:https://www.facebook.com/cre8pc

Posted 22 October 2012 - 03:55 PM

Yay! Reading this on the plane so I can't yell too loudly but this is so awesome! Michael is right on!

#3 bobbb

bobbb

    Time Traveler Member

  • 1000 Post Club
  • 1428 posts

Posted 22 October 2012 - 05:33 PM

He's right of course.

The comments are also interesting. As the first commentor said

"you really just need to get over it and accept that your content’s going to be crawled"

But what are we to do? You would need to spend all day defending your content. So you kill IP ranges and check user-agents. The latter just gets the ones that are "honest" enough to ID themselves. How many just say they are IE.

Another commentor states

"if the door is closed, and if you knock that door twice and no one opens, then do not try to open it because you’re not welcome"

But the door is open at port 80. Not even a need to knock.

They are easy to spot in the logs: they just get pages and never the .css or .js or images. Amazon-AWS is a very real culprit.

Maybe a solution is a class-action suit. How and who? Amazon? It maybe a good start. If they would add a clause in their TOS about crawing as IE. Then a complaint would do it.

that permission is sought otherwise the IP addresses of such Internet low lives should be blocked and reported

Are you going to spend any effort to report when it comes from .ua or .ru or .cz? Hardly! (Apologies to Kaspersky and AVG)

Edited by bobbb, 22 October 2012 - 05:46 PM.


#4 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 3991 posts

Posted 22 October 2012 - 06:27 PM

But what are we to do? You would need to spend all day defending your content. So you kill IP ranges and check user-agents.

No. You do need to spend some time setting up your defences but after that it should be automatic except for a couple hours (depending on requirements) a week in maintenance and analysis. Automation is a webdev's BFF. :)

Unfortunately some of the really effective efficient methods require a dedicated server (for proper command and control), a problem for many smaller sites/businesses on shared servers. However, one can still cut out a majority of bots even on such with effective use of htaccess and scripts. It is not something that most webdevs see as having value, which is fine by me.

#5 glyn

glyn

    Sonic Boom Member

  • 1000 Post Club
  • 1858 posts

Posted 23 October 2012 - 05:33 AM

This is the reason why I host with a server that gives me unlimited bandwidth - come crawl, scrape, do what you want it makes no difference to me.

Google doesn't respect robots exclusion protocols either....woohooo big suprise there!

Linkbait linkbait :)

#6 bobbb

bobbb

    Time Traveler Member

  • 1000 Post Club
  • 1428 posts

Posted 23 October 2012 - 04:04 PM

Google doesn't respect robots exclusion protocols either

This is interesting. Can you give examples? I'm biting.

They do run 'cloaked' bots, headless browsers, etc

I've suspected this also. Can you give examples?

Edited by bobbb, 23 October 2012 - 04:05 PM.


#7 glyn

glyn

    Sonic Boom Member

  • 1000 Post Club
  • 1858 posts

Posted 24 October 2012 - 02:53 AM

I thought I just gave you an example. I think what you are asking for is proof, and that is inside the log files of a client.

#8 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 3991 posts

Posted 24 October 2012 - 09:31 AM

bobbb: as glyn says the proof eventually shows up in one's log file. If a user-agent (mis)behaves in certain ways and reverse DNS confirms it's origin...

#9 bobbb

bobbb

    Time Traveler Member

  • 1000 Post Club
  • 1428 posts

Posted 24 October 2012 - 10:14 AM

If a user-agent (mis)behaves in certain ways and reverse DNS confirms it's origin

Oh OK. That I knew. Like when an IP from Brasil says it's googlebot. I thought you meant something more sneaky like G coming from a non-google IP (as per (ARIN) and saying it is just Chrome. I expect they do that.

I thought I just gave you an example

That explains it. I have no excludes for G so I would see nothing. You said linkbait so I presumed you were preparing to drop one. I'll set up an exclude and see.

Edited by bobbb, 24 October 2012 - 10:14 AM.


#10 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 3991 posts

Posted 24 October 2012 - 10:44 AM

I thought you meant something more sneaky like G coming from a non-google IP (as per (ARIN) and saying it is just Chrome. I expect they do that.

Google and other SEs have a history of coming from IPs that 'suddenly' have a backdated registration as being theirs. By running an analysis of the traffic between the two dates one can get a sense of the cloaked user-agent strings (and other system fingerprint data that I also log) and be reasonably confident on such sneaky behaviours. Such 'fingerprint' data can sometimes be used to identify future accesses by the same 'user'.

As some bots are designed to behave as a typical human rather than mindless slurper/followers it can be a challenge - which I enjoy some days - to design tripwires that don't return false positives. And of course Google and other SEs do employ real people to come and check out sites...or perhaps they are just employees doing personal stuff on company time...? :)

#11 EGOL

EGOL

    Eyes Like Hawk Moderator

  • Moderators
  • 4576 posts

Posted 24 October 2012 - 05:09 PM

This is the reason why I host with a server that gives me unlimited bandwidth -

Are you sure that their isn't some type of throttle on your processing, database connections or some other resource that makes unlimited bandwidth BS?

#12 glyn

glyn

    Sonic Boom Member

  • 1000 Post Club
  • 1858 posts

Posted 25 October 2012 - 07:52 AM

Can't say for sure EGOL, never asked, never had sites go down, never had a problem with them. Always been amazing support.



Reply to this topic



  


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users