Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Bad Bots


  • Please log in to reply
11 replies to this topic

#1 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 21 September 2011 - 10:15 AM

Hello everybody

It's been a while since I visited, I've mostly been involved with some technical stuff and moving back to the US of A.

This is a question I couldn't answer on the really geek tech forums, so I am putting it out to the "generalist" body of cre8...not 100% sure of correct forum, but security sounds good enough...

Nice new look by the way...


I have had some server-slowing periods stemming from discourteous bots originating on amazonaws.com.

I tried contacting amazonaws, but their reporting process required a lengthy online form...then it turned out the form didn't work correctly.

Ultimately I blocked all their reported IP ranges at server firewall level and completely solved the problem.

But here's the question: Am I cutting my nose to spite my face? Throwing out the baby with the bathwater?

Is there any server traffic coming from amazonaws that is doing me any good? For example, bots that actually get me directory links and generate visitors/traffic/links/revenues?

One thing I discovered is that bit.ly uses bitlybot hosted on amazonaws. Apparently it is used to read titles of pages, that are then displayed in mouseovers on the bit.ly urls. I wrote bit.ly, they responded immediately, I sent off the list of blocked IP ranges (had to go to their tech department) and I have not heard back from them. I suspect that amazonaws assigns IP addresses at random within a geographical area, so there is no relevant "range" that might be used by one amazonaws client such as bit.ly.

Any and all insights welcome as always...

Cheers
Mike

#2 DonnaFontenot

DonnaFontenot

    Peacekeeper Administrator

  • Site Administrators
  • 3821 posts

Posted 21 September 2011 - 11:42 AM

It's probably impossible to know if you might be missing out on something great either now or in the future. I was just reading this: http://blog.red7.com...rom-amazon-aws/ and he mentioned only blocking unidentified bots from amazonaws, which he says are the ones being the most obnoxious. That might be a good compromise solution.

#3 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 21 September 2011 - 12:36 PM

If you're not seeing any referrals from their site then I would say there is no point in letting them crawl your content. If you do see referrals, then you have to calculate the return on investment.

I sometimes throttle Bing on my own server and then let it back in after a week or so. I used to throttle Yahoo! much more (now I just block Slurp completely).

#4 iamlost

iamlost

    The Wind Master

  • Site Administrators
  • 4628 posts

Posted 22 September 2011 - 10:23 AM

:wave: nuts
Welcome back.

For many of us who actively block bots amazonaws (AWS) is perhaps the one standard default block :)
Yes, there are legitimate services on AWS including many/most/all of Amazons properties, i.e. Alexa, Archive.org, IMDb - which if any of whose crawlers you consider valuable to your site is a business decision.
Personally I block them all with a vengeance because AWS serves up not only bots by the zillions but is host to zillions of proxy servers utilised by scrapers (and scammers).
With one exception:
because my sites are Amazon Affiliates I must allow AMZNKAssocBot/4.0 to crawl.

Note: if you hate AWS you might also love to hate Google App Engine. If AWS is default block #1, GAE is #2.

#5 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 September 2011 - 11:01 AM

Thanks Cre8's

I just realized I had to allow cre8 emails with postini, which by the way does a fantastic job and has not only dramatically reduced spam (and the daily time need to deal with it) but also lowered server load on spamassassin.

I'm glad to see that others have taken the same tack. Since I dropped amazon affiliate stuff (due to zero revenues) -- along with ebay (due to their api problems plus low revenues) and pretty much all cpa advertising -- no problem there.

Donna, I read the link, however that approach is labor-intensive and creates server load, while firewall blocking stops the traffic before it hits the server. I *love* my managed hosting (ahem) **plug** for datapipe.net.

Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right? Wayback (archive) may have some value, but how much really?

Iamlost, regarding google app engine -- my understanding is that is used by individuals and the IP numbers **should not** cross over with google maps, right?

Cheers
Mike

#6 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 22 September 2011 - 12:52 PM

Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right?


No. They claim they gather data from other sources as well. I think most people are still looking at Alexa as if it never changed the way it functions several years ago.

From How are Alexa’s traffic rankings determined?: "Alexa’s Traffic Ranks are based on the traffic data provided by users in the Alexa Toolbar panel and data collected from other, diverse sources over a rolling 3 month period."

In addition to the Alexa toolbar, they also collect data from their analytics package. And several years ago I read that like Compete and other metrics services they were buying aggregated user data from various ISPs. I don't know if they are still doing that today.

Alexa is actually a much more reliable data source than most people believe it to be, and has been for several years.

That said, all third-party metrics services operate at a disadvantage as they only have incomplete data to work with for the Web in general. Alexa points that out in one or two of their FAQ answers.

#7 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 September 2011 - 01:00 PM

Michael, what I read in your comment suggests that if alexa is getting their data from other sources, it might not matter whether I block them or not....

#8 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 23 September 2011 - 11:53 AM

If they collect enough data about a Website they will try to rank it. I think they advise people not to put too much stock in the rankings below 100,000. Their estimates for the top 100,000 sites (1..100,000) are apparently more reliable than for the sites about which they have less data.

#9 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 23 September 2011 - 07:13 PM

I just realized that postini is a google app, so probably I am better off not blocking that across the board.

#10 iamlost

iamlost

    The Wind Master

  • Site Administrators
  • 4628 posts

Posted 29 September 2011 - 11:57 AM

With the release of Amazon's Kindle Fire tablet upcoming at a competitor hosing price I expect it's uptake to be swift and considerable. And that raises several problem questions...

First, as it's Silk browser will leverage the AWS cloud (in a similar manner to Opera, it looks like) that means that those of us simply blocking AWS IPs will need to reconsider our stance. How to adjust will depend on how Silk uses AWS: from dedicated IPs or random; if dedicated the fix is simple, if random then one will have to consider user-agent identification...and how long before scammer-scrapers are making as Silk?

Second, as it will also use a prefetch that means that webdevs such as myself that block prefetch will have to learn the best method of identification for blocking this one too.

Running a 'clean' site for ease of calculating direct ad visitor metrics is an uphill battle on an ever rising hill...

#11 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 12 January 2012 - 10:26 PM

Hello

I am posting an update to the list of amazonaws bots. This comes from https://forums.aws.a...jspa?annID=1252 posted Dec 14, 2011. The list is long, I blocked the older ones months ago, here are a few new bad actors:


23.20.0.0/14 (23.20.0.0 – 23.23.255.255) NEW
50.112.0.0/16 (50.112.0.0 - 50.112.255.255)
184.169.128.0/17 (184.160.128.0 - 184.169.255.255) NEW
176.34.64.0/18 (176.34.64.0 – 176.34.127.255) NEW
176.34.0.0/18 (176.34.0.0 - 176.34.63.255) NEW
177.71.128.0/17 (177.71.128.0 - 177.71.255.255) NEW

Cheers
Mike

#12 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 13 January 2012 - 01:39 PM

If people want to just cut and paste into their .htaccess files:
# Block Amazon AWS-based crawlers

deny from 23.20.0.0/14
deny from 50.112.0.0/16
deny from 184.169.128.0/17
deny from 176.34.64.0/18
deny from 176.34.0.0/18
deny from 177.71.128.0/17
# end block Amazon AWS-based crawlers




RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users