Jump to content

Cre8asiteforums

Web Site Design, Usability, SEO & Marketing Discussion and Support

Recommended Posts

Hello everybody

 

It's been a while since I visited, I've mostly been involved with some technical stuff and moving back to the US of A.

 

This is a question I couldn't answer on the really geek tech forums, so I am putting it out to the "generalist" body of cre8...not 100% sure of correct forum, but security sounds good enough...

 

Nice new look by the way...

 

 

I have had some server-slowing periods stemming from discourteous bots originating on amazonaws.com.

 

I tried contacting amazonaws, but their reporting process required a lengthy online form...then it turned out the form didn't work correctly.

 

Ultimately I blocked all their reported IP ranges at server firewall level and completely solved the problem.

 

But here's the question: Am I cutting my nose to spite my face? Throwing out the baby with the bathwater?

 

Is there any server traffic coming from amazonaws that is doing me any good? For example, bots that actually get me directory links and generate visitors/traffic/links/revenues?

 

One thing I discovered is that bit.ly uses bitlybot hosted on amazonaws. Apparently it is used to read titles of pages, that are then displayed in mouseovers on the bit.ly urls. I wrote bit.ly, they responded immediately, I sent off the list of blocked IP ranges (had to go to their tech department) and I have not heard back from them. I suspect that amazonaws assigns IP addresses at random within a geographical area, so there is no relevant "range" that might be used by one amazonaws client such as bit.ly.

 

Any and all insights welcome as always...

 

Cheers

Mike

Share this post


Link to post
Share on other sites

It's probably impossible to know if you might be missing out on something great either now or in the future. I was just reading this: http://blog.red7.com/swarming-searchbots-from-amazon-aws/ and he mentioned only blocking unidentified bots from amazonaws, which he says are the ones being the most obnoxious. That might be a good compromise solution.

Share this post


Link to post
Share on other sites

If you're not seeing any referrals from their site then I would say there is no point in letting them crawl your content. If you do see referrals, then you have to calculate the return on investment.

 

I sometimes throttle Bing on my own server and then let it back in after a week or so. I used to throttle Yahoo! much more (now I just block Slurp completely).

Share this post


Link to post
Share on other sites

:wave: nuts

Welcome back.

 

For many of us who actively block bots amazonaws (AWS) is perhaps the one standard default block :)

Yes, there are legitimate services on AWS including many/most/all of Amazons properties, i.e. Alexa, Archive.org, IMDb - which if any of whose crawlers you consider valuable to your site is a business decision.

Personally I block them all with a vengeance because AWS serves up not only bots by the zillions but is host to zillions of proxy servers utilised by scrapers (and scammers).

With one exception:

because my sites are Amazon Affiliates I must allow AMZNKAssocBot/4.0 to crawl.

 

Note: if you hate AWS you might also love to hate Google App Engine. If AWS is default block #1, GAE is #2.

Share this post


Link to post
Share on other sites

Thanks Cre8's

 

I just realized I had to allow cre8 emails with postini, which by the way does a fantastic job and has not only dramatically reduced spam (and the daily time need to deal with it) but also lowered server load on spamassassin.

 

I'm glad to see that others have taken the same tack. Since I dropped amazon affiliate stuff (due to zero revenues) -- along with ebay (due to their api problems plus low revenues) and pretty much all cpa advertising -- no problem there.

 

Donna, I read the link, however that approach is labor-intensive and creates server load, while firewall blocking stops the traffic before it hits the server. I *love* my managed hosting (ahem) **plug** for datapipe.net.

 

Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right? Wayback (archive) may have some value, but how much really?

 

Iamlost, regarding google app engine -- my understanding is that is used by individuals and the IP numbers **should not** cross over with google maps, right?

 

Cheers

Mike

Share this post


Link to post
Share on other sites
Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right?

 

No. They claim they gather data from other sources as well. I think most people are still looking at Alexa as if it never changed the way it functions several years ago.

 

From How are Alexa’s traffic rankings determined?: "Alexa’s Traffic Ranks are based on the traffic data provided by users in the Alexa Toolbar panel and data collected from other, diverse sources over a rolling 3 month period."

 

In addition to the Alexa toolbar, they also collect data from their analytics package. And several years ago I read that like Compete and other metrics services they were buying aggregated user data from various ISPs. I don't know if they are still doing that today.

 

Alexa is actually a much more reliable data source than most people believe it to be, and has been for several years.

 

That said, all third-party metrics services operate at a disadvantage as they only have incomplete data to work with for the Web in general. Alexa points that out in one or two of their FAQ answers.

Share this post


Link to post
Share on other sites

Michael, what I read in your comment suggests that if alexa is getting their data from other sources, it might not matter whether I block them or not....

Share this post


Link to post
Share on other sites

If they collect enough data about a Website they will try to rank it. I think they advise people not to put too much stock in the rankings below 100,000. Their estimates for the top 100,000 sites (1..100,000) are apparently more reliable than for the sites about which they have less data.

Share this post


Link to post
Share on other sites

I just realized that postini is a google app, so probably I am better off not blocking that across the board.

Share this post


Link to post
Share on other sites

With the release of Amazon's Kindle Fire tablet upcoming at a competitor hosing price I expect it's uptake to be swift and considerable. And that raises several problem questions...

 

First, as it's Silk browser will leverage the AWS cloud (in a similar manner to Opera, it looks like) that means that those of us simply blocking AWS IPs will need to reconsider our stance. How to adjust will depend on how Silk uses AWS: from dedicated IPs or random; if dedicated the fix is simple, if random then one will have to consider user-agent identification...and how long before scammer-scrapers are making as Silk?

 

Second, as it will also use a prefetch that means that webdevs such as myself that block prefetch will have to learn the best method of identification for blocking this one too.

 

Running a 'clean' site for ease of calculating direct ad visitor metrics is an uphill battle on an ever rising hill...

Share this post


Link to post
Share on other sites

Hello

 

I am posting an update to the list of amazonaws bots. This comes from https://forums.aws.amazon.com/ann.jspa?annID=1252 posted Dec 14, 2011. The list is long, I blocked the older ones months ago, here are a few new bad actors:

 

 

23.20.0.0/14 (23.20.0.0 – 23.23.255.255) NEW

50.112.0.0/16 (50.112.0.0 - 50.112.255.255)

184.169.128.0/17 (184.160.128.0 - 184.169.255.255) NEW

176.34.64.0/18 (176.34.64.0 – 176.34.127.255) NEW

176.34.0.0/18 (176.34.0.0 - 176.34.63.255) NEW

177.71.128.0/17 (177.71.128.0 - 177.71.255.255) NEW

 

Cheers

Mike

Share this post


Link to post
Share on other sites

If people want to just cut and paste into their .htaccess files:

 

# Block Amazon AWS-based crawlersdeny from 23.20.0.0/14deny from 50.112.0.0/16deny from 184.169.128.0/17deny from 176.34.64.0/18deny from 176.34.0.0/18deny from 177.71.128.0/17# end block Amazon AWS-based crawlers

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×