Posted 21 September 2011 - 10:45 AM
It's been a while since I visited, I've mostly been involved with some technical stuff and moving back to the US of A.
This is a question I couldn't answer on the really geek tech forums, so I am putting it out to the "generalist" body of cre8...not 100% sure of correct forum, but security sounds good enough...
Nice new look by the way...
I have had some server-slowing periods stemming from discourteous bots originating on amazonaws.com.
I tried contacting amazonaws, but their reporting process required a lengthy online form...then it turned out the form didn't work correctly.
Ultimately I blocked all their reported IP ranges at server firewall level and completely solved the problem.
But here's the question: Am I cutting my nose to spite my face? Throwing out the baby with the bathwater?
Is there any server traffic coming from amazonaws that is doing me any good? For example, bots that actually get me directory links and generate visitors/traffic/links/revenues?
One thing I discovered is that bit.ly uses bitlybot hosted on amazonaws. Apparently it is used to read titles of pages, that are then displayed in mouseovers on the bit.ly urls. I wrote bit.ly, they responded immediately, I sent off the list of blocked IP ranges (had to go to their tech department) and I have not heard back from them. I suspect that amazonaws assigns IP addresses at random within a geographical area, so there is no relevant "range" that might be used by one amazonaws client such as bit.ly.
Any and all insights welcome as always...
Posted 21 September 2011 - 12:12 PM
Posted 21 September 2011 - 01:06 PM
I sometimes throttle Bing on my own server and then let it back in after a week or so. I used to throttle Yahoo! much more (now I just block Slurp completely).
Posted 22 September 2011 - 10:53 AM
For many of us who actively block bots amazonaws (AWS) is perhaps the one standard default block
Yes, there are legitimate services on AWS including many/most/all of Amazons properties, i.e. Alexa, Archive.org, IMDb - which if any of whose crawlers you consider valuable to your site is a business decision.
Personally I block them all with a vengeance because AWS serves up not only bots by the zillions but is host to zillions of proxy servers utilised by scrapers (and scammers).
With one exception:
because my sites are Amazon Affiliates I must allow AMZNKAssocBot/4.0 to crawl.
Note: if you hate AWS you might also love to hate Google App Engine. If AWS is default block #1, GAE is #2.
Posted 22 September 2011 - 11:31 AM
I just realized I had to allow cre8 emails with postini, which by the way does a fantastic job and has not only dramatically reduced spam (and the daily time need to deal with it) but also lowered server load on spamassassin.
I'm glad to see that others have taken the same tack. Since I dropped amazon affiliate stuff (due to zero revenues) -- along with ebay (due to their api problems plus low revenues) and pretty much all cpa advertising -- no problem there.
Donna, I read the link, however that approach is labor-intensive and creates server load, while firewall blocking stops the traffic before it hits the server. I *love* my managed hosting (ahem) **plug** for datapipe.net.
Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right? Wayback (archive) may have some value, but how much really?
Iamlost, regarding google app engine -- my understanding is that is used by individuals and the IP numbers **should not** cross over with google maps, right?
Posted 22 September 2011 - 01:22 PM
Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right?
No. They claim they gather data from other sources as well. I think most people are still looking at Alexa as if it never changed the way it functions several years ago.
From How are Alexa’s traffic rankings determined?: "Alexa’s Traffic Ranks are based on the traffic data provided by users in the Alexa Toolbar panel and data collected from other, diverse sources over a rolling 3 month period."
In addition to the Alexa toolbar, they also collect data from their analytics package. And several years ago I read that like Compete and other metrics services they were buying aggregated user data from various ISPs. I don't know if they are still doing that today.
Alexa is actually a much more reliable data source than most people believe it to be, and has been for several years.
That said, all third-party metrics services operate at a disadvantage as they only have incomplete data to work with for the Web in general. Alexa points that out in one or two of their FAQ answers.
Posted 22 September 2011 - 01:30 PM
Posted 23 September 2011 - 12:23 PM
Posted 23 September 2011 - 07:43 PM
Posted 29 September 2011 - 12:27 PM
First, as it's Silk browser will leverage the AWS cloud (in a similar manner to Opera, it looks like) that means that those of us simply blocking AWS IPs will need to reconsider our stance. How to adjust will depend on how Silk uses AWS: from dedicated IPs or random; if dedicated the fix is simple, if random then one will have to consider user-agent identification...and how long before scammer-scrapers are making as Silk?
Second, as it will also use a prefetch that means that webdevs such as myself that block prefetch will have to learn the best method of identification for blocking this one too.
Running a 'clean' site for ease of calculating direct ad visitor metrics is an uphill battle on an ever rising hill...
Posted 12 January 2012 - 10:56 PM
I am posting an update to the list of amazonaws bots. This comes from https://forums.aws.a...jspa?annID=1252 posted Dec 14, 2011. The list is long, I blocked the older ones months ago, here are a few new bad actors:
22.214.171.124/14 (126.96.36.199 – 188.8.131.52) NEW
184.108.40.206/16 (220.127.116.11 - 18.104.22.168)
22.214.171.124/17 (126.96.36.199 - 188.8.131.52) NEW
184.108.40.206/18 (220.127.116.11 – 18.104.22.168) NEW
22.214.171.124/18 (126.96.36.199 - 188.8.131.52) NEW
184.108.40.206/17 (220.127.116.11 - 18.104.22.168) NEW
Posted 13 January 2012 - 02:09 PM
# Block Amazon AWS-based crawlers deny from 22.214.171.124/14 deny from 126.96.36.199/16 deny from 188.8.131.52/17 deny from 184.108.40.206/18 deny from 220.127.116.11/18 deny from 18.104.22.168/17 # end block Amazon AWS-based crawlers
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users