Baidu & Yandex
#1
Posted 21 April 2012 - 05:45 PM
I have been trying to refine my traffic so I don't get overloaded on 300+ sites. I have blocked all amazonaws ranges that I have found, along with other individual IP numbers that have overloaded me with bot traffic.
So the question is, do baidu and/or yandex do me any good in terms of human traffic -- traffic that clicks on advertising links and that I get paid for? At least in the past, clicks from the North America, Western Europe and Australia have been worh a lot more than clicks from Russia or China.
I am looking at yandex and this bot is coming on pretty strong. I probably will at least disallow it in robots.txt and see if it obeys.
Any comment??
Cheers
Mike
#2
Posted 21 April 2012 - 09:34 PM
Yandex and Baidu are major players and are becoming more international. Personally I do allow both but only their main webcrawlers - just as with Bing and Google. That does not mean that I don't block Russian and Chinese IPs
Note: I frequenctly block even whitelisted SE bots if they hammer too hard or too repetitively. In such instances it is only for an hour at a time. So far without apparent harm.
#3
Posted 21 April 2012 - 09:58 PM
http://www.parkansky.com/china.htm
I was surprised at how much code this will add to an htaccess file. Does that cause performance problems?
I was hoping for a short and sweet code to block about 1/2 of a hemisphere.
#4
Posted 21 April 2012 - 10:31 PM
iptables -p tcp -I INPUT -j DROP -s 103.4.8.0/21
followed by
/etc/init.d/iptables save && /etc/init.d/sshd restart
This blocks (103.4.8.0 - 103.4.15.255) which happens to be amazonaws Asia-Pacific region.
This gives me an iptables rules list like this:
list rules 4/20/12
iptables -L -n
DROP tcp -- 103.4.8.0/21 0.0.0.0/0
Iptables can act as s a linux internal firewall. It blocks traffic before it reaches .htaccess and causes less server load. External firewalls are even better, your server never takes the hit at all.
The CIDR syntax is explained here : http://en.wikipedia....wiki/CIDR_range
Current amazonaws IP ranges are listed here :
https://forums.aws.a...jspa?annID=1408
Posted on: Mar 20, 2012 2:14 PM
Iamlost, point taken on the changing nature of the net, keeping abreast of what is valuable and what is not. Realistically, at the moment I *think* most advertisers in English are looking to reach English speaking customers -- although I have not viewed my sites recently from non-English countries -- it's possible they may be displaying ads in other languages....
In general it looks to me like Yandex is pretty active on my machines.
Cheers & good luck!
Mike
#7
Posted 22 April 2012 - 05:17 PM
Unfortunately, I learned the hard way that I cannot simply block everything on Amazon because some of the social media services where I pipe my RSS feeds are hosted on Amazon AWS. It's a very sticky wicket.
And as far as Baidu and Yandex go, I open up to their crawlers every now and then to see how relevant their traffic is to my sites. I get more traffic from Yandex than Baidu and more crawl from Baidu than Yandex. At the present time I am blocking Baidu but I'll unblock it again later this year.
Yandex is definitely trying to break into international search and they have been indexing the English Web for several years.
#8
Posted 22 April 2012 - 07:01 PM
Although I do not get much Baidu or Yandex traffic I would not block them as suddenly something could become popular and I'd lose a lot of good traffic. I guess it largely depends on your market though. I am targeting the whole world.
#9
Posted 23 April 2012 - 12:50 PM
#10
Posted 23 April 2012 - 09:40 PM
tail -f /path.../access_log
The -f argument gives you a continuing view of every file request as it comes in. If your server is busy it can be a flow of text that is too fast to read.
The apache httpd.conf line
HostnameLookups On
gives the name attached to the IP making the request,
and when combined with
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D" combined
where the %v gives the your domain that serves the request
causes apache to print your domain and the requesting domain to the access_log file.
It's probably not 100% foolproof, but it gives you a pretty good idea who's requesting what, especially if there are an egregiously huge number of requests from the same IP and/or host. It's a good way to stop DOS attacks also.
You can also plug the IP number into maxmind.com, which gives you the real owner.
Cheers
Mike
#11
Posted 30 April 2012 - 12:33 PM
iptables -p tcp -I INPUT -j DROP -s 119.63.192.0/21
iptables -p tcp -I INPUT -j DROP -s 123.125.64.0/18
iptables -p tcp -I INPUT -j DROP -s 180.76.0.0/16
iptables -p tcp -I INPUT -j DROP -s 220.181.0.0/16
The final range actually blocks China Telecom Beijing.
Ref : http://www.netnuisance.net/ip/se.php
This seems to have lowered my cpu load. I don't know how to measure website results, though I can't imagine that English language advertisers have much interest in China traffic.
I do not see such a load from Yandex ... so far it seems to be relatively polite and I am leaving it alone.
Cheers
Mike
#12
Posted 30 April 2012 - 01:09 PM
A commenter on SEO Theory recommended Cloud Flare and one of their people followed up to clarify one of my concerns (I thought you had to move your hosting to Cloud Flare but they just manage your DNS).
If I understand the concept correctly, your routing passes through Cloud Flare and they take responsibility for managing rogue crawlers. I think they are caching your content but I'm not clear on that. Haven't really had time to fully evaluate it.
#13
Posted 30 April 2012 - 01:43 PM
Currently working on understanding memcache syntax, to cache pages in ram locally, with a possible migration to couchbase later, which works about the same but also has a text-based file that allows for quicker cache recovery on reboot. This is somewhat outside the topic of blocking bad bots, but both relate to server optimization, i.e. getting the biggest bang for your server buck.
I do use postini for mail handling for my primary account. This accepts all my incoming mail and forwards to my own account only that mail which they think is real. It's pretty effective, gets about 90% of the spam with almost no false positives. Postini is activated by the MX records in my DNS, which route incoming mail to them rather than me.
#14
Posted 01 June 2012 - 11:49 AM
https://forums.aws.a...jspa?forumID=30
#15
Posted 01 June 2012 - 12:59 PM
Isn't it obvious when you land on a made-for-Adsense site that it is a junk site? And people still click on their ads? Hmmm.
Hey I have a near 100 year old bridge for sale. Crosses the St-Lawrence river into Montreal. Got 4 mini Eiffel towers on top.
I hate them too.
Edited by bobbb, 01 June 2012 - 01:00 PM.
#16
Posted 01 June 2012 - 01:12 PM
Reply to this topic

0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users






