Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Baidu & Yandex


  • Please log in to reply
15 replies to this topic

#1 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 21 April 2012 - 05:45 PM

Hi again...

I have been trying to refine my traffic so I don't get overloaded on 300+ sites. I have blocked all amazonaws ranges that I have found, along with other individual IP numbers that have overloaded me with bot traffic.

So the question is, do baidu and/or yandex do me any good in terms of human traffic -- traffic that clicks on advertising links and that I get paid for? At least in the past, clicks from the North America, Western Europe and Australia have been worh a lot more than clicks from Russia or China.

I am looking at yandex and this bot is coming on pretty strong. I probably will at least disallow it in robots.txt and see if it obeys.

Any comment??

Cheers
Mike

#2 iamlost

iamlost

    The Wind Master

  • Site Administrators
  • 4643 posts

Posted 21 April 2012 - 09:34 PM

A lot depends on what traffic your revenue sources are looking for. If they only want those high value click locations you mention then you might want to block all extraneous IPs. If other geolocations are or become of value open things accordingly.

Yandex and Baidu are major players and are becoming more international. Personally I do allow both but only their main webcrawlers - just as with Bing and Google. That does not mean that I don't block Russian and Chinese IPs :) All SEs have a whole range of different crawlers and which ones you allow is a business model decision. I also note that most/all SEs have stealth bots and that most/all SEs follow real and hypothetical links, citations and entities... which may ignore robots.txt depending... bots are only reliably recognised by behaviour... and not always then.

Note: I frequenctly block even whitelisted SE bots if they hammer too hard or too repetitively. In such instances it is only for an hour at a time. So far without apparent harm.

#3 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5497 posts

Posted 21 April 2012 - 09:58 PM

After reading your comments I went out looking for ways to block IP ranges by country and found what I think is a good reference...
http://www.parkansky.com/china.htm

I was surprised at how much code this will add to an htaccess file. Does that cause performance problems?

I was hoping for a short and sweet code to block about 1/2 of a hemisphere.

#4 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 21 April 2012 - 10:31 PM

I use iptables to block ip ranges, such as

iptables -p tcp -I INPUT -j DROP -s 103.4.8.0/21

followed by

/etc/init.d/iptables save && /etc/init.d/sshd restart

This blocks (103.4.8.0 - 103.4.15.255) which happens to be amazonaws Asia-Pacific region.

This gives me an iptables rules list like this:

list rules 4/20/12
iptables -L -n

DROP tcp -- 103.4.8.0/21 0.0.0.0/0

Iptables can act as s a linux internal firewall. It blocks traffic before it reaches .htaccess and causes less server load. External firewalls are even better, your server never takes the hit at all.

The CIDR syntax is explained here : http://en.wikipedia....wiki/CIDR_range

Current amazonaws IP ranges are listed here :
https://forums.aws.a...jspa?annID=1408
Posted on: Mar 20, 2012 2:14 PM

Iamlost, point taken on the changing nature of the net, keeping abreast of what is valuable and what is not. Realistically, at the moment I *think* most advertisers in English are looking to reach English speaking customers -- although I have not viewed my sites recently from non-English countries -- it's possible they may be displaying ads in other languages....

In general it looks to me like Yandex is pretty active on my machines.

Cheers & good luck!

Mike

#5 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5497 posts

Posted 21 April 2012 - 11:11 PM

Thank you for all of this information, Mike.

How did you learn all of this stuff? :)

Edited by EGOL, 21 April 2012 - 11:12 PM.


#6 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 April 2012 - 01:40 AM

heh, one line at a time...

#7 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 22 April 2012 - 05:17 PM

Amazon AWS has a LOT of unpublished c-class IP address blocks. I have yet to find a complete listing.

Unfortunately, I learned the hard way that I cannot simply block everything on Amazon because some of the social media services where I pipe my RSS feeds are hosted on Amazon AWS. It's a very sticky wicket.

And as far as Baidu and Yandex go, I open up to their crawlers every now and then to see how relevant their traffic is to my sites. I get more traffic from Yandex than Baidu and more crawl from Baidu than Yandex. At the present time I am blocking Baidu but I'll unblock it again later this year.

Yandex is definitely trying to break into international search and they have been indexing the English Web for several years.

#8 jonbey

jonbey

    Eyes Like Hawk Moderator

  • Moderators
  • 4432 posts

Posted 22 April 2012 - 07:01 PM

How much bandwidth are these guys eating up? I have always taken the approach that more bandwidth generally leads to more revenue, and not bothered to look at what is hitting my sites.

Although I do not get much Baidu or Yandex traffic I would not block them as suddenly something could become popular and I'd lose a lot of good traffic. I guess it largely depends on your market though. I am targeting the whole world.

#9 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 23 April 2012 - 12:50 PM

Oddly enough, Google Analytics provides you with virtually no bandwidth analysis (at least, none that has leaped out at me). Webalizer, which has many drawbacks to it, will tell you WHO is drawing down the most bandwidth. The bandwidth reports can be ugly and spiders cannot hide from them. Its been a while since I looked at an AW Stats report but I want to say they also assess bandwidth usage.

#10 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 23 April 2012 - 09:40 PM

I use the linux command (as root)

tail -f /path.../access_log

The -f argument gives you a continuing view of every file request as it comes in. If your server is busy it can be a flow of text that is too fast to read.

The apache httpd.conf line

HostnameLookups On

gives the name attached to the IP making the request,

and when combined with

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D" combined

where the %v gives the your domain that serves the request

causes apache to print your domain and the requesting domain to the access_log file.

It's probably not 100% foolproof, but it gives you a pretty good idea who's requesting what, especially if there are an egregiously huge number of requests from the same IP and/or host. It's a good way to stop DOS attacks also.

You can also plug the IP number into maxmind.com, which gives you the real owner.

Cheers
Mike

#11 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 30 April 2012 - 12:33 PM

I have bitten the bullet with baidu and blocked the following:

iptables -p tcp -I INPUT -j DROP -s 119.63.192.0/21
iptables -p tcp -I INPUT -j DROP -s 123.125.64.0/18
iptables -p tcp -I INPUT -j DROP -s 180.76.0.0/16
iptables -p tcp -I INPUT -j DROP -s 220.181.0.0/16

The final range actually blocks China Telecom Beijing.

Ref : http://www.netnuisance.net/ip/se.php

This seems to have lowered my cpu load. I don't know how to measure website results, though I can't imagine that English language advertisers have much interest in China traffic.

I do not see such a load from Yandex ... so far it seems to be relatively polite and I am leaving it alone.

Cheers
Mike

#12 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 30 April 2012 - 01:09 PM

Some people block all traffic coming out of China in an effort to impede hackers' activity. I have never tried that approach. It strikes me as pretty radical.

A commenter on SEO Theory recommended Cloud Flare and one of their people followed up to clarify one of my concerns (I thought you had to move your hosting to Cloud Flare but they just manage your DNS).

If I understand the concept correctly, your routing passes through Cloud Flare and they take responsibility for managing rogue crawlers. I think they are caching your content but I'm not clear on that. Haven't really had time to fully evaluate it.

#13 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 30 April 2012 - 01:43 PM

I took a quick look at cloudflare.com and they describe some things: security (attacks), optimizer and cdn (content distribution network). Apparently the optimizer reduces file sizes by eliminating unnecessary html, seems like they would have to cache for this. Of course cdn is a cache situation. I looked into that with other cdn companies and it can get pretty pricey.

Currently working on understanding memcache syntax, to cache pages in ram locally, with a possible migration to couchbase later, which works about the same but also has a text-based file that allows for quicker cache recovery on reboot. This is somewhat outside the topic of blocking bad bots, but both relate to server optimization, i.e. getting the biggest bang for your server buck.

I do use postini for mail handling for my primary account. This accepts all my incoming mail and forwards to my own account only that mail which they think is real. It's pretty effective, gets about 90% of the spam with almost no false positives. Postini is activated by the MX records in my DNS, which route incoming mail to them rather than me.

#14 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 01 June 2012 - 11:49 AM

Posts regarding amazonaws ip ranges keep getting deleted. This seems to be a sticky link to the most current list:

https://forums.aws.a...jspa?forumID=30

#15 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 2189 posts

Posted 01 June 2012 - 12:59 PM

Just for curiousity, what do all those people using amazonaws ip addresses do? Are they just scrapers or do they add value to whatever it is they do? Or just to bust our asps?

Isn't it obvious when you land on a made-for-Adsense site that it is a junk site? And people still click on their ads? Hmmm. :dazed:
Hey I have a near 100 year old bridge for sale. Crosses the St-Lawrence river into Montreal. Got 4 mini Eiffel towers on top.

I hate them too.

Edited by bobbb, 01 June 2012 - 01:00 PM.


#16 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 01 June 2012 - 01:12 PM

I tend to concur in the question "What ARE these people doing anyway?" Hackers and crackers I sort of understand, there is evil in the world. The only explanation I have for bad bots and senseless stupidity repeated ad infinitum, is the opportunistic nature of all biology. If it can, it will. Weeds grow, diseases transmit. I guess these people are driven by hope, however misplaced, that their activities will benefit them in the end.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users