Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Robots Exclusion And The Fate Of Linkjuice?


  • Please log in to reply
19 replies to this topic

#1 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 27 July 2007 - 04:02 PM

I am trying to understand the advantages / disadvantages of excluding by robots.txt and the META robots tag.


1) If I exclude by robots.txt.....
- the page will not be indexed
- linkjuice will not flow into the page
- linkjuice will not flow out of the page


2) If I exclude by META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"
- has same effect as above


3) If I exclude by META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"
- the page will not be indexed

In situation #3, what happens to the linkjuice? if that page does have inbound links will the linkjuice be transferred? I can guess at an answer but I wonder if anyone knows for sure?

Edited by EGOL, 27 July 2007 - 04:08 PM.


#2 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 27 July 2007 - 05:17 PM

I checked some of the literature and couldn't find any reference to this. I assume you mean pagerank and Google, right?

Maybe we can work out some possibilities ....

Let's call our pages like so:
A: valued, known page, links to B, F
B: has robots=noindex,follow, links to C, D, E
F: links to G, H, I
C,D,E,G,H,I: normal pages

In general, for pagerank: Value passed per link is dampening * 1/(number links) (very simplified)
- Each link on a page gets identical value passed
- The value passed through those links is reduced by the number of links on a page
- Each level of linkage reduces the value passed by the dampening factor (eg for A -> F -> G the link from F to G passes less value than the link from A to F)

Let's take a quick look at the possibilities:
#1. B transparently passes the full value to C,D,E
#2. B keeps value (but is filtered), passes the dampened value to C,D,E
#3. B does not take value, does not pass value (treated like a blocked URL in the robots.txt)
#4: B is ignored, the links on B are treated like links on A.

Assuming the people at Google accept and use "noindex,follow", we can rule #3 out. This should be easy to test though (anyone?).

Let's take a look at #1: It would mean that A passes d*1/2 value to split over the 3 pages C,D,E. The link out of A cannot be treated as 3 links (it would "penalize" the other legitimate link out of A), this means:
F: gets d*1/2
C,D,E each get d*1/2 * 1/3 = d*1/6
G,H,I each get (d*1/2) * d*1/3 = (d^2) * 1/6
It would require that Google's link-tables accept more than one end-point for a link; the link would be A -> (C,D,E). I'm just guessing, but I doubt they would have a provision for that. For d<1, the value passed to C,D,E would be higher than the value passed to G,H,I.

Looking at #2: the value passed would be:
F: gets d*1/2
G,H,I each get (d*1/2) * d*1/3 = (d^2) * 1/6
B: gets d*1/2 (but is not shown)
C,D,E: each get (d*1/2) * d*1/3 = (d^2) * 1/6 (same as G,H,I)
With d < 1 (for dampening) this would mean the value passed in situation #2 to C,D,E would be lower than the value passed in situation #1.

#4 seems strange, but possible. The value passed would be:
F: gets d*1/4
C,D,E: each get d*1/4
G,H,I each get (d*1/4) * d*1/3 = (d^2) * 1/12
The value passed to C,D,E would be much higher than the value passed to G,H,I.
The value passed to G,H,I would be lower than in situations #1 and #2. These links would be "penalized" in favor of C,D,E. Taking it to the web, if A, B, F were separate domains, B could penalize the links on F by adding robots=noindex. That wouldn't make much sense (or would it?).

#1 would require a strange format for the link tables, which I doubt is in place. #3 seems improbable (but I haven't seen any tests on it). #4 allows pages to penalize others that are on a shared parent (imagine if B had 200 links ...), which seems improbable as well. To me, that just leaves #2.

What did I miss? It's getting late, I'm sure I oversaw something :).

John

Edited by JohnMu, 27 July 2007 - 05:19 PM.


#3 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 27 July 2007 - 06:23 PM

What generous mathematics, John! Reminds me of the trace tables that I had to prepare for a course a long time ago. Thanks!

Yes, I am thinking about Google and pagerank... but also about internal link popularity, which could be especially valuable for MSN.

If I have a blog with a thousand posts and I disallow the post pages to reduce dupe content, trivial content and dead links - I will gladly allow them to fall from the index but my concern is that they contribute to sitemass and internal anchor which could result in a loss that is larger than what I am trying to avoid.

I don't understand what the result might be and wonder if anyone does.

#4 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 27 July 2007 - 06:57 PM

Could it be that it comes down to a simple question, do URLs not included in the index have pagerank? (not included because they're not -yet- found, behind a robots.txt disallow or have a meta-robots tag of "noindex")

If they do have pagerank (or whatever the other engines use) then they will have "something" to pass to the other URLs (provided they have a meta-robots tag with "follow"). If they don't have any pagerank, they wouldn't have anything to pass to their own links.

Assuming they do still have pagerank, that could mean that all the tricks currently en vogue to trim websites through the robots.txt and the meta-robots tag would not make much sense.

Hmm..

Assuming a page has 4 links: one to a normal page, one to a page that is missing (404), one to a page behind a robots.txt and one to a page with meta robots "noindex". What value would be passed to each page?

Maybe I'm just getting off-tracked :) but I think it might be vital to your question. If a page has no value of it's own, it wouldn't be able to pass anything through links or anchortext to it's own linked pages.

John

#5 Respree

Respree

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5901 posts

Posted 27 July 2007 - 07:47 PM

I'm guessing (like you), but if a page is not indexed (as described in all three scenarios), it seems to me that the 'juice' cannot be given, taken away, nor transferred. They simply don't know about it. I could easily be wrong, but that's my guess.

Afterthought afterthoughtI also think it's entirely possible that they may be giving the 'appearance' of not know about it, but covertly (only a conspiracy theory) indexing the page for other (non SERP displayed) ranking purposes. Given the increasing sophistication in their information gathering techniques, I wouldn't be surprised if this were actually happening. Of course, they'll never admit. [/end_conspiracy_theory]

Edited by Respree, 27 July 2007 - 07:55 PM.


#6 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 27 July 2007 - 09:43 PM

In the original PageRank papers, including the "Anatomy of" one, they mention precisely what they do with links where the page linked to is not included in the database/index ... they ignore it.

The part you are looking for concerns "dangling links" (which isto say links where the resource on the other end is not indexed and is thus an unknown quantity (and may not even exist - 404), and "orphan pages" (which are pages that are in the index without any known links pointing to them anymore). Both of these kinds of URLs are removed from the process prior to the iterative calculations of PageRank commencing.

A page you have asked no to be in the index is obviously not in the index. Therefore it has no known content, including no known links. Links to that page will technically be 'dangling links' since the content of that URL is not in he index, and thus those specific links will not count in PageRank iterations.

Of course, there's been a lot of water under the bridge since those original papers, and a lot has changed with how PageRank is calculated, especially in the everflux guesstimates. There's no absolute guarantee that the same thing applies anymore, but original PageRank was pretty clear.

Since the dangling link was removed from the calculations, that percentage more 'juice' should have been passed through the remaining links, strengthening them a little, and giving them more passable juice to pass on too.

#7 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 27 July 2007 - 10:35 PM

Thanks Ammon. That looks like the best answer that I can get. Respree's rationalization agrees with this too.

I really appreciate your recalling back to the early documents on PR.

Cheers to both of you! This makes sense.... now to think about what to do.

#8 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 27 July 2007 - 11:00 PM

Interesting trains of thought I hadn't visited yet.

A no-index page still will have its URL indexed. Shouldn't it than receive a PR weighing based on inbound links as well?

Otherwise a no-index document which receives the most links with phrase XYZ would not rank for that phrase even though the links and the URL are known?

Likewise the links in the document can be considered, no?

#9 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 28 July 2007 - 12:08 AM

I started thinking about this after watching
http://www.wolf-howl...ngine-friendly/

I feel that it is important to place each post in two categories so I feel that I should eliminate some other occurrences of that content. I've decided upon killing the post pages as most are very short and the links go dead often.

I think that is what I want to do, but not sure. I hate to lose a massive amount of pages but gotta kill something,

#10 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 28 July 2007 - 02:01 AM

Do you have the full posts in the summary pages (archives, main page)? If so, perhaps it would make sense to just provide the introduction for the posts there and to only have the full post in the URL for the post itself. That would give the posting more unique content which would in turn make it more valuable to the search engines (of course this depends on how you have the postings -- if they're all fairly short then it wouldn't matter much).

Assuming what you mention is still correct, wouldn't that mean that a robots meta tag of "noindex, follow" would be treated the same as "noindex, nofollow"? I have seen a lot of occasions of URLs that were not visible in the index but which were still regularly crawled as if they were -- to me that could be seen as a sign that the URL is actually indexed but filtered from the displayed results. I know, I know, we should stop discussing and just test it :).

Can we assume that if a page passes value to it's links then it has to have value itself as well?

John

#11 Ron Carnell

Ron Carnell

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 2062 posts

Posted 28 July 2007 - 03:45 AM

Let's think of duplicate content pages as being "new" pages, pages just added to the site. That's a fair perspective, I think, because (1) you can create new dup content pages very quickly by just adding a blog post to a few more categories, and (2) new pages and duplicate content pages typically share a very important characteristic: they don't have any off-site links bringing in added PR.

The average PageRank or Link Popularity of a site is not increased by adding new pages.

On the contrary, new pages, pages that don't bring any off-site links to the table, will decrease the site average and, more importantly, bleed PR from more important pages. That's a mathematical certainty and, if anyone wants, we can explore the numbers that show that PR is finite and that outbound links (even to new pages) necessarily carry a cost. Of course, everyone here knows that the cost is actually an investment, because those new pages "should" eventually bring in some good link juice. That's not, however, necessarily true of duplicate content pages.

Losing "a massive amount of pages" is only going to adversely affect Link Pop if those pages have garnered off-site inbound links that bring in PR. Pages that don't bring in external PR don't increase the PR of your important pages, they drain it. It's not much different from creating a massive number of outbound links, which inevitably bleeds PR from your own site. Your internal links, to inconsequential pages, can have the same effect.

I think you might just discover that getting rid of the dead weight (if indeed it is dead weight) will ultimately have a positive impact on your Link Pop. I don't know if that's necessarily true of anchor text advantages (there aren't any published formulas for that), but the general consensus seems to be that the strength of anchor text is a function of PR, so it stands to reason that low PR pages are of little value there, too.

In my opinion, the trick to what you're considering will be to encourage visitors to link to indexed content, not the duplicate content you intend to exclude. Do that and your PR will increase, not decrease.

#12 DianeV

DianeV

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 7216 posts

Posted 28 July 2007 - 07:39 AM

I'm wondering if you're talking about a blog-like set up, such as WordPress, which displays posts in full on all archive pages, such as category, yearly, monthly and daily pages. If so, it's my opinion that, not only is this bad in terms of a ridiculous amount of duplicate content, but it's confusing to visitors — after all, why should an article or post be displayed in full on multiple pages, particularly pages that one would normally think should be simply tables of content?

For WordPress sites, I always convert archive pages to tables of content — in which case, assigning an article/post to multiple categories will only result in adding that article's link (or a link plus an excerpt) to those category pages — a good thing. This is particularly helpful for large sites, where visiting a category page can lead to that dizzying "I'm wandering around in circles" or "Where am I?" feel.

At any rate, I wrote about how to do this with WordPress at How to Make your WordPress Blog a Real Website — the code's there too. It's just a matter of updating templates.

Edited by DianeV, 28 July 2007 - 07:42 AM.


#13 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 28 July 2007 - 09:06 AM

Thanks for all of these great responses and ideas......

JohnMu: Do you have the full posts in the summary pages (archives, main page)? If so, perhaps it would make sense to just provide the introduction for the posts there and to only have the full post in the URL for the post itself.

The full posts are on all pages. This is because they are so very short. The title is a link (almost always to another website) and the description is only two to three sentences. So it makes sense to show the entire post instead of requiring the visitor to click through to a page that has trivial content. I write two to three sentences now because they are going onto pages that get indexed and feel that the two to three sentences are needed. However, if these pages are not indexed I will write only one to two sentences and save a little time. The posts are not attracting a lot of genuine comments so no loss there.


DianeV: I'm wondering if you're talking about a blog-like set up, such as WordPress, which displays posts in full on all archive pages, such as category, yearly, monthly and daily pages. If so, it's my opinion that, not only is this bad in terms of a ridiculous amount of duplicate content,

Yes, it is a blog-like set-up. I will probably block indexing of the archive pages too. Doing that will have each post appear three times on the blog... It will appear on the index page for about one week before rolling off (pagination pages will be blocked from search engine indexing)... It will also appear on two category pages for a few weeks to a few months before rolling off (again no indexing of pagination pages).... BTW... nice post on Make Your Wordpress Blog a Real Website... I will probably use something from that later. Thanks!

#14 DianeV

DianeV

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 7216 posts

Posted 28 July 2007 - 09:13 AM

You're welcome. I had just wanted to put all that information somewhere where people could make use of it.

If you're using WordPress, remember that you can substitute the_excerpt for the_content on archive (category, etc.) pages, and WP will automatically display only the excerpt of the post rather than the full content. That's easy enough to change on the archive template (archive.php of your theme).

#15 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 28 July 2007 - 09:20 AM

Ron Carnell: On the contrary, new pages, pages that don't bring any off-site links to the table, will decrease the site average and, more importantly, bleed PR from more important pages.

Exactly... the post pages and archive pages that I will probably block from indexing do not gain any links... however, the category pages and blog index pages do receive good links. In the past I have added large numbers of pages to websites and noticed that the rankings of the original pages dropped a little...... On this site the post pages get just a few visitors per month each... but there are core pages on the site that get thousands per month each. Although dropping the post and archive pages will result in the loss of thousands of visitors per month I believe that a slight increase the rankings of the core pages will gain more visitors than are lost. Also, analytics tells me that traffic into the post pages is not high quality, while traffic into the core pages are of higher value.

Ron Carnell: I think you might just discover that getting rid of the dead weight (if indeed it is dead weight) will ultimately have a positive impact on your Link Pop. I don't know if that's necessarily true of anchor text advantages (there aren't any published formulas for that), but the general consensus seems to be that the strength of anchor text is a function of PR, so it stands to reason that low PR pages are of little value there, too.

Thanks, Ron. I agree and you are making this decision easier for me. I am concerned about the loss of internal anchor NUMBERS, but the hope is that it will increase the STRENTGH of the remaining anchors - at least for Google. My guess is that it will have a negative impact on MSN rankings which I sense pay more attention to numbers than strength.

Edited by EGOL, 28 July 2007 - 09:21 AM.


#16 DianeV

DianeV

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 7216 posts

Posted 28 July 2007 - 09:25 AM

Wow; you're a daring guy, EGOL. I would have thought that people search for what's on the post pages (short tail, long tail and all that).

#17 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 28 July 2007 - 09:40 AM

lol... contrarian thinking... or I am crazy.

Edited by EGOL, 28 July 2007 - 09:41 AM.


#18 DianeV

DianeV

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 7216 posts

Posted 28 July 2007 - 09:46 AM

Aw, you edited it! I saw the original in my email. :)

Anway, I agree. I say: get both.

#19 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5185 posts

Posted 28 July 2007 - 09:59 AM

oops! That will teach me to think before pushing the button.

#20 Webnauts

Webnauts

    Whirl Wind Member

  • Members
  • 75 posts

Posted 12 March 2009 - 05:20 AM

Nocrawl instead Nofollow Pros and Cons

I would like to ask your opinion about a possible alternative to the "nofollow attribute, which I will call it here "bots=nocrawl".

I have for example a page linking to a page called example.html

The links looks like this:
/example.html?bots=nocrawl

In the robots.txt I add this:

User-agent: Googlebot
Disallow: *bots=nocrawl
Noindex: *bots=nocrawl

In addition I add in the .htaccess file X-Robots directives to prevent the robots.txt of being indexed, followed,etc.:

<FilesMatch "\.(txt)$">
Header set X-Robots-Tag "noindex,noarchive,nosnippet"
</FilesMatch>

The targeted page also has meta tags robots directives "noindex,noarchive,nosnippet" or is done through X-Robots.

What difference do you see between the use of the "nofollow" attribute and the "bots=nocrawl" example as setup this way.

What are the possible pros and cons using "bots=nocrawl" instead of the "nofollow" attribute?

To take this a step further, I was thinking what would be if using "bots=nocrawl" in destination URLs and adding on the targeted web pages the new "canonical element" (where applicable, i.e duplicated pages or with similar content).

I am asking all over the place and I get different answers. So I would like to hear your thought too.

Thanks,

John

Edited by Webnauts, 12 December 2009 - 08:23 PM.




RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users