Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Using Rel=Canonical On A Pdf Document

solving dupe content problem

  • Please log in to reply
28 replies to this topic

#1 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 02 February 2012 - 11:12 AM

I have a website with a few dozen pages, but it also has hundreds of .pdf documents. These .pdf documents are downloaded and printed by a lot of people. Each pdf contains an image file that has been formatted to print at a reliable scale on anyone's printer.

The pdf files also have a small amount of text and a clickable link to the homepage of my website. They are also optimized to display an SEOed title tag and rank well in the SERPs. Lots of people have linked to these pdfs.

I think that these pdfs are causing a duplicate content problem or a trivial content problem. To solve that I need to use rel=canonical in a way that attributes them back to the html page that the visitor visits to download them. Unfortunately there is no way to place an rel=canonical in a .pdf document (or I don't know of any way to do it). So I am going to fix this by htaccess following instructions from SearchEnginePeople.
http://www.searcheng...ccess-file.html

My htaccess lines will look as follows

<FilesMatch "brass-widget-1.pdf">


Header set Link '< http://www.mysite.co...rel="canonical"'


</FilesMatch>


================================


I think that the above will work fine, but I have lots of files and want to use wildcards in the .htaccess such as this...


<FilesMatch "brass-*.pdf">


Header set Link '< http://www.mysite.co...rel="canonical"'


</FilesMatch>


================================


I have two questions....


1) Do you think that my method of dealing with this is the right method..... and


2) Do you know if my use of the wildcard in the .htaccess is correct?


Thank you!



#2 jonbey

jonbey

    Eyes Like Hawk Moderator

  • Moderators
  • 4293 posts

Posted 02 February 2012 - 11:22 AM

I cannot answer either question, sorry!

But, is your plan so that the pdf's are no longer indexed? Is that the point of canonicalisation? Because if so, why not shift them all into their own directory and the block the bots?

#3 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 02 February 2012 - 11:35 AM

But, is your plan so that the pdf's are no longer indexed? Is that the point of canonicalisation? Because if so, why not shift them all into their own directory and the block the bots?

These pdfs have a lot of inbound links from other websites and I want to preserve that - or pass the linkvalue into my main site. That's why I didn't put them in a folder and block the bots.

I think that they should be indexed and followed. That's why I am only trying to apply rel=canonical.

.... but I am not quite sure that is the best approach.

#4 jonbey

jonbey

    Eyes Like Hawk Moderator

  • Moderators
  • 4293 posts

Posted 02 February 2012 - 11:46 AM

Ah, I see.

The head section seems to be the right way, according to Google:
http://support.googl...n&answer=139394



Indicate the canonical version of a URL by responding with the Link rel="canonical" HTTP header. Adding rel="canonical" to the head section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases you can indicate a canonical URL by responding with the Link rel="canonical" HTTP header, like this (note that to use this option, you'll need to be able to configure your server):
Link: <http://www.example.c...ite-paper.pdf>; rel="canonical"

Google currently supports these link header elements for Web Search only.


As for doing something clever in htaccess ..... hopefully a techy will turn up in a moment!

#5 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 02 February 2012 - 01:30 PM

You could try iFraming the PDF files. Move them to a new directory as Jon suggests. Block the crawlers from that directory. Then create iFraming pages at the old *.pdf URLs which allow you to implement the SEO meta directives you want to use. The iFraming pages would simply link to or load the .PDFs into their content.

I have seen this done on a few Websites. I'm not entirely clear on how they do it (possibly with AJAX). It has never occurred to me before to wonder how it's done.

#6 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 02 February 2012 - 01:44 PM

Thanks for the ideas... :)

Please consider this....

These pdfs have LOTS of links from other websites and they will accumulate a lot more links over time. I have them on my site because they are linkbait.

If I move these pdfs to a new folder and block them from the crawlers then their value as linkbait evaporates.

Yes? No?

Edited by EGOL, 02 February 2012 - 01:45 PM.


#7 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 4455 posts

Posted 02 February 2012 - 03:07 PM

I dislike and refuse to use rel=canonical (but then iamlost :)) so can't offer suggestions on how best to utilise it in your specific situation.

Given your usage of the pdf's - I'd have made them into html pages with the pdf's as a download/print option blocked via robots.txt (and .htaccess) - I would recommend:
<meta name="robots" content="noindex, follow">
This means that the SEs crawl the pdf's as normal and all values flow as usual through/via the various links and citations but the pdf's themselves are removed from the public index, i.e. will not show up in search query results.

Note: technically the 'follow' is the default and so not necessary but I believe in redundancy. :)

Of course the optimal solution would be to determine how/why they are causing a duplication issue and correct it. In all your spare time. :D

#8 jonbey

jonbey

    Eyes Like Hawk Moderator

  • Moderators
  • 4293 posts

Posted 02 February 2012 - 03:21 PM

You know, I was thinking similar. If all pdfs were pages, then you could slap ads on them and still link to the pdf as Iamlost suggests. If each pdf also linked to the html page and it was in a follow/noindex directory, then the pages should get the pagerank?

Maybe risky, making such a big change....

Maybe do a few and see how it goes.

#9 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 02 February 2012 - 03:34 PM

If I move these pdfs to a new folder and block them from the crawlers then their value as linkbait evaporates.

Yes? No?

The pages that would frame the PDF files would take up the old URLs.

In other words, if you have a PDF at:

www . example . com / my-cool . pdf

You would move that to:

www . example . com / blocked-directory / my-cool . pdf

and put up a normal HTML page at


www . example . com / my-cool . pdf

in which you use an iFrame to link to


www . example . com / blocked-directory / my-cool . pdf


Hence, all the links still pointing to (www . example . com / my-cool . pdf) would still lead people to your PDFs. You're just wrapping them in HTML envelopes that allow you to control crawler access to the PDF files and set some robots meta directives.

Is this necessary? I have no idea. It's just something to consider.

#10 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 02 February 2012 - 04:37 PM

Of course the optimal solution would be to determine how/why they are causing a duplication issue and correct it. In all your spare time. :D

In my spare time! OK!

Each of my html pages links to several pdfs... so if I can rel=canonical several pdfs to a single html page then that single html page should be kickass in the SERPs. They already are kickass but this would make it even better.

If all pdfs were pages, then you could slap ads on them and still link to the pdf as Iamlost suggests. If each pdf also linked to the html page and it was in a follow/noindex directory, then the pages should get the pagerank?

PDFs do accumulate pagerank and pass pagerank to any links that are embedded within them.

Also, you can place ads in .pdfs. Adsense does not work but you can sell ads to others... or place ads in them that link to your own product pages.

And, most shopping carts can be triggered from "buy button" links placed in pdf documents. Most people just never think to try this.

.....and put up a normal HTML page at

www . example . com / my-cool . pdf

This is a really interesting idea.... a little sneaky... but I am going to think about it. Thanks.

#11 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 02 February 2012 - 07:01 PM

I hope it doesn't seem sneaky. It's just a way to wrap an object in HTML. People have been doing that for Flash, so why not for PDFs?

ON EDIT: I suppose you could also set up alternative URLs for the framing pages and just implement 301-redirects from the old PDF URLs to the new framing pages, but that is not very efficient in my opinion.

Edited by Michael_Martinez, 02 February 2012 - 07:03 PM.


#12 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 04 February 2012 - 03:44 AM

Does the page that links to the PDFs have substantial or noticeably less content, than in the PDF files?

If not, then preserving old URLs with short HTML pages and moving and blocking the actual PDFs can potentially remove lots of long tail traffic, can it not? ;)

Edited by A.N.Onym, 04 February 2012 - 03:45 AM.


#13 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 04 February 2012 - 10:09 AM

Hello Yura,

The html pages have a small amount of content but the pdf documents have an image and just a few words.

Moving or blocking the pdf documents will result in thousands of lost visitors per day. The html pages that link to them receive even more traffic.

I am thinking that I could redirect the current pdfs to the html pages that link to them. That should give those pages extra ranking power and claim some of the rankings that the pdfs will lose.

I could then implement a way to provide the pdfs in a way that keeps them out of the index and attracts links, likes, etc to the html page.

... but I don't know if that is the best solution.

#14 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 04 February 2012 - 12:31 PM

Well, what would you do ideally?
Ideally, you'd have visitors use your HTML pages and link to them.

So, it makes sense to encourage them to print whatever they want from the HTML page, so they could link to it later, if necessary. Does it mean that you'd have to create HTML pages for the current PDF links or use 301 redirects? Not sure entirely.

As for PDFs, ideally, you'd:
- link to and from them to HTML pages
- noindex PDFs
- use the HTML pages as link magnets.

Since your visitors will still be linking to PDFs it doesn't make much sense to change URLs permanently with 301 redirects instead of using the rel="canonical" header (if rel passes PR as efficiently as a 301 does). Otherwise, you'd have to repeat this again with the new PDF URLs. Also, redirecting would make for slightly worse visitor experience for those, who have bookmarked the files.

So, to keep the links you'd rather use rel=canonical, than 301 redirects.

IMHO ;)

Edited by A.N.Onym, 04 February 2012 - 12:49 PM.


#15 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 04 February 2012 - 01:13 PM

Thank you for these ideas, Yura.

I agree that the pdfs should either be noindexed or have substantive content added.

I don't want to add substantive content because that would make them two pages long and they are currently formatted to print on a single page - and I don't want people who print them get the page they want plus a big page of text that they don't need.

I have one question... if I noindex the pdfs will PR still pass through them?

#16 iamlost

iamlost

    The Wind Master

  • Admin - Top Level
  • 4455 posts

Posted 04 February 2012 - 03:54 PM

I have one question... if I noindex the pdfs will PR still pass through them?

So long you DO NOT include nofollow as in
<meta name="robots" content="noindex, nofollow">
NOTE: DO NOT DO THE ABOVE UNLESS YOU REALLY UNDERSTAND THE CONSEQUENCES.

As I mentioned previously:
<meta name="robots" content="noindex, follow">
could simply be
<meta name="robots" content="noindex">
because the 'follow' is an implied default. I prefer to include it because I hate to rely on others to always correctly apply implied defaults.

One caution that I would like to throw out as a sea anchor: how any SE applies anything is subject to change often without notice. Therefore I recommend that if you decide to proceed with noindexing the pdf's that you only do so on a limited number, i.e. 10%, and wait a month to see what, if any, change occurs. If all goes as expected (and I have a zillion such on pages without a problem) then phase in the remainder.
Note: the reason that I recommend phasing in changes is that SEs, especially G, have been known to get jittery with massive wholesale changes.

#17 DonnaFontenot

DonnaFontenot

    Peacekeeper Administrator

  • Admin - Top Level
  • 3705 posts

Posted 04 February 2012 - 04:00 PM

According to Matt Cutts, PR will still pass through noindexed pages. From this interview: http://www.stonetemp...att-cutts.shtml

Eric Enge: Can a NoIndex page accumulate PageRank?

Matt Cutts: A NoIndex page can accumulate PageRank, because the links are still followed outwards from a NoIndex page.

Eric Enge: So, it can accumulate and pass PageRank.

Matt Cutts: Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make a NoIndex page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages.

For example you might want to have a master Sitemap page and for whatever reason NoIndex that, but then have links to all your sub Sitemaps.

Eric Enge: Another example is if you have pages on a site with content that from a user point of view you recognize that it's valuable to have the page, but you feel that is too duplicative of content on another page on the site

That page might still get links, but you don't want it in the Index and you want the crawler to follow the paths into the rest of the site.

Matt Cutts: That's right. Another good example is, maybe you have a login page, and everybody ends up linking to that login page. That provides very little content value, so you could NoIndex that page, but then the outgoing links would still have PageRank.


And as iamlost said, that's only if you do a noindex, follow.

#18 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 04 February 2012 - 04:13 PM

<meta name="robots" content="noindex, nofollow">
NOTE: DO NOT DO THE ABOVE UNLESS YOU REALLY UNDERSTAND THE CONSEQUENCES.


:lol:

I know!

One of my competitors had their site redesigned and the designer tossed it up with noindex. They disappeared next day and it took them a couple weeks to figure out what was wrong.

Also.... I accidentally added that to an article on my site. After I removed it google did not like the article and ignored it for a couple of months before indexing - and this is a site that gets TONS of spidering.


According to Matt Cutts, PR will still pass through noindexed pages. From this interview: http://www.stonetemp...att-cutts.shtml

Thank you!

#19 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 February 2012 - 05:35 AM

I'm quite sure that the <FilesMatch> option with the HTTP header can be tweaked to include some sort of wildcard...

http://httpd.apache....html#filesmatch

Apparently it accepts regex

The <Files> directive even accepts regular wildcards. Maybe you should experiment with that one first?
Just a single example, then check web-sniffer.net for the result...

----------edit:

Hmm it appears to be difficult, if not impossible, to transfer the "whatever" in whatever.pdf to the actual link. None of my quick&dirty little tests seem to work :(

Edited by Wit, 05 February 2012 - 07:58 AM.


#20 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 February 2012 - 07:56 AM

Ok how 'bout something like this?

Options +FollowSymlinks
RewriteEngine On

SetEnvIfNoCase Host "(.*)" HTTP_MY_HOST=$1
SetEnvIfNoCase Request_URI "(.*)" HTTP_MY_REQUEST_URI=$1

<FilesMatch "\.pdf$">
Header set Link '<http://%{HTTP_MY_HOST}e%{HTTP_MY_REQUEST_URI}e.html>; rel="canonical"'
# this creates a HTTP Header pointing to http://yourdomain.com/whatever.pdf.html
# ...so if you can live with the .pdf.html bit, then stop here
</FilesMatch>

RewriteBase /
RewriteCond %{REQUEST_URI} ^(.+)\.pdf.html$
RewriteRule ^(.+)\.pdf.html$ /$1.html [R=301,L]
# this redirects /whatever.pdf.html to /whatever.html (if needed)

# Cheers, Wit :-)

Edited by Wit, 05 February 2012 - 08:00 AM.


#21 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 05 February 2012 - 09:08 AM

Wow!

Thank you, very much Wit!

This is another example of how you have the answers to some of the most difficult code questions.

I really appreciate your help!

#22 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 05 February 2012 - 11:15 AM

OK.... I am going to test part of this...

I have lots of folders with five .pdf documents in them. My goal is to remove them from the search engine index and transfer their PR to an shtml page outside of that folder with rel=canonical...

The code below will be placed as an .htaccess file within the /widget/ folder - so it should only act on files within that folder.

======================

Options +FollowSymlinks
RewriteEngine On

SetEnvIfNoCase Host "(.*)" HTTP_MY_HOST=$1
SetEnvIfNoCase Request_URI "(.*)" HTTP_MY_REQUEST_URI=$1

<FilesMatch "\.pdf$">
Header set Link '< http://www.mydomain....rel="canonical"'
Header set X-Robots-Tag "noindex"
</FilesMatch>

======================

#23 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 February 2012 - 12:15 PM

So you're not linking individual PDFs to individual HTML pages?

#24 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 05 February 2012 - 12:40 PM

So you're not linking individual PDFs to individual HTML pages?

Right, multiple .pdfs are linked from single html pages.

This is an easy situation where all of the .pdfs in a folder go to a single html page.

My more difficult situation is where there are about 100 pdfs in a folder but only about 10 html pages link to them.

#25 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 05 February 2012 - 01:27 PM

I don't want to add substantive content because that would make them two pages long and they are currently formatted to print on a single page - and I don't want people who print them get the page they want plus a big page of text that they don't need.

Actually, by having visitors print from HTML pages instead of PDF files, you'd have to make exclusive CSS files for the print media, which would also grant you complete control over your visitors' printing experience.

In this case, you would:
- preserve the HTML content instead of PDF
- be able to place additional content to HTML files
- allow your visitors to get the printing experience they want, but you..
- have to create the best quality printing experience via CSS and your HTML page (hint: you can hide some content from printers and it shouldn't be punishable for cloaking, but correct me, if I'm wrong about this ;) ).

#26 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 February 2012 - 01:38 PM

I have to point out at this point that Matt C (for what it's worth) has pointed out specifically that Google does not like canonical relations between pages that are not at all similar. I think it was my bud Jon who linked to that interview (+ transcript)

#27 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 12 March 2012 - 08:29 AM

Here is an update on this...

I used htaccess to noindex and canonical about 250 .pdf documents. This is on a site where new pages(files) are usually crawled and displayed in the Google SERPs within 24 hours.

It is working. However, this deindexing is very very slow. The rate averages just a few files per day.

I am not complaining because it is working. Just surprised that it is taking so long.

#28 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 06 April 2012 - 08:29 AM

Another update...

About 150 of those .pdf files have fallen out of the google index - still about 100 to go.

However, today the rankings of this site are seeing a little lift. I hope that this sticks.

We have moved from #2 and #3, back up to #1 and #2. Let me tell you that one position drop can really damage your sales.

Thanks to everyone who helped in this thread!

#29 EGOL

EGOL

    Professor

  • Hall Of Fame
  • 5177 posts

Posted 24 May 2012 - 11:18 PM

Finally, now, about three months later, almost all of my .pdfs are out of the index and the rankings on my .html pages are up a little and bringing in more traffic.

This has worked as I hoped but it took an awful long time. Rel=canonical is a "hint" that google must find and decide to honor.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users