Google Showing Binary Data in SERPs
#1
Posted 21 June 2006 - 05:28 AM
http://www.google.co...en&lr=&filter=0
To add to the mess, the illegible contents of the zip files are making it into Google Sitemaps as "common words in my website".
Riiiight.
Screenshots here.
Is it just me or is Google REALLY broken now?
Pierre
#2
Posted 21 June 2006 - 06:14 AM
Perhaps it would have been a good idea to use rel=nofollow on those links, but it's too late now (and it seems strange to have to rebuild your site around Googles bugs). They should just watch the content-type tags (instead of the file extension).
Mind if I cross-post to the Google-Sitemaps group? (or do you want to?) The Google employees behind sitemaps watch that group pretty closely and even if they don't respond, you can be sure that they will take a look at the issue.
Could it have something to do with the MySQL error? Perhaps they indexed the error, noticed the URL was a "text-type" URL and just refreshed the cache / index afterwards, ignoring the change in content-type? Google would never want to drop the URL once it has found indexable content...
John
#3
Posted 21 June 2006 - 06:20 AM
Easiest way to avoid this situation is to use robots.txt to block access to the download script (it's just the one that does everything). That will block access, but that's not the point: Google should not be interpreting zip files as text. PDFs are binary data, but you can extract useful text from them.
Please feel free to cross post to the G sitemaps group. People know you there already, so it might get more notice
Pierre
Edit: typo
Edited by eKstreme, 21 June 2006 - 06:22 AM.
#4
Posted 21 June 2006 - 07:28 AM


Google even indexes executables:

In all cases the headers look perfect.
Is it just me or is Google REALLY broken now?
Unbeliever!
* URL edit
Edited by Ruud, 21 June 2006 - 07:32 AM.
#5
Posted 21 June 2006 - 07:29 AM
This is very interesting!
Is this something new or did you notice this one before? Did you have any proper indexing before without the binary data showing?
To make matters more interesting when I clicked on the link you provided Google gave me a 404 error with the comments that they suspecting that this query was a machine generated link!
Google is indexing more and more deeply non-html files. This is one experiment I am trying at the moment! My feeling is that they have opened up the zip file and their algo picked it up as text and indexed it. Sometimes they will list a site and will mention file format not recognized. Check this one out that I got by just typing some random letters into a search query! It even politely asked me if I meant '& a gthu ooi'! Please John when you mention it to Google Sitemaps do tell them that there are two engineers that lost their pages recently and if they can remove all these junk results they can have space to re-index our sites! On a serious note I think this is the sort of thing that is causing Google's data centers' overload!!
Post Edit: Sorry Ruud we essentially said the same thing we were busy typing at the same time!
Edited by yannis, 21 June 2006 - 07:31 AM.
#6
Posted 21 June 2006 - 07:48 AM
Is this something new or did you notice this one before? Did you have any proper indexing before without the binary data showing?
Given the result for the number of Windows executables they have indexed so far it must have started quite recent.
They should just watch the content-type tags (instead of the file extension).
Even extension doesn't matter. I thought it would only affect dynamic download URLs but this is a normal path to an executable:
#7
Posted 21 June 2006 - 07:58 AM
I only noticed it late last night while looking at the sitemap in question. I didn't make much of it, but something told me to dig deeper this morning. The kicker was when I did a site: search and saw one of those binary results in the first page. Then I searched for the download URL, and you know what happened next...
The site has been online since November, and I have not seen any problems with it.
Pierre
#8
Posted 21 June 2006 - 09:32 AM
I wonder if this could be used for SEO? A direct link to an executable from Google would be an amazing promotion for a new software (or as an entry-vector for a virus....).
John
#9
Posted 21 June 2006 - 09:58 AM
http://www.google.co...mp3&btnG=Search
<tsk tsk -- google the file sharing enabler?>
Could this have something to do with Googles push to get "everything" listed in their sitemap files?
John
Edit: it seems to have mangled the URL, I'm searching for "ÿÿÿÿÿÿÿÿÿ inurl:mp3" (hope it works here :-))
Edited by softplus, 21 June 2006 - 09:58 AM.
#10
Posted 21 June 2006 - 09:59 AM
I had my cat take a walk on my keyboard and you cannot believe the results I got!
#11
Posted 21 June 2006 - 10:12 AM
Actually, it's quite fun now to take a hex-editor to your binary files and see what Google has indexed. Hmmm. DLLs, fonts, executables, zip-files, mp3's, avi/mpeg's, Corel-Draw files, etc all indexed
(PDF's and office formats should be indexed as well, but they show the document type and you can actually search for them explicitly in the advanced search options. This has been there for a while now)
I wonder if we could grab a glimpse of Googles internal files this way? Imagine being able to download (through their cache) one of their crawler executables / scripts. Hmmmm :ph34r: :ph34r:
John
PS FWIW, I use the free hex editor from hhd
PPS Image search anyone: GIF87a0 or JFIF Exif II (some of them are legitimate, some are broken)
Edited by softplus, 21 June 2006 - 10:32 AM.
#12
Posted 21 June 2006 - 10:25 AM
http://72.14.235.104...en&ct=clnk&cd=1. This is actually quite a read!
#13
Posted 21 June 2006 - 10:42 AM
#14
Posted 21 June 2006 - 10:48 AM
In case it doesn't work:
"ÿÿÿÿÿ" -inurl:htm -inurl:html -inurl:php -inurl:asp -inurl:aspx
#15
Posted 21 June 2006 - 11:01 AM
Sure, you can index that URL but based on your header request you shouldn't attempt to index it. And that is precisely what they did. In this case (Bulletproof, 1st result) they downloaded a 5.1MB file...
Also, for them to show the header info from an executable they have to parse the file: the octet-stream itself won't contain this information....
Hmmm.... Do we see a mix-up of desktop search technology and regular search technology? Or an expansion of the desktop search technology into the regular search technology?
#17
Posted 21 June 2006 - 11:19 AM
Big Daddy rears its... head. Let's keep this polite. BD was supposed to be an infrastructure upgrade, supposedly for future ideas. Is Google planning on indexing even binary data? What will happen when we do, ooh, site:flickr.com?I just talked with someone and these results have started to appear occasionally since at least 6 weeks...
Ah yes, excellent idea. Thanks@ekstreme: for those will-possibly-be-mangled Google URL's, try www.tinyurl.com
Edited by eKstreme, 21 June 2006 - 11:22 AM.
#18
Posted 21 June 2006 - 11:47 AM
Hmmm.... Do we see a mix-up of desktop search technology and regular search technology? Or an expansion of the desktop search technology into the regular search technology?
Ruud I think we see a case of 'male' programming. Software bugs! Missing fields etc. Anyone with an elementary knowledge of databases knows these are not meaningful. We are not talking about pages that they threw in the 'supplemental result dustbin' they are main stream. The hidden part of the web files like .exe, pdfs etc is estimated at anything from 5-30 times the visible web. Once you start collecting these you in storage area trouble. So what you do, now and then you add a filter push the index through it and remove pages. If the filters are based on the 'old ranking tune' webmasters lose pages! By the time you have done this the new algos collected another batch of 'trusted websites garbage'. Another round of late night Googleplex programming, new filter... ah! let's look at the quality of incoming links, another filter... some more billion of pages disappear! These are big problems! As Sergei said sometime back the algos have a life of their own sometimes!
These are big problems. If you have Google shares get rid of them!
Yannis
PS My guess these bugs were introduced from routines that were involved in parsing foreign languages and or complicated sequences like those found in genetics sequences.
Edited by yannis, 21 June 2006 - 11:54 AM.
#19
Posted 21 June 2006 - 12:22 PM
Ruud I think we see a case of 'male' programming. Software bugs!
Based on the zip and rar files I would have been tempted to agree. But this indexed executable has been parsed. That is a deliberate action, not a bug. It's not a generic parser for anything non-HTML either otherwise you still would have seen a whole bunch of garbled text as it tries to display characters.
#20
Posted 21 June 2006 - 12:41 PM
You are right that this has started to appear about six weeks ago. About that time I noticed a considerable amount of increase on traffic from one of my sites that has a lot of pdfs. I am currently doing an experiment to see if one could optimize words in a pdf file as well as check if links in pdfs are indexed! This I guess has thrown me a bit out of course. I will need to rethink the experiment!
Interesting times! Have you got any thoughts as to why they would index files like this?
Reply to this topic

0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users






