Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Google Showing Binary Data in SERPs


  • Please log in to reply
22 replies to this topic

#1 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 21 June 2006 - 05:28 AM

I was doing some analysis on one of my sites, and I found a little broken gem: Google is showing binary data (from zip files) in the SERPs:

http://www.google.co...en&lr=&filter=0

To add to the mess, the illegible contents of the zip files are making it into Google Sitemaps as "common words in my website".

Riiiight.

Screenshots here.

Is it just me or is Google REALLY broken now?

Pierre

#2 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 06:14 AM

What's your sitemap URL? Are you listing those files in it?

Perhaps it would have been a good idea to use rel=nofollow on those links, but it's too late now (and it seems strange to have to rebuild your site around Googles bugs). They should just watch the content-type tags (instead of the file extension).

Mind if I cross-post to the Google-Sitemaps group? (or do you want to?) The Google employees behind sitemaps watch that group pretty closely and even if they don't respond, you can be sure that they will take a look at the issue.

Could it have something to do with the MySQL error? Perhaps they indexed the error, noticed the URL was a "text-type" URL and just refreshed the cache / index afterwards, ignoring the change in content-type? Google would never want to drop the URL once it has found indexable content...

John

#3 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 21 June 2006 - 06:20 AM

Well, the Google Sitemaps shows the Content-Type headers it receives, and unsurprisingly, it's finding a lot of octet-streams (binary data). I never thought it would try to interpret them as text and index them and consider them content!

Easiest way to avoid this situation is to use robots.txt to block access to the download script (it's just the one that does everything). That will block access, but that's not the point: Google should not be interpreting zip files as text. PDFs are binary data, but you can extract useful text from them.

Please feel free to cross post to the G sitemaps group. People know you there already, so it might get more notice :D

Pierre

Edit: typo

Edited by eKstreme, 21 June 2006 - 06:22 AM.


#4 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 07:28 AM

Nice catch. You're not alone.

Posted Image

Posted Image

Google even indexes executables:

Posted Image

In all cases the headers look perfect.

Is it just me or is Google REALLY broken now?


Unbeliever! :) Bien non, it's just the new crawling priority.... too many links pointing at those files :D

* URL edit

Edited by Ruud, 21 June 2006 - 07:32 AM.


#5 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 21 June 2006 - 07:29 AM

Pierre

This is very interesting!

Is this something new or did you notice this one before? Did you have any proper indexing before without the binary data showing?

To make matters more interesting when I clicked on the link you provided Google gave me a 404 error with the comments that they suspecting that this query was a machine generated link!

Google is indexing more and more deeply non-html files. This is one experiment I am trying at the moment! My feeling is that they have opened up the zip file and their algo picked it up as text and indexed it. Sometimes they will list a site and will mention file format not recognized. Check this one out that I got by just typing some random letters into a search query! It even politely asked me if I meant '& a gthu ooi'! Please John when you mention it to Google Sitemaps do tell them that there are two engineers that lost their pages recently and if they can remove all these junk results they can have space to re-index our sites! On a serious note I think this is the sort of thing that is causing Google's data centers' overload!!

Post Edit: Sorry Ruud we essentially said the same thing we were busy typing at the same time!

Edited by yannis, 21 June 2006 - 07:31 AM.


#6 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 07:48 AM

Is this something new or did you notice this one before? Did you have any proper indexing before without the binary data showing?


Given the result for the number of Windows executables they have indexed so far it must have started quite recent.

They should just watch the content-type tags (instead of the file extension).


Even extension doesn't matter. I thought it would only affect dynamic download URLs but this is a normal path to an executable:

Posted Image

#7 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 21 June 2006 - 07:58 AM

Wow, that's quite scary how extensive this is.

I only noticed it late last night while looking at the sitemap in question. I didn't make much of it, but something told me to dig deeper this morning. The kicker was when I did a site: search and saw one of those binary results in the first page. Then I searched for the download URL, and you know what happened next...

The site has been online since November, and I have not seen any problems with it.

Pierre

#8 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 09:32 AM

It's just a little bad data push, don't worry about it. It'll get worked out in the algo and all outbound links from those executables will be discounted.

I wonder if this could be used for SEO? A direct link to an executable from Google would be an amazing promotion for a new software (or as an entry-vector for a virus....).

John

#9 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 09:58 AM

It's also got mp3's indexed:
http://www.google.co...mp3&btnG=Search
<tsk tsk -- google the file sharing enabler?>

Could this have something to do with Googles push to get "everything" listed in their sitemap files?

John

Edit: it seems to have mangled the URL, I'm searching for " inurl:mp3" (hope it works here :-))

Edited by softplus, 21 June 2006 - 09:58 AM.


#10 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 21 June 2006 - 09:59 AM

It's actually a big problem. Do a search for this '&&&&&&&&&&&&&&&&&&&&&&&' you get a hell of a lot of results. Some of them are incomprehensible as to how they got there for example this website http://www.yezzz.com/auctions/1013500/ comes in the above results but the search string is nowhere to be found! There pdf's indexed this way billions of them! One would think that they would have seen their way into supplemental results at least.

I had my cat take a walk on my keyboard and you cannot believe the results I got!

#11 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 10:12 AM

"Bad data push" -- perhaps Matt's cat walked over his keyboard and it started indexing from there :)

Actually, it's quite fun now to take a hex-editor to your binary files and see what Google has indexed. Hmmm. DLLs, fonts, executables, zip-files, mp3's, avi/mpeg's, Corel-Draw files, etc all indexed :)

(PDF's and office formats should be indexed as well, but they show the document type and you can actually search for them explicitly in the advanced search options. This has been there for a while now)

I wonder if we could grab a glimpse of Googles internal files this way? Imagine being able to download (through their cache) one of their crawler executables / scripts. Hmmmm :ph34r: :ph34r:

John

PS FWIW, I use the free hex editor from hhd

PPS Image search anyone: GIF87a0 or JFIF Exif II (some of them are legitimate, some are broken)

Edited by softplus, 21 June 2006 - 10:32 AM.


#12 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 21 June 2006 - 10:25 AM

I do not about their .exe but you can certainly look at some of their pdf's!

http://72.14.235.104...en&ct=clnk&cd=1. This is actually quite a read!

#13 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 10:42 AM

I like this PDF http://www.google.co...llwardBrown.pdf (though I imagine it's linked normally in the site somewhere). Some interesting reading in those PDFs :)

#14 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 21 June 2006 - 10:48 AM

Try this: "" -inurl:htm -inurl:html -inurl:php -inurl:asp -inurl:aspx

In case it doesn't work:

"" -inurl:htm -inurl:html -inurl:php -inurl:asp -inurl:aspx

#15 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 11:01 AM

Bad data push? Bad data. I don't see why this data would be there to push out to begin with. How hard is it to understand "Content-Type: application/octet-stream"?

Sure, you can index that URL but based on your header request you shouldn't attempt to index it. And that is precisely what they did. In this case (Bulletproof, 1st result) they downloaded a 5.1MB file...

Also, for them to show the header info from an executable they have to parse the file: the octet-stream itself won't contain this information....

Hmmm.... Do we see a mix-up of desktop search technology and regular search technology? Or an expansion of the desktop search technology into the regular search technology?

#16 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 11:11 AM

I just talked with someone and these results have started to appear occasionally since at least 6 weeks...

@ekstreme: for those will-possibly-be-mangled Google URL's, try www.tinyurl.com :)

#17 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 21 June 2006 - 11:19 AM

I just talked with someone and these results have started to appear occasionally since at least 6 weeks...

Big Daddy rears its... head. Let's keep this polite. BD was supposed to be an infrastructure upgrade, supposedly for future ideas. Is Google planning on indexing even binary data? What will happen when we do, ooh, site:flickr.com?

@ekstreme: for those will-possibly-be-mangled Google URL's, try www.tinyurl.com

Ah yes, excellent idea. Thanks :)

Edited by eKstreme, 21 June 2006 - 11:22 AM.


#18 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 21 June 2006 - 11:47 AM

Hmmm.... Do we see a mix-up of desktop search technology and regular search technology? Or an expansion of the desktop search technology into the regular search technology?


Ruud I think we see a case of 'male' programming. Software bugs! Missing fields etc. Anyone with an elementary knowledge of databases knows these are not meaningful. We are not talking about pages that they threw in the 'supplemental result dustbin' they are main stream. The hidden part of the web files like .exe, pdfs etc is estimated at anything from 5-30 times the visible web. Once you start collecting these you in storage area trouble. So what you do, now and then you add a filter push the index through it and remove pages. If the filters are based on the 'old ranking tune' webmasters lose pages! By the time you have done this the new algos collected another batch of 'trusted websites garbage'. Another round of late night Googleplex programming, new filter... ah! let's look at the quality of incoming links, another filter... some more billion of pages disappear! These are big problems! As Sergei said sometime back the algos have a life of their own sometimes!

These are big problems. If you have Google shares get rid of them!

Yannis

PS My guess these bugs were introduced from routines that were involved in parsing foreign languages and or complicated sequences like those found in genetics sequences.

Edited by yannis, 21 June 2006 - 11:54 AM.


#19 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 12:22 PM

Ruud I think we see a case of 'male' programming. Software bugs!


Based on the zip and rar files I would have been tempted to agree. But this indexed executable has been parsed. That is a deliberate action, not a bug. It's not a generic parser for anything non-HTML either otherwise you still would have seen a whole bunch of garbled text as it tries to display characters.

#20 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 21 June 2006 - 12:41 PM

Ruud agreed. As a matter of fact they are all sparsed. If you select strings from these files and you do another search most of them will pick up results. There are all sorts of files indexed dlls .dat files etc. this doesn't make it less of a software problem. These are not meaningful results for anyone. Neither do they belong to an index! Especially a main index.

You are right that this has started to appear about six weeks ago. About that time I noticed a considerable amount of increase on traffic from one of my sites that has a lot of pdfs. I am currently doing an experiment to see if one could optimize words in a pdf file as well as check if links in pdfs are indexed! This I guess has thrown me a bit out of course. I will need to rethink the experiment!

Interesting times! Have you got any thoughts as to why they would index files like this?

#21 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 12:52 PM

They're not all parsed. For the ZIP and RAR files they would be showing header information in that case. For the MP3's they would show ID3 tags. What we see instead in those cases is just the ASCII output of the octet-stream.

PDF's have been indexed for a while and yes, it can pay off to optimize them.

As to why they would consider indexing other binary formats... The easy thing with desktop search is that you can find files back based on what is inside of them, not just their filename.

What if you were to index the meta data of all files on the web and make that searchable? Find an MP3 based on the ID3 tags. Find an image not based on filename or surrounding text but based on its IPTC tags. Get to a specific version of a dll because you can search for that type of information stored in the dll itself.

#22 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 June 2006 - 01:23 PM

Hi Yannis
Like I said, they have been indexing (and ranking) PDFs for quite a while now (a few years at least). Back when my site was a SEO nightmare, my PDFs would consistently rank higher than my "real website" (even things like pricelists would rank higher than the product pages!). :)
John

I agree, Ruud, it could make sense to index binary files, but only when they're explicitly crawling known file types and even then they would only want to extract the real data, not just any 7bit ASCII sequence the find.

I know the international part is a bit neglected in the US, but binary files could have "international" content (umlauts, japanese characters, etc) as well - you can't just extract anything that looks like an english string, it doesn't make sense. You can only use data the is in a known format (assuming they're doing it on purpose). It doesn't make sense to index the binary part of an image when a. they could be using the image file for image-search and b. any textual (7bit) content in there is bound to be random (other than known EXIF information).

John

#23 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 21 June 2006 - 01:38 PM

...even then they would only want to extract the real data, not just any 7bit ASCII sequence the find.


True. That is why the indexed executables are the odd one out. Real data is extracted there. Maybe zip, rar, mp3, etc. is next?

Either way, if we only would see ASCII sequences it would look like a 100% fluke. But with those executables neatly parsed in between ... I'm not sure they're not working towards something instead of bugging something.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users