Google Code Search

Google has just made Code search public -- pretty slick. Finally something for the programmers :) (but real programmers never have to look things up anyway, ha ha).


However, there are a few things which might influence web search as well:


  • Google searches inside of "zip" files (and tar.gz, jar, etc).
    That means that if you have content which should not be indexed, it is not safe to just place it within zip-files. On the other hand, it will index content from your zip files.I wonder how it handles password protected zip files? or broken zip files (like those used to attack antivirus solutions that unzip files to check them)?
  • Google extracts information from your code, like which license is uses, which language it's written in
    Hmmm, where does it get that information from? Probably pattern matching for known license texts.
  • Google might be indexing your javascript code after all
    No more hiding stuff in javascript because Google doesn't index it. I wonder how this is applied to pages that just use external javascript files (compared to those that explicitly link to them for a download or include them in a zip-file)


One fairly problematic issue I see with a tool like this (and I'm sure they've thought of it as well) is that you can now easily search for known issues with open source (or for that matter: any indexed) code. Say there is a known exploit when a script uses certain functions in predictable ways: you can now search for that (using regular expressions), find out where it's used, and exploit the scripts. Sure this was possible before, but you would have needed to download all those scripts and done the search manually on your own system. Now you can search all indexable scripts within a few seconds.


How do you rank for code-search? Since your code is usually only linked from very few places within your own site (and hardly ever from the outside directly) I expect the influence of your own sites general value ("PR" if you will) is a strong factor. Within the code it's hard to determine important sections (no headers, no bold, etc.) but perhaps they take the frequency? How do they determine if a piece of code is relevant for your search term or not?


How do you make sure that your "current" code is indexed and perhaps the older versions are removed? How do you keep Google from indexing your "bad examples"?


Fun stuff. Finally something for the geeks among us :huh:. Hey - look, someone used my code snippets with my original comments in them :D! No more easter eggs .. :(



Oops, now it's also all too easy to stumble upon confidential code which is accidentally online ... I wonder what it takes to get your snippets out of Google Codesearch ....




PS what do you do when you see "FINDERS ARE ASKED TO DESTROY THIS DOCUMENT" :huh:?


PPS would it make sense to post vulnerable queries here, keep them to myself or try to find someone at Google who can block them (and how)?

I like the way they have the prominent message


Search public source code.


Which is basically saying that if it is on the www and not blocked by robots.txt then you obvisoulsy intend the information to be 'public' for everybody.

Oh, great another thing to worry about.

Even being a programmer, not sure how much I would utlise that.


However like FP_Guy said


"Oh, great another thing to worry about."

