Peeking Into Google
Here are a couple of snippets:
Google replicates the Web pages it caches by splitting them up into pieces it calls "shards." The shards are small enough that several can fit on one machine. And they're replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. The engineers call these "chunk servers."
The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words.
To do this, the system tries to cluster concepts into "reasonably coherent" subclusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. This kind of machine learning has had little success in academic trials, Hoelzle said, because they didn't have enough data. "If you have enough data, you get reasonably good answers out of it."
Nice to get a quick peek under the cover now and then.