Lars Eilstrup Rasmussen is from Google's Sydney office, and he was a lead engineer on the team that created Google Maps.
He and his brother, Jens Eilstrup Rasmussen, founded mapping startup, Where 2 Technologies, which was acquired by Google in October of 2004.
Together they put together a patent application which describes a process to assign geographical information to web pages. It was published earlier today.
If you find the following patent application interesting, you may also enjoy this one: System for automatically integrating a digital map system
United States Patent Application 20050182770
August 18, 2005
Assigning geographic location identifiers to web pages
A system and method for assigning geographic location identifiers to web documents may include identifying a set of web documents. A geographic location identifier included within a first web document in the set of web documents may be identified. The identified geographic location identifier may be assigned to a second web document in the set of web documents based on a relevancy of the first web document to the second web document.
Inventors: Lars Eilstrup Rasmussen, and Jens Eilstrup Rasmussen
1. identifies a number of web pages;
2. looks for location information within those pages
3. assigns locations to pages which include geographic information
4. assigns locations to pages "relevant" to those pages that include geographical information.
Reasons for the patent:
Keyword-based search engines failed to geographically define web pages when trying to use:
1. Search engine manual assignment of locations to pages
2. Site owner manual assignment of locations to pages
3. Use of geographic meta tags
4. Search engines assignment of location when looking at postal addresses appearing on the same pages as the keywords.
Assignment of geographic location identifiers
"Geographic location identifiers" on web pages can be assigned to other pages which might or might not include geographic identifiers, after relevancy criteria is looked at, allowing pages without location information to be included in a geography based search. Those relevancy factors may include:
1. relative distance between documents,
2. the terminology used, and
3. Whether the page is on the same site.
A geographic location identifier may be:
1. a partial or complete postal address,
2. telephone number,
3. area code,
4. airport codes
5. landmark identifiers
4. other values tied to physical locations, such as longitude and latitude.
5. or based upon hyperlinks between pages without geo information that seem related to these pages which do have location information.
Other documents, such as directories may be useful in associating location identifiers.
Pattern matching may be used to associate documents examining text that matches standard formats for addresses and other information that tends to describe location.
Those location identifiers may then be standardized into a common, predefined format
Example: addresses without zip codes may have the appropriate zip code added.
Example 2: Misspellings and other possible errors that can be identified may be corrected.
These standardized formats may include a number of categories, such as:
1. street number,
2. street name,
3. street type,
8. zip code,
How assignment works
After standardizing (data correction and supplementation and other standardization methods), the location identifier may be assigned to pages on which the information appears.
A identifier may be associated with unassigned documents or which already have an identifier or a different one (some pages may be associated with more than one location).
That assignment may be made by assigning each page with a location associated with a page linked, either directly or indirectly (through a predetermined number of links), to the document.
Once an association has been made, the identifiers could be used in finding other associated pages or in ranking search results.
Or search results which include the pages may show the assigned location to users.
Associations and disassociations of locations can happen as a collection of documents is reviewed.
The first assumption is that if a page has location information on it, it is associated with that location.
The process may begin by identifying, for each page, other pages that include a geographic location identifier and are "relevant" to that page from a geographic identification standpoint.
Defining relevant documents
"Relevant" documents" may be defined as relevant where
1 The pages are on the same web site, and
2 the anchor text appearing on the page with location information leading to the other page contains one or more terms from a small rule-based set of terms.
Those "relevant" terms may include, for example:
A document could also be considered relevant if the anchor text to it includes a complete or partial postal address.
For images or other non-text anchors, a linked page may be relevant if the URL in the link includes either a complete or partial postal address or one of the above "relevant" terms.
A page could be considered relevant by examining the contents of the page directly.
A link failing the above tests may be considered "relevant" if the HTML title of the target document includes any of the "relevant" terms, or a complete or partial postal address.
These types of titles would probably be included in the first pass through of all the collected documents. Other rules may be used to determine if the target document makes a hyperlink "relevant".
Looking at distance
After a relevant page has been identified, The number of links away from the page with the location on is is looked at. One version of the invention looks for a range of 2 - 5 links.
If the distance is further, the next relevant document is reviewed. If that one is within the right number of links, it may be associated with the initial document with location information.
That process continues until all relevant documents are reviewed.
Forward links and in-bound links
That describes the process of pages linked from the page with location information on it. The same process happens with pages that link to (backlinks) the page with the geographical identifying information.
A potential addition:
Relevant links and link distances are calculated for documents which don't contain the geographical location information. Each of those pages collects a measure of relevance based upon those distances, and that measure is added together for all neighboring documents that may contain geographical information. So, if a page is linked from or to by a number of pages that use relevant anchor text or URLs, it may be determined to be more relevant for that geographical information on the other pages.
As mentioned above, more than one location can be associated with a document.
The link above to the patent application describing Google Maps is a lot more readable after working through this patent application first. Both share a few concepts, and the Maps application includes more details on geographical location identifiers.