An Interlude: Self-Organizing Maps
On September 28, 2006 the US Patent Office published a patent application from Microsoft entitled System and method for improving search relevance.
The inventor contemplates the following problem:
Take a collection of documents, say about the size of the Web, and try to organize them based upon textual similarities between them. Can that organization provide a useful way to index the web?The invention would augment keyword search. Documents would not be indexed based on keywords directly. Instead there would be an indirection. Documents have labels -- many labels. Just look at how documents are labeled by a typical user of del.icio.us or one of its competitors. Keywords would be related to labels that are related to documents.
Some invention like this is what Microsoft proposes using a technique called self-organizing maps.
The Self-Organizing Map (SOM) by Kohonen is motivated by the receptive fields in the human brain. High dimensional data [e.g. labeled documents where each label is a dimension] are projected in a self organizing process onto a low dimensional grid [e.g. a system of keywords that Microsoft refers to as "content tiles" in the application] analogous to sensory input in a part of the brain.See the discussion of Emergent SOM at the website of the Databionics Research Group for a more in depth treatment of self-organizing maps including some nifty visualizations of the SOM process. See also my del.icio.us som.
Meanwhile here is the patent application abstract:
A system and method for performing context based document searching is provided. A grid of content tiles is constructed corresponding to a desired concept space. Each content tile is assigned a content tag and is associated with a series of feature values. The feature values are trained to correspond to various regions of the content space. Documents are associated with one or more content tags based on a comparison of document feature values with content tile feature values. A search query is modified to include one or more content tags based on the terms in the search query and/or user preferences. The search query is then matched to documents associated with content tags contained in the search query.