Friday, December 29, 2006

An Interlude: Self-Organizing Maps

On September 28, 2006 the US Patent Office published a patent application from Microsoft entitled System and method for improving search relevance.

The inventor contemplates the following problem:

Take a collection of documents, say about the size of the Web, and try to organize them based upon textual similarities between them. Can that organization provide a useful way to index the web?
The invention would augment keyword search. Documents would not be indexed based on keywords directly. Instead there would be an indirection. Documents have labels -- many labels. Just look at how documents are labeled by a typical user of or one of its competitors. Keywords would be related to labels that are related to documents.

Some invention like this is what Microsoft proposes using a technique called self-organizing maps.
The Self-Organizing Map (SOM) by Kohonen is motivated by the receptive fields in the human brain. High dimensional data [e.g. labeled documents where each label is a dimension] are projected in a self organizing process onto a low dimensional grid [e.g. a system of keywords that Microsoft refers to as "content tiles" in the application] analogous to sensory input in a part of the brain.
See the discussion of Emergent SOM at the website of the Databionics Research Group for a more in depth treatment of self-organizing maps including some nifty visualizations of the SOM process. See also my som.

Meanwhile here is the patent application abstract:
A system and method for performing context based document searching is provided. A grid of content tiles is constructed corresponding to a desired concept space. Each content tile is assigned a content tag and is associated with a series of feature values. The feature values are trained to correspond to various regions of the content space. Documents are associated with one or more content tags based on a comparison of document feature values with content tile feature values. A search query is modified to include one or more content tags based on the terms in the search query and/or user preferences. The search query is then matched to documents associated with content tags contained in the search query.


Wednesday, December 27, 2006


Ting-lan wondered twice.

Would her way be vinegar or might she succeed with the many?

Generally speaking, the Americans were premature. They were premature this and premature that. Both of the sexes. Need she say more?

Ting-lan had left the pack for a few weeks now.

She was working on the usable data problem. External data had been a disappointment. Too many unanswered questions. Sometimes it was easier not to use the external data at all.

Ting-lan had the insight that would minimize the unanswered questions. She would use the off-colors. The off-colors were the very small data islands on her screen that could not be located in question space.

Ting-lan liked answers without questions. As a little girl this had always been her way.

At first Ting-lan has thought the clustering algorithm would be easy to devise. She would discover the drifts in the questions with answers and use the drifts to reel in the off-colors at random and with brute force.

Only the net she cast was too large.

Now Ting-lan was working with majors and minors. Semantics, she thought, was a dog's day afternoon.


Tuesday, December 26, 2006

Active Participation

Carl was impressed with active participation.

At first, when the concept had been presented at training, Carl had been sure that he had a personality that was averse to shaping.

Just ask the wife.

Now that he had been working with the shaper though, Carl was of a different mind.

Carl didn't think this was brain washing. He thought of it as collaboration. True the shaper was only semi-sentient and Web 2.0 had largely closed the door on the participation of other beings.

Indeed many hives actively challenged and tried to screen out the semi-sentients.

The company and now Carl obviously had second thoughts.

The record had to be set straight, Carl decided. So he made a mental note: when he had a little time, Carl would post the skinny on active participation.


Monday, December 25, 2006

Future Instruments -- Part IV

Velvet wasn't happy.

On the dashboard she was watching a picture develop that was nothing short of confusion.

The picture was an overlay -- a scattergram -- on top of a pipeline. The pipeline itself showed neighborhoods. Some of the neighborhoods were dark and some were light, depending on what happened after Velvet had dropped the external data sources.

External data sources made certain neighborhoods more or less light while leaving other neighborhoods in the dark.

Today this was not Velvet's problem.

Many dark neighborhoods in the overlay had been labelled "let sleeping dogs lie". The company would not pay her to ask the questions in these sections. Other neighborhoods that had only been partially illuminated as a result of the drop were now labeled "simulation candidates".

What was troubling Velvet were the neighborhoods marked "critical sections". These were empty sections that needed to be asked or not depending on the access roads. The trouble as Velvet looked at the scattergram was that it had waffled on the access roads.

Velvet, like most interviewers, wanted to see the access roads in black and white. Interviewers didn't have to enter a neighborhood connected to other neighborhoods on a black road. Being a person of color, Velvet wished the company had devised other color coding but she would muse on geek culture at another time.

Right now on the scattergram she saw dark neighborhoods labelled "critical sections" whose access roads were the color gray.

This could only mean back tracking. And back tracking always required a plan.

This had not always been the case. Before the advent of external data, interviews had been one way in and one way out for the most part. She wasn't sure why but external data sources had complicated this picture. In training they had simply told the interviewers that "now there is more than one way to skin a cat."

Velvet liked cats but she caught their drift.

On the dashboard Velvet flipped overlays to a plan called "the shortest distance between two points." This would be the plan that minimized back tracking which of course, Velvet thought, would have the side effect of maximizing the company's profits.

It didn't take Velvet more than a few pokes at this plan to decide it was high risk. Velvet's rule of thumb was to look at gate questions that would turn access roads to white coming from neighborhoods that were lit up by external data sources. In these situations Velvet judged whether the respondent would know the answer.

Now Velvet chose a plan called "NoMax". Velvet wasn't sure what "NoMax" did. Velvet did know that "NoMax" did not maximize the use of data from external data sources. So much for double negatives.

Velvet looked at the scattergram once more and liked what she saw. Instead of lots of gray access roads, everything was coming up roses.

She counted four roses. Velvet knew that this was a borderline case that required authorization. Velvet also knew she was not about to be second guessed by central authority. That was why as she dragged the current pipeline she had just devised on top of the central authority resource, Velvet waited with easy confidence.


Sunday, December 24, 2006

An Interlude: The Cast of Characters

the pipeline -- an application developer's view
from the stylus studio xml pipeline
1. Advanced simulation data source -- aka "the match"
2. Active participation data source -- aka "Carl is in the house"
3. XQuery merges two streams into one XML
4. XSLT generates ajax-enabled HTML
5. XQuery generates an XSL:FO stylesheet
6. XSL:FO prepares a PDF for ad hoc interview publishing

selected resource pool -- an application developer's view
from stylus studio's xml pipeline

respondent statuses


Future Instruments -- Part III

Carl was about to begin a new interview. He knew the drill. He would snap and shoot the respondent and an avatar would appear in the resource pool. He then would drop and drag the avatar onto the pipeline and, after a pause, the questions would begin.

Only today there has been a refusal right after the snap and shoot. Perhaps it had been stage fright. In any event he wouldn't get to see the respondent's avatar first turn busy and then progress in hues from gray scale to whatever rgb his monitor was set at as the interview unfolded. Indeed something new occurred as Carl changed the avatar's status to "refusal". Carl had been trained for this eventuality but the eventuality had never happened before.

Carl checked the resource pool on the dashboard again. He had not been mistaken: a match had occurred. In a match the actual respondent avatar and a funny one were joined at the hip: a twin was born.

Carl remembered a Starbucks around the corner. It was no telling how long working the match would take.

Once in line Carl ordered a triple short mocha and after almost no wait, he sat alone at a table ready for the eventuality.

Carl dragged the match onto the pipeline hoping against hope that now his heartaches would begin. A heartache, Carl had learned in training, was when an interviewer had to join the match.

This wasn't normal. Generally matches, the product of advanced simulation, didn't require interviewer intervention. But Carl's company which had developed advanced simulation in the first place (patent pending), had also written the book on active participation.

Active participation was a training module. It was also an avatar in the resource pool. This avatar had Carl's face. Carl would use it if advanced simulation called on Carl to speak.

The trouble with advanced simulation, Carl had learned during training, was that sometimes a match gave answers that would fail the consistency check. There were two types of failures (called lies by experienced interviewers) that the match might be caught in. K failures had face validity. With F failures a match made a complete non-sequitor.

In either event Carl would get to say the answer.

Carl kicked back, dropped the match on the pipeline and watched the match do the walking. He did steal a glance at the resource pool and saw himself. Carl smiled at his reflection. It wasn't every interviewer who could say: I am in the house. This required specialization.