Sunday, December 31, 2006

Carl's Skinny on Active Participation

Generally speaking, interviewers aren't told alot by the company. They say that is so the interviewer won't be a source of bias -- something about interviewers shaping respondents subliminally.

Like if I were to wink my eye at the respondent here and there.

It might fill her head with ideas. I hope noone is reading this -- least of all the wife.

In any event Carl had cornered a statistician on the bus where he proceeded to impersonate a programmer and the two of them kicked around the nearest neighbor hotdeck method. This was how an interview that was refused got its data.

The technical term for this is imputation.

It turns out imputation is cool except when you are collecting longitudinal data. This is when your heartaches begin.

With longitudinal data it is possible to check hotdeck data against a previous interview from the real respondent.

Some people would have said to let sleeping dogs lie but curiosity got the better of the statistician, and now it is possible to wonder about imputed data.

Enter the heartache and the impersonator and the shaper and Carl. That's me.

Carl was feeling bigger every moment.

That's because active participation was impersonation guided by the shaper.

Carl, the impersonator, would get in the mindset of the respondent and answer questions that were flagged. He would answer the questions to the shaper's liking.

This could happen sooner or it could happen later.

You would think that a bot or semi-sentient like the shaper had all the time in the world. But she didn't. Indeed she went wobbly when the impersonator gave a good answer on the first try. And this made Carl blush.


Friday, December 29, 2006

An Interlude: Self-Organizing Maps

On September 28, 2006 the US Patent Office published a patent application from Microsoft entitled System and method for improving search relevance.

The inventor contemplates the following problem:

Take a collection of documents, say about the size of the Web, and try to organize them based upon textual similarities between them. Can that organization provide a useful way to index the web?
The invention would augment keyword search. Documents would not be indexed based on keywords directly. Instead there would be an indirection. Documents have labels -- many labels. Just look at how documents are labeled by a typical user of or one of its competitors. Keywords would be related to labels that are related to documents.

Some invention like this is what Microsoft proposes using a technique called self-organizing maps.
The Self-Organizing Map (SOM) by Kohonen is motivated by the receptive fields in the human brain. High dimensional data [e.g. labeled documents where each label is a dimension] are projected in a self organizing process onto a low dimensional grid [e.g. a system of keywords that Microsoft refers to as "content tiles" in the application] analogous to sensory input in a part of the brain.
See the discussion of Emergent SOM at the website of the Databionics Research Group for a more in depth treatment of self-organizing maps including some nifty visualizations of the SOM process. See also my som.

Meanwhile here is the patent application abstract:
A system and method for performing context based document searching is provided. A grid of content tiles is constructed corresponding to a desired concept space. Each content tile is assigned a content tag and is associated with a series of feature values. The feature values are trained to correspond to various regions of the content space. Documents are associated with one or more content tags based on a comparison of document feature values with content tile feature values. A search query is modified to include one or more content tags based on the terms in the search query and/or user preferences. The search query is then matched to documents associated with content tags contained in the search query.


Wednesday, December 27, 2006


Ting-lan wondered twice.

Would her way be vinegar or might she succeed with the many?

Generally speaking, the Americans were premature. They were premature this and premature that. Both of the sexes. Need she say more?

Ting-lan had left the pack for a few weeks now.

She was working on the usable data problem. External data had been a disappointment. Too many unanswered questions. Sometimes it was easier not to use the external data at all.

Ting-lan had the insight that would minimize the unanswered questions. She would use the off-colors. The off-colors were the very small data islands on her screen that could not be located in question space.

Ting-lan liked answers without questions. As a little girl this had always been her way.

At first Ting-lan has thought the clustering algorithm would be easy to devise. She would discover the drifts in the questions with answers and use the drifts to reel in the off-colors at random and with brute force.

Only the net she cast was too large.

Now Ting-lan was working with majors and minors. Semantics, she thought, was a dog's day afternoon.


Tuesday, December 26, 2006

Active Participation

Carl was impressed with active participation.

At first, when the concept had been presented at training, Carl had been sure that he had a personality that was averse to shaping.

Just ask the wife.

Now that he had been working with the shaper though, Carl was of a different mind.

Carl didn't think this was brain washing. He thought of it as collaboration. True the shaper was only semi-sentient and Web 2.0 had largely closed the door on the participation of other beings.

Indeed many hives actively challenged and tried to screen out the semi-sentients.

The company and now Carl obviously had second thoughts.

The record had to be set straight, Carl decided. So he made a mental note: when he had a little time, Carl would post the skinny on active participation.


Monday, December 25, 2006

Future Instruments -- Part IV

Velvet wasn't happy.

On the dashboard she was watching a picture develop that was nothing short of confusion.

The picture was an overlay -- a scattergram -- on top of a pipeline. The pipeline itself showed neighborhoods. Some of the neighborhoods were dark and some were light, depending on what happened after Velvet had dropped the external data sources.

External data sources made certain neighborhoods more or less light while leaving other neighborhoods in the dark.

Today this was not Velvet's problem.

Many dark neighborhoods in the overlay had been labelled "let sleeping dogs lie". The company would not pay her to ask the questions in these sections. Other neighborhoods that had only been partially illuminated as a result of the drop were now labeled "simulation candidates".

What was troubling Velvet were the neighborhoods marked "critical sections". These were empty sections that needed to be asked or not depending on the access roads. The trouble as Velvet looked at the scattergram was that it had waffled on the access roads.

Velvet, like most interviewers, wanted to see the access roads in black and white. Interviewers didn't have to enter a neighborhood connected to other neighborhoods on a black road. Being a person of color, Velvet wished the company had devised other color coding but she would muse on geek culture at another time.

Right now on the scattergram she saw dark neighborhoods labelled "critical sections" whose access roads were the color gray.

This could only mean back tracking. And back tracking always required a plan.

This had not always been the case. Before the advent of external data, interviews had been one way in and one way out for the most part. She wasn't sure why but external data sources had complicated this picture. In training they had simply told the interviewers that "now there is more than one way to skin a cat."

Velvet liked cats but she caught their drift.

On the dashboard Velvet flipped overlays to a plan called "the shortest distance between two points." This would be the plan that minimized back tracking which of course, Velvet thought, would have the side effect of maximizing the company's profits.

It didn't take Velvet more than a few pokes at this plan to decide it was high risk. Velvet's rule of thumb was to look at gate questions that would turn access roads to white coming from neighborhoods that were lit up by external data sources. In these situations Velvet judged whether the respondent would know the answer.

Now Velvet chose a plan called "NoMax". Velvet wasn't sure what "NoMax" did. Velvet did know that "NoMax" did not maximize the use of data from external data sources. So much for double negatives.

Velvet looked at the scattergram once more and liked what she saw. Instead of lots of gray access roads, everything was coming up roses.

She counted four roses. Velvet knew that this was a borderline case that required authorization. Velvet also knew she was not about to be second guessed by central authority. That was why as she dragged the current pipeline she had just devised on top of the central authority resource, Velvet waited with easy confidence.


Sunday, December 24, 2006

An Interlude: The Cast of Characters

the pipeline -- an application developer's view
from the stylus studio xml pipeline
1. Advanced simulation data source -- aka "the match"
2. Active participation data source -- aka "Carl is in the house"
3. XQuery merges two streams into one XML
4. XSLT generates ajax-enabled HTML
5. XQuery generates an XSL:FO stylesheet
6. XSL:FO prepares a PDF for ad hoc interview publishing

selected resource pool -- an application developer's view
from stylus studio's xml pipeline

respondent statuses


Future Instruments -- Part III

Carl was about to begin a new interview. He knew the drill. He would snap and shoot the respondent and an avatar would appear in the resource pool. He then would drop and drag the avatar onto the pipeline and, after a pause, the questions would begin.

Only today there has been a refusal right after the snap and shoot. Perhaps it had been stage fright. In any event he wouldn't get to see the respondent's avatar first turn busy and then progress in hues from gray scale to whatever rgb his monitor was set at as the interview unfolded. Indeed something new occurred as Carl changed the avatar's status to "refusal". Carl had been trained for this eventuality but the eventuality had never happened before.

Carl checked the resource pool on the dashboard again. He had not been mistaken: a match had occurred. In a match the actual respondent avatar and a funny one were joined at the hip: a twin was born.

Carl remembered a Starbucks around the corner. It was no telling how long working the match would take.

Once in line Carl ordered a triple short mocha and after almost no wait, he sat alone at a table ready for the eventuality.

Carl dragged the match onto the pipeline hoping against hope that now his heartaches would begin. A heartache, Carl had learned in training, was when an interviewer had to join the match.

This wasn't normal. Generally matches, the product of advanced simulation, didn't require interviewer intervention. But Carl's company which had developed advanced simulation in the first place (patent pending), had also written the book on active participation.

Active participation was a training module. It was also an avatar in the resource pool. This avatar had Carl's face. Carl would use it if advanced simulation called on Carl to speak.

The trouble with advanced simulation, Carl had learned during training, was that sometimes a match gave answers that would fail the consistency check. There were two types of failures (called lies by experienced interviewers) that the match might be caught in. K failures had face validity. With F failures a match made a complete non-sequitor.

In either event Carl would get to say the answer.

Carl kicked back, dropped the match on the pipeline and watched the match do the walking. He did steal a glance at the resource pool and saw himself. Carl smiled at his reflection. It wasn't every interviewer who could say: I am in the house. This required specialization.


Saturday, December 23, 2006

Future Instruments -- Part II

Maybe the future of instruments is edit mode and the future of edit mode is a control panel/dashboard.

Interviewers -- generally middle-aged women and men -- would learn to play Ender's Game.

Of course (or perhaps) this is my fantasy. So in our cold, cruel world ("we live in a time where meaning falls and splinters from our lives") what might the dashboard be like?


Thursday, December 21, 2006

Future Instruments -- Part I

Just saw the article in Amstat International's newsletter on the situation at Statistics Netherlands.

Does one think that the shift to registers and administrative files as the primary data source is a phenomenon unique to the Netherlands?

Let's say it is not and see what happens (a thought experiment).

Let's say that the use of data banks in both establishment and household surveys becomes a requirement in the US too.

At the same time, as the article suggests, there may be "initiatives for new statistics". I would also add that with this data source, as with any other, there will also be missing data.

What kind of "instrument" is required to build the record under these circumstances?

It would be an instrument that largely filled in the gaps or, as people in the market research world might says, it would be an instrument that filled in white spaces in the single source of truth.

I try to imagine such an instrument.

Could we get by using the current paradigm which in this case would be a preload of the registers and an instrument that knew how to jump from one white space to another without interviewer intervention.

Would we not want to know before a survey went to a mode whether it was going to take five minutes or five hours depending on the missing data?

Would an interview that jumps between white spaces trash the context in which questions are normally asked?

Alternatively, if the context was preserved and an interview walked instead of jumped, wouldn't the cost of data points become prohibitive and the interview come to be seen as flat-footed?


Monday, October 23, 2006

The topography of XML

Which are the stuff of dreams -- XML documents or relational databases?

Freud might say that a relational database is polymorphous perverse. That's because through the miracle of foreign keys the same table can have many parents. Imagine that.

Actually it is hard to imagine the life of rows because they can appear in many places at once. That is, the same row might be involved in several relationships. Are you getting the perversity of all this?

In XML, elements have elements, and elements have attributes.

I can hear the train of thought now. XML is hierarchical whereas in a relational database there are more degrees of freedom.

Freedom and perversity.

When have I heard this story before.

Is XML then the representation of conservatives?

Am I a conservative who chooses XML?

And do relational database administrators really have a proclivity to sleep around?

Tune in next week for an entirely new episode of "XML Matters" only on Wolf TV...

Actually there are two things you have to know about the flavor of hierarchy that comes in XML. They are cardinality and order.

Cardinality is about the number of elements. Order is about their position. The calculus -- say integration -- uses number and order to make sequence. Sequence is like the card deck our ancestors once flipped to produce animation. Sequence is a movie. Sequence is visualization.

I'd like to see a relational database put sequence in its pipe and smoke it. It is not possible according to Codd and Date who are two really big relational database fish.

On the other hand, shaping, shapes, movement and maybe, as we shall see, even morphing and cloning are second nature with XML.

Finegan. Begin again.

Did you know that Finegan's Wake is an XML document?


Sunday, October 22, 2006

The faces of a solution

How does one traverse the solution space when it comes to building a platform for data collection/data dissemination?

How does one go about "surfacing" a solution?

How many surfaces does a solution have?

What is the connection of these surfaces?

Is there a glide that takes us from surface to surface?

Do we follow that glide and construct a family of surfaces? Or do surfaces fall away as new surfaces pop us, kind of like a comet?

Maybe in the one case -- call it the product line model -- we are building a house. Maybe in the other case -- call it the comet model -- we are building momentum.

Everything is legacy after the first moment.

Comets solve the problem of legacy and over-extension which creates white elephants or, again, too much baggage with a vanishing tail.

In the comet model every surface has a sunset. The sunset is when noone comes to the surface and sticks to it like flies. That's when the surface goes into the tail of the comet and dies.

Fly. Die.

In any event what fuels a surface surfacing and any story a set of surfaces make are engines.

I want to talk about engines before hell freezes over.

Maybe we will talk if there is a tomorrow.


Saturday, October 21, 2006

What's in a "placeholder"?

Recently in the context of an agile design session I called some software we are about to add to our platform a "placeholder".

The context also contained a reference to "see emily run". This goes back to Syd Barrett's "See Emily Play." The Emily I meant was Emily Chang, Strategic Designer, and my "see emily run" was a a reference to keywords Emily ran in her run-lola-run post Design 2.0: Minimalism, Transparency, and You.

So what's a placeholder?

It's probably another word for an iteration.

Usually (in agile circles) we think of iterations as steps on the way to a product.

What if we don't know what a product can become but at the same time there is a business REALITY an in "I want the world and I want it now" (The Doors).

Maybe I am stupid not to know what this product can become. But if you are in the business of data collection and data dissemination (who isn't?) and you think you know how this is going to work five years out, you are a fool.

That's why I talk about placeholders.