Posts Tagged ‘structured data’

Structuring Data is a Dirty Business

Monday, March 16th, 2009

Why is George W. Bush a criminal?
Why is Joe Biden a football player?
Why is Tiger Woods a journalist?
Why is Britney Spears a director?

Our goal is not to make any editorial decisions about former President Bush or anyone else; in fact, we prefer to not editorialize any of our data. Our philosophy is to allow data to speak for itself.

Why it’s not what you think
We are building our knowledge base from numerous structured and semi-structured data sources, like Freebase and Wikipedia. So, for example, if you head over to Freebase and check out the George W Bush page (http://www.freebase.com/view/en/george_w_bush), you can see under the Crime topic, Freebase has information about his conviction for drunk driving. Although this was years ago in his younger days, in our system, conviction for a crime (at this time, any crime), will lead an individual to be faceted (the broad categories we classify people with – think politician, olympic medalist, and so forth) as a criminal.

Likewise, Joe Biden was the halfback for the Blue Hens football team at his alma mater. the University of Delaware. Tiger Woods writes a weekly column for the Golf Digest, and Britney Spears has directed music videos.  All of these data lead our system to facet these entities accordingly.

Bringing order to the universe is challenging
As the curator, my main efforts involve resolving the structural differences between our sources at a high level, rather than at the individual level. For example, one data source may include individual baseball players classified according to the positions they play. Do we roll all the players together as ‘baseball player’, or do we retain the source’s designations of ‘pitcher’, ‘first baseman’, or do we create a hybrid of both representations?

The decisions we make at the higher level can on occasion translate into seemingly odd classifications. This is why exposing our users to the context of the classifications is so important.  A richer experience with the world’s information is what we are trying to build and is one of the most interesting aspects of what we could do.