Archive for the ‘Tech talk’ Category

Discovering Content by Mining the Entity Web

Wednesday, November 18th, 2009

I had a blast last night presenting to CS students at the University of Washington. For those who missed the talk, the video is embedded below.

Abstract: Unstructured natural language text found in blogs, news and other web content is rich with semantic relations linking entities (people, places and things). At Evri, we are building a system which automatically reads web content similar to the way humans do. The system can be thought of as an army of 7th grade grammar students armed with a really large dictionary. The dictionary, or knowledge base, consists of relatively static information mined from structured and semi-structured publicly available information repositories like Wikipedia, Crunchbase, and Amazon. This large knowledge base is in turn used by a highly distributed search and indexing infrastructure to perform a deep linguistic analysis of many millions of documents ultimately culminating in a large set of semantic relationships expressing grammatical SVO style clause level relationships. This highly expressive, exacting, and scalable index makes possible a new generation of content discovery applications.

The full talk: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, and the Slides.

Is Blue, Blue, or Blue Right for You?

Wednesday, December 3rd, 2008

Historically, machines have faired poorly at understanding ambiguous terms and their meaning. For example, it is easy for you to look at the following sentences:

  • Blue is a forthcoming Indian film starring Sunjay Dutt, Akshay Kumar, Suniel Shetty, Lara Dutta and Zayed Khan.
  • Blue are an English pop boy band consisting of four members: Lee Ryan, Duncan James, Antony Costa, and Simon Webbe.
  • Blue is the 1971 album of Canadian-born singer-songwriter Joni Mitchell.

and realise each sentence is referring to a different Blue. This, and related disambiguation tasks are not so easy for machines to perform well. Perhaps because of the difficulty of the task, most modern applications have simply shied away from the problem and relied on you, the human, to do more work. For example, if you go to your favorite search engine and search for the keyword Blue, you get results like these, or these.

If you are really after articles, images or video about the Bollywood film Blue, then it is up to you, the human, to add the disambiguating information and formulate your queries in a more exacting way. For example, you might search for Bollywood Blue, or if you remember the film stars Akshay Kumar you might search for Akshay Kumar Blue. In addition, you often need to execute these more refined queries in multiple places like your image, video, web and news search engine.

Here at Evri, we are busy getting machines to search more, so you don’t have to. At Evri, if you go to the Bollywood film Blue page, you’ll note the articles, videos, and images are all in reference to the correct film. Same thing if you go to the English pop boy band Blue page, or Joni Mitchell’s album Blue page.

Let’s take a look and compare the Evri found images when I took the screenshots:

And here is a snapshot of the video media bar taken from the Evri pages shown in the same order as above:

And just for fun, here are the top images for Blue from a favorite image search engine:

So how do we do it? Well our approach is multi-pronged, but core to our activities is teaching machines to read documents similar to the way humans do; unlike most search engines, we don’t just treat documents like a bag of keywords with no understood inter-word meaning. So, for example, for a sentence like:

Chief Seattle spoke to his people.

our system recognizes the base subject>verb>object grammatical clause to be Seattle>speak>people. In addition, our system knows that cities do not typically speak to people whereas people do. Our system also recognizes Chief to be the prefix modifier of Seattle, as well as a title used to address a human. This understanding enables our system to recognize Seattle, in this sentence, as a person.

In addition to building a deep NLP based grammatical understanding of text, we are busy building a very large knowledgebase of entities (people, places, or things); this knowledgebase contains structured information about each entity. For example, we know that the boy band Blue is a band, and more generally, an organization. We know the band originated in London, England, and was active between 2001 and 2005. We also know the active members of the band were: Lee Ryan, Antony Costa, Duncan James and Simon Webbe. This structured information is used in multiple aspects of our system to disambiguate between this Blue and the many others present in the world at large.

We also regularly leverage our NLP based text understanding in conjunction with our structured knowledgebase understanding. Here is a diagram depicting the data flow of our knowledgebase and its use at indexing and search time.

Finally, while we believe we’ve made great strides toward proving that machines can disambiguate well and help alleviate the burden from us humans, there is still a long way to go. While we have many examples of pages that work amazingly well, you will no doubt encounter those where we could use improvement. Please send us these examples and we will do our best to fix them.

I will leave you with a screen shot of a page whose accuracy I find compelling. When I think of the sheer number of pages on the web containing the word ten (more than 600 million in Google’s index) that are not about this album by Pearl Jam, it is a true feat to hone in on just the right stuff.

Evri’s Garden Sprouts Some Search

Tuesday, September 30th, 2008

We thought about launching a labs site where we could showcase our latest gadgetry, but decided none of us really fancy wearing lab coats. Many of us have gardens, however, and a few of us wear overalls, so we figured we’d instead start a garden to sprout new ideas. So, voila: we have a new section of our site called Evri’s Garden where we’ll be showcasing our fresh but not fully farmed veggies. Our first garden sprout is Evri Search, which I’ll spend some time chatting about.

Evri Search exposes our text analysis infrastructure that automatically identifies and makes available linguistic links connecting people, places and things found on the web. To provide this enhanced search capability, Evri Search performs an exhaustive deep natural language processing based analysis of every sentence in our corpus. This search interface allows you to directly interact with the same back end system our scientists and engineers use everyday to fine tune the algorithms used in our applications to search on your behalf. The help section on the search page is pretty exhaustive, so I thought it would be more entertaining to just walk through some interesting queries.

One of my favorite queries is to find corporate acquisitions. To do so using the Evri query language (EQL), I can construct a query like:

[Organization/Name]>buy>[Organization/Name]

In this query, I am asking the system for all sentences containing a grammatical clause where the source of an action is a named organization (usually companies but also non profits and government agencies), the action is the verb buy (or similar verbs), and the target of the action is also a company.  Here is a screen shot of the results the day I executed the query:

Note: the system has over 24000 instances of acquisitions, and I am shown them in ranked order. One day I will chat more about how we do this ranking, but in the mean time, suffice it to say many factors impact this ordering, including, but not limited to: relevance of the document, verb condition, importance of the entities, relationship parts of speech, relationship redundancy, document timeliness, and credibility of the source.

Now also note: I’m shown the name of the acquiring company and the name of the acquired company; I’m not sent off to a web page to sift through acquisitions nor do I need to merge results from multiple websites containing acquisitions. A key goal of traditional search engines, as well as many semantic search engines, is to point users to documents, or web sites, where users are expected to read the results and assimilate the information they are after. Evri Search excels at distilling relationships, or facts, from disparate web sites — this ultimately enables users to read less, and understand more. Now let’s expand the first result:

Note: the relationship: Bank of America > buy > Merrill Lynch was extracted from multiple different sentences, or different ways of expressing the same thing. Also note: you can click on the article titles to visit the article and read the sentence containing the matched relationship in context. Let’s do a slight modification of this query now, and execute:

[Organization/Name]>buy>[Organization/Name] PREP CONTAINS [Money]

Now we are asking for the same relationships as before, except now we only want relationships where the complement of the preposition is a monetary amount. In other words, the sentence should contain language like: Company X bought Company Y for Z dollars. Here is the first result expanded:

Note: In every sentence, the monetary amount of the acquisition is mentioned. Now lets say we want to get even more constrained. Lets say we only care about acquisitions with the amount mentioned but in the media sector. We could constrain the query a bit more:

[Organization/Name]>buy>[Organization/Name] PREP CONTAINS [Money] CONTEXT CONTAINS media

Now we are asking the system for the same results as before, except the context (the sentence containing the relationship, the sentence before or the sentence after) must contain the word media. Note: the results are now focused on the media sector:

You may, on occasion, note that the sentence matching a query does not contain the name of the entity. For example, in the query:

shark>attack>[Person/Name]

I expanded the first result (when I ran the query), and got:

Note: the shark attack victim is not mentioned in the matched sentence shown in black. This is because the pronoun she is referring to Bettina Pereira mentioned in a previous sentence. Evri Search is able to understand, similar to the way a human does, that pronouns (along with other anaphora like the company and the lawyer) refer to other named entities.

I’ll now leave this post with a few queries to help you get started geeking out with Evri Search. Feel free to try these queries out on your favorite keyword or semantic search engine.

Finally, if you find any great searches you’d like to share, you’ll find the +share link in the top right of your browser that links to all your favorite bookmarking apps, else we’d love for you to drop it in the comments section here. Have fun geeking out on Evri Search.

The Grammar Students Guide to Radiohead

Tuesday, July 15th, 2008

Here at Evri, we talk a lot about searching less. When we talk about searching less, we are talking about you, our users with precious time — we want you to search less — we aren’t talking about our machines, because they do an awful lot of searching so you don’t have to. So how are they, our racks and racks of computers, searching so you can understand more?

Well it comes down to teaching our machines to read documents more similar to the way humans do – to basically understand more of the meaning of the documents they index. This is very different from what traditional keyword based search technology does. Typical search engines, when they encounter a document, treat the document like a bag of words — associations between the words, how they interconnect, and form actual meaning is lost. Consider the following text snippet from a Starpulse article:

Howard insists they won’t be copying Radiohead’s idea and making their disc only available on the internet. [...] He tells BBC Radio 1, “We won’t be doing the same thing as Radiohead, no.” [...] Last year, Radiohead released In Rainbows as an Internet download and allowed fans to name their own price for the album.

Now from this snippet of text, your favorite search engine will store this data something like:

Radiohead – 3
Howard – 1
Rainbows – 1
released – 1
Internet – 1

and so on. I’m simplifying things a lot for the sake of discussion, but basically, your favorite search engine is maintaining a list of words, and keeping track of how many times those words appear in a given document. This approach works quite well for finding websites, but not very well for discovering facts, or relationships describing how people, places and things interconnect.

Now consider how Evri’s approach is different. For this same snippet of text, our machines will break the snippet out into multiple sentences. For each sentence, our machines will, in essence, diagram the sentence similar to what you did back in 7th grade grammar class. So, for every grammatical clause in a sentence, our system creates a data structure like that shown below.

In the last sentence of the snippet above, our system will store a relationship like: Radiohead > released > In Rainbows

In addition, our system knows that Radiohead is a band, released is a verb, and In Rainbows is an album. If a sentence said: Radiohead of Oxfordshire may release an album called In Rainbows, our system will store Oxfordshire as the suffix modifer of Radiohead, and will store the verb release as being conditional; knowing that a verb is conditional or negated is important as this information can be used to determine where in a list of results this relationship should appear. In addition, if a subsequent sentence says something like: The band’s experiment proved successful., our system will know that The band refers to Radiohead; this is because our system attempts to resolve anaphora similar to the way humans do. Finally, this triplet style data structure is searchable at web scale and web speed by searches expressible in a query language; this query language is quite flexible, but basically allows our recommendation and information navigation applications to formulate effective queries in a precise manner. For example, a query like:

[musical_artist] OR [band] > praise > Radiohead

is being used to render the right column in the entity detail page shown in the screen shot below.

When you actually click on a person or organization, like Billy Corgan, the system will execute a more refined query like:

Billy Corgan > praise > Radiohead

One of the challenges our scientists and engineers face is how to formulate these types of queries in clever ways so you, the user, do not have to; I’ll save this discussion for another day, however.

Finally, we published a book chapter last year that does a more thorough job explaining our approach, and additional grammatical treatments our system performs. So if you’re interested, see the Natural Language Processing and Text Mining book chapter titled A Case Study in Natural Language Based Web Search.