Is Blue, Blue, or Blue Right for You?

Historically, machines have faired poorly at understanding ambiguous terms and their meaning. For example, it is easy for you to look at the following sentences:

  • Blue is a forthcoming Indian film starring Sunjay Dutt, Akshay Kumar, Suniel Shetty, Lara Dutta and Zayed Khan.
  • Blue are an English pop boy band consisting of four members: Lee Ryan, Duncan James, Antony Costa, and Simon Webbe.
  • Blue is the 1971 album of Canadian-born singer-songwriter Joni Mitchell.

and realise each sentence is referring to a different Blue. This, and related disambiguation tasks are not so easy for machines to perform well. Perhaps because of the difficulty of the task, most modern applications have simply shied away from the problem and relied on you, the human, to do more work. For example, if you go to your favorite search engine and search for the keyword Blue, you get results like these, or these.

If you are really after articles, images or video about the Bollywood film Blue, then it is up to you, the human, to add the disambiguating information and formulate your queries in a more exacting way. For example, you might search for Bollywood Blue, or if you remember the film stars Akshay Kumar you might search for Akshay Kumar Blue. In addition, you often need to execute these more refined queries in multiple places like your image, video, web and news search engine.

Here at Evri, we are busy getting machines to search more, so you don’t have to. At Evri, if you go to the Bollywood film Blue page, you’ll note the articles, videos, and images are all in reference to the correct film. Same thing if you go to the English pop boy band Blue page, or Joni Mitchell’s album Blue page.

Let’s take a look and compare the Evri found images when I took the screenshots:

And here is a snapshot of the video media bar taken from the Evri pages shown in the same order as above:

And just for fun, here are the top images for Blue from a favorite image search engine:

So how do we do it? Well our approach is multi-pronged, but core to our activities is teaching machines to read documents similar to the way humans do; unlike most search engines, we don’t just treat documents like a bag of keywords with no understood inter-word meaning. So, for example, for a sentence like:

Chief Seattle spoke to his people.

our system recognizes the base subject>verb>object grammatical clause to be Seattle>speak>people. In addition, our system knows that cities do not typically speak to people whereas people do. Our system also recognizes Chief to be the prefix modifier of Seattle, as well as a title used to address a human. This understanding enables our system to recognize Seattle, in this sentence, as a person.

In addition to building a deep NLP based grammatical understanding of text, we are busy building a very large knowledgebase of entities (people, places, or things); this knowledgebase contains structured information about each entity. For example, we know that the boy band Blue is a band, and more generally, an organization. We know the band originated in London, England, and was active between 2001 and 2005. We also know the active members of the band were: Lee Ryan, Antony Costa, Duncan James and Simon Webbe. This structured information is used in multiple aspects of our system to disambiguate between this Blue and the many others present in the world at large.

We also regularly leverage our NLP based text understanding in conjunction with our structured knowledgebase understanding. Here is a diagram depicting the data flow of our knowledgebase and its use at indexing and search time.

Finally, while we believe we’ve made great strides toward proving that machines can disambiguate well and help alleviate the burden from us humans, there is still a long way to go. While we have many examples of pages that work amazingly well, you will no doubt encounter those where we could use improvement. Please send us these examples and we will do our best to fix them.

I will leave you with a screen shot of a page whose accuracy I find compelling. When I think of the sheer number of pages on the web containing the word ten (more than 600 million in Google’s index) that are not about this album by Pearl Jam, it is a true feat to hone in on just the right stuff.

9 Responses to “Is Blue, Blue, or Blue Right for You?”

  1. Abhishek Says:

    Brilliant stuff!!! So is your Named Entity Recognition and Disambiguation engine more heuristic based or machine learning based?

  2. Jisheng Says:

    For a complex system such as entity recognition and disambiguation, it is difficult to label it simply as heuristic based or machine learning based. Our solution is a combination of many techniques, from regex patterns to maximum entropy models, and from simple stopwording to full linguistic analysis.

    At Evri, we take a data-driven approach. For each given problem, we look at the data, do prototyping, and run multiple experiments before deciding on a solution. Also, our solution must scale to very large data sizes. We often find simpler statistical models perform very well as the data size grows.

  3. NLP Advisor Says:

    Interesting blog post. What would you say was the most important factor in using NLP?

  4. Samuel Clemens Says:

    “Historically, machines have faired poorly” — perhaps humans, too. I think you meant “fared.”

  5. 4 Search Engines That Use Different Approaches to Achieve Relevancy Says:

    [...] an example from the Evri Blog, for a sentence [...]

  6. 4 Search Engines That Use Different Approaches to Achieve Relevancy | Msn Yahoo Google Says:

    [...] an example from the Evri Blog, for a sentence [...]

  7. 4 Search Engines That Use Different Approaches to Achieve Relevancy | The Internet Marketing Spot Says:

    [...] an example from the Evri Blog, for a sentence [...]

  8. 4 Search Engines That Use Different Approaches to Achieve Relevancy - Google Live Search Says:

    [...] an example from the Evri Blog, for a sentence [...]

  9. Bill Bartmann Says:

    Excellent site, keep up the good work

Leave a Reply