Cracking the Search Category Problem

We’ve been reading all the articles, reviews, and blog postings about the new izik tablet search app with great interest. The most common questions we’ve been asked are about the categories that we divide the results into: where do they come from? How do we decide which ones to show for each query? How many are there? How do we pick which results go in each category?

izik_serp

Search engines have been trying to divide results into meaningful categories — something better than “web, images, or news” — for many years without success. A few experimental search engines showed a list of categories on the left-hand side of the screen, and users rarely clicked on them to see what was inside. Now that the iPad has enabled easy horizontal and vertical swiping and scrolling, the user interface for exploring multiple categories of results is much easier and prettier. izik takes full advantage of that opportunity. But the second problem with categories is the one that izik has really solved: picking good ones.

If you do many different queries on izik, you’ll see that there are thousands of categories. For ambiguous queries, they can often do a good job of showing all sides of the question. (Update 30May13: to work better on all browsers, the following examples have been switched from the Izik user interface to the new blekko.com user interface, which shows the same information.) A query for [giants] will show both football and baseball-related categories, and a query for [organic baby food] has separate categories for buying, making, and the health issues around organic baby food. In comparison, bing and Google show almost all links and ads for buying baby food, a single link for the health angle, and nothing about making your own.

Here are a few more examples of queries and categories:

ford mustang: muscle-cars, car-parts, ford, news, classic-cars, cars
Tom Cruise: actors, people, gossip
fiscal cliff: money, news, congress
Beyoncé: latest, gossip, music, news, lyrics, tv, movies
asteroids: science, music (a band named “The Asteroids”), video games, latest is about asteroids hitting the earth

Selecting Categories

Some of the categories are straightforward. “Top results” are similar results to the “10 blue links” you’d get on most other search engines, “Images” are images, and “Latest” is the most recent dated results for the query. “News” is results from our human-curated list of news sources, so while it probably isn’t quite as up-to-date or diverse as “Latest”, it may have much higher quality.

What about the other thousand categories? Well, to start with, these categories correspond to the human-curated vertical search engines that we call “slashtags” on the blekko.com search engine. blekko.com automatically applies these slashtags to fight webspam and improve the relevance of results. However, blekko’s search results are still displayed as the usual search engine result — “10 blue links” — and it’s not immediately clear what slashtags are doing for you.

With izik, we decided to show a lot more than 10 results, and show you all of the slashtags which might seem relevant to the query. As you can see from the baby food example, this can really help you explore all aspects of your query.

The two types of analysis that are commonly attempted to categorize search engine results are clustering and semantic search, both of which are hot research topics. Most semantic search algorithms use explicit knowledge about entities and words: “Tiger Woods” is a golfer, and the ‘Giants’ is the name of both a pro baseball and pro football team.

The heart of blekko’s Dynamic Inference Graph (DIG) algorithm doesn’t know anything about words. It uses two data sources: librarian-crafted semantic categories (slashtags), and the enormous, free dataset known as the “World Wide Web.” The web knows that Tiger Woods is a golfer: the words “Tiger Woods” frequently appear on golfing websites. These 2 words also frequently appear on websites that sell golfing equipment. And they don’t appear frequently on websites about sailing, football, or women’s handbags.

How did we choose the categories? This process is driven by our librarians, and has evolved significantly over time. We used to have golfing websites and golf-equipment-selling websites together in a single category, until we realized that it’s very useful to be able to separate answers about sports from information about buying sports equipment. These sorts of insights have really helped us as we’ve expanded the category list from 10s to 100s to thousands of slashtags.

In addition to having names, categories contain lists of human-curated, high-quality websites. This is one thing that the data on the web, or an algorithm, can’t do it without human help. No one has invented an algorithm that detects bogus medical advice, but a human can choose the right standard for a category (in the health case, that the website has all of its information reviewed by doctors), and the evaluate whether websites meet that standard or not.

The web is the other dataset used by DIG. Since we’re a search engine, we already have the text from the web stored in our datacenter. We distill the text and links down to a small semantic database, and use that database to map queries to a large list of categories. This mapping currently takes about 60 milliseconds. We then run the search itself in a similar fashion to how blekko.com does it, finding results for every category in addition to generic results. Finally, we look at the parent/child relationships of the categories, and at duplicate results, in order to end with a reasonable number of categories with different results.

When we look back now, we’ve been hard at work building this technology for 5 years. At the start, we had no idea what we were building. After building out the basic slashtag system, two years ago someone asked a key question: “Hey, why can’t we automatically add the /health slashtag to the query [cure for headaches]?” One year ago we launched “auto-boosted slashtags” on blekko.com, and today we have the Dynamic Inference Graph and izik.

Every refinement we’ve made has led us closer to a categorized tablet search engine that’s fun and easy to use. It’s been a fun ride, and we hope that you find izik to be useful!

About Greg

Greg is the CTO of blekko
This entry was posted in Features, izik, Technology and tagged . Bookmark the permalink.

Comments are closed.