www_magnifying_glass.jpg

Web search technology: Time for a little Q&A?

An information systems researcher at the W. P. Carey School of Business is testing a new search engine that understands natural language questions -- a system which, beyond its practical utility for the average Web searcher, has critical implications for homeland security agencies and artificial intelligence research.

To hear Dmitri Roussinov tell it, his initial reaction upon discovering Web search technology was one shared by most users — amazement. Roussinov, an assistant professor of information systems at the W. P. Carey School of Business, recalls his awe at the realization of the wealth of information revealed by Web search engines.

It wasn't just general knowledge, either. Whether you're looking for a good wine bar in lower Manhattan or a cactus farm in Phoenix, the information is out there somewhere. "That's what attracted me specifically, the fact you can use Web search engines and find all of that information," says Roussinov.

The operative word is can. It is possible, using a search engine, to find a wine bar or the name of the CEO of a specific company, but it's seldom easy. The fix, Roussinov thought, would be a self-learning Web question answering system: "We just hope to have a better, more powerful search engine so you can find things you want."

Roussinov's system understands natural language questions, such as "Who's the CEO of IBM?" and uses the Web to come back with an answer: "The CEO of IBM is Samuel Palmisano" (the correct response as of April 2005). Beyond answering questions, systems similar to this may provide a means to develop and test artificial intelligence technology.

In contrast, most existing search engines are a lot less direct. "Search engines are designed not to find answers to questions, but to find Web pages that are thought to be relevant," notes Nicholas Belkin, a professor of information science at Rutgers University.

There's a lot going on behind the screens in modern search engines. Some systems consult a database or an encyclopedia and spit out the exact answer -- if the query fits a predefined pattern with a known, exact response. Otherwise the search engine applies a series of rules to rank pages, using such criteria as how often a keyword appears on the page and where it appears. Other methods, such as link analysis, determine how important and meaningful other Web users have found the page. From these and other factors, the search engine returns a list that ranks pages based on how pertinent it appears to be relative to the keywords in the search.

There's a bit of an arms race going on, as webmasters tweak their pages to boost rankings and search engines alter their algorithms to counter. Of course, there is also paid placement, in which a Web page appears in a prominent position. Such transactions, along with the use of targeted ads and other revenues, finance the whole setup.

Roussinov's approach, which builds upon earlier work, begins with a natural language question, such as "Who is the CEO of IBM?" The software then matches the question against question patterns it knows to create an initial query. For example, the pattern "who is" indicates that the answer will be a person. Using that formulation, the software looks at stored answer patterns. In this case, the software looks for the word pattern "The CEO of IBM" when is occurs near "is", to create a final query that is sent out to search engines. The system retrieves Web pages, which hold possible answers.

These Web pages are then subjected to a triangulation mechanism. Triangulation, according to Roussinov, is "a term widely used in intelligence and journalism, which stands for confirming or disconfirming facts using multiple sources." Possible solutions are filtered according to the stored answer patterns to produce the final response, which is delivered to the user. In Roussinov's scheme, the stored questions and answers are honed through a training mechanism which enables the system to learn answer patterns and also improve its ability to correctly rank potential answers.

Finding dangerous information

The system can be used to find factoids, but it can also be used to highlight Web pages containing dangerous information. Take the case where a site contains instructions on how to steal passwords. Law enforcement and homeland security agencies would like to find such pages, and Roussinov has tested his system's ability to do so.

"We're looking at how Question Answering technology can help to locate Web pages, like those that teach how to make weapons and bombs," he explains.

One gauge of the system's effectiveness is the mean reciprocal rank, a measurement that yields 1 if the right answer averages the first one returned, 1/2 if it's the second, 1/3 if the answer on average appears third, and so on down to 0 if the answer isn't found at all. In looking for malicious information, Roussinov found that the QA approach boosted MRR by 25 percent in certain circumstances as compared to using a keyword-based search engine alone.

Rutgers' Belkin notes that QA systems have been investigated for the last six or so years by the Text Retrieval Conference, which is jointly sponsored by the National Institute of Standards and Technology and the U.S. Department of Defense. While Roussinov's scheme is similar to others that have been proposed and tried, Belkin says, "It has some potential." Michael McQuaid, an assistant professor at the University of Michigan School of Information, has used Roussinov's system as an aid in his own studies of information visualization.

Based on his experience with Roussinov's approach, McQuaid says, "His is a very good one."

The next phase of Roussinov's research is fine tuning. Since part of the process involves pattern matching with questions, Roussinov is finding out from graduate students in the W. P. Carey School of Business what questions they might ask as part of day-to-day managerial activities. From this he'll develop patterns to be used for a QA system fine-tuned for such applications. The initial testing of these new question patterns won't begin until the fall of 2005, with results expected some time after that.

Beyond better search tools, there's another possible payoff for QA research. "How would you test if a computer understands you or not?" Roussinov said. "Well, you start asking a bunch of questions and you see what kind of answers you get. If the answers make sense then I would claim it is artificial intelligence."

Latest news