robots-steal-jobs-content-2019.jpg

Searching for truth — or whatever — could be easier with recognition

Keyword searches can be a waste of valuable time, affecting productivity in a company with an extensive database. New research from an information systems professor at W. P. Carey School of Business has come up with a better idea for document storage and retrieval: dynamic visual hierarchies that tap the human searcher's ability to recognize information, rather than recall it.

Typing the term "document retrieval systems" into the Google search engine recently brought back nearly 4.9 million hits in 0.22 seconds. Not only do such bountiful results put a little "eek!" in the word "seek," they exemplify a growing problem with electronic document storage. Document management systems have not evolved with storage capabilities, and stockpiling information without re-engineering retrieval processes can impede worker productivity.

The problems of plenty

Researchers at IBM estimate that today's corporate workers spend up to 30 percent of their time searching for information. At the same time, IT-industry analyst IDC maintains that only about 20 percent of corporate knowledge gets tapped. The rest hides on company disk drives, undiscovered or scantily used.

Along with a huge frustration factor, there is a dollar cost associated with not being able to find needed information. In 2002, IDC estimated that fruitless information searches cost companies around $5,500 per worker per year. And because the amount of stored information has increased over the past few years, there's a good chance document retrieval difficulties are up, as well.

That's because the more information you have available, the more chances you have for irrelevant information to slip into search-engine results, a point demonstrated through research conducted in the 1940s and '50s by the late Harvard linguistics professor, George Zipf.

"Zipf found out two really bothersome pieces of information," says Robert St. Louis, professor of information systems at the W. P. Carey School of Business. "The first is that as you increase the number of documents in a document library, you increase the frequency of occurrences" for any specific word.

For instance, if you're keeping an eye on the word "computer," and you jump from a library with 1,000 documents to one containing 100,000 documents, Zipf's research found that the number of references to the word "computer" would increase 71-fold. That means if 100 occurrences of the word "computer" appeared in the smaller document library, there will be 7,100 computer references in the larger document collection, St. Louis notes.

Lest you think more information will be more helpful, Zipf also discovered an "Index of Ambiguity," which states that the number of different semantic meanings for most words is usually the square root of the number of usages of the word. In other words, out of 100 references to the word "computer," there are likely to be 10 specific connotations attached to the word. And, sadly, a computer doesn't know that.

"Keyword searches are dumb in the sense that there is no meaning inherent in the search," says W. P. Carey information systems professor Karen Corral. "With keywords, you'll get back any document that has the same combination of words used somewhere in the text."

"You get more and more articles back, and more and more of them are not what you're looking for," St. Louis adds.

Shortcomings of simple storage

St. Louis maintains that the proliferation of documents now available online and in corporate intranets compounds the information retrieval problems explained by Zipf's findings. Storage devices are relatively affordable, leading to increased reliance on electronic files, he says. Yet current document storage methodologies don't add innovation to automation. Rather than re-engineer document retrieval, most organizations simply automated the manual filing process of tucking information away in folders that now reside on disk drives rather than in file cabinets.

And whether the electronic file drawers serve a corporate colossus or some small business where an inventive clerk confounds co-workers by sticking salary data into a file labeled "paychecks," workers are at each other's mercy when it comes to understanding the logic of document filing and retrieval.

No wonder Ryan LaBrie found while researching his doctoral dissertation for Arizona State University that keyword searches produce merely 10 to 30 percent of relevant documents in organizational repositories. Such poor results occur when people actually spell the keywords correctly. Typing goofs, creative abbreviations and other inconsistencies cut the effectiveness of keyword searches even further.

In calling for a solution, St. Louis points to Michael Hammer's 1990 article on business process engineering, which promotes the use of computers as a means of redesigning, not simply automating, existing business processes. That is exactly what St. Louis and Corral recommend organizations do for document-retrieval systems: the use of dynamic visual hierarchies that build on the successful design elements of dimensional data warehouses.

Recognition vs. recall

For more than 100 years, cognitive psychologists studying memory have noted two dominant methods of prying information out of that personal database called the human brain. Those methods are recall and recognition. "Trying to remember the name of someone you met at a conference is recall. Recognition is remembering that person's name when looking at a list of conference attendees," according to St. Louis.

Psychologists have found that recognition outperforms recall as a mnemonic methodology. For that reason, it's easier to click on a link to categories of information that spotlight desired research subjects than it is to think up all the terms that might lead to an effective keyword search.

That's why St. Louis and Corral have come up with what they call "dynamic visual hierarchies," which they define as "an alternative mechanism for presenting keywords that is based on a recognition paradigm that can be dynamically updated. Keywords are arranged into a tree hierarchy to facilitate links to keyword phrases and enable browsing."

"The general idea is to categorize knowledge — in this case, documents — into a hierarchy so that the person searching will recognize phrases or words that more narrowly define the search," Corral says. Fashioned with dimensional data warehouses in mind, dynamic visual hierarchies slice and dice data into "lenses," or ways of looking at the information. Like dimensions in a data warehouse, the lenses represent ways people might want to search the data.

"We are drawing off a mature technology — data management," Corral explains, adding that the lessons learned over 30 years of data-management experience have not previously been applied to document management. "With data warehouses, we've taken data collections and reorganized them based on what we think users will need. We're suggesting the same thing with document management."

Expert organization

To refine the hierarchy and make is useful, St. Louis maintains that each user group must determine its own "lenses" or categories for document management. "You're not going to find a single set of lenses that works for the whole organization," St. Louis says. Finance, supply chain, engineering — each department has different document needs and, therefore, needs different search lenses.

Such search lenses might include document subject, retrieval frequency, user-acceptance, time frames and more. A research department, for instance, might want to add a "methodology used" lens to its retrieval mechanism for research reports. The system would filter out potentially irrelevant search results by looking first for documents that fit within several lenses or categories. Meanwhile, links would do the work of jetting users from keyword to document. Consequently, the information could be mentioned — essentially cross-referenced — in several branches of the tree hierarchy even though the document itself would reside in one location only, thereby saving storage space.

The key to the dynamic-visual-hierarchy solution is user-defined organization of documents. "The people who know how documents will be used should be putting that knowledge into storage and retrieval mechanisms," Corral says. The solution "does not take any specialized hardware or software," St. Louis says. "It can be done with Microsoft SQL Server."

Does it work? Research performed at the W. P. Carey School of Business indicates that it does.

"We ran an experiment with students," St. Louis says. "With traditional keywords, they were able to find 17.67 percent of relevant articles. With visual hierarchies, which would be looking at just one dimension — and our argument is that you need to look at many dimensions — they found 28.96 percent. That was more than a 60 percent improvement."

Because online information searches are notorious for bringing up irrelevant documents to peruse and discard, St. Louis and his team also evaluated search accuracy by subtracting irrelevant results from relevant ones. Using that research protocol, experiment subjects achieved 5.37 percent search accuracy via keywords or recall and 20.84 percent accuracy with dynamic visual hierarchies drawing on the team's recognition memory processes. Recognition beat recall almost four times over.

The model works in the lab, but will it work in an organization? St. Louis and Corral believe it will — provided the hierarchies are set up correctly. Everyone can benefit from knowledge management, St. Louis says, and everyone should be involved in setting it up, as well. But, he cautions, the key to reducing information overload is "more data, not less," as long as it's tidily organized.

Corral adds that while there is time and cost involved in setting up hierarchical systems, the effort will more than pay for itself." She quotes Nobel Memorial Prize-winner Herbert Simon who said, "A wealth of information creates a poverty of attention."

"That sums up the trouble" with search engines today, Corral says. "Getting back too much information is more of a problem than getting too little."

Latest news