KnowIT

Solving the small dataset dilemma

New research shows how multitask learning, a method that trains algorithms to solve multiple tasks at once while exploiting commonalities and differences across tasks, offers a solution to the challenges of small datasets and improves predictive accuracy across industries.

By Betsy Loeff

2024-09-22

Most people think a million is a pretty big number. After all, if you're not already a millionaire, wouldn't you like to be one? And doesn't it feel great when someone says, "You look like a million bucks?" But in data modeling, a million tokens — or parts of words — is a small dataset, and that can present big problems when creating predictive analytics to support decision-making.

Assistant Professor of Information Systems Kan Xu found a workaround for small datasets. He and his research colleague Hamsa Bastani from The Wharton School of Business created a two-stage multitask learning solution that delivers better predictive accuracy than commonly used approaches to analytics with limited information.

Numbers count

If you doubt a million tokens equals small data, consider this: OpenAI's original ChatGPT model was trained on some 500 billion tokens. Many companies and business sectors can't come close to such informational riches.

Big data only exists in big companies like Amazon and Google. Compared to billions of tokens, even moderately sized data — millions of tokens — is not considered large.

— Kan Xu, assistant professor of information systems

The main problem with small datasets is that they inhibit accurate predictions. "If you only have a few data points, the data size may not be large enough to train a machine learning algorithm," says Xu. He likens artificial intelligence models to the human mind. "Human beings learn from their experiences. It's the same for machine learning. AI models can only learn from what they observe."

Imagine what could happen when a model tries to diagnose or predict the proper treatment protocol for a patient who doesn't fit any of the symptoms shown by other patients in a small dataset. "If the dataset is not large enough, the machine learning will give you an answer that relies too much on the existing observations," Xu says. "The model's answer can be wrong."

Health care is a setting where small data is a common problem. In part, that's because some states prohibit hospitals from sharing patient records, Xu explains. He adds that sharing datasets is often hard because the databases have different structures, although organizations can simply share the training model so that the model itself is trained on each of the datasets.

Still, that can be problematic, too, especially in health care. "The population in Pennsylvania may be very different from the population in Colorado or Arizona because of the gender proportion differences, racial proportion differences, and other things related to health conditions," Xu explains. Local health care utilization patterns, hospital diagnostics, treatment patterns, and even medical coding practices can impact modeling accuracy, too. Studies have shown that models trained on medical record data from one hospital generally perform poorly when applied to patients from other hospitals, largely because of a phenomenon called dataset shift, notes Xu and Bastani in another paper about their research.

Health care isn't the only industry in which dataset shift happens. As noted in the research write-up, retailers with store-specific data might also need a way to protect against this small-data problem. So might a meal-delivery company that operates in different regions. "Across the delivery centers, we found demand for certain foods was very different," Xu says. "The reason is like what you see in the hospital setting. The population, the source of demand, differs from one area to another."

Splitting the difference

Xu and Bastani created a multitask estimator to leverage the differences between hospitals in predictive analytics. A multitask estimator is a machine-learning model designed to perform multiple tasks simultaneously. In the research Xu and Bastani conducted, they first trained separate models to predict diabetes risk at 13 hospitals using each hospital's medical records. Then the team shared the information with each hospital and identified distinctions within each site so that the hospitals could learn from each other to improve their diagnostic acumen.

"We find naturally that any two data sources will differ in a specific way," Xu says. "In this paper, we call it a sparse difference." Sparse datasets have missing data, so most cells in the database table are empty. For example, the team found that in certain hospitals, one feature to predict a patient's diabetes risk was whether people were taking a specific drug for osteoarthritis. "It raises the blood sugar," Xu explains. All told, there were 80 different features factored into the data model, but not all were significant. Those that were didn't necessarily have an impact at all hospitals, hence the term "sparse" was used to describe this information.

After sharing the data across hospitals, the model detected the feature distinctions and used them when appropriate to improve predictions on the patient population to find people with high diabetes risk. "The hospitals didn't need to detect these distinctions because the machine algorithm automatically did it to improve analysis," Xu says. Compared with how hospitals had been calculating diabetes risk previously, this approach improved prediction accuracy by 3.7% in one test hospital and 8% in another.

The team also augmented their multitask estimator by employing bandit algorithms, which balance exploring new approaches with exploiting the best-known solution. Bandit algorithms have value in organizations facing online learning challenges, such as a newly launched online retail site that starts with no customer or purchasing data and must collect it over time. The research team had their multitask estimator leverage multiple different bandit instances to improve the estimator's predictive power even further.

In the end, the multitask estimator Xu and Bastami developed — their multitask model approach —raised the model's predictive accuracy and lowered what data analysts call regret.

Regret is the difference between what a situation's optimal outcome would be and what happens. As Xu notes, "Minimizing cumulative regret is equivalent to maximizing the cumulative welfare of your patients or retail fulfillment center. It helps these companies or nonprofit institutions prepare better for the future so they will be able to generate more revenue and help more people."

Information Systems

Solving the small dataset dilemma

Numbers count

Splitting the difference

Latest news