Clinical drug trials are designed to compare a treatment against a placebo to determine the best option for a patient. These randomized control studies are the gold standard for determining causality. If there are enough participants, they can be used to determine the best course of action. The knowledge gained through the experiment allows drug makers to confidently distribute a drug to thousands, or even millions, of patients.
This traditional method of conducting experiments is not without its downside. The group that receives the less beneficial treatment will permanently lose out.
The stakes in a drug test differ from those of testing variations of a website or advertisement for a company, but A/B tests have the same problem. “We gain information at the cost of this opportunity costs,” explains Mohsen Bayati, an associate professor at Stanford Graduate School of Business. The opportunity cost increases as the number of variants is increased. For example, each drug combination in a chemotherapy trial requires its group treatments.
Bayati and Stanford Ph.D. students Nima Hamidiopen in a new window and Khashayar Khosraviopen a new window recently sought to improve the efficiency and reduce opportunity costs of these tests. They discovered that a “greedy algorithm” could streamline trials that explore many options while producing conclusive results.
Bet on Bandits
It’s not new that there are ways to reduce such costs, but they have only been implemented in recent years by tech-savvy businesses that need to compare multiple variables quickly. This experimental design, also known as sequential or adaptive experimentation, uses a series of ideas called “multiarmed bandit,” named after a row of slot machines (or single-armed bandits). Each “arm,” whether a product, a headline, or something else, represents an option being tested.
Multiarmed bandit, instead of randomly allocating users or participants to all arms, as in a conventional experiment (100 users to condition A, then 100 to B, and so on), gives only a few at a given time to each arm and adjusts the subsequent allocations based on which arms show the best results. By the end of an adaptable experiment, fewer people will have been exposed to suboptimal versions. Bayati says that if this adaptive approach were used in a drug test, more patients would be on the better variant.
This is the idea, but how can you optimize it so that most users are on the best arm possible? This is the core of the multiarmed bandit research, which dates back to the 1930s. It models the trade-offs between exploring options to discover their rewards and “exploiting options” whose rewards are already known. In different situations, there are better policies to solve this dilemma. Bayati and his co-workers asked what approach to use when there are more than two arms (for instance, 10,000 arms – a scenario not uncommon in online gaming).
One hundred arms are better than 10,000
The researchers presented several findings through mathematical analysis and simulations based on accurate data at the 2020 Conference on Neural Information Processing Systems. They first proved that subsampling is the best strategy when dealing with large numbers of weapons. For example, instead of exploring all 10,000 arms, it’s better to examine just 100. While it is possible to find the best weapon by trying all of them, the benefits of this over just trying a subsample are not significant enough to justify such a high cost. Bayati says that a random sample is enough. “We show that even the best 100 [rather than 10,000] are very good.”
What should you do when there are only a few arms available? Researchers’ second discovery answered this question: their simulations proved the effectiveness of greedy algorithms, which exploit without much exploration. It pulls every arm once and then only the best components each time.
Researchers are still trying to understand why greed works so well. Still, part of the reason seems to be that it eliminates the majority of suboptimal weapons in the initial stages and does not try them later. This is what their analysis shows. It explores Greedy’s experiments with a few to narrow down the number of arms. Bayati says that “the greedy algorithm benefits” from free [costless] exploring. This means the algorithm doesn’t actively explore, but it still learns. “That’s what I find interesting.”
Bayati’s team and he also proved the lower bound of any algorithm for these problems. They showed that greed followed by subsampling is optimal when there is at least the same number of arms as square roots of the users.
Researchers say this set of insights will be necessary to AI developers because most reinforcement-learning algorithms are based on exploration, and the multiarmed bandit is an example of reinforcement learning.
Bayati: “The main takeaway is that you can reduce your experimentation.” Amazon, Facebook, and Uber are just a few companies that regularly run large-scale experiments on their users. “We are telling them that if they try the options you’ve seen so far, they might learn more than they imagined.”