Analysis of Regret Function Trends of Algorithms for Multi-armed Bandit Problem
Download as PDF
DOI: 10.23977/csic2022.013
Corresponding Author
Bingjiang Xia
ABSTRACT
The multi-armed bandit (MAB) problem has gained interest in reinforcement learning and other areas due to its wide applications. Many algorithms for the MAB problem have been proposed. This paper discusses the three most common algorithms for the MAB problem: ε-greedy algorithm, upper confidence bound algorithm, and Thompson sampling algorithm. Since the performance, in terms of expected returns, of these algorithms is critical when deciding which algorithm to use, this paper analyzes the trends and properties of the regret function of these algorithms to compare their performance by stimulations. It is suggested that the Thompson sampling algorithm usually shows the best algorithm among the three algorithms based on the simulation performed.
KEYWORDS
multi-armed bandit problem, Thompson sampling, regret function