Batched Nonparametric Contextual Bandits

Rong Jiang Committee on Computational and Applied Mathematics, University of Chicago Cong Ma Department of Statistics, University of Chicago
(February 2024;  Revised June 2024)
Abstract

We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.

1 Introduction

Recent years have witnessed substantial progress in the field of sequential decision making under uncertainty. Especially noteworthy are the advancements in personalized decision making, where the decision maker uses side-information to make customized decision for a user. The contextual bandit framework has been widely adopted to model such problems because of its applicability and elegance [35, 53, 6]. In this framework, one interacts with an environment for a number of rounds: at each round, one is given a context, picks an action, and receives a reward. One can update the action-assignment policy based on previous observations and the goal is to maximize the expected cumulative rewards. For example, in online news recommendation, a recommendation algorithm selects an article for each newly arrived user based on the user’s contextual information, and observes whether the user clicks the article or not. The goal is to try to maximize the number of clicks received. Apart from news recommendation, contextual bandits have found numerous applications in other fields such as clinical trials, personalized medicine, and online advertising [30, 62, 13].

At the core of designing a contextual bandit algorithm is deciding how to update the policy based on prior observations. A standard metric of performance for bandit algorithms is regret, which is the expected difference between the cumulative rewards obtained by an oracle who knows the optimal action for every context and that obtained by the actual algorithm under consideration. Many existing regret optimal bandit algorithms require a policy update per observation (unit) [4, 1, 39, 34]. At a first glance, such frequent policy updates are needed so that the algorithm can quickly learn the optimal action under each context and reduce regret. However, this kind of algorithm ignores an important concern in the practice of sequential decision making—the batch constraint.

In many real world scenarios, the data often arrive in batches: the statistician can only observe the outcomes of the policy at the end of a batch, and then decides what to do for the next batch. For example, this batch constraint is ubiquitous in clinical trials: statisticians need to divide the participants into batches, determine a treatment allocation policy before the batch starts, and then observe all the outcomes at the end of the batch [49]. Policy updates are made per batch instead of per unit. In fact, it is infeasible to apply unit-wise policy update in this case because observing the effect of a treatment takes time and if one waits for the result before deciding how to treat the next patient, the entire experiment will take too long to complete when the number of participants is huge. The batch constraint also appears in areas such as online marketing, crowdsourcing, and simulations [8, 50, 31, 15]. Clearly, the batch constraint presents additional challenges to online learning. Indeed, from an information perspective, the statistician’s information set is largely restricted since she can only observe all the responses at the end of a batch. The following questions naturally arise:

Given a batch budget M𝑀Mitalic_M and a total number of T𝑇Titalic_T rounds, how should the statistician determine the size of each batch, and how should she update the policy after each batch? Can the statistician design batch learning algorithms that achieve regret performances on par with the fully online setting using as few policy updates as possible?

1.1 Main contributions

In this work, we address the aforementioned questions under a classical framework for personalized decision making—nonparametric contextual bandits [48, 39]. In this framework, the expected reward associated with each treatment (or arm in the language of bandits) is modeled as a nonparametric smooth function of the covariates [59]. In the fully online setup, seminal works [48, 39] establish the minimax optimal regret bounds for the nonparametric contextual bandits. Nevertheless, under the more challenging setting with the batch constraint, the fundamental limits for nonparametric bandits remain unknown. Our paper aims to bridge this gap. More concretely, we make the following contributions:

  • First, we establish a minimax regret lower bound for the nonparametric bandits with the batch constraint. Our lower bound holds even when the batch size is adaptively chosen (based on the data observed in prior batches). The proof relies on a simple but useful insight that the worst-case regret over the entire horizon is greater than the worst-case regret over the first i𝑖iitalic_i batches for any 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M. To exploit this insight, for each different batch, we construct different families of hard instances to target it, leading to a maximal regret over this batch.

  • In addition, we demonstrate that the aforementioned lower bound is tight by providing a matching upper bound (up to log factors). Specifically, we design a novel algorithm—Batched Successive Elimination with Dynamic Binning (BaSEDB)—for the nonparametric bandits with batch constraints. BaSEDB progressively splits the covariate space into smaller bins whose widths are carefully selected to align well with the corresponding batch size. The delicate interplay between the batch size and the bin width is crucial for obtaining the optimal regret in the batch setting.

  • On the other hand, we show the suboptimality of static binning under the batch constraint by proving an algorithm-specific lower bound. Unlike the fully online setting where policies that use a fixed number of bins can attain the optimal regret [39], our lower bound indicates that batched successive elimination with static binning is strictly suboptimal.111In a certain regime the BSE policy from [39] which uses a fixed number bins could loose by log factors compared to the optimal fully online regret. However, we will show the price of fixed binning is polynomial under the batch setting. This highlights the necessity of dynamic binning in some sense under the batch setting, which is uncommon in classical nonparametric estimation.

  • Last but not least, we demonstrate the challenge of adapting to the margin parameter in tha batch setting. Specifically, we show that when M𝑀Mitalic_M is small, the price of not knowing the true margin parameter for an algorithm is at least a polynomial increase in terms of the regret.

It is also worth mentioning that an immediate consequence of our results is that MloglogTgreater-than-or-equivalent-to𝑀𝑇M\gtrsim\log\log Titalic_M ≳ roman_log roman_log italic_T number of batches suffices to achieve the optimal regret in the fully online setting. In other words, we can use a nearly constant number of policy updates in practice to achieve the optimal regret obtained by policies that require one update per round.

1.2 Related work

Nonparametric contextual bandits.

[58] introduced the mathematical framework of contextual bandit. The theory of contextual bandits in the fully online setting has been continuously developed in the past few decades. On one hand, [4, 1, 23, 6, 7, 41] obtained learning guarantees for linear contextual bandits in both low and high dimensional settings. On the other hand, [59] introduced the nonparametric approach to model the mean reward function. [48] proved a minimax lower bound on the regret of nonparametric bandit and developed an upper-confidence-bound (UCB) based policy to achieve a near-optimal rate. [39] improved this result and proposed the Adaptively Binned Successive Elimination (ABSE) policy that can also adapt to the unknown margin parameter. Further insights in this nonparametric setting were developed in subsequent works [42, 43, 45, 24, 27, 52, 25, 10, 51, 9]. The smoothness assumption is also adopted in another line of work [37, 36, 33, 11] on the continuum-armed bandit problems. However in contrast to what we study, the reward is assumed to be a Lipschitz function of the action, and the covariates are not taken into considerations.

Batch learning.

The batch constraint has received increasing attention in recent years. [40, 21] considered the multi-armed bandit problem under the batch setting and showed that O(loglogT)𝑂𝑇O(\log\log T)italic_O ( roman_log roman_log italic_T ) batches are adequate in achieving the rate-optimal regret, compared to the fully online setting. [26, 47] extended batch learning to the (generalized) linear contextual bandits and [46, 56, 17] further studied the setting with high-dimensional covariates. [29, 28] established batch learning guarantees for the Thompson sampling algorithm. [18] considered Lipschitz continuum-armed bandit problem with the batch constraint. Inference for batched bandits was considered in [60]. A concept related to batch learning in literature is called delayed feedback [14, 13, 55, 19]. These works consider the setting where rewards are observed with delay and analyze effects of delay on the regret. [32, 2] studied delayed feedback in nonparametric bandits and the key difference to batch learning is that the batch size is given, whereas in our case, it is a design choice by the statistician. Batch learning’s focus is different to that of delayed feedback in the sense that the former gives the decision maker discretion to choose the batch size which makes it possible to approximate the optimal standard online regret with a small number of batches. Finally, the notion switching cost is intimately related to the batch constraint. [12] studied online learning with low switching cost and obtained minimax optimal regret with O(loglogT)𝑂𝑇O(\log\log T)italic_O ( roman_log roman_log italic_T ) batches. [5, 61, 20, 57, 44] developed regret guarantees with low switching cost for reinforcement learning. Low switching cost can be interpreted as infrequent policy updates, but it does not require the learner to divide the samples into batches with feedback only becoming available at the end of a batch.

2 Problem setup

We begin by introducing the problem setup for nonparametric bandits with the batch constraint.

A two-arm nonparametric bandit with horizon T1𝑇1T\geq 1italic_T ≥ 1 is specified by a sequence of independent and identically distributed random vectors

(Xt,Yt(1),Yt(1)),for t=1,2,,T,formulae-sequencesubscript𝑋𝑡superscriptsubscript𝑌𝑡1superscriptsubscript𝑌𝑡1for 𝑡12𝑇(X_{t},Y_{t}^{(1)},Y_{t}^{(-1)}),\qquad\text{for }t=1,2,\ldots,T,( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ) , for italic_t = 1 , 2 , … , italic_T , (1)

where Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from a distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Throughout the paper, we assume that Xt𝒳[0,1]dsubscript𝑋𝑡𝒳superscript01𝑑X_{t}\in\mathcal{X}\coloneqq[0,1]^{d}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X ≔ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT has a density (w.r.t. the Lebesgue measure) that is bounded below and above by some constants c¯,c¯>0¯𝑐¯𝑐0\underline{c},\bar{c}>0under¯ start_ARG italic_c end_ARG , over¯ start_ARG italic_c end_ARG > 0, respectively. For k{1,1}𝑘11k\in\{1,-1\}italic_k ∈ { 1 , - 1 } and t1𝑡1t\geq 1italic_t ≥ 1, we assume that Yt(k)[0,1]superscriptsubscript𝑌𝑡𝑘01Y_{t}^{(k)}\in[0,1]italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] and that

𝔼[Yt(k)Xt]=f(k)(Xt).𝔼delimited-[]conditionalsuperscriptsubscript𝑌𝑡𝑘subscript𝑋𝑡superscript𝑓𝑘subscript𝑋𝑡\mathbb{E}[Y_{t}^{(k)}\mid X_{t}]=f^{(k)}(X_{t}).blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Here f(k)superscript𝑓𝑘f^{(k)}italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the unknown mean reward function for the arm k𝑘kitalic_k.

Without the batch constraint, the game of nonparametric bandits plays sequentially. At each step t𝑡titalic_t, the statistician observes the context Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and pulls an action At{1,1}subscript𝐴𝑡11A_{t}\in\{1,-1\}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , - 1 } according to a rule πt:𝒳{1,1}:subscript𝜋𝑡maps-to𝒳11\pi_{t}:\mathcal{X}\mapsto\{1,-1\}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X ↦ { 1 , - 1 }. Then she receives the corresponding reward Yt(At)superscriptsubscript𝑌𝑡subscript𝐴𝑡Y_{t}^{(A_{t})}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. In this case, the rule πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for selecting the action at time t𝑡titalic_t is allowed to depend on all the observations strictly anterior to t𝑡titalic_t.

In an M𝑀Mitalic_M-batch game, the statistician needs to design an M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ), where Γ={t0,t1,,tM}Γsubscript𝑡0subscript𝑡1subscript𝑡𝑀\Gamma=\{t_{0,}t_{1,...,}t_{M}\}roman_Γ = { italic_t start_POSTSUBSCRIPT 0 , end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , … , end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } is a partition of the entire time horizon T𝑇Titalic_T that satisfies 0=t0<t1<<tM1<tM=T0subscript𝑡0subscript𝑡1subscript𝑡𝑀1subscript𝑡𝑀𝑇0=t_{0}<t_{1}<...<t_{M-1}<t_{M}=T0 = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < … < italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_T, and π={πt}t=1T𝜋superscriptsubscriptsubscript𝜋𝑡𝑡1𝑇\pi=\{\pi_{t}\}_{t=1}^{T}italic_π = { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a sequence of random functions πt:𝒳{1,1}:subscript𝜋𝑡maps-to𝒳11\pi_{t}:\mathcal{X}\mapsto\{1,-1\}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X ↦ { 1 , - 1 }. The grid ΓΓ\Gammaroman_Γ can be chosen adaptively, meaning that the statistician can use all information up to ti1subscript𝑡𝑖1t_{i-1}italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to determine tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. More specifically, prior to the start of the game, she will specify the first batch t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and at the end of t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, she will use all observations she have to decide the next batch t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and this process repeats in batches. In contrast to the case without the batch constraint, only the rewards associated with timesteps prior to the current batch are observed and available for making decisions for the current batch. Specifically, let Γ(t)Γ𝑡\Gamma(t)roman_Γ ( italic_t ) be the batch index for the time t𝑡titalic_t, i.e., Γ(t)Γ𝑡\Gamma(t)roman_Γ ( italic_t ) is the unique integer such that tΓ(t)1<ttΓ(t)subscript𝑡Γ𝑡1𝑡subscript𝑡Γ𝑡t_{\Gamma(t)-1}<t\leq t_{\Gamma(t)}italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT < italic_t ≤ italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) end_POSTSUBSCRIPT. Then at time t𝑡titalic_t, the available information for πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is only {Xl}l=1t{Yl(Al)}l=1Γ(t)1superscriptsubscriptsubscript𝑋𝑙𝑙1𝑡superscriptsubscriptsuperscriptsubscript𝑌𝑙subscript𝐴𝑙𝑙1Γ𝑡1\{X_{l}\}_{l=1}^{t}\cup\{Y_{l}^{(A_{l})}\}_{l=1}^{\Gamma(t)-1}{ italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ { italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUPERSCRIPT, which we denote by tsuperscript𝑡\mathcal{F}^{t}caligraphic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. The statistician’s policy πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t is allowed to depend on tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The goal of the statistician is to design an M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ) that can compete with an oracle that has perfect knowledge (i.e., the law of (Xt,Yt(1),Yt(1))subscript𝑋𝑡superscriptsubscript𝑌𝑡1superscriptsubscript𝑌𝑡1(X_{t},Y_{t}^{(1)},Y_{t}^{(-1)})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT )) of the environment. Formally, we define the cumulative regret as

RT(π)𝔼[t=1T(f(Xt)f(πt(Xt))(Xt))],subscript𝑅𝑇𝜋𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡R_{T}(\pi)\coloneqq\mathbb{E}\left[\sum_{t=1}^{T}\left(f^{\star}(X_{t})-f^{(% \pi_{t}(X_{t}))}(X_{t})\right)\right],italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ≔ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] , (2)

where f(x)maxk{1,1}f(k)(x)superscript𝑓𝑥subscript𝑘11superscript𝑓𝑘𝑥f^{\star}(x)\coloneqq\max_{k\in\{1,-1\}}f^{(k)}(x)italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) ≔ roman_max start_POSTSUBSCRIPT italic_k ∈ { 1 , - 1 } end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) is the maximum mean reward one could obtain on the context x𝑥xitalic_x. Note here we omit the dependence on ΓΓ\Gammaroman_Γ for simplicity.

2.1 Assumptions

We adopt two standard assumptions in the nonparametric bandits literature [48, 39]. The first assumption is on the smoothness of the mean reward functions.

Assumption 1 (Smoothness).

We assume that the reward function for each arm is (β,L)𝛽𝐿(\beta,L)( italic_β , italic_L )-smooth, that is, there exist β(0,1]𝛽01\beta\in(0,1]italic_β ∈ ( 0 , 1 ] and L>0𝐿0L>0italic_L > 0 such that for k{1,1}𝑘11k\in\{1,-1\}italic_k ∈ { 1 , - 1 },

|f(k)(x)f(k)(x)|Lxx2βsuperscript𝑓𝑘𝑥superscript𝑓𝑘superscript𝑥𝐿superscriptsubscriptnorm𝑥superscript𝑥2𝛽|f^{(k)}(x)-f^{(k)}(x^{\prime})|\leq L\|x-x^{\prime}\|_{2}^{\beta}| italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_L ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT

holds for all x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X.

The second assumption is about the separation between the two reward functions.

Assumption 2 (Margin).

We assume that the reward functions satisfy the margin condition with parameter α>0𝛼0\alpha>0italic_α > 0, that is there exist δ0(0,1)subscript𝛿001\delta_{0}\in(0,1)italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , 1 ) and D0>0subscript𝐷00D_{0}>0italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that

X(0<|f(1)(X)f(1)(X)|δ)D0δαsubscript𝑋0superscript𝑓1𝑋superscript𝑓1𝑋𝛿subscript𝐷0superscript𝛿𝛼\mathbb{P}_{X}\left(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq\delta\right)\leq D% _{0}\delta^{\alpha}blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_δ ) ≤ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT

holds for all δ[0,δ0]𝛿0subscript𝛿0\delta\in[0,\delta_{0}]italic_δ ∈ [ 0 , italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ].

Assumption 2 is related to the margin condition in classification [38, 54, 3] and is introduced to bandits in [22, 48, 39]. The margin parameter affects the complexity of the problem. Intuitively, a small α𝛼\alphaitalic_α, say α0𝛼0\alpha\approx 0italic_α ≈ 0, means the two mean functions are entangled with each other in many regions and hence it is challenging to distinguish them; a large α𝛼\alphaitalic_α, on the other hand, means the two reward functions are mostly well-separated.

From now on, we use (α,β)𝛼𝛽\mathcal{F}(\alpha,\beta)caligraphic_F ( italic_α , italic_β ) to denote the class of nonparametric bandit instances (i.e., distributions over (1)) that satisfy Assumptions 1-2.

Remark 1.

Throughout the paper, we assume that αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1. By proposition 2.1 from [48], when αβ>1𝛼𝛽1\alpha\beta>1italic_α italic_β > 1, one of the arms will dominate the other one for the entire covariate space. The instance is reduced to a multi-armed bandit without covariates which is not the interest of the current paper. Therefore, we focus on the case αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1 hereafter.

3 Fundamental limits of batched nonparametric bandits

In this section, we establish minimax lower bounds for the regret achievable by any M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ); see Theorem 2. To begin with, we state a minimax lower bound, together with its proof, when the grid ΓΓ\Gammaroman_Γ is prespecified, that is, the statistician divides the horizon [1:T]delimited-[]:1𝑇[1:T][ 1 : italic_T ] into M𝑀Mitalic_M disjoint batches [1:t1]delimited-[]:1subscript𝑡1[1:t_{1}][ 1 : italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ], [t1+1:t2]delimited-[]:subscript𝑡11subscript𝑡2[t_{1}+1:t_{2}][ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], ,[tM1+1,T]subscript𝑡𝑀11𝑇\ldots,[t_{M-1}+1,T]… , [ italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT + 1 , italic_T ] before the game begins; see Theorem 1. As we will soon see, the proof of the lower bound with fixed grid is not only useful for establishing the lower bound for any general M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ), but also instrumental in our development of an optimal policy to be detailed in Section 4.

Recall that (α,β)𝛼𝛽\mathcal{F}(\alpha,\beta)caligraphic_F ( italic_α , italic_β ) denotes the class of nonparametric bandit instances (i.e., distributions over (1)) that obey Assumptions 1-2. We have the following minimax lower bound for any M𝑀Mitalic_M-batch policy with a fixed grid, in which we define

γβ(1+α)2β+d(0,1).𝛾𝛽1𝛼2𝛽𝑑01\gamma\coloneqq\frac{\beta(1+\alpha)}{2\beta+d}\in(0,1).italic_γ ≔ divide start_ARG italic_β ( 1 + italic_α ) end_ARG start_ARG 2 italic_β + italic_d end_ARG ∈ ( 0 , 1 ) .
Theorem 1.

Suppose that αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1, and assume that PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is the uniform distribution on 𝒳=[0,1]d𝒳superscript01𝑑\mathcal{X}=[0,1]^{d}caligraphic_X = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. For any M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ) where ΓΓ\Gammaroman_Γ is prespecified, there exists a nonparametric bandit instance in (α,β)𝛼𝛽\mathcal{F}(\alpha,\beta)caligraphic_F ( italic_α , italic_β ) such that the regret of (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ) on this instance is lower bounded by

𝔼[RT(π)]D~T1γ1γM,𝔼delimited-[]subscript𝑅𝑇𝜋~𝐷superscript𝑇1𝛾1superscript𝛾𝑀\mathbb{E}[R_{T}(\pi)]\geq\tilde{D}T^{\frac{1-\gamma}{1-\gamma^{M}}},blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≥ over~ start_ARG italic_D end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ,

where D~>0~𝐷0\tilde{D}>0over~ start_ARG italic_D end_ARG > 0 is a constant independent of T𝑇Titalic_T and M𝑀Mitalic_M.

See Section 3.1 for the proof of this lower bound.

As a sanity check, one sees that as M𝑀Mitalic_M increases, the lower bound decreases. This is intuitive, as the policy is more powerful as M𝑀Mitalic_M increases. As a result, the problem of batched nonparametric bandits becomes easier.

3.1 Proof of Theorem 1

Let (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ) be the M𝑀Mitalic_M-batch policy under consideration, with

Γ={t0=0,t1,t2,,tM=T}.Γformulae-sequencesubscript𝑡00subscript𝑡1subscript𝑡2subscript𝑡𝑀𝑇\Gamma=\{t_{0}=0,t_{1},t_{2},\ldots,t_{M}=T\}.roman_Γ = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_T } .

Throughout this proof, we consider Bernoulli reward distributions, that is Yt(1),Yt(1)superscriptsubscript𝑌𝑡1superscriptsubscript𝑌𝑡1Y_{t}^{(1)},Y_{t}^{(-1)}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT are Bernoulli random variables with mean f(1)(Xt)superscript𝑓1subscript𝑋𝑡f^{(1)}(X_{t})italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and f(1)(Xt)superscript𝑓1subscript𝑋𝑡f^{(-1)}(X_{t})italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), respectively. In addition, we fix f(1)(x)=12superscript𝑓1𝑥12f^{(-1)}(x)=\frac{1}{2}italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Let f𝑓fitalic_f be the mean reward function of the first arm. To make the dependence on the reward instance clear, we write the cumulative regret up to time n𝑛nitalic_n as Rn(π;f)subscript𝑅𝑛𝜋𝑓R_{n}(\pi;f)italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ; italic_f ).

Our proof relies on a simple observation: the worst-case regret over [T]delimited-[]𝑇[T][ italic_T ] is larger than the worst-case regret over the first i𝑖iitalic_i batches. Formally, we have

sup(f,12)(α,β)RT(π;f)max1iMsup(f,12)(α,β)Rti(π;f).subscriptsupremum𝑓12𝛼𝛽subscript𝑅𝑇𝜋𝑓subscript1𝑖𝑀subscriptsupremum𝑓12𝛼𝛽subscript𝑅subscript𝑡𝑖𝜋𝑓\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}R_{T}(\pi;f)\geq\max_{1\leq i% \leq M}\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}R_{t_{i}}(\pi;f).roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≥ roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) . (3)

Though simple, this observation lends us freedom on choosing different families of instances in (α,β)𝛼𝛽\mathcal{F}(\alpha,\beta)caligraphic_F ( italic_α , italic_β ) targeting different batch indices i𝑖iitalic_i.

Our proof consists of four steps. In Step 1, we reduce bounding the regret of a policy to lower bounding its inferior sampling rate to be defined. In Step 2, we detail the choice of different families of instances for each different batch index i𝑖iitalic_i. Then in Step 3, we apply an Assouad-type of argument to lower bound the average inferior sampling rate of the family of hard instances. Lastly in Step 4, we combine the arguments to complete the proof.

Step 1: Relating regret to inferior sampling rate.

Given an M𝑀Mitalic_M-batch policy, we define its inferior sampling rate at time n𝑛nitalic_n on an instance (f,12)𝑓12(f,\frac{1}{2})( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) to be

Sn(π;f)𝔼[t=1n1{πt(Xt)π(Xt),f(Xt)12}].subscript𝑆𝑛𝜋𝑓𝔼delimited-[]superscriptsubscript𝑡1𝑛1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡superscript𝜋subscript𝑋𝑡𝑓subscript𝑋𝑡12S_{n}(\pi;f)\coloneqq\mathbb{E}\left[\sum_{t=1}^{n}1\{\pi_{t}(X_{t})\neq\pi^{% \star}(X_{t}),f(X_{t})\neq\frac{1}{2}\}\right].italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≔ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ divide start_ARG 1 end_ARG start_ARG 2 end_ARG } ] .

In words, Sn(π;f)subscript𝑆𝑛𝜋𝑓S_{n}(\pi;f)italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ; italic_f ) counts the number of times π𝜋\piitalic_π selects the strictly suboptimal arm up to time n𝑛nitalic_n. Thanks to the following lemma, we can reduce lower bounding the regret to the inferior sampling rate.

Lemma 1 (Lemma 3.1 in [48]).

Suppose that (f,12)(α,β)𝑓12𝛼𝛽(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ). Then for any 1nT1𝑛𝑇1\leq n\leq T1 ≤ italic_n ≤ italic_T, we have

Sn(π;f)Dn11+αRn(π;f)α1+α,subscript𝑆𝑛𝜋𝑓𝐷superscript𝑛11𝛼subscript𝑅𝑛superscript𝜋𝑓𝛼1𝛼S_{n}(\pi;f)\leq Dn^{\frac{1}{1+\alpha}}R_{n}(\pi;f)^{\frac{\alpha}{1+\alpha}},italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≤ italic_D italic_n start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α end_ARG end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ; italic_f ) start_POSTSUPERSCRIPT divide start_ARG italic_α end_ARG start_ARG 1 + italic_α end_ARG end_POSTSUPERSCRIPT ,

for some constant D>0𝐷0D>0italic_D > 0.

As an immediate consequence of the above lemma, we obtain

sup(f,12)(α,β)RT(π;f)subscriptsupremum𝑓12𝛼𝛽subscript𝑅𝑇𝜋𝑓\displaystyle\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}R_{T}(\pi;f)roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ; italic_f ) max1iMsup(f,12)(α,β)(1D)1+ααti1α(Sti(π;f))1+ααabsentsubscript1𝑖𝑀subscriptsupremum𝑓12𝛼𝛽superscript1𝐷1𝛼𝛼superscriptsubscript𝑡𝑖1𝛼superscriptsubscript𝑆subscript𝑡𝑖𝜋𝑓1𝛼𝛼\displaystyle\geq\max_{1\leq i\leq M}\sup_{(f,\frac{1}{2})\in\mathcal{F}(% \alpha,\beta)}(\frac{1}{D})^{\frac{1+\alpha}{\alpha}}t_{i}^{-\frac{1}{\alpha}}% (S_{t_{i}}(\pi;f))^{\frac{1+\alpha}{\alpha}}≥ roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ) start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT
=(1D)1+ααmax1iMti1α[sup(f,12)(α,β)Sti(π;f)]1+αα.absentsuperscript1𝐷1𝛼𝛼subscript1𝑖𝑀superscriptsubscript𝑡𝑖1𝛼superscriptdelimited-[]subscriptsupremum𝑓12𝛼𝛽subscript𝑆subscript𝑡𝑖𝜋𝑓1𝛼𝛼\displaystyle=(\frac{1}{D})^{\frac{1+\alpha}{\alpha}}\max_{1\leq i\leq M}t_{i}% ^{-\frac{1}{\alpha}}\left[\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}S_% {t_{i}}(\pi;f)\right]^{\frac{1+\alpha}{\alpha}}.= ( divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ] start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT .

From now on, we focus on lower bounding sup(f,12)(α,β)Sti(π;f)subscriptsupremum𝑓12𝛼𝛽subscript𝑆subscript𝑡𝑖𝜋𝑓\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}S_{t_{i}}(\pi;f)roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ).

Step 2: Introducing the family of reward instances for tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Our construction of the family of hard instances is adapted from [48]. Define z1=1subscript𝑧11z_{1}=1italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, and zi=ti11/(2β+d)z_{i}=\lceil t_{i-1}{}^{1/(2\beta+d)}\rceilitalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌈ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_FLOATSUPERSCRIPT ⌉ for i=2,3,,M𝑖23𝑀i=2,3,\ldots,Mitalic_i = 2 , 3 , … , italic_M. Henceforth, we will fix some i𝑖iitalic_i and write zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as z𝑧zitalic_z. We partition [0,1]dsuperscript01𝑑[0,1]^{d}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT into zdsuperscript𝑧𝑑z^{d}italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT bins with equal width. Denote the bins by Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for j=1,,zd𝑗1superscript𝑧𝑑j=1,...,z^{d}italic_j = 1 , … , italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and let qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the center of Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Define a set of binary sequences Ωs{±1}ssubscriptΩ𝑠superscriptplus-or-minus1𝑠\Omega_{s}\coloneqq\{\pm 1\}^{s}roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≔ { ± 1 } start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, with szdαβ𝑠superscript𝑧𝑑𝛼𝛽s\coloneqq\lceil z^{d-\alpha\beta}\rceilitalic_s ≔ ⌈ italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT ⌉. For each ωΩs𝜔subscriptΩ𝑠\omega\in\Omega_{s}italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT we define a function fω:[0,1]d:subscript𝑓𝜔maps-tosuperscript01𝑑f_{\omega}:[0,1]^{d}\mapsto\mathbb{R}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT : [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ blackboard_R:

fω(x)=12+j=1sωjφj(x),subscript𝑓𝜔𝑥12superscriptsubscript𝑗1𝑠subscript𝜔𝑗subscript𝜑𝑗𝑥f_{\omega}(x)=\frac{1}{2}+\sum_{j=1}^{s}\omega_{j}\varphi_{j}(x),italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ,

where φj(x)=Dϕzβϕ(2z(xqj))𝟏{xCj}subscript𝜑𝑗𝑥subscript𝐷italic-ϕsuperscript𝑧𝛽italic-ϕ2𝑧𝑥subscript𝑞𝑗1𝑥subscript𝐶𝑗\varphi_{j}(x)=D_{\phi}z^{-\beta}\phi(2z(x-q_{j}))\mathbf{1}\{x\in C_{j}\}italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_ϕ ( 2 italic_z ( italic_x - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) bold_1 { italic_x ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } with ϕ(x)=(1x)β𝟏{x1}italic-ϕ𝑥superscript1subscriptnorm𝑥𝛽1subscriptnorm𝑥1\phi(x)=(1-\|x\|_{\infty})^{\beta}\mathbf{1}\{\|x\|_{\infty}\leq 1\}italic_ϕ ( italic_x ) = ( 1 - ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT bold_1 { ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 }, and Dϕ=min(2βL,1/4)subscript𝐷italic-ϕsuperscript2𝛽𝐿14D_{\phi}=\min(2^{-\beta}L,1/4)italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = roman_min ( 2 start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_L , 1 / 4 ). In all, we consider the family of reward instances

𝒞z{f(1)(x)=fω(x),f(1)(x)=12ωΩs}.subscript𝒞𝑧conditional-setformulae-sequencesuperscript𝑓1𝑥subscript𝑓𝜔𝑥superscript𝑓1𝑥12𝜔subscriptΩ𝑠\mathcal{C}_{z}\coloneqq\left\{f^{(1)}(x)=f_{\omega}(x),f^{(-1)}(x)=\frac{1}{2% }\mid\omega\in\Omega_{s}\right\}.caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ≔ { italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∣ italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } . (4)

With slight abuse of notation, we also use 𝒞zsubscript𝒞𝑧\mathcal{C}_{z}caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT to denote {fω:ωΩs}conditional-setsubscript𝑓𝜔𝜔subscriptΩ𝑠\{f_{\omega}:\omega\in\Omega_{s}\}{ italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT : italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }. It is straightforward to check that 𝒞z(α,β).subscript𝒞𝑧𝛼𝛽\mathcal{C}_{z}\subseteq\mathcal{F}(\alpha,\beta).caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⊆ caligraphic_F ( italic_α , italic_β ) .

Step 3: Lower bounding the inferior sampling rate.

Fix some i[M]𝑖delimited-[]𝑀i\in[M]italic_i ∈ [ italic_M ], and consider z=zi𝑧subscript𝑧𝑖z=z_{i}italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since 𝒞z(α,β)subscript𝒞𝑧𝛼𝛽\mathcal{C}_{z}\subseteq\mathcal{F}(\alpha,\beta)caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⊆ caligraphic_F ( italic_α , italic_β ), we have

sup(f,12)(α,β)Sti(π;f)supf𝒞zSti(π;f).subscriptsupremum𝑓12𝛼𝛽subscript𝑆subscript𝑡𝑖𝜋𝑓subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑡𝑖𝜋𝑓\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}S_{t_{i}}(\pi;f)\geq\sup_{f% \in\mathcal{C}_{z}}S_{t_{i}}(\pi;f).roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≥ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) .

Using the definitions of 𝒞zsubscript𝒞𝑧\mathcal{C}_{z}caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and Sti(π;f)subscript𝑆subscript𝑡𝑖𝜋𝑓S_{t_{i}}(\pi;f)italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ), we have

supf𝒞zSti(π;f)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑡𝑖𝜋𝑓\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{t_{i}}(\pi;f)roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) =supωΩs𝔼π,fω[t=1ti𝟏{πt(Xt)sign(fω(Xt)12),fω(Xt)12}]absentsubscriptsupremum𝜔subscriptΩ𝑠subscript𝔼𝜋subscript𝑓𝜔delimited-[]superscriptsubscript𝑡1subscript𝑡𝑖1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡signsubscript𝑓𝜔subscript𝑋𝑡12subscript𝑓𝜔subscript𝑋𝑡12\displaystyle=\sup_{\omega\in\Omega_{s}}\mathbb{E}_{\pi,f_{\omega}}\left[\sum_% {t=1}^{t_{i}}\mathbf{1}\{\pi_{t}(X_{t})\neq\textrm{sign}(f_{\omega}(X_{t})-% \frac{1}{2}),f_{\omega}(X_{t})\neq\frac{1}{2}\}\right]= roman_sup start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ sign ( italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ divide start_ARG 1 end_ARG start_ARG 2 end_ARG } ]
12sωΩs𝔼π,fω[t=1ti𝟏{πt(Xt)sign(fω(Xt)12),fω(Xt)12}].absent1superscript2𝑠subscript𝜔subscriptΩ𝑠subscript𝔼𝜋subscript𝑓𝜔delimited-[]superscriptsubscript𝑡1subscript𝑡𝑖1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡signsubscript𝑓𝜔subscript𝑋𝑡12subscript𝑓𝜔subscript𝑋𝑡12\displaystyle\geq\frac{1}{2^{s}}\sum_{\omega\in\Omega_{s}}\mathbb{E}_{\pi,f_{% \omega}}\left[\sum_{t=1}^{t_{i}}\mathbf{1}\{\pi_{t}(X_{t})\neq\textrm{sign}(f_% {\omega}(X_{t})-\frac{1}{2}),f_{\omega}(X_{t})\neq\frac{1}{2}\}\right].≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ sign ( italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ divide start_ARG 1 end_ARG start_ARG 2 end_ARG } ] .

Since fω(x)=12subscript𝑓𝜔𝑥12f_{\omega}(x)=\frac{1}{2}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG for xj=1,sCj𝑥subscript𝑗1𝑠subscript𝐶𝑗x\notin\cup_{j=1,\ldots s}C_{j}italic_x ∉ ∪ start_POSTSUBSCRIPT italic_j = 1 , … italic_s end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we further obtain

supf𝒞zSti(π;f)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑡𝑖𝜋𝑓\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{t_{i}}(\pi;f)roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) 12sωΩst=1tij=1s𝔼π,fωt[𝟏{πt(Xt)ωj,XtCj}].absent1superscript2𝑠subscript𝜔subscriptΩ𝑠superscriptsubscript𝑡1subscript𝑡𝑖superscriptsubscript𝑗1𝑠superscriptsubscript𝔼𝜋subscript𝑓𝜔𝑡delimited-[]1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡subscript𝜔𝑗subscript𝑋𝑡subscript𝐶𝑗\displaystyle\geq\frac{1}{2^{s}}\sum_{\omega\in\Omega_{s}}\sum_{t=1}^{t_{i}}% \sum_{j=1}^{s}\mathbb{E}_{\pi,f_{\omega}}^{t}\left[\mathbf{1}\{\pi_{t}(X_{t})% \neq\omega_{j},X_{t}\in C_{j}\}\right].≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ] . (5)

Here we use π,fωtsuperscriptsubscript𝜋subscript𝑓𝜔𝑡\mathbb{P}_{\pi,f_{\omega}}^{t}blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to denote the joint distribution of {Xl}l=1t{Ylπl(Xl)}l=1Γ(t)1superscriptsubscriptsubscript𝑋𝑙𝑙1𝑡superscriptsubscriptsuperscriptsubscript𝑌𝑙subscript𝜋𝑙subscript𝑋𝑙𝑙1Γ𝑡1\{X_{l}\}_{l=1}^{t}\cup\{Y_{l}^{\pi_{l}(X_{l})}\}_{l=1}^{\Gamma(t)-1}{ italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ { italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUPERSCRIPT , where Γ(t)Γ𝑡\Gamma(t)roman_Γ ( italic_t ) is the batch index for t𝑡titalic_t, i.e., the unique integer such that tΓ(t)1<ttΓ(t)subscript𝑡Γ𝑡1𝑡subscript𝑡Γ𝑡t_{\Gamma(t)-1}<t\leq t_{\Gamma(t)}italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT < italic_t ≤ italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) end_POSTSUBSCRIPT. We use 𝔼π,fωtsuperscriptsubscript𝔼𝜋subscript𝑓𝜔𝑡\mathbb{E}_{\pi,f_{\omega}}^{t}blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to denote the corresponding expectation. Expand the right hand side of (5) to see that

supf𝒞zSti(π;f)12sj=1st=1tiω[j]Ωs1h{±1}𝔼π,fω[j]ht[𝟏{πt(Xt)h,XtCj}]Wj,t,ω[j],subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑡𝑖𝜋𝑓1superscript2𝑠superscriptsubscript𝑗1𝑠superscriptsubscript𝑡1subscript𝑡𝑖subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscriptsubscriptplus-or-minus1superscriptsubscript𝔼𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗𝑡delimited-[]1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡subscript𝐶𝑗subscript𝑊𝑗𝑡subscript𝜔delimited-[]𝑗\sup_{f\in\mathcal{C}_{z}}S_{t_{i}}(\pi;f)\geq\frac{1}{2^{s}}\sum_{j=1}^{s}% \sum_{t=1}^{t_{i}}\sum_{\omega_{[-j]}\in\Omega_{s-1}}\underbrace{\sum_{h\in\{% \pm 1\}}\mathbb{E}_{\pi,f_{\omega_{[-j]}^{h}}}^{t}[\mathbf{1}\{\pi_{t}(X_{t})% \neq h,X_{t}\in C_{j}\}]}_{W_{j,t,\omega_{[-j]}}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ { ± 1 } end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_h , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ] end_ARG start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_j , italic_t , italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (6)

where ω[j]hsuperscriptsubscript𝜔delimited-[]𝑗\omega_{[-j]}^{h}italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is the same as ω𝜔\omegaitalic_ω except for the j𝑗jitalic_j-th entry being hhitalic_h. Note that here we use the fact that for fω[j]hsubscript𝑓superscriptsubscript𝜔delimited-[]𝑗f_{\omega_{[-j]}^{h}}italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the optimal arm in the bin Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is hhitalic_h. We then relate Wj,t,ω[j]subscript𝑊𝑗𝑡subscript𝜔delimited-[]𝑗W_{j,t,\omega_{[-j]}}italic_W start_POSTSUBSCRIPT italic_j , italic_t , italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT to a binary testing error,

Wj,t,ω[j]subscript𝑊𝑗𝑡subscript𝜔delimited-[]𝑗\displaystyle W_{j,t,\omega_{[-j]}}italic_W start_POSTSUBSCRIPT italic_j , italic_t , italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT =1zdh{±1}π,fω[j]ht(πt(Xt)hXtCj)absent1superscript𝑧𝑑subscriptplus-or-minus1superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗𝑡subscript𝜋𝑡subscript𝑋𝑡conditionalsubscript𝑋𝑡subscript𝐶𝑗\displaystyle=\frac{1}{z^{d}}\sum_{h\in\{\pm 1\}}\mathbb{P}_{\pi,f_{\omega_{[-% j]}^{h}}}^{t}(\pi_{t}(X_{t})\neq h\mid X_{t}\in C_{j})= divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ { ± 1 } end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_h ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
14zdexp[KL(π,fω[j]1t,π,fω[j]1t)],absent14superscript𝑧𝑑KLsuperscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡\displaystyle\geq\frac{1}{4z^{d}}\exp\left[-\mathrm{KL}(\mathbb{P}_{\pi,f_{% \omega_{[-j]}^{-1}}}^{t},\mathbb{P}_{\pi,f_{\omega_{[-j]}^{1}}}^{t})\right],≥ divide start_ARG 1 end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp [ - roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] , (7)

where the second step invokes Le Cam’s method. Under the batch setting, at time t𝑡titalic_t, the available information is only up to tΓ(t)1subscript𝑡Γ𝑡1t_{\Gamma(t)-1}italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT. Consequently, we can apply Lemma 5 to obtain

KL(π,fω[j]1t,π,fω[j]1t)=KL(π,fω[j]1tΓ(t)1,π,fω[j]1tΓ(t)1)2z(2β+d)tΓ(t)1.KLsuperscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡KLsuperscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑡Γ𝑡1superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑡Γ𝑡12superscript𝑧2𝛽𝑑subscript𝑡Γ𝑡1\displaystyle\operatorname*{KL}(\mathbb{P}_{\pi,f_{\omega_{[-j]}^{-1}}}^{t},% \mathbb{P}_{\pi,f_{\omega_{[-j]}^{1}}}^{t})=\operatorname*{KL}(\mathbb{P}_{\pi% ,f_{\omega_{[-j]}^{-1}}}^{t_{\Gamma(t)-1}},\mathbb{P}_{\pi,f_{\omega_{[-j]}^{1% }}}^{t_{\Gamma(t)-1}})\leq 2z^{-(2\beta+d)}t_{\Gamma(t)-1}.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ 2 italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT . (8)

Combining (6), (7), and (8), we arrive at

supf𝒞zSti(π;f)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑡𝑖𝜋𝑓\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{t_{i}}(\pi;f)roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) 18j=1st=1ti1zdexp(2z(2β+d)tΓ(t)1)absent18superscriptsubscript𝑗1𝑠superscriptsubscript𝑡1subscript𝑡𝑖1superscript𝑧𝑑2superscript𝑧2𝛽𝑑subscript𝑡Γ𝑡1\displaystyle\geq\frac{1}{8}\sum_{j=1}^{s}\sum_{t=1}^{t_{i}}\frac{1}{z^{d}}% \exp\left(-2z^{-(2\beta+d)}t_{\Gamma(t)-1}\right)≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp ( - 2 italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_Γ ( italic_t ) - 1 end_POSTSUBSCRIPT )
18j=1zdαβl=1itltl1zdexp(2z(2β+d)tl1)absent18superscriptsubscript𝑗1superscript𝑧𝑑𝛼𝛽superscriptsubscript𝑙1𝑖subscript𝑡𝑙subscript𝑡𝑙1superscript𝑧𝑑2superscript𝑧2𝛽𝑑subscript𝑡𝑙1\displaystyle\geq\frac{1}{8}\sum_{j=1}^{z^{d-\alpha\beta}}\sum_{l=1}^{i}\frac{% t_{l}-t_{l-1}}{z^{d}}\exp\left(-2z^{-(2\beta+d)}t_{l-1}\right)≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp ( - 2 italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT )
18j=1zdαβl=1itltl1zdexp(2z(2β+d)ti1),absent18superscriptsubscript𝑗1superscript𝑧𝑑𝛼𝛽superscriptsubscript𝑙1𝑖subscript𝑡𝑙subscript𝑡𝑙1superscript𝑧𝑑2superscript𝑧2𝛽𝑑subscript𝑡𝑖1\displaystyle\geq\frac{1}{8}\sum_{j=1}^{z^{d-\alpha\beta}}\sum_{l=1}^{i}\frac{% t_{l}-t_{l-1}}{z^{d}}\exp\left(-2z^{-(2\beta+d)}t_{i-1}\right),≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp ( - 2 italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ,

where the second line uses the fact that s=zdαβ𝑠superscript𝑧𝑑𝛼𝛽s=\lceil z^{d-\alpha\beta}\rceilitalic_s = ⌈ italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT ⌉, and the last inequality holds since tl1ti1subscript𝑡𝑙1subscript𝑡𝑖1t_{l-1}\leq t_{i-1}italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT for all 1li1𝑙𝑖1\leq l\leq i1 ≤ italic_l ≤ italic_i. Now recall that z=zi=(ti1)1/(2β+d)z=z_{i}=\lceil(t_{i-1}){}^{1/(2\beta+d)}\rceilitalic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌈ ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) start_FLOATSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_FLOATSUPERSCRIPT ⌉ for i1𝑖1i\geq 1italic_i ≥ 1, and z=1𝑧1z=1italic_z = 1 for i=1𝑖1i=1italic_i = 1. We can continue the lower bound to see that

supf𝒞ziSti(π;f)subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑆subscript𝑡𝑖𝜋𝑓\displaystyle\sup_{f\in\mathcal{C}_{z_{i}}}S_{t_{i}}(\pi;f)roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) 18j=1zdαβl=1itltl1zdexp(2z(2β+d)ti1)absent18superscriptsubscript𝑗1superscript𝑧𝑑𝛼𝛽superscriptsubscript𝑙1𝑖subscript𝑡𝑙subscript𝑡𝑙1superscript𝑧𝑑2superscript𝑧2𝛽𝑑subscript𝑡𝑖1\displaystyle\geq\frac{1}{8}\sum_{j=1}^{z^{d-\alpha\beta}}\sum_{l=1}^{i}\frac{% t_{l}-t_{l-1}}{z^{d}}\exp\left(-2z^{-(2\beta+d)}t_{i-1}\right)≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp ( - 2 italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )
cj=1zdαβl=1itltl1zdabsentsuperscript𝑐superscriptsubscript𝑗1superscript𝑧𝑑𝛼𝛽superscriptsubscript𝑙1𝑖subscript𝑡𝑙subscript𝑡𝑙1superscript𝑧𝑑\displaystyle\geq c^{\star}\sum_{j=1}^{z^{d-\alpha\beta}}\sum_{l=1}^{i}\frac{t% _{l}-t_{l-1}}{z^{d}}≥ italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG
=ctizαβ={ctiti1αβ2β+d,i>1ct1,i=1,absentsuperscript𝑐subscript𝑡𝑖superscript𝑧𝛼𝛽casessuperscript𝑐subscript𝑡𝑖superscriptsubscript𝑡𝑖1𝛼𝛽2𝛽𝑑𝑖1superscript𝑐subscript𝑡1𝑖1\displaystyle=c^{\star}\cdot\frac{t_{i}}{z^{\alpha\beta}}=\begin{cases}c^{% \star}\cdot\frac{t_{i}}{t_{i-1}^{\frac{\alpha\beta}{2\beta+d}}},&i>1\\ c^{\star}t_{1},&i=1\end{cases},= italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_α italic_β end_POSTSUPERSCRIPT end_ARG = { start_ROW start_CELL italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α italic_β end_ARG start_ARG 2 italic_β + italic_d end_ARG end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL italic_i > 1 end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL italic_i = 1 end_CELL end_ROW ,

for some c>0superscript𝑐0c^{\star}>0italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT > 0.

Step 4: Combining bounds together.

Combining the previous arguments together leads to the conclusion that

sup(f,12)(α,β)RT(π;f)subscriptsupremum𝑓12𝛼𝛽subscript𝑅𝑇𝜋𝑓\displaystyle\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}R_{T}(\pi;f)roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ; italic_f ) max1iMsupf𝒞ziRti(π;f)absentsubscript1𝑖𝑀subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑅subscript𝑡𝑖𝜋𝑓\displaystyle\geq\max_{1\leq i\leq M}\sup_{f\in\mathcal{C}_{z_{i}}}R_{t_{i}}(% \pi;f)≥ roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f )
(1D)1+ααmax1iMti1α[supf𝒞ziSti(π;f)]1+ααabsentsuperscript1𝐷1𝛼𝛼subscript1𝑖𝑀superscriptsubscript𝑡𝑖1𝛼superscriptdelimited-[]subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑆subscript𝑡𝑖𝜋𝑓1𝛼𝛼\displaystyle\geq(\frac{1}{D})^{\frac{1+\alpha}{\alpha}}\max_{1\leq i\leq M}t_% {i}^{-\frac{1}{\alpha}}\left[\sup_{f\in\mathcal{C}_{z_{i}}}S_{t_{i}}(\pi;f)% \right]^{\frac{1+\alpha}{\alpha}}≥ ( divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ] start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT
max{t1,t2t1γ,,TtM1γ}greater-than-or-equivalent-toabsentsubscript𝑡1subscript𝑡2superscriptsubscript𝑡1𝛾𝑇superscriptsubscript𝑡𝑀1𝛾\displaystyle\gtrsim\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\gamma}},...,\frac{T}% {t_{M-1}^{\gamma}}\right\}≳ roman_max { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG , … , divide start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG } (9)
D~T1γ1γM.absent~𝐷superscript𝑇1𝛾1superscript𝛾𝑀\displaystyle\geq\tilde{D}T^{\frac{1-\gamma}{1-\gamma^{M}}}.≥ over~ start_ARG italic_D end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT .

This finishes the proof.

3.2 Lower bound for general M𝑀Mitalic_M-batch policy

Now we are ready to state the minimax lower bound for any general M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ), i.e., when the grid ΓΓ\Gammaroman_Γ is allowed to be adaptively chosen.

Theorem 2.

Suppose that αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1, and assume that PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is the uniform distribution on 𝒳=[0,1]d𝒳superscript01𝑑\mathcal{X}=[0,1]^{d}caligraphic_X = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. For any M𝑀Mitalic_M-batch policy (Γ,π)Γ𝜋(\Gamma,\pi)( roman_Γ , italic_π ), there exists a nonparametric bandit instance in (α,β)𝛼𝛽\mathcal{F}(\alpha,\beta)caligraphic_F ( italic_α , italic_β ) such that the regret of π𝜋\piitalic_π on this instance is lower bounded by

𝔼[RT(π)]D~1(1M)D~2T1γ1γM,𝔼delimited-[]subscript𝑅𝑇𝜋subscript~𝐷1superscript1𝑀subscript~𝐷2superscript𝑇1𝛾1superscript𝛾𝑀\mathbb{E}[R_{T}(\pi)]\geq\tilde{D}_{1}(\frac{1}{M})^{\tilde{D}_{2}}T^{\frac{1% -\gamma}{1-\gamma^{M}}},blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≥ over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ) start_POSTSUPERSCRIPT over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ,

where D~1,D~2>0subscript~𝐷1subscript~𝐷20\tilde{D}_{1},\tilde{D}_{2}>0over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are constants independent of T𝑇Titalic_T and M𝑀Mitalic_M.

See Appendix A for the proof.

Since our focus is on MloglogT𝑀𝑇M\apprle\log\log Titalic_M ≲ roman_log roman_log italic_T (when MloglogT𝑀𝑇M\apprge\log\log Titalic_M ≳ roman_log roman_log italic_T, by Corollary 1 there exists an algorithm whose regret attains the optimal fully online regret), we can see Theorem 1 and Theorem 2 differ at most by poly-log factors in T𝑇Titalic_T.

Unlike the fixed grid case where we choose a specific family of hard instances to target the regret in a certain batch, we cannot directly do so when the grid is adaptively selected because the adversary does not know {ti}i=1Msuperscriptsubscriptsubscript𝑡𝑖𝑖1𝑀\{t_{i}\}_{i=1}^{M}{ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT in advance. Inspired by [21], we overcome this difficulty by using an appropriately defined bad event that happens with sufficient probability to reduce the adaptive case to the fixed grid case. However, the proof under the nonparametric setting is much more challenging because the presence of contexts makes the instances for different batches less indistinguishable with each other. A key ingredient of our proof is to establish tight control of the total variation distance between two mixture distributions. The full proof can be found in Appendix A.

3.3 Implications on design of optimal M𝑀Mitalic_M-batch policy

As we have mentioned, the proof of the lower bound with fixed grid, i.e., Theorem 1 facilitates the design of optimal M𝑀Mitalic_M-batch policy.

Grid selection.

First, the lower bound of the whole horizon is reduced to the worst-case regret over a specific batch; see (3). Consequently, we need to design the grid Γ=(t0,t1,t2,,tM1,tM)Γsubscript𝑡0subscript𝑡1subscript𝑡2subscript𝑡𝑀1subscript𝑡𝑀\Gamma=(t_{0},t_{1},t_{2},\ldots,t_{M-1},t_{M})roman_Γ = ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) such that the total regret is evenly distributed across batches. More concretely, in view of the lower bound (9), one needs to set t1titi1γT1γ1γMasymptotically-equalssubscript𝑡1subscript𝑡𝑖superscriptsubscript𝑡𝑖1𝛾asymptotically-equalssuperscript𝑇1𝛾1superscript𝛾𝑀t_{1}\asymp\frac{t_{i}}{t_{i-1}^{\gamma}}\asymp T^{\frac{1-\gamma}{1-\gamma^{M% }}}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG ≍ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT for 2iM2𝑖𝑀2\leq i\leq M2 ≤ italic_i ≤ italic_M.

Dynamic binning.

In addition, in the proof of the lower bound, for each different batch i𝑖iitalic_i, we use different families of hard reward instances, parameterized by the number of bins zi=ti11/(2β+d)subscript𝑧𝑖superscriptsubscript𝑡𝑖112𝛽𝑑z_{i}=\lceil t_{i-1}^{1/(2\beta+d)}\rceilitalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌈ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT ⌉. In other words, from the lower bound perspective, the granularity (i.e., the bin width 1/zi1subscript𝑧𝑖1/z_{i}1 / italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) at which we investigate the mean reward function depends crucially on the grid points {ti}subscript𝑡𝑖\{t_{i}\}{ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }: the larger the grid point tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the finer the granularity. This key observation motivates us to consider the batched successive elimination with dynamic binning algorithm to be introduced below.

4 Batched successive elimination with dynamic binning

Algorithm 1 Batched successive elimination with dynamic binning (BaSEDB)
  • 𝐈𝐧𝐩𝐮𝐭𝐈𝐧𝐩𝐮𝐭\mathbf{Input}bold_Input: Batch size M𝑀Mitalic_M, grid Γ={ti}i=0MΓsuperscriptsubscriptsubscript𝑡𝑖𝑖0𝑀\Gamma=\{t_{i}\}_{i=0}^{M}roman_Γ = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, split factors {gi}i=0M1superscriptsubscriptsubscript𝑔𝑖𝑖0𝑀1\{g_{i}\}_{i=0}^{M-1}{ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT.

    1subscript1\mathcal{L}\leftarrow\mathcal{B}_{1}caligraphic_L ← caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

    𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L dodo\mathbf{do}bold_do

    • C=subscript𝐶\mathcal{I}_{C}=\mathcal{I}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = caligraphic_I

    𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for i=1,,M1𝑖1𝑀1i=1,...,M-1italic_i = 1 , … , italic_M - 1 𝐝𝐨𝐝𝐨\mathbf{do}bold_do

    • 𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for t=ti1+1,,ti𝑡subscript𝑡𝑖11subscript𝑡𝑖t=t_{i-1}+1,...,t_{i}italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 , … , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 𝐝𝐨𝐝𝐨\mathbf{do}bold_do

      • C(Xt)𝐶subscript𝑋𝑡C\leftarrow\mathcal{L}(X_{t})italic_C ← caligraphic_L ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

        Pull an arm from Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT in a round-robin way.

        𝐢𝐟𝐢𝐟\mathbf{if}bold_if t=ti𝑡subscript𝑡𝑖t=t_{i}italic_t = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 𝐭𝐡𝐞𝐧𝐭𝐡𝐞𝐧\mathbf{then}bold_then

        • Update \mathcal{L}caligraphic_L and {C}Csubscriptsubscript𝐶𝐶\{\mathcal{I}_{C}\}_{C\in\mathcal{L}}{ caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_C ∈ caligraphic_L end_POSTSUBSCRIPT by Algorithm 2 (,{C}C,i,gi)subscriptsubscript𝐶𝐶𝑖subscript𝑔𝑖(\mathcal{L},\{\mathcal{I}_{C}\}_{C\in\mathcal{L}},i,g_{i})( caligraphic_L , { caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_C ∈ caligraphic_L end_POSTSUBSCRIPT , italic_i , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

    𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for t=tM1+1,,T𝑡subscript𝑡𝑀11𝑇t=t_{M-1}+1,...,Titalic_t = italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT + 1 , … , italic_T 𝐝𝐨𝐝𝐨\mathbf{do}bold_do

    • C(Xt)𝐶subscript𝑋𝑡C\leftarrow\mathcal{L}(X_{t})italic_C ← caligraphic_L ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

      Pull any arm from Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.

Algorithm 2 Tree growing subroutine
  • 𝐈𝐧𝐩𝐮𝐭𝐈𝐧𝐩𝐮𝐭\mathbf{Input}bold_Input: Active nodes \mathcal{L}caligraphic_L, active arm sets {C}Csubscriptsubscript𝐶𝐶\{\mathcal{I}_{C}\}_{C\in\mathcal{L}}{ caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_C ∈ caligraphic_L end_POSTSUBSCRIPT, batch number i𝑖iitalic_i, split factor gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

    {}superscript\mathcal{L}^{\prime}\leftarrow\{\}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← { }

    𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L 𝐝𝐨𝐝𝐨\mathbf{do}bold_do

    • 𝐢𝐟𝐢𝐟\mathbf{if}bold_if |C|=1subscript𝐶1|\mathcal{I}_{C}|=1| caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | = 1 𝐭𝐡𝐞𝐧𝐭𝐡𝐞𝐧\mathbf{then}bold_then

      • {C}superscriptsuperscript𝐶\mathcal{L^{\prime}}\leftarrow\mathcal{L^{\prime}}\cup\{C\}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_C }

        Proceed to next C𝐶Citalic_C in the iteration.

      Y¯C,imaxmaxkCY¯C,i(k)superscriptsubscript¯𝑌𝐶𝑖subscript𝑘subscript𝐶superscriptsubscript¯𝑌𝐶𝑖𝑘\bar{Y}_{C,i}^{\max}\leftarrow\max_{k\in\mathcal{I}_{C}}\bar{Y}_{C,i}^{(k)}over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT ← roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT

      𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for kC𝑘subscript𝐶k\in\mathcal{I}_{C}italic_k ∈ caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 𝐝𝐨𝐝𝐨\mathbf{do}bold_do

      • 𝐢𝐟𝐢𝐟\mathbf{if}bold_if Y¯C,imaxY¯C,i(k)>U(mC,i,T,C)superscriptsubscript¯𝑌𝐶𝑖superscriptsubscript¯𝑌𝐶𝑖𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶\bar{Y}_{C,i}^{\max}-\bar{Y}_{C,i}^{(k)}>U(m_{C,i},T,C)over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) 𝐭𝐡𝐞𝐧𝐭𝐡𝐞𝐧\mathbf{then}bold_then CC{k}subscript𝐶subscript𝐶𝑘\mathcal{I}_{C}\leftarrow\mathcal{I}_{C}-\{k\}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - { italic_k }

      𝐢𝐟𝐢𝐟\mathbf{if}bold_if |C|>1subscript𝐶1|\mathcal{I}_{C}|>1| caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | > 1 𝐭𝐡𝐞𝐧𝐭𝐡𝐞𝐧\mathbf{then}bold_then

      • CCsubscriptsuperscript𝐶subscript𝐶\mathcal{I}_{C^{\prime}}\leftarrow\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 𝐟𝐨𝐫𝐟𝐨𝐫\mathbf{for}bold_for Cchild(C,gi)superscript𝐶child𝐶subscript𝑔𝑖C^{\prime}\in\textrm{child}(C,g_{i})italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ child ( italic_C , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

        child(C,gi)superscriptsuperscriptchild𝐶subscript𝑔𝑖\mathcal{L^{\prime}}\leftarrow\mathcal{L^{\prime}}\cup\textrm{child}(C,g_{i})caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ child ( italic_C , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

      𝐞𝐥𝐬𝐞𝐞𝐥𝐬𝐞\mathbf{else}bold_else

      • {C}superscriptsuperscript𝐶\mathcal{L^{\prime}}\leftarrow\mathcal{L^{\prime}}\cup\{C\}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_C }

    Return superscript\mathcal{L^{\prime}}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

In this section, we present the batched successive elimination with dynamic binning policy (BaSEDB) that nearly attains the minimax lower bound, up to log factors; see Algorithm 1. On a high level, Algorithm 1 gradually partitions the covariate space 𝒳𝒳\mathcal{X}caligraphic_X into smaller hypercubes (i.e., bins) throughout the batches based on a list of carefully chosen cube widths, and reduces the nonparametric bandit in each cube to a bandit problem without covariates.

\pgfmathresultpt[0,1]01[0,1][ 0 , 1 ]\pgfmathresultpt[0,14)014[0,\frac{1}{4})[ 0 , divide start_ARG 1 end_ARG start_ARG 4 end_ARG )\pgfmathresultpt[0,112)0112[0,\frac{1}{12})[ 0 , divide start_ARG 1 end_ARG start_ARG 12 end_ARG )\pgfmathresultpt[112,16)11216[\frac{1}{12},\frac{1}{6})[ divide start_ARG 1 end_ARG start_ARG 12 end_ARG , divide start_ARG 1 end_ARG start_ARG 6 end_ARG )\pgfmathresultpt[16,14)1614[\frac{1}{6},\frac{1}{4})[ divide start_ARG 1 end_ARG start_ARG 6 end_ARG , divide start_ARG 1 end_ARG start_ARG 4 end_ARG )\pgfmathresultpt[14,12)1412[\frac{1}{4},\frac{1}{2})[ divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG )\pgfmathresultpt[12,34)1234[\frac{1}{2},\frac{3}{4})[ divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 3 end_ARG start_ARG 4 end_ARG )\pgfmathresultpt[34,1]341[\frac{3}{4},1][ divide start_ARG 3 end_ARG start_ARG 4 end_ARG , 1 ]\pgfmathresultpt[34,56)3456[\frac{3}{4},\frac{5}{6})[ divide start_ARG 3 end_ARG start_ARG 4 end_ARG , divide start_ARG 5 end_ARG start_ARG 6 end_ARG )\pgfmathresultpt[56,1112)561112[\frac{5}{6},\frac{11}{12})[ divide start_ARG 5 end_ARG start_ARG 6 end_ARG , divide start_ARG 11 end_ARG start_ARG 12 end_ARG )\pgfmathresultpt[1112,1]11121[\frac{11}{12},1][ divide start_ARG 11 end_ARG start_ARG 12 end_ARG , 1 ]
Figure 1: An example of the tree growing process for d=1,M=3,G={4,3,1}formulae-sequence𝑑1formulae-sequence𝑀3𝐺431d=1,M=3,G=\{4,3,1\}italic_d = 1 , italic_M = 3 , italic_G = { 4 , 3 , 1 }. The root node is at depth 0. For the first batch, the 4 nodes located at depth 1 of the tree were used. Both [14,12)1412[\frac{1}{4},\frac{1}{2})[ divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) and [12,34)1234[\frac{1}{2},\frac{3}{4})[ divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 3 end_ARG start_ARG 4 end_ARG ) only had one active arm remaining so they were not further split and remained in the set of active nodes (green). Meanwhile, |[0,14)|=|[34,1]|=2subscript014subscript3412|\mathcal{I}_{[0,\frac{1}{4})}|=|\mathcal{I}_{[\frac{3}{4},1]}|=2| caligraphic_I start_POSTSUBSCRIPT [ 0 , divide start_ARG 1 end_ARG start_ARG 4 end_ARG ) end_POSTSUBSCRIPT | = | caligraphic_I start_POSTSUBSCRIPT [ divide start_ARG 3 end_ARG start_ARG 4 end_ARG , 1 ] end_POSTSUBSCRIPT | = 2 so each of them was split into 3 smaller nodes, and both nodes were marked as inactive (red). For the second batch, all the green nodes were actively used but arm elimination was performed at the end of batch 2 only for nodes located at depth 2 (the green nodes at depth 1 already have 1 active arm remaining so there is no need to eliminate again).
A tree-based interpretation.

The process is best illustrated with the notion of a tree 𝒯𝒯\mathcal{T}caligraphic_T of depth M𝑀Mitalic_M; see Figure 1. Each layer of of the tree 𝒯𝒯\mathcal{T}caligraphic_T is a set of bins that form a regular partition of 𝒳𝒳\mathcal{X}caligraphic_X using hypercubes with equal widths. And the common width of the bins isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in layer i𝑖iitalic_i is dictated by a list {gi}i=0M1superscriptsubscriptsubscript𝑔𝑖𝑖0𝑀1\{g_{i}\}_{i=0}^{M-1}{ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT of split factors. More precisely, we let

wi(l=0i1gl)1subscript𝑤𝑖superscriptsuperscriptsubscriptproduct𝑙0𝑖1subscript𝑔𝑙1w_{i}\coloneqq(\prod_{l=0}^{i-1}g_{l})^{-1}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (10)

be the width of the cubes in the i𝑖iitalic_i-th layer isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i1𝑖1i\geq 1italic_i ≥ 1, and w0=1subscript𝑤01w_{0}=1italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. In other words, isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains all the cubes

Ci,𝒗={x𝒳:(vj1)wixj<vjwi,1jd},subscript𝐶𝑖𝒗conditional-set𝑥𝒳formulae-sequencesubscript𝑣𝑗1subscript𝑤𝑖subscript𝑥𝑗subscript𝑣𝑗subscript𝑤𝑖1𝑗𝑑C_{i,\bm{v}}=\{x\in\mathcal{X}:(v_{j}-1)w_{i}\leq x_{j}<v_{j}w_{i},1\leq j\leq d\},italic_C start_POSTSUBSCRIPT italic_i , bold_italic_v end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_X : ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 1 ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_d } ,

where 𝒗=(v1,v2,,vd)[1wi]d𝒗subscript𝑣1subscript𝑣2subscript𝑣𝑑superscriptdelimited-[]1subscript𝑤𝑖𝑑\bm{v}=(v_{1},v_{2},\ldots,v_{d})\in[\frac{1}{w_{i}}]^{d}bold_italic_v = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ [ divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. As a result, there are in total (1wi)dsuperscript1subscript𝑤𝑖𝑑(\frac{1}{w_{i}})^{d}( divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT bins in isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Algorithm 1 proceeds in batches and maintains two key objects: (1) a list \mathcal{L}caligraphic_L of active bins, and (2) the corresponding active arms Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT for each C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L; see Figure 1 for an example. Specifically, prior to the game (i.e., prior to the first batch), \mathcal{L}caligraphic_L is set to be 1subscript1\mathcal{B}_{1}caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, all bins in layer 1, and C={1,1}subscript𝐶11\mathcal{I}_{C}=\{1,-1\}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { 1 , - 1 } for all C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L. Within this batch, the statistician tries the arms in Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT equally likely for all bins in \mathcal{L}caligraphic_L. Then at the end of the batch, given the revealed rewards in this batch, we update the active arms Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT for each C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L via successive elimination. If no arm were eliminated from Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, this suggests that the current bin is not fine enough for the statistician to tell the difference between the two arms. As a result, she splits the bin C𝐶C\in\mathcal{L}italic_C ∈ caligraphic_L into its children child(C)child𝐶\textrm{child}(C)child ( italic_C ) in 𝒯𝒯\mathcal{T}caligraphic_T. All the child nodes will be included in \mathcal{L}caligraphic_L, while the parent C𝐶Citalic_C stops being active (i.e., C𝐶Citalic_C is removed from \mathcal{L}caligraphic_L). The whole process repeats in a batch fashion. 222For the final batch M𝑀Mitalic_M, the split factor gM1=1subscript𝑔𝑀11g_{M-1}=1italic_g start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT = 1 by default because there is no need to further partition the nodes for estimation.

Grid ΓΓ\Gammaroman_Γ and split factors {gi}i=0M1superscriptsubscriptsubscript𝑔𝑖𝑖0𝑀1\{g_{i}\}_{i=0}^{M-1}{ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT.

As one can see, the split factor gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT controls how many children a node at layer i𝑖iitalic_i can have and its appropriate choice is crucial for obtaining small regret. Intuitively, gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be selected in a way such that a node Ci+1subscript𝐶𝑖1C_{i+1}italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT with width wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can fully leverage the number of samples allocated to it during the (i+1)𝑖1(i+1)( italic_i + 1 )-th batch. With these goals in mind, we design the grid Γ={ti}Γsubscript𝑡𝑖\Gamma=\{t_{i}\}roman_Γ = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and split factors {gi}subscript𝑔𝑖\{g_{i}\}{ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } as follows. Recall that γ=β(1+α)2β+d𝛾𝛽1𝛼2𝛽𝑑\gamma=\frac{\beta(1+\alpha)}{2\beta+d}italic_γ = divide start_ARG italic_β ( 1 + italic_α ) end_ARG start_ARG 2 italic_β + italic_d end_ARG. We set

b=Θ(T1γ1γM).𝑏Θsuperscript𝑇1𝛾1superscript𝛾𝑀b=\Theta\left(T^{\frac{1-\gamma}{1-\gamma^{M}}}\right).italic_b = roman_Θ ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ) .

The split factors are chosen according to

g0=b12β+d,andgi=gi1γ,i=1,,M2.formulae-sequencesubscript𝑔0superscript𝑏12𝛽𝑑andformulae-sequencesubscript𝑔𝑖superscriptsubscript𝑔𝑖1𝛾𝑖1𝑀2\displaystyle g_{0}=\lfloor b^{\frac{1}{2\beta+d}}\rfloor,\qquad\text{and}% \qquad g_{i}=\lfloor g_{i-1}^{\gamma}\rfloor,i=1,...,M-2.italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ⌊ italic_b start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_β + italic_d end_ARG end_POSTSUPERSCRIPT ⌋ , and italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ italic_g start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ⌋ , italic_i = 1 , … , italic_M - 2 . (11)

In addition, the grid is chosen such that

titi1subscript𝑡𝑖subscript𝑡𝑖1\displaystyle t_{i}-t_{i-1}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =liwi(2β+d)log(Twid),1iM1,formulae-sequenceabsentsubscript𝑙𝑖superscriptsubscript𝑤𝑖2𝛽𝑑𝑇superscriptsubscript𝑤𝑖𝑑1𝑖𝑀1\displaystyle=\lfloor l_{i}w_{i}^{-(2\beta+d)}\log(Tw_{i}^{d})\rfloor,1\leq i% \leq M-1,= ⌊ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT roman_log ( italic_T italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⌋ , 1 ≤ italic_i ≤ italic_M - 1 , (12)

where li>0subscript𝑙𝑖0l_{i}>0italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 is a constant to be specified later. It is easy to check that with these choices, we have

t1T1γ1γM,andti=b(ti1)γ,for i=2,,M.formulae-sequenceasymptotically-equalssubscript𝑡1superscript𝑇1𝛾1superscript𝛾𝑀andformulae-sequencesubscript𝑡𝑖𝑏superscriptsubscript𝑡𝑖1𝛾for 𝑖2𝑀t_{1}\asymp T^{\frac{1-\gamma}{1-\gamma^{M}}},\qquad\text{and}\qquad t_{i}=% \lfloor b(t_{i-1})^{\gamma}\rfloor,\quad\text{for }i=2,...,M.italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , and italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ italic_b ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ⌋ , for italic_i = 2 , … , italic_M .

In particular, we set b𝑏bitalic_b properly to make tM=Tsubscript𝑡𝑀𝑇t_{M}=Titalic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_T. Indeed, these choices taken together meet the expectation laid out in Section 3.3: we need to choose the grid and the split factors appropriately so that (1) the total regret spreads out across different batches, and (2) the granularity becomes finer as we move further to later batches.

When to eliminate arms?

Now we zoom in on the elimination process described in Algorithm 2. The basic idea follows from successive elimination in the bandit literature [16, 39, 21]: the statistician eliminates an arm from Csubscript𝐶\mathcal{I}_{C}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT if she expects the arm to be suboptimal in the bin C𝐶Citalic_C given the rewards collected in C𝐶Citalic_C. Specifically, for any node C𝒯𝐶𝒯C\in\mathcal{T}italic_C ∈ caligraphic_T, define

U(τ,T,C)4log(2T|C|d)τ,𝑈𝜏𝑇𝐶42𝑇superscript𝐶𝑑𝜏U(\tau,T,C)\coloneqq 4\sqrt{\frac{\log(2T|C|^{d})}{\tau}},italic_U ( italic_τ , italic_T , italic_C ) ≔ 4 square-root start_ARG divide start_ARG roman_log ( 2 italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_τ end_ARG end_ARG ,

where |C|𝐶|C|| italic_C | denotes the width of the bin. Let mC,it=ti1+1ti𝟏{XtC}subscript𝑚𝐶𝑖superscriptsubscript𝑡subscript𝑡𝑖11subscript𝑡𝑖1subscript𝑋𝑡𝐶m_{C,i}\coloneqq\sum_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}\{X_{t}\in C\}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≔ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C } be the number of times we observe contexts from C𝐶Citalic_C in batch i𝑖iitalic_i. We then define for k{1,1}𝑘11k\in\{1,-1\}italic_k ∈ { 1 , - 1 } that

Y¯C,i(k)t=ti1+1tiYt𝟏{XtC,At=k}t=ti1+1ti𝟏{XtC,At=k},superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript𝑡subscript𝑡𝑖11subscript𝑡𝑖subscript𝑌𝑡1formulae-sequencesubscript𝑋𝑡𝐶subscript𝐴𝑡𝑘superscriptsubscript𝑡subscript𝑡𝑖11subscript𝑡𝑖1formulae-sequencesubscript𝑋𝑡𝐶subscript𝐴𝑡𝑘\bar{Y}_{C,i}^{(k)}\coloneqq\frac{\sum_{t=t_{i-1}+1}^{t_{i}}Y_{t}\cdot\mathbf{% 1}\{X_{t}\in C,A_{t}=k\}}{\sum_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}\{X_{t}\in C,A_{% t}=k\}},over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≔ divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k } end_ARG ,

which is the empirical mean reward of arm k𝑘kitalic_k in node C𝐶Citalic_C during the i𝑖iitalic_i-th batch. It is easy to check that Y¯C,i(k)superscriptsubscript¯𝑌𝐶𝑖𝑘\bar{Y}_{C,i}^{(k)}over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT has expectation f¯C(k)superscriptsubscript¯𝑓𝐶𝑘\bar{f}_{C}^{(k)}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT given by

f¯C(k)𝔼[f(k)(X)XC]=1X(C)Cf(k)(x)dX(x).superscriptsubscript¯𝑓𝐶𝑘𝔼delimited-[]conditionalsuperscript𝑓𝑘𝑋𝑋𝐶1subscript𝑋𝐶subscript𝐶superscript𝑓𝑘𝑥differential-dsubscript𝑋𝑥\bar{f}_{C}^{(k)}\coloneqq\mathbb{E}[f^{(k)}(X)\mid X\in C]=\frac{1}{\mathbb{P% }_{X}(C)}\int_{C}f^{(k)}(x)\mathrm{d}\mathbb{P}_{X}(x).over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≔ blackboard_E [ italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_X ) ∣ italic_X ∈ italic_C ] = divide start_ARG 1 end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) roman_d blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) .

Similarly, we define the average optimal reward in bin C𝐶Citalic_C to be

f¯C1X(C)Cf(x)dX(x).superscriptsubscript¯𝑓𝐶1subscript𝑋𝐶subscript𝐶superscript𝑓𝑥differential-dsubscript𝑋𝑥\bar{f}_{C}^{\star}\coloneqq\frac{1}{\mathbb{P}_{X}(C)}\int_{C}f^{\star}(x)% \mathrm{d}\mathbb{P}_{X}(x).over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≔ divide start_ARG 1 end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) roman_d blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) .

The elimination threshold U(mC,i,T,C)𝑈subscript𝑚𝐶𝑖𝑇𝐶U(m_{C,i},T,C)italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) is chosen such that an arm k𝑘kitalic_k with f¯Cf¯C(k)|C|βmuch-greater-thansuperscriptsubscript¯𝑓𝐶superscriptsubscript¯𝑓𝐶𝑘superscript𝐶𝛽\bar{f}_{C}^{\star}-\bar{f}_{C}^{(k)}\gg|C|^{\beta}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≫ | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is eliminated with high probability at the end of batch i𝑖iitalic_i. Therefore, when |C|>1subscript𝐶1|\mathcal{I}_{C}|>1| caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | > 1, the remaining arms are statistically indistinguishable from each other, so C𝐶Citalic_C is split into smaller nodes to estimate those arms more accurately using samples from future batches. On the other hand, when |C|=1subscript𝐶1|\mathcal{I}_{C}|=1| caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | = 1, the remaining arm is optimal in C𝐶Citalic_C with high probability—a consequence of the smoothness condition, and it will be exploited in the later batches.

Connections and differences with ABSE in [39].

In appearance, BaSEDB (Algorithm 1) looks quite similar to the Adaptively Binned Successive Elimination (ABSE) proposed in [39]. However, we would like to emphasize several fundamental differences. First, the motivations for the algorithms are completely different. [39] designs ABSE to adapt to the unknown margin condition α𝛼\alphaitalic_α, while our focus is to tackle the batch constraint. In fact, without the batch constraints, if α𝛼\alphaitalic_α is known, adaptive binning is not needed to achieve the optimal regret [39]. This is certainly not the case in the batched setting. Fixing the number of bins used across different batches is suboptimal because one can construct instances that cause the regret incurred during a certain batch to explode. We will expand on this phenomenon in Section 4.3. Secondly, the algorithm in [39] partitions a bin into a fixed number 2dsuperscript2𝑑2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of smaller ones once the original bin is unable to distinguish the remaining arms. In this way, the algorithm can adapt to the difference in the local difficulty of the problem. In comparison, one of our main contributions is to carefully design the list of varying split factors that allows the new cubes to maximally utilize the number of samples allocated to it during the next batch.

4.1 Regret guarantees

Now we are ready to present the regret performance of BaSEDB (Algorithm 1).

Theorem 3.

Suppose that αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1. Fix any constant D1>0subscript𝐷10D_{1}>0italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and suppose that MD1logT𝑀subscript𝐷1𝑇M\leq D_{1}\log Titalic_M ≤ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log italic_T. Equipped with the grid and split factors list that satisfy (12) and (11), the policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG given by Algorithm 1 obeys

𝔼[RT(π^)]C~(logT)2T1γ1γM,𝔼delimited-[]subscript𝑅𝑇^𝜋~𝐶superscript𝑇2superscript𝑇1𝛾1superscript𝛾𝑀\mathbb{E}[R_{T}(\hat{\pi})]\leq\tilde{C}(\log T)^{2}\cdot T^{\frac{1-\gamma}{% 1-\gamma^{M}}},blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] ≤ over~ start_ARG italic_C end_ARG ( roman_log italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ,

where C~>0~𝐶0\tilde{C}>0over~ start_ARG italic_C end_ARG > 0 is a constant independent of T𝑇Titalic_T and M𝑀Mitalic_M.

See Appendix B for the proof.

While Theorem 3 requires MlogTless-than-or-similar-to𝑀𝑇M\lesssim\log Titalic_M ≲ roman_log italic_T, we see from the corollary below that it is in fact sufficient to show the optimality of Algorithm 1.

Corollary 1.

As long as MD2loglog(T)𝑀subscript𝐷2𝑇M\geq D_{2}\log\log(T)italic_M ≥ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log roman_log ( italic_T ), where D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT depends on γ=β(1+α)2β+d𝛾𝛽1𝛼2𝛽𝑑\gamma=\frac{\beta(1+\alpha)}{2\beta+d}italic_γ = divide start_ARG italic_β ( 1 + italic_α ) end_ARG start_ARG 2 italic_β + italic_d end_ARG, Algorithm 1 achieves

𝔼[RT(π^)]C~(logT)2T1γ,𝔼delimited-[]subscript𝑅𝑇^𝜋~𝐶superscript𝑇2superscript𝑇1𝛾\mathbb{E}[R_{T}(\hat{\pi})]\leq\tilde{C}(\log T)^{2}\cdot T^{1-\gamma},blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] ≤ over~ start_ARG italic_C end_ARG ( roman_log italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT ,

where C~>0~𝐶0\tilde{C}>0over~ start_ARG italic_C end_ARG > 0 is a constant independent of T𝑇Titalic_T and M𝑀Mitalic_M.

Theorem 3, together with Corollary 1 and Theorem 2 establish the fundamental limits of batch learning for the nonparametric bandits with covariates, as well as the optimality of BaSEDB, up to logarithmic factors. To see this, when Mloglog(T)less-than-or-similar-to𝑀𝑇M\lesssim\log\log(T)italic_M ≲ roman_log roman_log ( italic_T ), the upper bound in Theorem 3 matches the lower bounds in Theorem 1 and Theorem 2, apart from log factors. On the other end, when Mloglog(T)greater-than-or-equivalent-to𝑀𝑇M\gtrsim\log\log(T)italic_M ≳ roman_log roman_log ( italic_T ), Algorithm 1, while splitting the horizon into M𝑀Mitalic_M batches, achieves the optimal regret (up to log factors) for the setting without the batch constraint [39]. It is evident that Algorithm 1 is optimal in this case.

4.2 Numerical experiments

In this section, we provide some experiments on the empirical performance of Algorithm 1. We set T=50000,d=β=1,α=0.2formulae-sequenceformulae-sequence𝑇50000𝑑𝛽1𝛼0.2T=50000,d=\beta=1,\alpha=0.2italic_T = 50000 , italic_d = italic_β = 1 , italic_α = 0.2. We let PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT be the uniform distribution on [0,1]01[0,1][ 0 , 1 ]. Denote qj=(j1/2)/4subscript𝑞𝑗𝑗124q_{j}=(j-1/2)/4italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_j - 1 / 2 ) / 4 and Cj=[qj1/8,qj+1/8]subscript𝐶𝑗subscript𝑞𝑗18subscript𝑞𝑗18C_{j}=[q_{j}-1/8,q_{j}+1/8]italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 1 / 8 , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 / 8 ] for 1j41𝑗41\leq j\leq 41 ≤ italic_j ≤ 4. For the mean reward functions, we choose f(1),f(1):[0,1]:superscript𝑓1superscript𝑓101f^{(1)},f^{(-1)}:[0,1]\rightarrow\mathbb{R}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT : [ 0 , 1 ] → blackboard_R such that

f(1)(x)=12+j=14ωjφj(x),f(1)(x)=12,formulae-sequencesuperscript𝑓1𝑥12superscriptsubscript𝑗14subscript𝜔𝑗subscript𝜑𝑗𝑥superscript𝑓1𝑥12f^{(1)}(x)=\frac{1}{2}+\sum_{j=1}^{4}\omega_{j}\varphi_{j}(x),\qquad f^{(-1)}(% x)=\frac{1}{2},italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ,

where ωjssuperscriptsubscript𝜔𝑗𝑠\omega_{j}^{\prime}sitalic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s are sampled i.i.d. from Rad(12)Rad12\mathrm{Rad}(\frac{1}{2})roman_Rad ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ), φj(x)=14ϕ(8(xqj))𝟏{xCj}subscript𝜑𝑗𝑥14italic-ϕ8𝑥subscript𝑞𝑗1𝑥subscript𝐶𝑗\varphi_{j}(x)=\frac{1}{4}\phi(8(x-q_{j}))\mathbf{1}\{x\in C_{j}\}italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_ϕ ( 8 ( italic_x - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) bold_1 { italic_x ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and ϕ(x)=(1|x|)𝟏{|x|1}italic-ϕ𝑥1𝑥1𝑥1\phi(x)=(1-|x|)\mathbf{1}\{|x|\leq 1\}italic_ϕ ( italic_x ) = ( 1 - | italic_x | ) bold_1 { | italic_x | ≤ 1 }. We let Y(k)Bernoulli(f(k)(x))similar-tosuperscript𝑌𝑘Bernoullisuperscript𝑓𝑘𝑥Y^{(k)}\sim\mathrm{Bernoulli}(f^{(k)}(x))italic_Y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ roman_Bernoulli ( italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) ). To illustrate the performance of Algorithm 1, we compare it with the Binned Successive Elimination (BSE) policy from [39], which is shown to be minimax optimal in the fully online case. Figure 2 shows the regret of Algorithm 1 under different batch budegts. One can see that it is sufficient to have M=5𝑀5M=5italic_M = 5 batches to achieve the fully online efficiency.

Refer to caption
Figure 2: Regret vs. batch budget M𝑀Mitalic_M.

4.3 Failure of static binning

We have seen the power of dynamic binning in solving batched nonparametric bandits by establishing its rate-optimality in minimizing regret. Now we turn to a complimentary but intriguing question: is it necessary to use dynamic binning to achieve optimal regret under the batch constraint? To formally address this question, we investigate the performance of successive elimination with static binning, i.e., Algorithm 1 with g0=gsubscript𝑔0𝑔g_{0}=gitalic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_g, and g1=g2=gM2=1subscript𝑔1subscript𝑔2subscript𝑔𝑀21g_{1}=g_{2}=\cdots g_{M-2}=1italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ⋯ italic_g start_POSTSUBSCRIPT italic_M - 2 end_POSTSUBSCRIPT = 1. Although static binning works when M𝑀Mitalic_M is large (e.g., a single choice of g𝑔gitalic_g attains the optimal regret [48, 39] in the fully online setting), we show that it must fail when M𝑀Mitalic_M is small.

To bring the failure mode of static binning into focus, we consider the simplest scenario when M=3𝑀3M=3italic_M = 3, and α=β=d=1𝛼𝛽𝑑1\alpha=\beta=d=1italic_α = italic_β = italic_d = 1. Note that the successive elimination with static binning algorithm is parameterized by the grid choice Γ={t0=0,t1,t2,t3=T}Γformulae-sequencesubscript𝑡00subscript𝑡1subscript𝑡2subscript𝑡3𝑇\Gamma=\{t_{0}=0,t_{1},t_{2},t_{3}=T\}roman_Γ = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T } and the fixed number g𝑔gitalic_g of bins. The following theorem formalizes the failure of static binning in achieving optimal regret when M=3𝑀3M=3italic_M = 3.

Theorem 4.

Consider M=3𝑀3M=3italic_M = 3, and α=β=d=1𝛼𝛽𝑑1\alpha=\beta=d=1italic_α = italic_β = italic_d = 1. For any choice of 1t1<t2T11subscript𝑡1subscript𝑡2𝑇11\leq t_{1}<t_{2}\leq T-11 ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_T - 1, and any choice of g𝑔gitalic_g, there exists a nonparametric bandit instance in (1,1)11\mathcal{F}(1,1)caligraphic_F ( 1 , 1 ) such that the resulting successive elimination with static binning algorithm π^staticsubscript^𝜋static\hat{\pi}_{\mathrm{static}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT satisfies

𝔼[RT(π^static)]C~1T919+κ,𝔼delimited-[]subscript𝑅𝑇subscript^𝜋staticsubscript~𝐶1superscript𝑇919𝜅\mathbb{E}[R_{T}(\hat{\pi}_{\mathrm{static}})]\geq\tilde{C}_{1}T^{\frac{9}{19}% +\kappa},blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT ) ] ≥ over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG + italic_κ end_POSTSUPERSCRIPT ,

for some κ,C~1>0𝜅subscript~𝐶10\kappa,\tilde{C}_{1}>0italic_κ , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 that are independent of T𝑇Titalic_T. Here T919superscript𝑇919T^{\frac{9}{19}}italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG end_POSTSUPERSCRIPT is the optimal regret achieved by BaSEDB—an successive elimination algorithm with dynamic binning.

While the formal proof is deferred to Appendix C, we would like to immediately point out the intuition underlying the failure of static binning.

Necessary choice of grid ΓΓ\Gammaroman_Γ.

It is evident from the proof of the minimax lower bound (Theorem 1) that one needs to set t1T9/19asymptotically-equalssubscript𝑡1superscript𝑇919t_{1}\asymp T^{9/19}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 9 / 19 end_POSTSUPERSCRIPT, and t2T15/19asymptotically-equalssubscript𝑡2superscript𝑇1519t_{2}\asymp T^{15/19}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 15 / 19 end_POSTSUPERSCRIPT. Otherwise, the inequality (9) guarantees the worst-case regret of π^staticsubscript^𝜋static\hat{\pi}_{\mathrm{static}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT exceeds the optimal one T919superscript𝑇919T^{\frac{9}{19}}italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG end_POSTSUPERSCRIPT. Consequently, we can focus on the algorithm with t1T9/19asymptotically-equalssubscript𝑡1superscript𝑇919t_{1}\asymp T^{9/19}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 9 / 19 end_POSTSUPERSCRIPT, t2T15/19asymptotically-equalssubscript𝑡2superscript𝑇1519t_{2}\asymp T^{15/19}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 15 / 19 end_POSTSUPERSCRIPT, and only consider the design choice g𝑔gitalic_g.

Why fixed g𝑔gitalic_g fails.

As a baseline for comparison, recall that in the optimal algorithm with dynamic binning, we set g0T3/19asymptotically-equalssubscript𝑔0superscript𝑇319g_{0}\asymp T^{3/19}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 3 / 19 end_POSTSUPERSCRIPT, and g0g1T5/19asymptotically-equalssubscript𝑔0subscript𝑔1superscript𝑇519g_{0}g_{1}\asymp T^{5/19}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 5 / 19 end_POSTSUPERSCRIPT so that the worst case regret in three batches are all on the order of T919superscript𝑇919T^{\frac{9}{19}}italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG end_POSTSUPERSCRIPT. In view of this, we split the choice of g𝑔gitalic_g into three cases.

  • Suppose that gT3/19much-greater-than𝑔superscript𝑇319g\gg T^{3/19}italic_g ≫ italic_T start_POSTSUPERSCRIPT 3 / 19 end_POSTSUPERSCRIPT. In this case, we can construct an instance such that the reward difference only appears on an interval with length 1/z1/gmuch-greater-than1𝑧1𝑔1/z\gg 1/g1 / italic_z ≫ 1 / italic_g; see Figure 3. In other words, the static binning is finer than that in the reward instance. As a result, the number of pulls in the smaller bin (used by the algorithm) in the first batch is not sufficient to tell the two arms apart, that is with constant probability, arm elimination will not happen after the first batch. This necessarily yields the blowup of the regret in the second batch.

  • Suppose that gT3/19much-less-than𝑔superscript𝑇319g\ll T^{3/19}italic_g ≪ italic_T start_POSTSUPERSCRIPT 3 / 19 end_POSTSUPERSCRIPT. In this case, we can construct an instance such that the reward difference only appears on an interval with length 1/z1/gmuch-less-than1𝑧1𝑔1/z\ll 1/g1 / italic_z ≪ 1 / italic_g; see Figure 4. In other words, the static binning is coarser than that in the reward instance. Since the aggregated reward difference on the larger bin is so small, the number of pulls in the larger bin (used by the algorithm) in the first batch is still not sufficient to result in successful arm elimination. Again, the regret on the second batch blows up.

  • Suppose that gT3/19asymptotically-equals𝑔superscript𝑇319g\asymp T^{3/19}italic_g ≍ italic_T start_POSTSUPERSCRIPT 3 / 19 end_POSTSUPERSCRIPT. Since this choices matches g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT used in the optimal dynamic binning algorithm, there is no reward instance that can blow up the regret in the first two batches. Nevertheless, since gg0g1T5/19much-less-than𝑔subscript𝑔0subscript𝑔1asymptotically-equalssuperscript𝑇519g\ll g_{0}g_{1}\asymp T^{5/19}italic_g ≪ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 5 / 19 end_POSTSUPERSCRIPT, one can construct the instance similar to the previous case (i.e., Figure 4) such that the regret on the third batch blows up.

x𝑥xitalic_x1/g1𝑔1/g1 / italic_g1/z1𝑧1/z1 / italic_zδ/2𝛿2\delta/2italic_δ / 2
Figure 3: Instance with g>z𝑔𝑧g>zitalic_g > italic_z. Each bin B𝐵Bitalic_B produced by π^staticsubscript^𝜋static\hat{\pi}_{\mathrm{static}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT has width 1/g1𝑔1/g1 / italic_g.
x𝑥xitalic_x1/z1𝑧1/z1 / italic_z1/g1𝑔1/g1 / italic_gδ/2𝛿2\delta/2italic_δ / 2
Figure 4: Instance with g<z𝑔𝑧g<zitalic_g < italic_z. Each bin B𝐵Bitalic_B produced by π^staticsubscript^𝜋static\hat{\pi}_{\mathrm{static}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT has width 1/g1𝑔1/g1 / italic_g.

5 Adaptivity to margin parameter α𝛼\alphaitalic_α

In this section, we provide some discussions on the possibility of adapting to the margin parameter α𝛼\alphaitalic_α if it is unknown. Recall in Section 4, the grid choice of Algorithm 1 requires knowledge of α𝛼\alphaitalic_α. One may ask if such knowledge is essential in obtaining small regret. Unfortunately, the following theorem demonstrates that the price of not knowing α𝛼\alphaitalic_α is at least a polynomial increase in regret.

Theorem 5.

Consider M=2𝑀2M=2italic_M = 2 (or 3333) and β=d=1𝛽𝑑1\beta=d=1italic_β = italic_d = 1. For any algorithm that does not know the true margin parameter αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, there exists a choice of αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT such that

sup(α,1)𝔼[RT(π)]D~3T1α+131(α+13)M+κ1,subscriptsupremumsuperscript𝛼1𝔼delimited-[]subscript𝑅𝑇𝜋subscript~𝐷3superscript𝑇1superscript𝛼131superscriptsuperscript𝛼13𝑀subscript𝜅1\sup_{\mathcal{F}(\alpha^{\star},1)}\mathbb{E}[R_{T}(\pi)]\geq\tilde{D}_{3}T^{% \frac{1-\frac{\alpha^{\star}+1}{3}}{1-(\frac{\alpha^{\star}+1}{3})^{M}}+\kappa% _{1}},roman_sup start_POSTSUBSCRIPT caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , 1 ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≥ over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG end_ARG start_ARG 1 - ( divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for some D~3>0,κ1subscript~𝐷30subscript𝜅1\tilde{D}_{3}>0,\kappa_{1}over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 , italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that are independent of T𝑇Titalic_T.

See Appendix D for the proof.

Theorem 5 says for any algorithm that does not have knowledge of αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, its regret is at least a polynomial factor larger than the optimal regret attained by Algorithm 1. This result shows batch learning for nonparametric bandits is much harder than the fully online case to some extent, where adaptivity to αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT could be achieved for free [39].

The intuition behind the proof of Theorem 5 is that since the algorithm does not know αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it has little hope to pick the first batch size t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT optimally. If t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is too large, then the adversary can choose a big αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, which corresponds to the family of reward functions with larger gaps, so that the algorithm’s regret during the first batch explodes. If t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is too small, then the adversary can choose a little αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, which corresponds to the family of reward functions with smaller gaps, so that the algorithm’s knowledge gathered during the first batch is not enough to distinguish the arms and its regret will explode in later batches.

6 Conclusions

In this paper, we characterize the fundamental limits of batch learning in nonparametric contextual bandits. In particular, our optimal batch learning algorithm (i.e., Algorithm 1) is able to match the optimal regret in the fully online setting with only O(loglogT)𝑂𝑇O(\log\log T)italic_O ( roman_log roman_log italic_T ) policy updates. Our work open a few interesting avenues to explore in the future.

Extensions to multiple arms.

With slight modification, our algorithm works for nonparametric contextual bandits with more than two arms. However, it remains unclear what the fundamental limits of batch learning are in this multi-armed case (i.e., when K𝐾Kitalic_K is large).

Improving the log factor.

Comparing the upper and lower bounds, it is evident that Algorithm 1 is near-optimal up to log factors. It is certainly interesting to improve this log factor, either by strengthening the lower bound, or making the upper bound more efficient.

Adapting to margin parameters.

While we have shown that adaptivity to the margin parameter is not possible when the batch constraint is stringent, i.e., when M=2𝑀2M=2italic_M = 2 (or 3), it leaves open the question of designing optimal adaptive algorithm when M𝑀Mitalic_M is large, say MloglogTasymptotically-equals𝑀𝑇M\asymp\log\log Titalic_M ≍ roman_log roman_log italic_T. In fact, when M=T𝑀𝑇M=Titalic_M = italic_T, i.e., in the fully online setting, [39] provides an adaptively binned successive elimination algorithm that is capable of adapting to the margin parameter optimally.

Acknowledgements

CM is partially supported by the National Science Foundation via grant DMS-2311127.

References

  • [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  • [2] Sakshi Arya and Yuhong Yang. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. Statistics & Probability Letters, 164:108818, 2020.
  • [3] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608–633, 2007.
  • [4] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • [5] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
  • [6] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
  • [7] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
  • [8] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
  • [9] Moise Blanchard, Steve Hanneke, and Patrick Jaillet. Non-stationary contextual bandits and universal learning. arXiv preprint arXiv:2302.07186, 2023.
  • [10] Changxiao Cai, T Tony Cai, and Hongzhe Li. Transfer learning for contextual multi-armed bandits. arXiv preprint arXiv:2211.12612, 2022.
  • [11] T Tony Cai and Hongming Pu. Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm. The Annals of Statistics, 50(4):2179–2204, 2022.
  • [12] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26, 2013.
  • [13] Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105, 2014.
  • [14] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.
  • [15] Stephen E Chick and Noah Gans. Economic analysis of simulation selection problems. Management Science, 55(3):421–437, 2009.
  • [16] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079–1105, 2006.
  • [17] Jianqing Fan, Zhaoran Wang, Zhuoran Yang, and Chenlu Ye. Provably efficient high-dimensional bandit learning with batched feedbacks. arXiv preprint arXiv:2311.13180, 2023.
  • [18] Yasong Feng, Zengfeng Huang, and Tianyu Wang. Lipschitz bandits with batched feedback. Advances in Neural Information Processing Systems, 35:19836–19848, 2022.
  • [19] Manegueu Anne Gael, Claire Vernade, Alexandra Carpentier, and Michal Valko. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348–3356. PMLR, 2020.
  • [20] Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
  • [21] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
  • [22] Alexander Goldenshluger and Assaf Zeevi. Woodroofe’s one-armed bandit problem revisited. The Annals of Applied Probability, 19(4):1603–1633, 2009.
  • [23] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
  • [24] Melody Guan and Heinrich Jiang. Nonparametric stochastic contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [25] Yonatan Gur, Ahmadreza Momeni, and Stefan Wager. Smoothness-adaptive contextual bandits. Operations Research, 70(6):3198–3216, 2022.
  • [26] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
  • [27] Yichun Hu, Nathan Kallus, and Xiaojie Mao. Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Operations Research, 70(6):3261–3281, 2022.
  • [28] Cem Kalkanli and Ayfer Ozgur. Batched thompson sampling. Advances in Neural Information Processing Systems, 34:29984–29994, 2021.
  • [29] Amin Karbasi, Vahab Mirrokni, and Mohammad Shadravan. Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535–10548, 2021.
  • [30] Edward S Kim, Roy S Herbst, Ignacio I Wistuba, J Jack Lee, George R Blumenschein Jr, Anne Tsao, David J Stewart, Marshall E Hicks, Jeremy Erasmus Jr, Sanjay Gupta, et al. The battle trial: personalizing therapy for lung cancer. Cancer discovery, 1(1):44–53, 2011.
  • [31] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 453–456, 2008.
  • [32] Anders Bredahl Kock and Martin Thyrsgaard. Optimal sequential treatment allocation. arXiv preprint arXiv:1705.09952, 2017.
  • [33] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. The Journal of Machine Learning Research, 21(1):5402–5446, 2020.
  • [34] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • [35] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • [36] Andrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, pages 1463–1492. PMLR, 2018.
  • [37] Tyler Lu, Dávid Pál, and Martin Pál. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009.
  • [38] Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
  • [39] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. Ann. Statist., 41(2):693–721, 2013.
  • [40] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. Ann. Statist., 44(2):660–681, 2016.
  • [41] Wei Qian, Ching-Kang Ing, and Ji Liu. Adaptive algorithm for multi-armed bandit problem with high-dimensional covariates. Journal of the American Statistical Association, pages 1–13, 2023.
  • [42] Wei Qian and Yuhong Yang. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, 17(149), 2016.
  • [43] Wei Qian and Yuhong Yang. Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics, 10(1):242–270, 2016.
  • [44] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  • [45] Henry Reeve, Joe Mellor, and Gavin Brown. The k-nearest neighbour ucb algorithm for multi-armed bandits with covariates. In Algorithmic Learning Theory, pages 725–752. PMLR, 2018.
  • [46] Zhimei Ren and Zhengyuan Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. Management Science, 2023.
  • [47] Zhimei Ren, Zhengyuan Zhou, and Jayant R Kalagnanam. Batched learning in generalized linear contextual bandits with general decision sets. IEEE Control Systems Letters, 6:37–42, 2020.
  • [48] Philippe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630, 2010.
  • [49] Herbert E. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527–535, 1952.
  • [50] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
  • [51] Joe Suk and Samory Kpotufe. Tracking most significant shifts in nonparametric contextual bandits. arXiv preprint arXiv:2307.05341, 2023.
  • [52] Joseph Suk and Samory Kpotufe. Self-tuning bandits over unknown covariate-shifts. In Algorithmic Learning Theory, pages 1114–1156. PMLR, 2021.
  • [53] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
  • [54] Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
  • [55] Claire Vernade, Olivier Cappé, and Vianney Perchet. Stochastic bandit models for delayed conversions. arXiv preprint arXiv:1706.09186, 2017.
  • [56] Chi-Hua Wang and Guang Cheng. Online batch decision-making with high-dimensional covariates. In International Conference on Artificial Intelligence and Statistics, pages 3848–3857. PMLR, 2020.
  • [57] Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
  • [58] Michael Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.
  • [59] Yuhong Yang and Dan Zhu. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist., 30(1):100–121, 2002.
  • [60] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in neural information processing systems, 33:9818–9829, 2020.
  • [61] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
  • [62] Zhijin Zhou, Yingfei Wang, Hamed Mamani, and David G Coffey. How do tumor cytogenetics inform cancer treatments? dynamic risk stratification and precision medicine using multi-armed bandits. Dynamic Risk Stratification and Precision Medicine Using Multi-armed Bandits (June 17, 2019).

Appendix A Proof of Theorem 2

It is worth emphasizing that Theorem 2 aims to establish the hardness of batched nonparametric bandits even when the grid ΓΓ\Gammaroman_Γ is allowed to be adaptively chosen. Nevertheless, the proof of Theorem 1 in the fixed-grid case is still useful.

Define bT(1γ)/(1γM)asymptotically-equals𝑏superscript𝑇1𝛾1superscript𝛾𝑀b\asymp T^{(1-\gamma)/(1-\gamma^{M})}italic_b ≍ italic_T start_POSTSUPERSCRIPT ( 1 - italic_γ ) / ( 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. For each 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M, we set Ti=b(1γi)/(1γ)subscript𝑇𝑖superscript𝑏1superscript𝛾𝑖1𝛾T_{i}=\lfloor b^{(1-\gamma^{i})/(1-\gamma)}\rflooritalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ italic_b start_POSTSUPERSCRIPT ( 1 - italic_γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) / ( 1 - italic_γ ) end_POSTSUPERSCRIPT ⌋, zi=(36Ti1M2)1/(2β+d)subscript𝑧𝑖superscript36subscript𝑇𝑖1superscript𝑀212𝛽𝑑z_{i}=\lceil(36T_{i-1}M^{2})^{1/(2\beta+d)}\rceilitalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌈ ( 36 italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT ⌉, and sizidαβsubscript𝑠𝑖superscriptsubscript𝑧𝑖𝑑𝛼𝛽s_{i}\coloneqq\lceil z_{i}^{d-\alpha\beta}\rceilitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ⌈ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT ⌉. We reuse the family of hard instances 𝒞zisubscript𝒞subscript𝑧𝑖\mathcal{C}_{z_{i}}caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as defined in (4), and define the mixture distribution

Qi()=1si2si1j=1siω[j]Ωsi1π,i,fω[j]1()=ωΩsiqi(ω)π,i,fω(),subscript𝑄𝑖1subscript𝑠𝑖superscript2subscript𝑠𝑖1superscriptsubscript𝑗1subscript𝑠𝑖subscriptsubscript𝜔delimited-[]𝑗subscriptΩsubscript𝑠𝑖1subscript𝜋𝑖subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝜔subscriptΩsubscript𝑠𝑖subscript𝑞𝑖𝜔subscript𝜋𝑖subscript𝑓𝜔Q_{i}(\cdot)=\frac{1}{s_{i}2^{s_{i}-1}}\sum_{j=1}^{s_{i}}\sum_{\omega_{[-j]}% \in\Omega_{s_{i}-1}}\mathbb{P}_{\pi,i,f_{\omega_{[-j]}^{-1}}}(\cdot)=\sum_{% \omega\in\Omega_{s_{i}}}q_{i}(\omega)\mathbb{P}_{\pi,i,f_{\omega}}(\cdot),italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_π , italic_i , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) blackboard_P start_POSTSUBSCRIPT italic_π , italic_i , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) , (13)

where qi:Ωsi[0,1]:subscript𝑞𝑖subscriptΩsubscript𝑠𝑖01q_{i}:\Omega_{s_{i}}\rightarrow[0,1]italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT → [ 0 , 1 ] is selected so that the above equality holds, π,i,fωsubscript𝜋𝑖subscript𝑓𝜔\mathbb{P}_{\pi,i,f_{\omega}}blackboard_P start_POSTSUBSCRIPT italic_π , italic_i , italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the distribution of the observations when fω𝒞zisubscript𝑓𝜔subscript𝒞subscript𝑧𝑖f_{\omega}\in\mathcal{C}_{z_{i}}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. It is easy to see ωΩsiqi(ω)=1subscript𝜔subscriptΩsubscript𝑠𝑖subscript𝑞𝑖𝜔1\sum_{\omega\in\Omega_{s_{i}}}q_{i}(\omega)=1∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) = 1.

We pause here to state a useful claim regarding the family {Qi}i=1Msuperscriptsubscriptsubscript𝑄𝑖𝑖1𝑀\{Q_{i}\}_{i=1}^{M}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of mixture distributions, namely, they are all close to each other under the total variation distance.

Lemma 2.

For any 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M, one has TV(QMTi1,QiTi1)12Ti1zi(2β+d).TVsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖112subscript𝑇𝑖1superscriptsubscript𝑧𝑖2𝛽𝑑\mathrm{TV}(Q_{M}^{T_{i-1}},Q_{i}^{T_{i-1}})\leq\frac{1}{2}\sqrt{T_{i-1}z_{i}^% {-(2\beta+d)}}.roman_TV ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT end_ARG .

Define the event

Ai={ti1<Ti1<Titi}.subscript𝐴𝑖subscript𝑡𝑖1subscript𝑇𝑖1subscript𝑇𝑖subscript𝑡𝑖A_{i}=\{t_{i-1}<T_{i-1}<T_{i}\leq t_{i}\}.italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } .

Intuitively, Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT models the event when the algorithm’s selected grid points ti1subscript𝑡𝑖1t_{i-1}italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are suboptimal. When Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT happens, the goal is to design a problem instance such that using observations up ti1subscript𝑡𝑖1t_{i-1}italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT cannot distinguish the optimal arm and therefore the policy must incur a large regret between ti1subscript𝑡𝑖1t_{i-1}italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The following lemma ensures that at least one of the bad events Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s happens with sufficiently large probability under the mixture distribution.

Lemma 3.

There exists i[M]superscript𝑖delimited-[]𝑀i^{\star}\in[M]italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ [ italic_M ] such that Qi(Ai)1/(2M)subscript𝑄superscript𝑖subscript𝐴superscript𝑖12𝑀Q_{i^{\star}}(A_{i^{\star}})\geq 1/(2M)italic_Q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ 1 / ( 2 italic_M ).

The next lemma indeed shows that when Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT happens, the regret must be large.

Lemma 4.

If Qi(Ai)1/(2M)subscript𝑄𝑖subscript𝐴𝑖12𝑀Q_{i}(A_{i})\geq 1/(2M)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 1 / ( 2 italic_M ), then

supf𝒞ziRTi(π;f)Tiziβ(1+α)M1+αα.subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑅subscript𝑇𝑖𝜋𝑓subscript𝑇𝑖superscriptsubscript𝑧𝑖𝛽1𝛼superscript𝑀1𝛼𝛼\sup_{f\in\mathcal{C}_{z_{i}}}R_{T_{i}}(\pi;f)\apprge T_{i}z_{i}^{-\beta(1+% \alpha)}M^{-\frac{1+\alpha}{\alpha}}.roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≳ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β ( 1 + italic_α ) end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT - divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT .

Now we are ready to establish the desired claim in the theorem. It is straightforward to see that

sup(f,12)(α,β)RT(π;f)supf𝒞ziRTi(π;f)Tiziβ(1+α)M1+αα=D~1(1M)D~2T1γ1γM,subscriptsupremum𝑓12𝛼𝛽subscript𝑅𝑇𝜋𝑓subscriptsupremum𝑓subscript𝒞subscript𝑧superscript𝑖subscript𝑅subscript𝑇superscript𝑖𝜋𝑓subscript𝑇superscript𝑖superscriptsubscript𝑧superscript𝑖𝛽1𝛼superscript𝑀1𝛼𝛼subscript~𝐷1superscript1𝑀subscript~𝐷2superscript𝑇1𝛾1superscript𝛾𝑀\sup_{(f,\frac{1}{2})\in\mathcal{F}(\alpha,\beta)}R_{T}(\pi;f)\geq\sup_{f\in% \mathcal{C}_{z_{i^{\star}}}}R_{T_{i^{\star}}}(\pi;f)\apprge T_{i^{\star}}z_{i^% {\star}}^{-\beta(1+\alpha)}M^{-\frac{1+\alpha}{\alpha}}=\tilde{D}_{1}(\frac{1}% {M})^{\tilde{D}_{2}}T^{\frac{1-\gamma}{1-\gamma^{M}}},roman_sup start_POSTSUBSCRIPT ( italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∈ caligraphic_F ( italic_α , italic_β ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≥ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≳ italic_T start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β ( 1 + italic_α ) end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT - divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT = over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ) start_POSTSUPERSCRIPT over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ,

where the second inequality uses Lemma 4, and the last one arises from the definitions of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

A.1 Proof of Lemma 2

It suffices to bound their KL divergence. By the standard decomposition of KL divergence and Bernoulli reward structure,

KL(QMTi1,QiTi1)KLsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖1\displaystyle\mathrm{KL}(Q_{M}^{T_{i-1}},Q_{i}^{T_{i-1}})roman_KL ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) 8t=1Ti1𝔼QM[(ωΩMqM(ω)fM,ω(Xt)ωΩiqi(ω)fi,ω(Xt)Δt)2𝟏{πt(Xt)=1}],absent8superscriptsubscript𝑡1subscript𝑇𝑖1subscript𝔼subscript𝑄𝑀delimited-[]superscriptsubscriptsubscript𝜔subscriptΩ𝑀subscript𝑞𝑀𝜔subscript𝑓𝑀𝜔subscript𝑋𝑡subscript𝜔subscriptΩ𝑖subscript𝑞𝑖𝜔subscript𝑓𝑖𝜔subscript𝑋𝑡subscriptΔ𝑡21subscript𝜋𝑡subscript𝑋𝑡1\displaystyle\leq 8\sum_{t=1}^{T_{i-1}}\mathbb{E}_{Q_{M}}\left[\left(% \underbrace{\sum_{\omega\in\Omega_{M}}q_{M}(\omega)f_{M,\omega}(X_{t})-\sum_{% \omega\in\Omega_{i}}q_{i}(\omega)f_{i,\omega}(X_{t})}_{\Delta_{t}}\right)^{2}% \mathbf{1}\{\pi_{t}(X_{t})=1\}\right],≤ 8 ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ω ) italic_f start_POSTSUBSCRIPT italic_M , italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) italic_f start_POSTSUBSCRIPT italic_i , italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 } ] , (14)

where fi,ωsubscript𝑓𝑖𝜔f_{i,\omega}italic_f start_POSTSUBSCRIPT italic_i , italic_ω end_POSTSUBSCRIPT denotes an instance from 𝒞zisubscript𝒞subscript𝑧𝑖\mathcal{C}_{z_{i}}caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. To control ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we further decompose it as

ΔtsubscriptΔ𝑡\displaystyle\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =j=1si(ωΩMqM(ω)fM,ω(Xt)ωΩiqi(ω)fi,ω(Xt))𝟏{XtCi,j},absentsuperscriptsubscript𝑗1subscript𝑠𝑖subscript𝜔subscriptΩ𝑀subscript𝑞𝑀𝜔subscript𝑓𝑀𝜔subscript𝑋𝑡subscript𝜔subscriptΩ𝑖subscript𝑞𝑖𝜔subscript𝑓𝑖𝜔subscript𝑋𝑡1subscript𝑋𝑡subscript𝐶𝑖𝑗\displaystyle=\sum_{j=1}^{s_{i}}\left(\sum_{\omega\in\Omega_{M}}q_{M}(\omega)f% _{M,\omega}(X_{t})-\sum_{\omega\in\Omega_{i}}q_{i}(\omega)f_{i,\omega}(X_{t})% \right)\mathbf{1}\{X_{t}\in C_{i,j}\},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ω ) italic_f start_POSTSUBSCRIPT italic_M , italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) italic_f start_POSTSUBSCRIPT italic_i , italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } ,

where Ci,jsubscript𝐶𝑖𝑗C_{i,j}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bin corresponding to the instance family 𝒞zisubscript𝒞subscript𝑧𝑖\mathcal{C}_{z_{i}}caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Here, the difference between the two sums can be restricted to j=1siCi,jsuperscriptsubscript𝑗1subscript𝑠𝑖subscript𝐶𝑖𝑗\cup_{j=1}^{s_{i}}C_{i,j}∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT because j=1sMCM,jj=1siCi,jsuperscriptsubscript𝑗1subscript𝑠𝑀subscript𝐶𝑀𝑗superscriptsubscript𝑗1subscript𝑠𝑖subscript𝐶𝑖𝑗\cup_{j=1}^{s_{M}}C_{M,j}\subseteq\cup_{j=1}^{s_{i}}C_{i,j}∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_M , italic_j end_POSTSUBSCRIPT ⊆ ∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Indeed, the area of effective bins for an instance in 𝒞zisubscript𝒞subscript𝑧𝑖\mathcal{C}_{z_{i}}caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is sizid=ziαβsubscript𝑠𝑖superscriptsubscript𝑧𝑖𝑑superscriptsubscript𝑧𝑖𝛼𝛽s_{i}z_{i}^{-d}=z_{i}^{-\alpha\beta}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α italic_β end_POSTSUPERSCRIPT, which decreases as i𝑖iitalic_i increases. Notice for XtCi,jsubscript𝑋𝑡subscript𝐶𝑖𝑗X_{t}\in C_{i,j}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT,

|ωΩiqi(ω)fi,ω(Xt)12|2si1si2si1ziβ4=ziβ4si,subscript𝜔subscriptΩ𝑖subscript𝑞𝑖𝜔subscript𝑓𝑖𝜔subscript𝑋𝑡12superscript2subscript𝑠𝑖1subscript𝑠𝑖superscript2subscript𝑠𝑖1superscriptsubscript𝑧𝑖𝛽4superscriptsubscript𝑧𝑖𝛽4subscript𝑠𝑖|\sum_{\omega\in\Omega_{i}}q_{i}(\omega)f_{i,\omega}(X_{t})-\frac{1}{2}|\leq% \frac{2^{s_{i}-1}}{s_{i}\cdot 2^{s_{i}-1}}\cdot\frac{z_{i}^{-\beta}}{4}=\frac{% z_{i}^{-\beta}}{4s_{i}},| ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) italic_f start_POSTSUBSCRIPT italic_i , italic_ω end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG | ≤ divide start_ARG 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG = divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,

because all the ω[j]1superscriptsubscript𝜔delimited-[]𝑗1\omega_{[-j]}^{-1}italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT’s have a negative sign in the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bin and there are 2si1superscript2subscript𝑠𝑖12^{s_{i}-1}2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT of them, while for each kj𝑘𝑗k\neq jitalic_k ≠ italic_j, the positive and negative spikes within the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bin cancel out each other when summing over ω[k]1superscriptsubscript𝜔delimited-[]𝑘1\omega_{[-k]}^{-1}italic_ω start_POSTSUBSCRIPT [ - italic_k ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT’s due to symmetry. Therefore,

|Δt|ziβ4sij=1si𝟏{XtCi,j}=14ziβd+αβj=1si𝟏{XtCi,j}.subscriptΔ𝑡superscriptsubscript𝑧𝑖𝛽4subscript𝑠𝑖superscriptsubscript𝑗1subscript𝑠𝑖1subscript𝑋𝑡subscript𝐶𝑖𝑗14superscriptsubscript𝑧𝑖𝛽𝑑𝛼𝛽superscriptsubscript𝑗1subscript𝑠𝑖1subscript𝑋𝑡subscript𝐶𝑖𝑗|\Delta_{t}|\leq\frac{z_{i}^{-\beta}}{4s_{i}}\sum_{j=1}^{s_{i}}\mathbf{1}\{X_{% t}\in C_{i,j}\}=\frac{1}{4}z_{i}^{-\beta-d+\alpha\beta}\sum_{j=1}^{s_{i}}% \mathbf{1}\{X_{t}\in C_{i,j}\}.| roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } = divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β - italic_d + italic_α italic_β end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } .

Plugging the above back to (14) we obtain

KL(QMTi1,QiTi1)KLsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖1\displaystyle\mathrm{KL}(Q_{M}^{T_{i-1}},Q_{i}^{T_{i-1}})roman_KL ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) 12t=1Ti1𝔼QM[zi2(β+dαβ)j=1si𝟏{XtCi,j}𝟏{πt(Xt)=1}]absent12superscriptsubscript𝑡1subscript𝑇𝑖1subscript𝔼subscript𝑄𝑀delimited-[]superscriptsubscript𝑧𝑖2𝛽𝑑𝛼𝛽superscriptsubscript𝑗1subscript𝑠𝑖1subscript𝑋𝑡subscript𝐶𝑖𝑗1subscript𝜋𝑡subscript𝑋𝑡1\displaystyle\leq\frac{1}{2}\sum_{t=1}^{T_{i-1}}\mathbb{E}_{Q_{M}}[z_{i}^{-2(% \beta+d-\alpha\beta)}\sum_{j=1}^{s_{i}}\mathbf{1}\{X_{t}\in C_{i,j}\}\mathbf{1% }\{\pi_{t}(X_{t})=1\}]≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_β + italic_d - italic_α italic_β ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 } ]
=12zi2(β+dαβ)t=1Ti1[j=1siQM(XtCi,j,πt(Xt)=1)]absent12superscriptsubscript𝑧𝑖2𝛽𝑑𝛼𝛽superscriptsubscript𝑡1subscript𝑇𝑖1delimited-[]superscriptsubscript𝑗1subscript𝑠𝑖subscriptsubscript𝑄𝑀formulae-sequencesubscript𝑋𝑡subscript𝐶𝑖𝑗subscript𝜋𝑡subscript𝑋𝑡1\displaystyle=\frac{1}{2}z_{i}^{-2(\beta+d-\alpha\beta)}\sum_{t=1}^{T_{i-1}}[% \sum_{j=1}^{s_{i}}\mathbb{P}_{Q_{M}}(X_{t}\in C_{i,j},\pi_{t}(X_{t})=1)]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_β + italic_d - italic_α italic_β ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 ) ]
12zi2(β+dαβ)t=1Ti1j=1sizid=12Ti1zi(2β+d+(dαβ)),absent12superscriptsubscript𝑧𝑖2𝛽𝑑𝛼𝛽superscriptsubscript𝑡1subscript𝑇𝑖1superscriptsubscript𝑗1subscript𝑠𝑖superscriptsubscript𝑧𝑖𝑑12subscript𝑇𝑖1superscriptsubscript𝑧𝑖2𝛽𝑑𝑑𝛼𝛽\displaystyle\leq\frac{1}{2}z_{i}^{-2(\beta+d-\alpha\beta)}\sum_{t=1}^{T_{i-1}% }\sum_{j=1}^{s_{i}}z_{i}^{-d}=\frac{1}{2}T_{i-1}z_{i}^{-(2\beta+d+(d-\alpha% \beta))},≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_β + italic_d - italic_α italic_β ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d + ( italic_d - italic_α italic_β ) ) end_POSTSUPERSCRIPT ,

where the last inequality is because QM(XtCi,j,πt(Xt)=1)=zidQM(πt(Xt)=1XtCi,j)zidsubscriptsubscript𝑄𝑀formulae-sequencesubscript𝑋𝑡subscript𝐶𝑖𝑗subscript𝜋𝑡subscript𝑋𝑡1superscriptsubscript𝑧𝑖𝑑subscriptsubscript𝑄𝑀subscript𝜋𝑡subscript𝑋𝑡conditional1subscript𝑋𝑡subscript𝐶𝑖𝑗superscriptsubscript𝑧𝑖𝑑\mathbb{P}_{Q_{M}}(X_{t}\in C_{i,j},\pi_{t}(X_{t})=1)=z_{i}^{-d}\mathbb{P}_{Q_% {M}}(\pi_{t}(X_{t})=1\mid X_{t}\in C_{i,j})\leq z_{i}^{-d}blackboard_P start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ≤ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT. Since αβ1𝛼𝛽1\alpha\beta\leq 1italic_α italic_β ≤ 1,

KL(QMTi1,QiTi1)12Ti1zi(2β+d+(dαβ))12Ti1zi(2β+d).KLsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖112subscript𝑇𝑖1superscriptsubscript𝑧𝑖2𝛽𝑑𝑑𝛼𝛽12subscript𝑇𝑖1superscriptsubscript𝑧𝑖2𝛽𝑑\mathrm{KL}(Q_{M}^{T_{i-1}},Q_{i}^{T_{i-1}})\leq\frac{1}{2}T_{i-1}z_{i}^{-(2% \beta+d+(d-\alpha\beta))}\leq\frac{1}{2}T_{i-1}z_{i}^{-(2\beta+d)}.roman_KL ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d + ( italic_d - italic_α italic_β ) ) end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT .

By Pinsker’s inequality, we can conclude

TV(QMTi1,QiTi1)12KL(QMTi1,QiTi1)12Ti1zi(2β+d).TVsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖112KLsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖112subscript𝑇𝑖1superscriptsubscript𝑧𝑖2𝛽𝑑\mathrm{TV}(Q_{M}^{T_{i-1}},Q_{i}^{T_{i-1}})\leq\sqrt{\frac{1}{2}\mathrm{KL}(Q% _{M}^{T_{i-1}},Q_{i}^{T_{i-1}})}\leq\frac{1}{2}\sqrt{T_{i-1}z_{i}^{-(2\beta+d)% }}.roman_TV ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_KL ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT end_ARG .

A.2 Proof of Lemma 3

For any 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M, we have

|QM(Ai)Qi(Ai)|=(i)|QMTi1(Ai)QiTi1(Ai)|(ii)TV(QMTi1,QiTi1)(iii)12M,subscript𝑄𝑀subscript𝐴𝑖subscript𝑄𝑖subscript𝐴𝑖isuperscriptsubscript𝑄𝑀subscript𝑇𝑖1subscript𝐴𝑖superscriptsubscript𝑄𝑖subscript𝑇𝑖1subscript𝐴𝑖iiTVsuperscriptsubscript𝑄𝑀subscript𝑇𝑖1superscriptsubscript𝑄𝑖subscript𝑇𝑖1iii12𝑀|Q_{M}(A_{i})-Q_{i}(A_{i})|\overset{\mathrm{(i)}}{=}|Q_{M}^{T_{i-1}}(A_{i})-Q_% {i}^{T_{i-1}}(A_{i})|\overset{\mathrm{(ii)}}{\leq}\mathrm{TV}(Q_{M}^{T_{i-1}},% Q_{i}^{T_{i-1}})\overset{\mathrm{(iii)}}{\leq}\frac{1}{2M},| italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG = end_ARG | italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG roman_TV ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_M end_ARG , (15)

where step (i) is because Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be determined by observations up to Ti1subscript𝑇𝑖1T_{i-1}italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, step (ii) uses the definition of TV, and step (iii) applies Lemma 2 and the definition of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Consequently,

i=1MQi(Ai)superscriptsubscript𝑖1𝑀subscript𝑄𝑖subscript𝐴𝑖\displaystyle\sum_{i=1}^{M}Q_{i}(A_{i})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =QM(AM)+i=1M1Qi(Ai)absentsubscript𝑄𝑀subscript𝐴𝑀superscriptsubscript𝑖1𝑀1subscript𝑄𝑖subscript𝐴𝑖\displaystyle=Q_{M}(A_{M})+\sum_{i=1}^{M-1}Q_{i}(A_{i})= italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=QM(AM)+i=1M1(Qi(Ai)QM(Ai)+QM(Ai))absentsubscript𝑄𝑀subscript𝐴𝑀superscriptsubscript𝑖1𝑀1subscript𝑄𝑖subscript𝐴𝑖subscript𝑄𝑀subscript𝐴𝑖subscript𝑄𝑀subscript𝐴𝑖\displaystyle=Q_{M}(A_{M})+\sum_{i=1}^{M-1}(Q_{i}(A_{i})-Q_{M}(A_{i})+Q_{M}(A_% {i}))= italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
(iv)QM(AM)+i=1M1(QM(Ai)12M)i=1MQM(Ai)12=(v)12,ivsubscript𝑄𝑀subscript𝐴𝑀superscriptsubscript𝑖1𝑀1subscript𝑄𝑀subscript𝐴𝑖12𝑀superscriptsubscript𝑖1𝑀subscript𝑄𝑀subscript𝐴𝑖12v12\displaystyle\overset{\mathrm{(iv)}}{\geq}Q_{M}(A_{M})+\sum_{i=1}^{M-1}(Q_{M}(% A_{i})-\frac{1}{2M})\geq\sum_{i=1}^{M}Q_{M}(A_{i})-\frac{1}{2}\overset{\mathrm% {(v)}}{=}\frac{1}{2},start_OVERACCENT ( roman_iv ) end_OVERACCENT start_ARG ≥ end_ARG italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_M end_ARG ) ≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG start_OVERACCENT ( roman_v ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ,

where step (iv) uses inequality (15), and step (v) uses the fact that i=1MQM(Ai)=1superscriptsubscript𝑖1𝑀subscript𝑄𝑀subscript𝐴𝑖1\sum_{i=1}^{M}Q_{M}(A_{i})=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1.

A.3 Proof of Lemma 4

We try to lower-bound the number of mistakes we make up to Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By inequality (5),

supf𝒞zSTi(π,f,12)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑇𝑖𝜋𝑓12\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{T_{i}}(\pi,f,\frac{1}{2})roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π , italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) 12sj=1st=1Tiω[j]Ωs11zdh{±1}π,fω[j]h(πt(Xt)hXtCj)absent1superscript2𝑠superscriptsubscript𝑗1𝑠superscriptsubscript𝑡1subscript𝑇𝑖subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠11superscript𝑧𝑑subscriptplus-or-minus1subscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗subscript𝜋𝑡subscript𝑋𝑡conditionalsubscript𝑋𝑡subscript𝐶𝑗\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{t=1}^{T_{i}}\sum_{\omega_{% [-j]}\in\Omega_{s-1}}\frac{1}{z^{d}}\sum_{h\in\{\pm 1\}}\mathbb{P}_{\pi,f_{% \omega_{[-j]}^{h}}}(\pi_{t}(X_{t})\neq h\mid X_{t}\in C_{j})≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ { ± 1 } end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_h ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
12sj=1sω[j]Ωs11zdt=1Timin{dPπ,fω[j]1t,dPπ,fω[j]1t}absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠11superscript𝑧𝑑superscriptsubscript𝑡1subscript𝑇𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-% 1}}\frac{1}{z^{d}}\sum_{t=1}^{T_{i}}\int\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^% {t},dP_{\pi,f_{\omega_{[-j]}^{1}}}^{t}\}≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }
12sj=1sω[j]Ωs1Tizdmin{dPπ,fω[j]1Ti,dPπ,fω[j]1Ti},absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑇𝑖superscript𝑧𝑑𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-% 1}}\frac{T_{i}}{z^{d}}\int\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i}},dP_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i}}\},≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ,

where the second inequality invokes Le Cam’s method, and the last inequality holds since min{dP,dQ}=1TV(P,Q)𝑑𝑃𝑑𝑄1TV𝑃𝑄\int\min\{dP,dQ\}=1-\mathrm{TV}(P,Q)∫ roman_min { italic_d italic_P , italic_d italic_Q } = 1 - roman_TV ( italic_P , italic_Q ), and TV(Pπ,fω[j]1t,Pπ,fω[j]1t)TV(Pπ,fω[j]1Ti,Pπ,fω[j]1Ti)TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{t},P_{\pi,f_{\omega_{[-j]}^{1}}}^{% t})\leq\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i}},P_{\pi,f_{\omega_{[-% j]}^{1}}}^{T_{i}})roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≤ roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) for tTi𝑡subscript𝑇𝑖t\leq T_{i}italic_t ≤ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We continue the lower-bound to see that

supf𝒞zSTi(π,f,12)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑇𝑖𝜋𝑓12\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{T_{i}}(\pi,f,\frac{1}{2})roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π , italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) 12sj=1sω[j]Ωs1Tizdmin{dPπ,fω[j]1Ti,dPπ,fω[j]1Ti}absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑇𝑖superscript𝑧𝑑𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-% 1}}\frac{T_{i}}{z^{d}}\int\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i}},dP_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i}}\}≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }
12sj=1sω[j]Ωs1TizdAimin{dPπ,fω[j]1Ti,dPπ,fω[j]1Ti}absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑇𝑖superscript𝑧𝑑subscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-% 1}}\frac{T_{i}}{z^{d}}\int_{A_{i}}\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i}% },dP_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i}}\}≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }
=12sj=1sω[j]Ωs1TizdAimin{dPπ,fω[j]1Ti1,dPπ,fω[j]1Ti1},absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑇𝑖superscript𝑧𝑑subscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle=\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-1}}% \frac{T_{i}}{z^{d}}\int_{A_{i}}\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}}% ,dP_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}\},= divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ,

where the last step uses the fact that under Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the available observations for π𝜋\piitalic_π at Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the same as those at Ti1subscript𝑇𝑖1T_{i-1}italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Using properties of TV, we reach

supf𝒞zSTi(π,f,12)subscriptsupremum𝑓subscript𝒞𝑧subscript𝑆subscript𝑇𝑖𝜋𝑓12\displaystyle\sup_{f\in\mathcal{C}_{z}}S_{T_{i}}(\pi,f,\frac{1}{2})roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π , italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) 12sj=1sω[j]Ωs1TizdAimin{dPπ,fω[j]1Ti1,dPπ,fω[j]1Ti1}absent1superscript2𝑠superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑇𝑖superscript𝑧𝑑subscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\geq\frac{1}{2^{s}}\sum_{j=1}^{s}\sum_{\omega_{[-j]}\in\Omega_{s-% 1}}\frac{T_{i}}{z^{d}}\int_{A_{i}}\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-% 1}},dP_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}\}≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }
12sTizdj=1sω[j]Ωs1(Pπ,fω[j]1(Ai)32TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1))absent1superscript2𝑠subscript𝑇𝑖superscript𝑧𝑑superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝐴𝑖32TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\geq\frac{1}{2^{s}}\cdot\frac{T_{i}}{z^{d}}\sum_{j=1}^{s}\sum_{% \omega_{[-j]}\in\Omega_{s-1}}\left(P_{\pi,f_{\omega_{[-j]}^{-1}}}(A_{i})-\frac% {3}{2}\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{\pi,f_{\omega_{[% -j]}^{1}}}^{T_{i-1}})\right)≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) )
12sTizdj=1sω[j]Ωs1(Pπ,fω[j]1(Ai)3212KL(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)),absent1superscript2𝑠subscript𝑇𝑖superscript𝑧𝑑superscriptsubscript𝑗1𝑠subscriptsubscript𝜔delimited-[]𝑗subscriptΩ𝑠1subscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝐴𝑖3212KLsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\geq\frac{1}{2^{s}}\cdot\frac{T_{i}}{z^{d}}\sum_{j=1}^{s}\sum_{% \omega_{[-j]}\in\Omega_{s-1}}\left(P_{\pi,f_{\omega_{[-j]}^{-1}}}(A_{i})-\frac% {3}{2}\sqrt{\frac{1}{2}\mathrm{KL}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_% {\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}})}\right),≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_KL ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG ) ,

where the second inequality applies Lemma 6, and the third inequality is due to Pinsker’s inequality. Take z=zi𝑧subscript𝑧𝑖z=z_{i}italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and use Lemma 5, we have

supf𝒞ziSTi(π,f,12)subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑆subscript𝑇𝑖𝜋𝑓12\displaystyle\sup_{f\in\mathcal{C}_{z_{i}}}S_{T_{i}}(\pi,f,\frac{1}{2})roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π , italic_f , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) 12siTizidj=1siω[j]Ωsi1(Pπ,i,fω[j]1(Ai)32zi(2β+d)Ti1)absent1superscript2subscript𝑠𝑖subscript𝑇𝑖superscriptsubscript𝑧𝑖𝑑superscriptsubscript𝑗1subscript𝑠𝑖subscriptsubscript𝜔delimited-[]𝑗subscriptΩsubscript𝑠𝑖1subscript𝑃𝜋𝑖subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝐴𝑖32superscriptsubscript𝑧𝑖2𝛽𝑑subscript𝑇𝑖1\displaystyle\geq\frac{1}{2^{s_{i}}}\cdot\frac{T_{i}}{z_{i}^{d}}\sum_{j=1}^{s_% {i}}\sum_{\omega_{[-j]}\in\Omega_{s_{i}-1}}\left(P_{\pi,i,f_{\omega_{[-j]}^{-1% }}}(A_{i})-\frac{3}{2}\sqrt{z_{i}^{-(2\beta+d)}T_{i-1}}\right)≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_π , italic_i , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG )
=12siTizid(si2si1Qi(Ai)32si2si1136M2)absent1superscript2subscript𝑠𝑖subscript𝑇𝑖superscriptsubscript𝑧𝑖𝑑subscript𝑠𝑖superscript2subscript𝑠𝑖1subscript𝑄𝑖subscript𝐴𝑖32subscript𝑠𝑖superscript2subscript𝑠𝑖1136superscript𝑀2\displaystyle=\frac{1}{2^{s_{i}}}\cdot\frac{T_{i}}{z_{i}^{d}}\left(s_{i}2^{s_{% i}-1}Q_{i}(A_{i})-\frac{3}{2}s_{i}2^{s_{i}-1}\sqrt{\frac{1}{36M^{2}}}\right)= divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG 36 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG )
18Tiziαβ1M,absent18subscript𝑇𝑖superscriptsubscript𝑧𝑖𝛼𝛽1𝑀\displaystyle\geq\frac{1}{8}T_{i}z_{i}^{-\alpha\beta}\frac{1}{M},≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α italic_β end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ,

where the last inequality uses the assumption Qi(Ai)1/(2M)subscript𝑄𝑖subscript𝐴𝑖12𝑀Q_{i}(A_{i})\geq 1/(2M)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 1 / ( 2 italic_M ). Therefore, we arrive at

supf𝒞ziRTi(π;f)Ti1α[supf𝒞ziSti(π;f)]1+ααTiziβ(1+α)M1+αα.subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑅subscript𝑇𝑖𝜋𝑓superscriptsubscript𝑇𝑖1𝛼superscriptdelimited-[]subscriptsupremum𝑓subscript𝒞subscript𝑧𝑖subscript𝑆subscript𝑡𝑖𝜋𝑓1𝛼𝛼subscript𝑇𝑖superscriptsubscript𝑧𝑖𝛽1𝛼superscript𝑀1𝛼𝛼\displaystyle\sup_{f\in\mathcal{C}_{z_{i}}}R_{T_{i}}(\pi;f)\apprge T_{i}^{-% \frac{1}{\alpha}}\left[\sup_{f\in\mathcal{C}_{z_{i}}}S_{t_{i}}(\pi;f)\right]^{% \frac{1+\alpha}{\alpha}}\apprge T_{i}z_{i}^{-\beta(1+\alpha)}M^{-\frac{1+% \alpha}{\alpha}}.roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≳ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ] start_POSTSUPERSCRIPT divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT ≳ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β ( 1 + italic_α ) end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT - divide start_ARG 1 + italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT .
Lemma 5.

Fix z>0𝑧0z>0italic_z > 0 and suppose fω𝒞zsubscript𝑓𝜔subscript𝒞𝑧f_{\omega}\in\mathcal{C}_{z}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. For any n[T]𝑛delimited-[]𝑇n\in[T]italic_n ∈ [ italic_T ] and any policy π𝜋\piitalic_π, one has

KL(π,fω[j]1n,π,fω[j]1n)2nz(2β+d).KLsuperscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑛superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑛2𝑛superscript𝑧2𝛽𝑑\mathrm{KL}(\mathbb{P}_{\pi,f_{\omega_{[-j]}^{-1}}}^{n},\mathbb{P}_{\pi,f_{% \omega_{[-j]}^{1}}}^{n})\leq 2nz^{-(2\beta+d)}.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≤ 2 italic_n italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT .
Proof.

We can compute

KL(π,fω[j]1n,π,fω[j]1n)KLsuperscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑛superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑛\displaystyle\mathrm{KL}(\mathbb{P}_{\pi,f_{\omega_{[-j]}^{-1}}}^{n},\mathbb{P% }_{\pi,f_{\omega_{[-j]}^{1}}}^{n})roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) (i)8𝔼π,fω[j]1[t=1n(fω[j]1(Xt)fω[j]1(Xt))2𝟏{πt(Xt)=1}]i8subscript𝔼𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1delimited-[]superscriptsubscript𝑡1𝑛superscriptsubscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑋𝑡subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑋𝑡21subscript𝜋𝑡subscript𝑋𝑡1\displaystyle\overset{\mathrm{(i)}}{\leq}8\mathbb{E}_{\pi,f_{\omega_{[-j]}^{-1% }}}[\sum_{t=1}^{n}(f_{\omega_{[-j]}^{-1}}(X_{t})-f_{\omega_{[-j]}^{1}}(X_{t}))% ^{2}\mathbf{1}\{\pi_{t}(X_{t})=1\}]start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG 8 blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 } ]
(ii)32Dϕ2z2β𝔼π,fω[j]1[t=1n𝟏{πt(Xt)=1,XtCj}]ii32superscriptsubscript𝐷italic-ϕ2superscript𝑧2𝛽subscript𝔼𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1delimited-[]superscriptsubscript𝑡1𝑛1formulae-sequencesubscript𝜋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐶𝑗\displaystyle\overset{\mathrm{(ii)}}{\leq}32D_{\phi}^{2}z^{-2\beta}\mathbb{E}_% {\pi,f_{\omega_{[-j]}^{-1}}}[\sum_{t=1}^{n}\mathbf{1}\{\pi_{t}(X_{t})=1,X_{t}% \in C_{j}\}]start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG 32 italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT - 2 italic_β end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ]
=(iii)32Dϕ2z(2β+d)t=1nπ,fω[j]1t(πt(Xt)=1XtCj)iii32superscriptsubscript𝐷italic-ϕ2superscript𝑧2𝛽𝑑superscriptsubscript𝑡1𝑛superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡subscript𝜋𝑡subscript𝑋𝑡conditional1subscript𝑋𝑡subscript𝐶𝑗\displaystyle\overset{\mathrm{(iii)}}{=}32D_{\phi}^{2}z^{-(2\beta+d)}\sum_{t=1% }^{n}\mathbb{P}_{\pi,f_{\omega_{[-j]}^{-1}}}^{t}(\pi_{t}(X_{t})=1\mid X_{t}\in C% _{j})start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG = end_ARG 32 italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
(iv)32Dϕ2z(2β+d)n2nz(2β+d).iv32superscriptsubscript𝐷italic-ϕ2superscript𝑧2𝛽𝑑𝑛2𝑛superscript𝑧2𝛽𝑑\displaystyle\overset{\mathrm{(iv)}}{\leq}32D_{\phi}^{2}z^{-(2\beta+d)}n\leq 2% nz^{-(2\beta+d)}.start_OVERACCENT ( roman_iv ) end_OVERACCENT start_ARG ≤ end_ARG 32 italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT italic_n ≤ 2 italic_n italic_z start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT .

Here, step (i) uses the standard decomposition of KL divergence and Bernoulli reward structure; step (ii) is due to the definition of fωsubscript𝑓𝜔f_{\omega}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT; step (iii) uses (XtCj)=1/zdsubscript𝑋𝑡subscript𝐶𝑗1superscript𝑧𝑑\mathbb{P}(X_{t}\in C_{j})=1/z^{d}blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1 / italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and step (iv) arises from π,fω[j]1t(πt(Xt)=1XtCj)1superscriptsubscript𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1𝑡subscript𝜋𝑡subscript𝑋𝑡conditional1subscript𝑋𝑡subscript𝐶𝑗1\mathbb{P}_{\pi,f_{\omega_{[-j]}^{-1}}}^{t}(\pi_{t}(X_{t})=1\mid X_{t}\in C_{j% })\leq 1blackboard_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ 1 for any 1tn1𝑡𝑛1\leq t\leq n1 ≤ italic_t ≤ italic_n. ∎

Lemma 6.

For any i[M]𝑖delimited-[]𝑀i\in[M]italic_i ∈ [ italic_M ], one has

Aimin{dPπ,fω[j]1Ti1,dPπ,fω[j]1Ti1}Pπ,fω[j]1(Ai)32TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1).subscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝐴𝑖32TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\int_{A_{i}}\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},dP_{\pi,f_{\omega_% {[-j]}^{1}}}^{T_{i-1}}\}\geq P_{\pi,f_{\omega_{[-j]}^{-1}}}(A_{i})-\frac{3}{2}% \mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{\pi,f_{\omega_{[-j]}^{% 1}}}^{T_{i-1}}).∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ≥ italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) .
Proof.

We can compute

Aimin{dPπ,fω[j]1Ti1,dPπ,fω[j]1Ti1}subscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\int_{A_{i}}\min\{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},dP_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}\}∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min { italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } =AidPπ,fω[j]1Ti1+dPπ,fω[j]1Ti1|dPπ,fω[j]1Ti1dPπ,fω[j]1Ti1|2absentsubscriptsubscript𝐴𝑖𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1𝑑superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖12\displaystyle=\int_{A_{i}}\frac{dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}}+dP_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}-|dP_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}% }-dP_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}|}{2}= ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - | italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_d italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | end_ARG start_ARG 2 end_ARG
12(Pπ,fω[j]1Ti1(Ai)+Pπ,fω[j]1Ti1(Ai))TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)absent12superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\geq\frac{1}{2}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}}(A_{i})+P% _{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}(A_{i}))-\mathrm{TV}(P_{\pi,f_{\omega_{[% -j]}^{-1}}}^{T_{i-1}},P_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}})≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
=12(2Pπ,fω[j]1Ti1(Ai)+Pπ,fω[j]1Ti1(Ai)Pπ,fω[j]1Ti1(Ai))TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)absent122superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle=\frac{1}{2}(2P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}}(A_{i})+P_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}(A_{i})-P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_% {i-1}}(A_{i}))-\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{\pi,f_{% \omega_{[-j]}^{1}}}^{T_{i-1}})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 2 italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
(i)Pπ,fω[j]1Ti1(Ai)12TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)isuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖12TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\overset{\mathrm{(i)}}{\geq}P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-% 1}}(A_{i})-\frac{1}{2}\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}})-\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}% }}^{T_{i-1}},P_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}})start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≥ end_ARG italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) - roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
=(ii)Pπ,fω[j]1(Ai)32TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1),iisubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝐴𝑖32TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1\displaystyle\overset{\mathrm{(ii)}}{=}P_{\pi,f_{\omega_{[-j]}^{-1}}}(A_{i})-% \frac{3}{2}\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{\pi,f_{% \omega_{[-j]}^{1}}}^{T_{i-1}}),start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG = end_ARG italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ,

where step (i) is due to |Pπ,fω[j]1Ti1(Ai)Pπ,fω[j]1Ti1(Ai)|TV(Pπ,fω[j]1Ti1,Pπ,fω[j]1Ti1)superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1subscript𝐴𝑖TVsuperscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1|P_{\pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}}(A_{i})-P_{\pi,f_{\omega_{[-j]}^{-1}}}% ^{T_{i-1}}(A_{i})|\leq\mathrm{TV}(P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}},P_{% \pi,f_{\omega_{[-j]}^{1}}}^{T_{i-1}})| italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ roman_TV ( italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), and step (ii) uses the fact that Pπ,fω[j]1subscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1P_{\pi,f_{\omega_{[-j]}^{-1}}}italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Pπ,fω[j]1Ti1superscriptsubscript𝑃𝜋subscript𝑓superscriptsubscript𝜔delimited-[]𝑗1subscript𝑇𝑖1P_{\pi,f_{\omega_{[-j]}^{-1}}}^{T_{i-1}}italic_P start_POSTSUBSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT [ - italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are equivalent on Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. ∎

Appendix B Proof of Theorem 3

Our proof of Theorem 3 is inspired by the framework developed in [39]. Our setting presents additional technical difficulty due to the batch constraint.

We begin with introducing some useful notations. Recall the tree growing process described in section 4, where we have defined a tree 𝒯𝒯\mathcal{T}caligraphic_T of depth M𝑀Mitalic_M. The root (depth 0) of the tree is the whole space 𝒳𝒳\mathcal{X}caligraphic_X. In depth 1111, 𝒳𝒳\mathcal{X}caligraphic_X has g0dsuperscriptsubscript𝑔0𝑑g_{0}^{d}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT children, each of which is a bin of width 1/g01subscript𝑔01/g_{0}1 / italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For each bin in depth 1, it has g1dsuperscriptsubscript𝑔1𝑑g_{1}^{d}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT children, each of which is a bin of width 1/(g0g1)1subscript𝑔0subscript𝑔11/(g_{0}g_{1})1 / ( italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). These children form the depth 2 nodes of the tree 𝒯𝒯\mathcal{T}caligraphic_T. We form the tree recursively until depth M𝑀Mitalic_M.

For a bin C𝒯𝐶𝒯C\in\mathcal{T}italic_C ∈ caligraphic_T, we define its parent by 𝗉(C)={C𝒯:C𝖼𝗁𝗂𝗅𝖽(C)}𝗉𝐶conditional-setsuperscript𝐶𝒯𝐶𝖼𝗁𝗂𝗅𝖽superscript𝐶\mathsf{p}(C)=\{C^{\prime}\in\mathcal{T}:C\in\mathsf{child}(C^{\prime})\}sansserif_p ( italic_C ) = { italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_T : italic_C ∈ sansserif_child ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. Moreover, we let 𝗉1(C)=𝗉(C)superscript𝗉1𝐶𝗉𝐶\mathsf{p}^{1}(C)=\mathsf{p}(C)sansserif_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_C ) = sansserif_p ( italic_C ) and define 𝗉k(C)=𝗉(𝗉k1(C))superscript𝗉𝑘𝐶𝗉superscript𝗉𝑘1𝐶\mathsf{p}^{k}(C)=\mathsf{p}(\mathsf{p}^{k-1}(C))sansserif_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_C ) = sansserif_p ( sansserif_p start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( italic_C ) ) for k2𝑘2k\geq 2italic_k ≥ 2 recursively. In all, we denote by 𝒫(C)={C𝒯:C=𝗉k(C) for some k1}𝒫𝐶conditional-setsuperscript𝐶𝒯superscript𝐶superscript𝗉𝑘𝐶 for some 𝑘1\mathcal{P}(C)=\{C^{\prime}\in\mathcal{T}:C^{\prime}=\mathsf{p}^{k}(C)\textrm{% for some }k\geq 1\}caligraphic_P ( italic_C ) = { italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_T : italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = sansserif_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_C ) for some italic_k ≥ 1 } all the ancestors of the bin C𝐶Citalic_C.

We also define tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be the set of active bins at time t𝑡titalic_t, with the dummy case 0={𝒳}subscript0𝒳\mathcal{L}_{0}=\{\mathcal{X}\}caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { caligraphic_X }. Clearly, for 1tt11𝑡subscript𝑡11\leq t\leq t_{1}1 ≤ italic_t ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, one has 1=1subscript1subscript1\mathcal{L}_{1}=\mathcal{B}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where 1subscript1\mathcal{B}_{1}caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are all the bins in the first layer.

B.1 Two clean events

The regret analysis relies on two clean events. First, fix a batch i1𝑖1i\geq 1italic_i ≥ 1, and recall ti1+1subscriptsubscript𝑡𝑖11\mathcal{L}_{t_{i-1}+1}caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT is the set of active bins at time ti1+1subscript𝑡𝑖11t_{i-1}+1italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1. We denote the random number of pulls for a bin Cti1+1𝐶subscriptsubscript𝑡𝑖11C\in\mathcal{L}_{t_{i-1}+1}italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT within batch i𝑖iitalic_i to be

mC,it=ti1+1ti𝟏{XtC}.subscript𝑚𝐶𝑖superscriptsubscript𝑡subscript𝑡𝑖11subscript𝑡𝑖1subscript𝑋𝑡𝐶m_{C,i}\coloneqq\sum_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}\{X_{t}\in C\}.italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≔ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C } .

Clearly, it has expectation

mC,i=𝔼[mC,i]=(titi1)X(XC).superscriptsubscript𝑚𝐶𝑖𝔼delimited-[]subscript𝑚𝐶𝑖subscript𝑡𝑖subscript𝑡𝑖1subscript𝑋𝑋𝐶m_{C,i}^{\star}=\mathbb{E}[m_{C,i}]=(t_{i}-t_{i-1})\mathbb{P}_{X}(X\in C).italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = blackboard_E [ italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ] = ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C ) .

The first clean event claims that mC,isubscript𝑚𝐶𝑖m_{C,i}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT concentrates well around its expectation mC,isuperscriptsubscript𝑚𝐶𝑖m_{C,i}^{\star}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT uniformly over all C𝒯𝐶𝒯C\in\mathcal{T}italic_C ∈ caligraphic_T. We denote this event by E𝐸Eitalic_E.

Lemma 7.

Suppose that MD1log(T)𝑀subscript𝐷1𝑇M\leq D_{1}\log(T)italic_M ≤ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( italic_T ) for some constant D1>0subscript𝐷10D_{1}>0italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0. With probability at least 11/T11𝑇1-1/T1 - 1 / italic_T, for all 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M, and Cti1+1𝐶subscriptsubscript𝑡𝑖11C\in\mathcal{L}_{t_{i-1}+1}italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT, we have

12mC,imC,i32mC,i.12superscriptsubscript𝑚𝐶𝑖subscript𝑚𝐶𝑖32superscriptsubscript𝑚𝐶𝑖\frac{1}{2}m_{C,i}^{\star}\leq m_{C,i}\leq\frac{3}{2}m_{C,i}^{\star}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT .

See Section B.5.1 for the proof.

Since MD1log(T)𝑀subscript𝐷1𝑇M\leq D_{1}\log(T)italic_M ≤ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( italic_T ) by assumption, we can apply Lemma 7 to obtain

𝔼[RT(π^)𝟏(Ec)]T(Ec)=1.𝔼delimited-[]subscript𝑅𝑇^𝜋1superscript𝐸𝑐𝑇superscript𝐸𝑐1\mathbb{E}[R_{T}(\hat{\pi})\mathbf{1}(E^{c})]\leq T\mathbb{P}(E^{c})=1.blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) bold_1 ( italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] ≤ italic_T blackboard_P ( italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = 1 .

Therefore, in the remaining proof, we condition on E𝐸Eitalic_E and focus on bounding 𝔼[RT(π^)𝟏(E)]𝔼delimited-[]subscript𝑅𝑇^𝜋1𝐸\mathbb{E}[R_{T}(\hat{\pi})\mathbf{1}(E)]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) bold_1 ( italic_E ) ].

The second clean event is on the elimination process. Since we use successive elimination in each bin, it is natural to expect that the optimal arm in each bin is not eliminated during the process. To mathematically specify this event, we need a few notations.

For each bin Ci𝐶subscript𝑖C\in\mathcal{L}_{i}italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, let Csuperscriptsubscript𝐶\mathcal{I}_{C}^{\prime}caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the set of remaining arms at the end of batch i𝑖iitalic_i, i.e., after Algorithm 2 is invoked. Define

¯Csubscript¯𝐶\displaystyle\bar{\mathcal{I}}_{C}over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ={k{1,1}:supxCf(x)f(k)(x)c1|C|β},absentconditional-set𝑘11subscriptsupremum𝑥𝐶superscript𝑓𝑥superscript𝑓𝑘𝑥subscript𝑐1superscript𝐶𝛽\displaystyle=\left\{k\in\{1,-1\}:\sup_{x\in C}f^{\star}(x)-f^{(k)}(x)\leq c_{% 1}|C|^{\beta}\right\},= { italic_k ∈ { 1 , - 1 } : roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_C end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT } ,
¯C={k{1,1}:supxCf(x)f(k)(x)c0|C|β},subscript¯𝐶conditional-set𝑘11subscriptsupremum𝑥𝐶superscript𝑓𝑥superscript𝑓𝑘𝑥subscript𝑐0superscript𝐶𝛽\underline{\mathcal{I}}_{C}=\left\{k\in\{1,-1\}:\sup_{x\in C}f^{\star}(x)-f^{(% k)}(x)\leq c_{0}|C|^{\beta}\right\},under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { italic_k ∈ { 1 , - 1 } : roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_C end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT } ,

where c0=2Ldβ/2+1subscript𝑐02𝐿superscript𝑑𝛽21c_{0}=2Ld^{\beta/2}+1italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_L italic_d start_POSTSUPERSCRIPT italic_β / 2 end_POSTSUPERSCRIPT + 1 and c1=8c0subscript𝑐18subscript𝑐0c_{1}=8c_{0}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Clearly, we have

¯C¯C.subscript¯𝐶subscript¯𝐶\underline{\mathcal{I}}_{C}\subseteq\bar{\mathcal{I}}_{C}.under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ⊆ over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT .

Define a good event 𝒜C={¯CC¯C}subscript𝒜𝐶subscript¯𝐶superscriptsubscript𝐶subscript¯𝐶\mathcal{A}_{C}=\{\underline{\mathcal{I}}_{C}\subseteq\mathcal{I}_{C}^{\prime}% \subseteq\bar{\mathcal{I}}_{C}\}caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ⊆ caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT }, which is the event that the remaining arms in C𝐶Citalic_C have gaps of correct order. In addition, define 𝒢C=C𝒫(C)𝒜Csubscript𝒢𝐶subscriptsuperscript𝐶𝒫𝐶subscript𝒜superscript𝐶\mathcal{G}_{C}=\cap_{C^{\prime}\in\mathcal{P}(C)}\mathcal{A}_{C^{\prime}}caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ∩ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_C ) end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Recall isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of bins C𝐶Citalic_C with |C|=(l=0i1gl)1=wi𝐶superscriptsuperscriptsubscriptproduct𝑙0𝑖1subscript𝑔𝑙1subscript𝑤𝑖|C|=(\prod_{l=0}^{i-1}g_{l})^{-1}=w_{i}| italic_C | = ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i1𝑖1i\geq 1italic_i ≥ 1.

Lemma 8.

For any 1iM11𝑖𝑀11\leq i\leq M-11 ≤ italic_i ≤ italic_M - 1 and Ci𝐶subscript𝑖C\in\mathcal{B}_{i}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have

(E𝒢C𝒜Cc)4mC,iT|C|d.𝐸subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐4superscriptsubscript𝑚𝐶𝑖𝑇superscript𝐶𝑑\mathbb{P}(E\cap\mathcal{G}_{C}\cap\mathcal{A}_{C}^{c})\leq\frac{4m_{C,i}^{% \star}}{T|C|^{d}}.blackboard_P ( italic_E ∩ caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ divide start_ARG 4 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG .

In words, Lemma 8 guarantees that 𝒜Csubscript𝒜𝐶\mathcal{A}_{C}caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT happens with high probability if E𝐸Eitalic_E holds and 𝒜Csubscript𝒜superscript𝐶\mathcal{A}_{C^{\prime}}caligraphic_A start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT holds for all the ancestors C𝐶C\text{\textquoteright}italic_C ’ of C𝐶Citalic_C. See Section B.5.2 for the proof.

B.2 Regret decomposition

In this section, we decompose the regret into three terms. First, for a bin C𝐶Citalic_C, we define

rTlive(C)t=1T(f(Xt)f(πt(Xt))(Xt))𝟏(XtC)𝟏(Ct).superscriptsubscript𝑟𝑇live𝐶superscriptsubscript𝑡1𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡𝐶1𝐶subscript𝑡r_{T}^{\textrm{live}}(C)\coloneqq\sum_{t=1}^{T}\left(f^{\star}(X_{t})-f^{(\pi_% {t}(X_{t}))}(X_{t})\right)\mathbf{1}(X_{t}\in C)\mathbf{1}(C\in\mathcal{L}_{t}).italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) ≔ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C ) bold_1 ( italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

In addition, define 𝒥tstssubscript𝒥𝑡subscript𝑠𝑡subscript𝑠\mathcal{J}_{t}\coloneqq\cup_{s\leq t}\mathcal{L}_{s}caligraphic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ ∪ start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to be the set of bins that have been live up until time t𝑡titalic_t. Correspondingly we define

rTborn(C)t=1T(f(Xt)f(πt(Xt))(Xt))𝟏(XtC)𝟏(C𝒥t).superscriptsubscript𝑟𝑇born𝐶superscriptsubscript𝑡1𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡𝐶1𝐶subscript𝒥𝑡r_{T}^{\textrm{born}}(C)\coloneqq\sum_{t=1}^{T}\left(f^{\star}(X_{t})-f^{(\pi_% {t}(X_{t}))}(X_{t})\right)\mathbf{1}(X_{t}\in C)\mathbf{1}(C\in\mathcal{J}_{t}).italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) ≔ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C ) bold_1 ( italic_C ∈ caligraphic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

It is clear from the definition that for any C𝒯𝐶𝒯C\in\mathcal{T}italic_C ∈ caligraphic_T, one has

rTborn(C)superscriptsubscript𝑟𝑇born𝐶\displaystyle r_{T}^{\textrm{born}}(C)italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) =rTlive(C)+C𝖼𝗁𝗂𝗅𝖽(C)rTborn(C)absentsuperscriptsubscript𝑟𝑇live𝐶subscriptsuperscript𝐶𝖼𝗁𝗂𝗅𝖽𝐶superscriptsubscript𝑟𝑇bornsuperscript𝐶\displaystyle=r_{T}^{\textrm{live}}(C)+\sum_{C^{\prime}\in\mathsf{child}(C)}r_% {T}^{\textrm{born}}(C^{\prime})= italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) + ∑ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_child ( italic_C ) end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=rTborn(C)𝟏(𝒜Cc)+rTlive(C)𝟏(𝒜C)+C𝖼𝗁𝗂𝗅𝖽(C)rTborn(C)𝟏(𝒜C).absentsuperscriptsubscript𝑟𝑇born𝐶1superscriptsubscript𝒜𝐶𝑐superscriptsubscript𝑟𝑇live𝐶1subscript𝒜𝐶subscriptsuperscript𝐶𝖼𝗁𝗂𝗅𝖽𝐶superscriptsubscript𝑟𝑇bornsuperscript𝐶1subscript𝒜𝐶\displaystyle=r_{T}^{\textrm{born}}(C)\mathbf{1}(\mathcal{A}_{C}^{c})+r_{T}^{% \textrm{live}}(C)\mathbf{1}(\mathcal{A}_{C})+\sum_{C^{\prime}\in\mathsf{child}% (C)}r_{T}^{\textrm{born}}(C^{\prime})\mathbf{1}(\mathcal{A}_{C}).= italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) + italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_child ( italic_C ) end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) bold_1 ( caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) .

Applying this relation recursively leads to the following regret decomposition:

RT(π)subscript𝑅𝑇𝜋\displaystyle R_{T}(\pi)italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) =rTborn(𝒳)absentsuperscriptsubscript𝑟𝑇born𝒳\displaystyle=r_{T}^{\textrm{born}}(\mathcal{X})= italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( caligraphic_X )
=rTlive(𝒳)=0+C𝖼𝗁𝗂𝗅𝖽(𝒳)rTborn(C)absentsubscriptsuperscriptsubscript𝑟𝑇live𝒳absent0subscriptsuperscript𝐶𝖼𝗁𝗂𝗅𝖽𝒳superscriptsubscript𝑟𝑇bornsuperscript𝐶\displaystyle=\underbrace{r_{T}^{\textrm{live}}(\mathcal{X})}_{=0}+\sum_{C^{% \prime}\in\mathsf{child}(\mathcal{X})}r_{T}^{\textrm{born}}(C^{\prime})= under⏟ start_ARG italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( caligraphic_X ) end_ARG start_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_child ( caligraphic_X ) end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=1i<M(CirTborn(C)𝟏(𝒢C𝒜Cc)Ui+CirTlive(C)𝟏(𝒢C𝒜C)Vi)absentsubscript1𝑖𝑀subscriptsubscript𝐶subscript𝑖superscriptsubscript𝑟𝑇born𝐶1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐absentsubscript𝑈𝑖subscriptsubscript𝐶subscript𝑖superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶subscript𝒜𝐶absentsubscript𝑉𝑖\displaystyle=\sum_{1\leq i<M}\left(\underbrace{\sum_{C\in\mathcal{B}_{i}}r_{T% }^{\textrm{born}}(C)\mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C}^{c})}_{% \eqqcolon U_{i}}+\underbrace{\sum_{C\in\mathcal{B}_{i}}r_{T}^{\textrm{live}}(C% )\mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})}_{\eqqcolon V_{i}}\right)= ∑ start_POSTSUBSCRIPT 1 ≤ italic_i < italic_M end_POSTSUBSCRIPT ( under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT ≕ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT ≕ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
+CMrTlive(C)𝟏(𝒢C),subscript𝐶subscript𝑀superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶\displaystyle\quad+\sum_{C\in\mathcal{B}_{M}}r_{T}^{\textrm{live}}(C)\mathbf{1% }(\mathcal{G}_{C}),+ ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ,

where the second equality arises from the fact that rTlive(𝒳)=0superscriptsubscript𝑟𝑇live𝒳0r_{T}^{\textrm{live}}(\mathcal{X})=0italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( caligraphic_X ) = 0. Indeed, 𝒳t𝒳subscript𝑡\mathcal{X}\notin\mathcal{L}_{t}caligraphic_X ∉ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for any 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T.

B.3 Controlling three terms

In what follows, we control Vi,Uisubscript𝑉𝑖subscript𝑈𝑖V_{i},U_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the last batch separately.

B.3.1 Controlling Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Fix some 1iM11𝑖𝑀11\leq i\leq M-11 ≤ italic_i ≤ italic_M - 1, and some bin Ci𝐶subscript𝑖C\in\mathcal{B}_{i}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. On the event 𝒢Csubscript𝒢𝐶\mathcal{G}_{C}caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT we have 𝗉(C)¯𝗉(C)superscriptsubscript𝗉𝐶subscript¯𝗉𝐶\mathcal{I}_{\mathsf{p}(C)}^{{}^{\prime}}\subseteq\bar{\mathcal{I}}_{\mathsf{p% }(C)}caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⊆ over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT, that is, for any k𝗉(C)𝑘superscriptsubscript𝗉𝐶k\in\mathcal{I}_{\mathsf{p}(C)}^{{}^{\prime}}italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT,

supx𝗉(C)f(x)f(k)(x)c1|𝗉(C)|β.subscriptsupremum𝑥𝗉𝐶superscript𝑓𝑥superscript𝑓𝑘𝑥subscript𝑐1superscript𝗉𝐶𝛽\sup_{x\in\mathsf{p}(C)}f^{\star}(x)-f^{(k)}(x)\leq c_{1}|\mathsf{p}(C)|^{% \beta}.roman_sup start_POSTSUBSCRIPT italic_x ∈ sansserif_p ( italic_C ) end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT .

This implies that for any xC𝑥𝐶x\in Citalic_x ∈ italic_C, and k𝗉(C)𝑘superscriptsubscript𝗉𝐶k\in\mathcal{I}_{\mathsf{p}(C)}^{{}^{\prime}}italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT,

(f(x)f(k)(x))𝟏{𝒢C}c1|𝗉(C)|β𝟏(0<|f(1)(x)f(1)(x)|c1|𝗉(C)|β).superscript𝑓𝑥superscript𝑓𝑘𝑥1subscript𝒢𝐶subscript𝑐1superscript𝗉𝐶𝛽10superscript𝑓1𝑥superscript𝑓1𝑥subscript𝑐1superscript𝗉𝐶𝛽\left(f^{\star}(x)-f^{(k)}(x)\right)\bm{1}\{\mathcal{G}_{C}\}\leq c_{1}|% \mathsf{p}(C)|^{\beta}\mathbf{1}(0<\left|f^{(1)}(x)-f^{(-1)}(x)\right|\leq c_{% 1}|\mathsf{p}(C)|^{\beta}).( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) ) bold_1 { caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT } ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT bold_1 ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) . (16)

As a result, we obtain

𝔼[rTlive(C)𝟏(𝒢C𝒜C)]=𝔼[t=1T(f(Xt)f(πt(Xt))(Xt))𝟏(XtC)𝟏(Ct)𝟏(𝒢C𝒜C)]𝔼delimited-[]superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶subscript𝒜𝐶𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡𝐶1𝐶subscript𝑡1subscript𝒢𝐶subscript𝒜𝐶\displaystyle\mathbb{E}[r_{T}^{\textrm{live}}(C)\mathbf{1}(\mathcal{G}_{C}\cap% \mathcal{A}_{C})]=\mathbb{E}\left[\sum_{t=1}^{T}\left(f^{\star}(X_{t})-f^{(\pi% _{t}(X_{t}))}(X_{t})\right)\mathbf{1}(X_{t}\in C)\mathbf{1}(C\in\mathcal{L}_{t% })\mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})\right]blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ] = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C ) bold_1 ( italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ]
(i)𝔼[t=1Tc1|𝗉(C)|β𝟏(0<|f(1)(Xt)f(1)(Xt)|c1|𝗉(C)|β)𝟏(XtC,Ct)𝟏(𝒢C𝒜C)]i𝔼delimited-[]superscriptsubscript𝑡1𝑇subscript𝑐1superscript𝗉𝐶𝛽10superscript𝑓1subscript𝑋𝑡superscript𝑓1subscript𝑋𝑡subscript𝑐1superscript𝗉𝐶𝛽1formulae-sequencesubscript𝑋𝑡𝐶𝐶subscript𝑡1subscript𝒢𝐶subscript𝒜𝐶\displaystyle\quad\overset{\mathrm{(i)}}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}c_% {1}|\mathsf{p}(C)|^{\beta}\mathbf{1}(0<\left|f^{(1)}(X_{t})-f^{(-1)}(X_{t})% \right|\leq c_{1}|\mathsf{p}(C)|^{\beta})\mathbf{1}(X_{t}\in C,C\in\mathcal{L}% _{t})\mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})\right]start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT bold_1 ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C , italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ]
(ii)c1|𝗉(C)|β𝔼[t=ti1+1ti𝟏(0<|f(1)(Xt)f(1)(Xt)|c1|𝗉(C)|β,XtC)𝟏(𝒢C𝒜C)]\displaystyle\quad\overset{\mathrm{(ii)}}{\leq}c_{1}|\mathsf{p}(C)|^{\beta}% \mathbb{E}\left[\sum_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}(0<\left|f^{(1)}(X_{t})-f^% {(-1)}(X_{t})\right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X_{t}\in C)\mathbf{1}(% \mathcal{G}_{C}\cap\mathcal{A}_{C})\right]start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ]
(iii)c1|𝗉(C)|βt=ti1+1ti(0<|f(1)(Xt)f(1)(Xt)|c1|𝗉(C)|β,XtC)\displaystyle\quad\overset{\mathrm{(iii)}}{\leq}c_{1}|\mathsf{p}(C)|^{\beta}% \sum_{t=t_{i-1}+1}^{t_{i}}\mathbb{P}(0<\left|f^{(1)}(X_{t})-f^{(-1)}(X_{t})% \right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X_{t}\in C)start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≤ end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C )
=c1|𝗉(C)|β(titi1)(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC).\displaystyle\quad=c_{1}|\mathsf{p}(C)|^{\beta}(t_{i}-t_{i-1})\mathbb{P}(0<% \left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X\in C).= italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) .

Here, step (i) uses relation (16), and the fact that πt(Xt)𝗉(C)subscript𝜋𝑡subscript𝑋𝑡superscriptsubscript𝗉𝐶\pi_{t}(X_{t})\in\mathcal{I}_{\mathsf{p}(C)}^{{}^{\prime}}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT when XtCsubscript𝑋𝑡𝐶X_{t}\in Citalic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C. For step (ii), if C𝐶Citalic_C is split, then it is no longer live, so the live regret incurred on the remaining batches is zero. On the other hand, if C𝐶Citalic_C is not split, then |C|=1superscriptsubscript𝐶1|\mathcal{I}_{C}^{\prime}|=1| caligraphic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | = 1. Without loss of generality, assume that arm 11-1- 1 is eliminated. Conditioned on 𝒜Csubscript𝒜𝐶\mathcal{A}_{C}caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, this means 1¯C1subscript¯𝐶-1\notin\underline{\mathcal{I}}_{C}- 1 ∉ under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and there exists x0Csubscript𝑥0𝐶x_{0}\in Citalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_C such that f(1)(x0)f(1)(x0)>c0|C|βsuperscript𝑓1subscript𝑥0superscript𝑓1subscript𝑥0subscript𝑐0superscript𝐶𝛽f^{(1)}(x_{0})-f^{(-1)}(x_{0})>c_{0}|C|^{\beta}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) > italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT. By the smoothness condition, having a gap at least c0|C|βsubscript𝑐0superscript𝐶𝛽c_{0}|C|^{\beta}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT on a single point in C𝐶Citalic_C implies f(1)(x)f(1)(x)>|C|βsuperscript𝑓1𝑥superscript𝑓1𝑥superscript𝐶𝛽f^{(1)}(x)-f^{(-1)}(x)>|C|^{\beta}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) > | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT for all xC𝑥𝐶x\in Citalic_x ∈ italic_C. Therefore, arm 1 which is the remaining one is the optimal arm for all xC𝑥𝐶x\in Citalic_x ∈ italic_C and would not incur any regret further. The third inequality holds since 𝟏(𝒢C𝒜C)11subscript𝒢𝐶subscript𝒜𝐶1\mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})\leq 1bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ≤ 1.

Taking the sum over all bins in isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and using the fact that |𝗉(C)|=wi1𝗉𝐶subscript𝑤𝑖1|\mathsf{p}(C)|=w_{i-1}| sansserif_p ( italic_C ) | = italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, we obtain

Ci𝔼[rTlive(C)𝟏(𝒢C𝒜C)]subscript𝐶subscript𝑖𝔼delimited-[]superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶subscript𝒜𝐶\displaystyle\sum_{C\in\mathcal{B}_{i}}\mathbb{E}[r_{T}^{\textrm{live}}(C)% \mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})]∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ] Cic1wi1β(titi1)(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC)\displaystyle\leq\sum_{C\in\mathcal{B}_{i}}c_{1}w_{i-1}^{\beta}(t_{i}-t_{i-1})% \mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^{% \beta},X\in C)≤ ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C )
=c1wi1β(titi1)Ci(0<|f(1)(X)f(1)(X)|c1wi1β,XC).\displaystyle=c_{1}w_{i-1}^{\beta}(t_{i}-t_{i-1})\sum_{C\in\mathcal{B}_{i}}% \mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}w_{i-1}^{\beta},X\in C).= italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) . (17)

Note that

Ci(0<|f(1)(X)f(1)(X)|c1wi1β,XC)\displaystyle\sum_{C\in\mathcal{B}_{i}}\mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(% X)\right|\leq c_{1}w_{i-1}^{\beta},X\in C)∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) =(0<|f(1)(X)f(1)(X)|c1wi1β)absent0superscript𝑓1𝑋superscript𝑓1𝑋subscript𝑐1superscriptsubscript𝑤𝑖1𝛽\displaystyle=\mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}w_{i-1% }^{\beta})= blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT )
D0[c1wi1β]α,absentsubscript𝐷0superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑖1𝛽𝛼\displaystyle\leq D_{0}\cdot\left[c_{1}w_{i-1}^{\beta}\right]^{\alpha},≤ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , (18)

where the last inequality follows from the margin condition. Combining relations (18) and (17), we reach

Ci𝔼[rTlive(C)𝟏(𝒢C𝒜C)]subscript𝐶subscript𝑖𝔼delimited-[]superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶subscript𝒜𝐶\displaystyle\sum_{C\in\mathcal{B}_{i}}\mathbb{E}[r_{T}^{\textrm{live}}(C)% \mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C})]∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ] (titi1)[c1wi1β]1+αD0.absentsubscript𝑡𝑖subscript𝑡𝑖1superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑖1𝛽1𝛼subscript𝐷0\displaystyle\leq(t_{i}-t_{i-1})\cdot[c_{1}w_{i-1}^{\beta}]^{1+\alpha}\cdot D_% {0}.≤ ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ⋅ [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 + italic_α end_POSTSUPERSCRIPT ⋅ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

B.3.2 Controlling Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Fix some 1iM11𝑖𝑀11\leq i\leq M-11 ≤ italic_i ≤ italic_M - 1, and some bin Ci𝐶subscript𝑖C\in\mathcal{B}_{i}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Again, using the definition of 𝒢Csubscript𝒢𝐶\mathcal{G}_{C}caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, we obtain

𝔼[rTborn(C)𝟏(𝒢C𝒜Cc)]𝔼delimited-[]superscriptsubscript𝑟𝑇born𝐶1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle\mathbb{E}[r_{T}^{\textrm{born}}(C)\mathbf{1}(\mathcal{G}_{C}\cap% \mathcal{A}_{C}^{c})]blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] =𝔼[t=1T(f(Xt)f(πt(Xt))(Xt))𝟏(XtC)𝟏(C𝒥t)𝟏(𝒢C𝒜Cc)]absent𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡𝐶1𝐶subscript𝒥𝑡1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\left(f^{\star}(X_{t})-f^{(\pi_{t}% (X_{t}))}(X_{t})\right)\mathbf{1}(X_{t}\in C)\mathbf{1}(C\in\mathcal{J}_{t})% \mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C}^{c})\right]= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C ) bold_1 ( italic_C ∈ caligraphic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ]
𝔼[t=1Tc1|𝗉(C)|β𝟏(0<|f(1)(Xt)f(1)(Xt)|c1|𝗉(C)|β)𝟏(XtC,C𝒥t)𝟏(𝒢C𝒜Cc)]absent𝔼delimited-[]superscriptsubscript𝑡1𝑇subscript𝑐1superscript𝗉𝐶𝛽10superscript𝑓1subscript𝑋𝑡superscript𝑓1subscript𝑋𝑡subscript𝑐1superscript𝗉𝐶𝛽1formulae-sequencesubscript𝑋𝑡𝐶𝐶subscript𝒥𝑡1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}c_{1}|\mathsf{p}(C)|^{\beta}% \mathbf{1}(0<\left|f^{(1)}(X_{t})-f^{(-1)}(X_{t})\right|\leq c_{1}|\mathsf{p}(% C)|^{\beta})\mathbf{1}(X_{t}\in C,C\in\mathcal{J}_{t})\mathbf{1}(\mathcal{G}_{% C}\cap\mathcal{A}_{C}^{c})\right]≤ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT bold_1 ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) bold_1 ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C , italic_C ∈ caligraphic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ]
c1|𝗉(C)|βT(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC)(𝒢C𝒜Cc).\displaystyle\leq c_{1}|\mathsf{p}(C)|^{\beta}T\mathbb{P}(0<\left|f^{(1)}(X)-f% ^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X\in C)\mathbb{P}(\mathcal{% G}_{C}\cap\mathcal{A}_{C}^{c}).≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_T blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) blackboard_P ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) .

Apply Lemma 8 to see that

𝔼[rTborn(C)𝟏(𝒢C𝒜Cc)]𝔼delimited-[]superscriptsubscript𝑟𝑇born𝐶1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle\mathbb{E}[r_{T}^{\textrm{born}}(C)\mathbf{1}(\mathcal{G}_{C}\cap% \mathcal{A}_{C}^{c})]blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] c1|𝗉(C)|βT(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC)4mC,iT|C|d\displaystyle\leq c_{1}|\mathsf{p}(C)|^{\beta}T\mathbb{P}(0<\left|f^{(1)}(X)-f% ^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X\in C)\frac{4m_{C,i}^{% \star}}{T|C|^{d}}≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_T blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) divide start_ARG 4 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG
=c1wi1β(0<|f(1)(X)f(1)(X)|c1wi1β,XC)4(titi1)X(XC)|C|d\displaystyle=c_{1}w_{i-1}^{\beta}\mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)% \right|\leq c_{1}w_{i-1}^{\beta},X\in C)\frac{4(t_{i}-t_{i-1})\mathbb{P}_{X}(X% \in C)}{|C|^{d}}= italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) divide start_ARG 4 ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C ) end_ARG start_ARG | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG
4c¯c1wi1β(0<|f(1)(X)f(1)(X)|c1wi1β,XC)(titi1),\displaystyle\leq 4\bar{c}c_{1}w_{i-1}^{\beta}\mathbb{P}(0<\left|f^{(1)}(X)-f^% {(-1)}(X)\right|\leq c_{1}w_{i-1}^{\beta},X\in C)(t_{i}-t_{i-1}),≤ 4 over¯ start_ARG italic_c end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ,

where we use the fact that X(XC)c¯|C|dsubscript𝑋𝑋𝐶¯𝑐superscript𝐶𝑑\mathbb{P}_{X}(X\in C)\leq\bar{c}|C|^{d}blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C ) ≤ over¯ start_ARG italic_c end_ARG | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT in the second inequality. Summing over all bins in isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we obtain

Ci𝔼[rTborn(C)𝟏(𝒢C𝒜Cc)]subscript𝐶subscript𝑖𝔼delimited-[]superscriptsubscript𝑟𝑇born𝐶1subscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle\sum_{C\in\mathcal{B}_{i}}\mathbb{E}[r_{T}^{\textrm{born}}(C)% \mathbf{1}(\mathcal{G}_{C}\cap\mathcal{A}_{C}^{c})]∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT born end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] 4c¯c1wi1β(titi1)Ci(0<|f(1)(X)f(1)(X)|c1wi1β,XC)\displaystyle\leq 4\bar{c}c_{1}w_{i-1}^{\beta}(t_{i}-t_{i-1})\sum_{C\in% \mathcal{B}_{i}}\mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}w_{i% -1}^{\beta},X\in C)≤ 4 over¯ start_ARG italic_c end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C )
4c¯c1wi1β(titi1)D0[c1wi1β]αabsent4¯𝑐subscript𝑐1superscriptsubscript𝑤𝑖1𝛽subscript𝑡𝑖subscript𝑡𝑖1subscript𝐷0superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑖1𝛽𝛼\displaystyle\leq 4\bar{c}c_{1}w_{i-1}^{\beta}(t_{i}-t_{i-1})D_{0}\cdot\left[c% _{1}w_{i-1}^{\beta}\right]^{\alpha}≤ 4 over¯ start_ARG italic_c end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT
=4D0c¯(titi1)[c1wi1β]1+α,absent4subscript𝐷0¯𝑐subscript𝑡𝑖subscript𝑡𝑖1superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑖1𝛽1𝛼\displaystyle=4D_{0}\bar{c}(t_{i}-t_{i-1})[c_{1}w_{i-1}^{\beta}]^{1+\alpha},= 4 italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG italic_c end_ARG ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 + italic_α end_POSTSUPERSCRIPT ,

where the second inequality reuses the bound in (18).

B.3.3 Last Batch

For CM𝐶subscript𝑀C\in\mathcal{B}_{M}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, one can similarly obtain

𝔼[rTlive(C)𝟏(𝒢C)]𝔼delimited-[]superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶\displaystyle\mathbb{E}[r_{T}^{\textrm{live}}(C)\mathbf{1}(\mathcal{G}_{C})]blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ] c1|𝗉(C)|β(TtM1)(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC).\displaystyle\leq c_{1}|\mathsf{p}(C)|^{\beta}(T-t_{M-1})\mathbb{P}(0<\left|f^% {(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^{\beta},X\in C).≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C ) .

Consequently, summing over CM𝐶subscript𝑀C\in\mathcal{B}_{M}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT yields

CM𝔼[rTlive(C)𝟏(𝒢C)]subscript𝐶subscript𝑀𝔼delimited-[]superscriptsubscript𝑟𝑇live𝐶1subscript𝒢𝐶\displaystyle\sum_{C\in\mathcal{B}_{M}}\mathbb{E}[r_{T}^{\textrm{live}}(C)% \mathbf{1}(\mathcal{G}_{C})]∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT live end_POSTSUPERSCRIPT ( italic_C ) bold_1 ( caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ] CMc1|𝗉(C)|β(TtM1)(0<|f(1)(X)f(1)(X)|c1|𝗉(C)|β,XC)\displaystyle\leq\sum_{C\in\mathcal{B}_{M}}c_{1}|\mathsf{p}(C)|^{\beta}(T-t_{M% -1})\mathbb{P}(0<\left|f^{(1)}(X)-f^{(-1)}(X)\right|\leq c_{1}|\mathsf{p}(C)|^% {\beta},X\in C)≤ ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) blackboard_P ( 0 < | italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_X ) - italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_X ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | sansserif_p ( italic_C ) | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT , italic_X ∈ italic_C )
c1wM1β(TtM1)D0[c1wM1β]αabsentsubscript𝑐1superscriptsubscript𝑤𝑀1𝛽𝑇subscript𝑡𝑀1subscript𝐷0superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑀1𝛽𝛼\displaystyle\leq c_{1}w_{M-1}^{\beta}(T-t_{M-1})D_{0}\cdot\left[c_{1}w_{M-1}^% {\beta}\right]^{\alpha}≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT
=D0(TtM1)[c1wM1β]1+α.absentsubscript𝐷0𝑇subscript𝑡𝑀1superscriptdelimited-[]subscript𝑐1superscriptsubscript𝑤𝑀1𝛽1𝛼\displaystyle=D_{0}(T-t_{M-1})[c_{1}w_{M-1}^{\beta}]^{1+\alpha}.= italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 + italic_α end_POSTSUPERSCRIPT .

B.4 Putting things together

In sum, the total regret is bounded by

𝔼[RT(π)]𝔼delimited-[]subscript𝑅𝑇𝜋\displaystyle\mathbb{E}[R_{T}(\pi)]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] c(t1+i=2M1(titi1)wi1β+αβ+(TtM1)wM1β+αβ),absent𝑐subscript𝑡1superscriptsubscript𝑖2𝑀1subscript𝑡𝑖subscript𝑡𝑖1superscriptsubscript𝑤𝑖1𝛽𝛼𝛽𝑇subscript𝑡𝑀1superscriptsubscript𝑤𝑀1𝛽𝛼𝛽\displaystyle\leq c\left(t_{1}+\sum_{i=2}^{M-1}(t_{i}-t_{i-1})\cdot w_{i-1}^{% \beta+\alpha\beta}+(T-t_{M-1})w_{M-1}^{\beta+\alpha\beta}\right),≤ italic_c ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ⋅ italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β + italic_α italic_β end_POSTSUPERSCRIPT + ( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β + italic_α italic_β end_POSTSUPERSCRIPT ) ,

where c𝑐citalic_c is a constant that depends on (α,β,D,L)𝛼𝛽𝐷𝐿(\alpha,\beta,D,L)( italic_α , italic_β , italic_D , italic_L ). Recall that wi=(l=0i1gl)1subscript𝑤𝑖superscriptsuperscriptsubscriptproduct𝑙0𝑖1subscript𝑔𝑙1w_{i}=(\prod_{l=0}^{i-1}g_{l})^{-1}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and the choices for the batch size and the split factors (12)-(11). We then obtain

t1subscript𝑡1\displaystyle t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T1γ1γMlogT,less-than-or-similar-toabsentsuperscript𝑇1𝛾1superscript𝛾𝑀𝑇\displaystyle\lesssim T^{\frac{1-\gamma}{1-\gamma^{M}}}\log T,≲ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT roman_log italic_T ,
(titi1)wi1β+αβsubscript𝑡𝑖subscript𝑡𝑖1superscriptsubscript𝑤𝑖1𝛽𝛼𝛽\displaystyle(t_{i}-t_{i-1})\cdot w_{i-1}^{\beta+\alpha\beta}( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ⋅ italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β + italic_α italic_β end_POSTSUPERSCRIPT T1γ1γMlogT,for 2iM1,formulae-sequenceless-than-or-similar-toabsentsuperscript𝑇1𝛾1superscript𝛾𝑀𝑇for 2𝑖𝑀1\displaystyle\lesssim T^{\frac{1-\gamma}{1-\gamma^{M}}}\log T,\qquad\text{for % }2\leq i\leq M-1,≲ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT roman_log italic_T , for 2 ≤ italic_i ≤ italic_M - 1 ,
(TtM1)wM1β+αβ𝑇subscript𝑡𝑀1superscriptsubscript𝑤𝑀1𝛽𝛼𝛽\displaystyle(T-t_{M-1})w_{M-1}^{\beta+\alpha\beta}( italic_T - italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β + italic_α italic_β end_POSTSUPERSCRIPT TwM1β+αβT1γ1γMlogT.absent𝑇superscriptsubscript𝑤𝑀1𝛽𝛼𝛽less-than-or-similar-tosuperscript𝑇1𝛾1superscript𝛾𝑀𝑇\displaystyle\leq Tw_{M-1}^{\beta+\alpha\beta}\lesssim T^{\frac{1-\gamma}{1-% \gamma^{M}}}\log T.≤ italic_T italic_w start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β + italic_α italic_β end_POSTSUPERSCRIPT ≲ italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT roman_log italic_T .

The proof is finished by combining the above three bounds.

B.5 Proofs for the clean events

We are left with proving that the two clean events happen with high probability.

B.5.1 Proof of Lemma 7

Fix the batch index i𝑖iitalic_i, and a node C𝐶Citalic_C in layer-i𝑖iitalic_i of the tree 𝒯𝒯\mathcal{T}caligraphic_T. By relation (12), we have

mC,isuperscriptsubscript𝑚𝐶𝑖\displaystyle m_{C,i}^{\star}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =(titi1)X(XC)absentsubscript𝑡𝑖subscript𝑡𝑖1subscript𝑋𝑋𝐶\displaystyle=(t_{i}-t_{i-1})\mathbb{P}_{X}(X\in C)= ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C )
|C|(2β+d)log(T|C|d)X(XC)asymptotically-equalsabsentsuperscript𝐶2𝛽𝑑𝑇superscript𝐶𝑑subscript𝑋𝑋𝐶\displaystyle\asymp|C|^{-(2\beta+d)}\log(T|C|^{d})\mathbb{P}_{X}(X\in C)≍ | italic_C | start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT roman_log ( italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C )
|C|2βg02β(T1γ1γM2β2β+d),asymptotically-equalssuperscript𝐶2𝛽absentsuperscriptsubscript𝑔02𝛽superscript𝑇1𝛾1superscript𝛾𝑀2𝛽2𝛽𝑑\displaystyle\apprge|C|^{-2\beta}\overset{\mathrm{}}{\geq}g_{0}^{2\beta}\asymp% (T^{\frac{1-\gamma}{1-\gamma^{M}}\cdot\frac{2\beta}{2\beta+d}}),≳ | italic_C | start_POSTSUPERSCRIPT - 2 italic_β end_POSTSUPERSCRIPT start_OVERACCENT end_OVERACCENT start_ARG ≥ end_ARG italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT ≍ ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 2 italic_β end_ARG start_ARG 2 italic_β + italic_d end_ARG end_POSTSUPERSCRIPT ) ,

where the last step uses the fact that X(XC)c¯|C|dsubscript𝑋𝑋𝐶¯𝑐superscript𝐶𝑑\mathbb{P}_{X}(X\in C)\geq\underline{c}|C|^{d}blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C ) ≥ under¯ start_ARG italic_c end_ARG | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Therefore, mC,i34log(2T2)superscriptsubscript𝑚𝐶𝑖342superscript𝑇2m_{C,i}^{\star}\geq\frac{3}{4}\log(2T^{2})italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≥ divide start_ARG 3 end_ARG start_ARG 4 end_ARG roman_log ( 2 italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for all i𝑖iitalic_i and C𝐶Citalic_C, as long as T𝑇Titalic_T is sufficiently large. This allows us to invoke Chernoff’s bound to obtain that with probability at most 1/T21superscript𝑇21/T^{2}1 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

|t=ti1+1ti𝟏{XtC}mC,i|3log(2T2)mC,i.superscriptsubscript𝑡subscript𝑡𝑖11subscript𝑡𝑖1subscript𝑋𝑡𝐶superscriptsubscript𝑚𝐶𝑖32superscript𝑇2superscriptsubscript𝑚𝐶𝑖\left|\sum\nolimits_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}\{X_{t}\in C\}-m_{C,i}^{% \star}\right|\geq\sqrt{3\log(2T^{2})m_{C,i}^{\star}}.| ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C } - italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT | ≥ square-root start_ARG 3 roman_log ( 2 italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG .

Denote Ec={1iM,Cti1+1 such that t=ti1+1ti𝟏{XtC}mC,i3log(2T2)mC,i}E^{c}=\{\exists 1\leq i\leq M,C\in\mathcal{L}_{t_{i-1}+1}\text{ such that }% \mid\sum_{t=t_{i-1}+1}^{t_{i}}\mathbf{1}\{X_{t}\in C\}-m_{C,i}^{\star}\mid\geq% \sqrt{3\log(2T^{2})m_{C,i}^{\star}}\}italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { ∃ 1 ≤ italic_i ≤ italic_M , italic_C ∈ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT such that ∣ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C } - italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∣ ≥ square-root start_ARG 3 roman_log ( 2 italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG }. Applying union bound to reach

(Ec)superscript𝐸𝑐\displaystyle\mathbb{P}(E^{c})blackboard_P ( italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) C𝒯1T2(i)1T2(i=1M(l=0i1gl)d)(ii)1T2M(l=0M1gl)d,absentsubscript𝐶𝒯1superscript𝑇2i1superscript𝑇2superscriptsubscript𝑖1𝑀superscriptsuperscriptsubscriptproduct𝑙0𝑖1subscript𝑔𝑙𝑑ii1superscript𝑇2𝑀superscriptsuperscriptsubscriptproduct𝑙0𝑀1subscript𝑔𝑙𝑑\displaystyle\leq\sum_{C\in\mathcal{T}}\frac{1}{T^{2}}\overset{\mathrm{(i)}}{% \leq}\frac{1}{T^{2}}\left(\sum_{i=1}^{M}(\prod_{l=0}^{i-1}g_{l})^{d}\right)% \overset{\mathrm{(ii)}}{\leq}\frac{1}{T^{2}}\cdot M\cdot(\prod_{l=0}^{M-1}g_{l% })^{d},≤ ∑ start_POSTSUBSCRIPT italic_C ∈ caligraphic_T end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_M ⋅ ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,

where step (i) sums over all possible nodes of 𝒯𝒯\mathcal{T}caligraphic_T across batches, and step (ii) is due to (l=0i1gl)d(l=0M1gl)dsuperscriptsuperscriptsubscriptproduct𝑙0𝑖1subscript𝑔𝑙𝑑superscriptsuperscriptsubscriptproduct𝑙0𝑀1subscript𝑔𝑙𝑑(\prod_{l=0}^{i-1}g_{l})^{d}\leq(\prod_{l=0}^{M-1}g_{l})^{d}( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ≤ ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for any 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M. Since gM1=1subscript𝑔𝑀11g_{M-1}=1italic_g start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT = 1, we further obtain

(Ec)superscript𝐸𝑐\displaystyle\mathbb{P}(E^{c})blackboard_P ( italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) 1T2M(l=0M2gl)d(iii)1T2MtM1d2β+d(iv)D11T2logTTd2β+d1T,absent1superscript𝑇2𝑀superscriptsuperscriptsubscriptproduct𝑙0𝑀2subscript𝑔𝑙𝑑iii1superscript𝑇2𝑀superscriptsubscript𝑡𝑀1𝑑2𝛽𝑑ivsubscript𝐷11superscript𝑇2𝑇superscript𝑇𝑑2𝛽𝑑1𝑇\displaystyle\leq\frac{1}{T^{2}}\cdot M\cdot(\prod_{l=0}^{M-2}g_{l})^{d}% \overset{\mathrm{(iii)}}{\leq}\frac{1}{T^{2}}\cdot M\cdot t_{M-1}^{\frac{d}{2% \beta+d}}\overset{\mathrm{(iv)}}{\leq}D_{1}\frac{1}{T^{2}}\cdot\log T\cdot T^{% \frac{d}{2\beta+d}}\leq\frac{1}{T},≤ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_M ⋅ ( ∏ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_M ⋅ italic_t start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 italic_β + italic_d end_ARG end_POSTSUPERSCRIPT start_OVERACCENT ( roman_iv ) end_OVERACCENT start_ARG ≤ end_ARG italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ roman_log italic_T ⋅ italic_T start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 italic_β + italic_d end_ARG end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ,

where step (iii) invokes relation (12), and step (iv) uses the assumption MD1logT𝑀subscript𝐷1𝑇M\leq D_{1}\log Titalic_M ≤ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log italic_T. This completes the proof.

B.5.2 Proof of Lemma 8

To simplify notation, for any event F𝐹Fitalic_F, we define 𝒢C(F)=(E𝒢CF)superscriptsubscript𝒢𝐶𝐹𝐸subscript𝒢𝐶𝐹\mathbb{P}^{\mathcal{G}_{C}}(F)=\mathbb{P}(E\cap\mathcal{G}_{C}\cap F)blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_F ) = blackboard_P ( italic_E ∩ caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ italic_F ).

Let 𝒟C1superscriptsubscript𝒟𝐶1\mathcal{D}_{C}^{1}caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT be the event that an arm k¯C𝑘subscript¯𝐶k\in\underline{\mathcal{I}}_{C}italic_k ∈ under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is eliminated at the end of batch i𝑖iitalic_i, and 𝒟C2superscriptsubscript𝒟𝐶2\mathcal{D}_{C}^{2}caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT be the event that an arm k¯C𝑘subscript¯𝐶k\notin\bar{\mathcal{I}}_{C}italic_k ∉ over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is not eliminated at the end of batch i𝑖iitalic_i. Consequently, we have

𝒢C(𝒜Cc)=𝒢C(𝒟C1)+𝒢C((𝒟C1)c𝒟C2).superscriptsubscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐superscriptsubscript𝒢𝐶superscriptsubscript𝒟𝐶1superscriptsubscript𝒢𝐶superscriptsuperscriptsubscript𝒟𝐶1𝑐superscriptsubscript𝒟𝐶2\mathbb{P}^{\mathcal{G}_{C}}(\mathcal{A}_{C}^{c})=\mathbb{P}^{\mathcal{G}_{C}}% (\mathcal{D}_{C}^{1})+\mathbb{P}^{\mathcal{G}_{C}}((\mathcal{D}_{C}^{1})^{c}% \cap\mathcal{D}_{C}^{2}).blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) + blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∩ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Recall U(τ,T,C)=4log(2T|C|d)τ𝑈𝜏𝑇𝐶42𝑇superscript𝐶𝑑𝜏U(\tau,T,C)=4\sqrt{\frac{\log(2T|C|^{d})}{\tau}}italic_U ( italic_τ , italic_T , italic_C ) = 4 square-root start_ARG divide start_ARG roman_log ( 2 italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_τ end_ARG end_ARG . By relation (12), we can write

mC,isuperscriptsubscript𝑚𝐶𝑖\displaystyle m_{C,i}^{\star}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =(titi1)X(XC)absentsubscript𝑡𝑖subscript𝑡𝑖1subscript𝑋𝑋𝐶\displaystyle=(t_{i}-t_{i-1})\mathbb{P}_{X}(X\in C)= ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C )
=li|C|(2β+d)log(T|C|d)X(XC),absentsubscript𝑙𝑖superscript𝐶2𝛽𝑑𝑇superscript𝐶𝑑subscript𝑋𝑋𝐶\displaystyle=l_{i}|C|^{-(2\beta+d)}\log(T|C|^{d})\mathbb{P}_{X}(X\in C),= italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT roman_log ( italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∈ italic_C ) ,

where li>0subscript𝑙𝑖0l_{i}>0italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 is a constant chosen such that U(2mC,i,T,C)=2c0|C|β𝑈2superscriptsubscript𝑚𝐶𝑖𝑇𝐶2subscript𝑐0superscript𝐶𝛽U(2m_{C,i}^{\star},T,C)=2c_{0}|C|^{\beta}italic_U ( 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_T , italic_C ) = 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT. Under E𝐸Eitalic_E, we have U(mC,i,T,C)4c0|C|β𝑈subscript𝑚𝐶𝑖𝑇𝐶4subscript𝑐0superscript𝐶𝛽U(m_{C,i},T,C)\leq 4c_{0}|C|^{\beta}italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) ≤ 4 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT because mC,i12mC,isubscript𝑚𝐶𝑖12superscriptsubscript𝑚𝐶𝑖m_{C,i}\geq\frac{1}{2}m_{C,i}^{\star}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

  1. 1.

    Upper bounding 𝒢C(𝒟C1)superscriptsubscript𝒢𝐶superscriptsubscript𝒟𝐶1\mathbb{P}^{\mathcal{G}_{C}}(\mathcal{D}_{C}^{1})blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ): when 𝒟C1superscriptsubscript𝒟𝐶1\mathcal{D}_{C}^{1}caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT occurs, an arm k¯C𝑘subscript¯𝐶k\in\underline{\mathcal{I}}_{C}italic_k ∈ under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is eliminated by some k𝗉(C)superscript𝑘superscriptsubscript𝗉𝐶k^{\prime}\in\mathcal{I}_{\mathsf{p}(C)}^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at the end of batch i𝑖iitalic_i. This means Y¯C,i(k)Y¯C,i(k)>U(mC,i,T,C)superscriptsubscript¯𝑌𝐶𝑖superscript𝑘superscriptsubscript¯𝑌𝐶𝑖𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶\bar{Y}_{C,i}^{(k^{\prime})}-\bar{Y}_{C,i}^{(k)}>U(m_{C,i},T,C)over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ). Meanwhile,

    f¯C(k)f¯C(k)superscriptsubscript¯𝑓𝐶superscript𝑘superscriptsubscript¯𝑓𝐶𝑘\displaystyle\bar{f}_{C}^{(k^{\prime})}-\bar{f}_{C}^{(k)}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT f¯Cf¯C(k)(i)c0|C|β12U(2mC,i,T,C),absentsuperscriptsubscript¯𝑓𝐶superscriptsubscript¯𝑓𝐶𝑘isubscript𝑐0superscript𝐶𝛽12𝑈2superscriptsubscript𝑚𝐶𝑖𝑇𝐶\displaystyle\leq\bar{f}_{C}^{\star}-\bar{f}_{C}^{(k)}\overset{\mathrm{(i)}}{% \leq}c_{0}|C|^{\beta}\leq\frac{1}{2}U(2m_{C,i}^{\star},T,C),≤ over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_U ( 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_T , italic_C ) ,

    where step (i) uses the definition of ¯Csubscript¯𝐶\underline{\mathcal{I}}_{C}under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. Consequently, |Y¯C,i(k)f¯C(k)|U(mC,i,T,C)/4superscriptsubscript¯𝑌𝐶𝑖superscript𝑘superscriptsubscript¯𝑓𝐶superscript𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶4{|\bar{Y}_{C,i}^{(k^{\prime})}-\bar{f}_{C}^{(k^{\prime})}|\leq U(m_{C,i},T,C)/4}| over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT | ≤ italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) / 4 and |Y¯C,i(k)f¯C(k)|U(mC,i,T,C)/4superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript¯𝑓𝐶𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶4|\bar{Y}_{C,i}^{(k)}-\bar{f}_{C}^{(k)}|\leq U(m_{C,i},T,C)/4| over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≤ italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) / 4 cannot hold simultaneously. Otherwise, this would contradict with Y¯C,i(k)Y¯C,i(k)>U(mC,i,T,C)superscriptsubscript¯𝑌𝐶𝑖superscript𝑘superscriptsubscript¯𝑌𝐶𝑖𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶\bar{Y}_{C,i}^{(k^{\prime})}-\bar{Y}_{C,i}^{(k)}>U(m_{C,i},T,C)over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) because mC,i2mC,isubscript𝑚𝐶𝑖2superscriptsubscript𝑚𝐶𝑖m_{C,i}\leq 2m_{C,i}^{\star}italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT under E𝐸Eitalic_E. Therefore,

    𝒢C(𝒟C1)superscriptsubscript𝒢𝐶superscriptsubscript𝒟𝐶1\displaystyle\mathbb{P}^{\mathcal{G}_{C}}(\mathcal{D}_{C}^{1})blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) {k𝗉(C),mC,i2mC,i:|Y¯C,i(k)f¯C(k)|14U(mC,i,T,C)}.absentconditional-setformulae-sequence𝑘superscriptsubscript𝗉𝐶subscript𝑚𝐶𝑖2superscriptsubscript𝑚𝐶𝑖superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript¯𝑓𝐶𝑘14𝑈subscript𝑚𝐶𝑖𝑇𝐶\displaystyle\leq\mathbb{P}\left\{\exists k\in\mathcal{I}_{\mathsf{p}(C)}^{% \prime},m_{C,i}\leq 2m_{C,i}^{\star}:|\bar{Y}_{C,i}^{(k)}-\bar{f}_{C}^{(k)}|% \geq\frac{1}{4}U(m_{C,i},T,C)\right\}.≤ blackboard_P { ∃ italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : | over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) } .
  2. 2.

    Upper bounding 𝒢C((𝒟C1)c𝒟C2)superscriptsubscript𝒢𝐶superscriptsuperscriptsubscript𝒟𝐶1𝑐superscriptsubscript𝒟𝐶2\mathbb{P}^{\mathcal{G}_{C}}((\mathcal{D}_{C}^{1})^{c}\cap\mathcal{D}_{C}^{2})blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∩ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ): when (𝒟C1)c𝒟C2superscriptsuperscriptsubscript𝒟𝐶1𝑐superscriptsubscript𝒟𝐶2(\mathcal{D}_{C}^{1})^{c}\cap\mathcal{D}_{C}^{2}( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∩ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT happens, no arm in ¯Csubscript¯𝐶\underline{\mathcal{I}}_{C}under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is eliminated while some k¯C𝑘subscript¯𝐶k\notin\bar{\mathcal{I}}_{C}italic_k ∉ over¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT remains in the active arm set. By definition, there exists x(k)superscript𝑥𝑘x^{(k)}italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT such that f(x(k))f(k)(x(k))>8c0|C|β.superscript𝑓superscript𝑥𝑘superscript𝑓𝑘superscript𝑥𝑘8subscript𝑐0superscript𝐶𝛽{f^{\star}(x^{(k)})-f^{(k)}(x^{(k)})>8c_{0}|C|^{\beta}}.italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) > 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT . Let η(k)𝜂𝑘\eta(k)italic_η ( italic_k ) be any arm that satisfies f(x(k))=f(η(k))(x(k))superscript𝑓superscript𝑥𝑘superscript𝑓𝜂𝑘superscript𝑥𝑘f^{\star}(x^{(k)})=f^{(\eta(k))}(x^{(k)})italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) = italic_f start_POSTSUPERSCRIPT ( italic_η ( italic_k ) ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ), and one can easily verify η(k)¯C𝜂𝑘subscript¯𝐶\eta(k)\in\underline{\mathcal{I}}_{C}italic_η ( italic_k ) ∈ under¯ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. Since k𝑘kitalic_k is not eliminated, we have Y¯C,i(η(k))Y¯C,i(k)U(mC,i,T,C)superscriptsubscript¯𝑌𝐶𝑖𝜂𝑘superscriptsubscript¯𝑌𝐶𝑖𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶\bar{Y}_{C,i}^{(\eta(k))}-\bar{Y}_{C,i}^{(k)}\leq U(m_{C,i},T,C)over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_η ( italic_k ) ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≤ italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ). On the other hand,

    f¯C(η(k))superscriptsubscript¯𝑓𝐶𝜂𝑘\displaystyle\bar{f}_{C}^{(\eta(k))}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_η ( italic_k ) ) end_POSTSUPERSCRIPT (iii)f(η(k))(x(k))c0|C|βiiisuperscript𝑓𝜂𝑘superscript𝑥𝑘subscript𝑐0superscript𝐶𝛽\displaystyle\overset{(\mathrm{iii})}{\geq}f^{(\eta(k))}(x^{(k)})-c_{0}|C|^{\beta}start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≥ end_ARG italic_f start_POSTSUPERSCRIPT ( italic_η ( italic_k ) ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT
    f(k)(x(k))+8c0|C|βc0|C|βabsentsuperscript𝑓𝑘superscript𝑥𝑘8subscript𝑐0superscript𝐶𝛽subscript𝑐0superscript𝐶𝛽\displaystyle\geq f^{(k)}(x^{(k)})+8c_{0}|C|^{\beta}-c_{0}|C|^{\beta}≥ italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT
    =f(k)(x(k))+7c0|C|βabsentsuperscript𝑓𝑘superscript𝑥𝑘7subscript𝑐0superscript𝐶𝛽\displaystyle=f^{(k)}(x^{(k)})+7c_{0}|C|^{\beta}= italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + 7 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT
    (iv)f¯C(k)+6c0|C|βf¯C(k)+32U(mC,i,T,C),ivsuperscriptsubscript¯𝑓𝐶𝑘6subscript𝑐0superscript𝐶𝛽superscriptsubscript¯𝑓𝐶𝑘32𝑈subscript𝑚𝐶𝑖𝑇𝐶\displaystyle\overset{(\mathrm{iv})}{\geq}\bar{f}_{C}^{(k)}+6c_{0}|C|^{\beta}% \geq\bar{f}_{C}^{(k)}+\frac{3}{2}U(m_{C,i},T,C),start_OVERACCENT ( roman_iv ) end_OVERACCENT start_ARG ≥ end_ARG over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + 6 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ≥ over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) , (19)

    where steps (iii) and (iv) use Lemma 10. Inequality (19) together with the fact that Y¯C,i(η(k))Y¯C,i(k)U(mC,i,T,C)superscriptsubscript¯𝑌𝐶𝑖𝜂𝑘superscriptsubscript¯𝑌𝐶𝑖𝑘𝑈subscript𝑚𝐶𝑖𝑇𝐶\bar{Y}_{C,i}^{(\eta(k))}-\bar{Y}_{C,i}^{(k)}\leq U(m_{C,i},T,C)over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_η ( italic_k ) ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≤ italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) imply |Y¯C,i(k0)f¯C(k0)|U(mC,i,T,C)/4superscriptsubscript¯𝑌𝐶𝑖subscript𝑘0superscriptsubscript¯𝑓𝐶subscript𝑘0𝑈subscript𝑚𝐶𝑖𝑇𝐶4|\bar{Y}_{C,i}^{(k_{0})}-\bar{f}_{C}^{(k_{0})}|\geq U(m_{C,i},T,C)/4| over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | ≥ italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) / 4 for either k0=ksubscript𝑘0𝑘k_{0}=kitalic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_k oder k0=η(k)subscript𝑘0𝜂𝑘k_{0}=\eta(k)italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_η ( italic_k ). Consequently,

    𝒢C((𝒟C1)c𝒟C2)superscriptsubscript𝒢𝐶superscriptsuperscriptsubscript𝒟𝐶1𝑐superscriptsubscript𝒟𝐶2\displaystyle\mathbb{P}^{\mathcal{G}_{C}}((\mathcal{D}_{C}^{1})^{c}\cap% \mathcal{D}_{C}^{2})blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∩ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) {k𝗉(C),mC,i2mC,i:|Y¯C,i(k)f¯C(k)|14U(mC,i,T,C)}.absentconditional-setformulae-sequence𝑘superscriptsubscript𝗉𝐶subscript𝑚𝐶𝑖2superscriptsubscript𝑚𝐶𝑖superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript¯𝑓𝐶𝑘14𝑈subscript𝑚𝐶𝑖𝑇𝐶\displaystyle\leq\mathbb{P}\left\{\exists k\in\mathcal{I}_{\mathsf{p}(C)}^{% \prime},m_{C,i}\leq 2m_{C,i}^{\star}:|\bar{Y}_{C,i}^{(k)}-\bar{f}_{C}^{(k)}|% \geq\frac{1}{4}U(m_{C,i},T,C)\right\}.≤ blackboard_P { ∃ italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : | over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) } .

Combining the two parts we obtain

𝒢C(𝒜Cc)superscriptsubscript𝒢𝐶superscriptsubscript𝒜𝐶𝑐\displaystyle\mathbb{P}^{\mathcal{G}_{C}}(\mathcal{A}_{C}^{c})blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) =𝒢C(𝒟C1)+𝒢C((𝒟C1)c𝒟C2)absentsuperscriptsubscript𝒢𝐶superscriptsubscript𝒟𝐶1superscriptsubscript𝒢𝐶superscriptsuperscriptsubscript𝒟𝐶1𝑐superscriptsubscript𝒟𝐶2\displaystyle=\mathbb{P}^{\mathcal{G}_{C}}(\mathcal{D}_{C}^{1})+\mathbb{P}^{% \mathcal{G}_{C}}((\mathcal{D}_{C}^{1})^{c}\cap\mathcal{D}_{C}^{2})= blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) + blackboard_P start_POSTSUPERSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∩ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
2{k𝗉(C),mC,i2mC,i:|Y¯C,i(k)f¯C(k)|14U(mC,i,T,C)}absent2conditional-setformulae-sequence𝑘superscriptsubscript𝗉𝐶subscript𝑚𝐶𝑖2superscriptsubscript𝑚𝐶𝑖superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript¯𝑓𝐶𝑘14𝑈subscript𝑚𝐶𝑖𝑇𝐶\displaystyle\leq 2\cdot\mathbb{P}\left\{\exists k\in\mathcal{I}_{\mathsf{p}(C% )}^{\prime},m_{C,i}\leq 2m_{C,i}^{\star}:|\bar{Y}_{C,i}^{(k)}-\bar{f}_{C}^{(k)% }|\geq\frac{1}{4}U(m_{C,i},T,C)\right\}≤ 2 ⋅ blackboard_P { ∃ italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : | over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) }
4mC,iT|C|d,absent4superscriptsubscript𝑚𝐶𝑖𝑇superscript𝐶𝑑\displaystyle\leq\frac{4m_{C,i}^{\star}}{T|C|^{d}},≤ divide start_ARG 4 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ,

where the last inequality applies Lemma 9.

B.6 Auxiliary lemmas

Lemma 9.

For any 1iM11𝑖𝑀11\leq i\leq M-11 ≤ italic_i ≤ italic_M - 1 and Ci𝐶subscript𝑖C\in\mathcal{B}_{i}italic_C ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, one has

{k𝗉(C),mC,i2mC,i:|Y¯C,i(k)f¯C(k)|14U(mC,i,T,C)}2mC,iT|C|d.conditional-setformulae-sequence𝑘superscriptsubscript𝗉𝐶subscript𝑚𝐶𝑖2superscriptsubscript𝑚𝐶𝑖superscriptsubscript¯𝑌𝐶𝑖𝑘superscriptsubscript¯𝑓𝐶𝑘14𝑈subscript𝑚𝐶𝑖𝑇𝐶2superscriptsubscript𝑚𝐶𝑖𝑇superscript𝐶𝑑\mathbb{P}\left\{\exists k\in\mathcal{I}_{\mathsf{p}(C)}^{\prime},m_{C,i}\leq 2% m_{C,i}^{\star}:|\bar{Y}_{C,i}^{(k)}-\bar{f}_{C}^{(k)}|\geq\frac{1}{4}U(m_{C,i% },T,C)\right\}\leq\frac{2m_{C,i}^{\star}}{T|C|^{d}}.blackboard_P { ∃ italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : | over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_U ( italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT , italic_T , italic_C ) } ≤ divide start_ARG 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG .
Proof.

Recall in Algorithm 1 we pull each arm in a round-robin fashion within a bin during batch i𝑖iitalic_i. Fix τ>0𝜏0\tau>0italic_τ > 0. Let Y¯τ(k)=j=1τYj(k)/τsuperscriptsubscript¯𝑌𝜏𝑘superscriptsubscript𝑗1𝜏superscriptsubscript𝑌𝑗𝑘𝜏\bar{Y}_{\tau}^{(k)}=\sum_{j=1}^{\tau}Y_{j}^{(k)}/\tauover¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT / italic_τ where Yj(k)superscriptsubscript𝑌𝑗𝑘Y_{j}^{(k)}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT’s are i.i.d. random variables with Yj(k)[0,1]superscriptsubscript𝑌𝑗𝑘01Y_{j}^{(k)}\in[0,1]italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] and 𝔼[Yj(k)]=f¯C(k)𝔼delimited-[]superscriptsubscript𝑌𝑗𝑘superscriptsubscript¯𝑓𝐶𝑘\mathbb{E}[Y_{j}^{(k)}]=\bar{f}_{C}^{(k)}blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ] = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. By Hoeffding’s inequality, with probability 1/(T|C|d)1𝑇superscript𝐶𝑑1/(T|C|^{d})1 / ( italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), we have

|Y¯τ(k)f¯C(k)|log(2T|C|d)2τ.superscriptsubscript¯𝑌𝜏𝑘superscriptsubscript¯𝑓𝐶𝑘2𝑇superscript𝐶𝑑2𝜏|\bar{Y}_{\tau}^{(k)}-\bar{f}_{C}^{(k)}|\geq\sqrt{\frac{\log(2T|C|^{d})}{2\tau% }}.| over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ square-root start_ARG divide start_ARG roman_log ( 2 italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_τ end_ARG end_ARG .

Applying union bound to get

{k𝗉(C),0τmC,i:|Y¯τ(k)f¯C(k)|log(2T|C|d)2τ}2mC,iT|C|d,conditional-setformulae-sequence𝑘subscript𝗉𝐶0𝜏superscriptsubscript𝑚𝐶𝑖superscriptsubscript¯𝑌𝜏𝑘superscriptsubscript¯𝑓𝐶𝑘2𝑇superscript𝐶𝑑2𝜏2superscriptsubscript𝑚𝐶𝑖𝑇superscript𝐶𝑑\mathbb{P}\left\{\exists k\in\mathcal{I}_{\mathsf{p}(C)},0\leq\tau\leq m_{C,i}% ^{\star}:|\bar{Y}_{\tau}^{(k)}-\bar{f}_{C}^{(k)}|\geq\sqrt{\frac{\log(2T|C|^{d% })}{2\tau}}\right\}\leq\frac{2m_{C,i}^{\star}}{T|C|^{d}},blackboard_P { ∃ italic_k ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_p ( italic_C ) end_POSTSUBSCRIPT , 0 ≤ italic_τ ≤ italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : | over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ≥ square-root start_ARG divide start_ARG roman_log ( 2 italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_τ end_ARG end_ARG } ≤ divide start_ARG 2 italic_m start_POSTSUBSCRIPT italic_C , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_T | italic_C | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ,

which completes the proof. ∎

Lemma 10.

Fix k{1,1}𝑘11k\in\{1,-1\}italic_k ∈ { 1 , - 1 } and C𝒯𝐶𝒯C\in\mathcal{T}italic_C ∈ caligraphic_T, for any xC𝑥𝐶x\in Citalic_x ∈ italic_C, one has

|f¯C(k)f(k)(x)|c0|C|β,superscriptsubscript¯𝑓𝐶𝑘superscript𝑓𝑘𝑥subscript𝑐0superscript𝐶𝛽|\bar{f}_{C}^{(k)}-f^{(k)}(x)|\leq c_{0}|C|^{\beta},| over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x ) | ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ,

where c0=2Ldβ/2+1subscript𝑐02𝐿superscript𝑑𝛽21c_{0}=2Ld^{\beta/2}+1italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_L italic_d start_POSTSUPERSCRIPT italic_β / 2 end_POSTSUPERSCRIPT + 1.

Proof.

For notation simplicity, we write f𝑓fitalic_f for f(k)superscript𝑓𝑘f^{(k)}italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the following proof. By definition,

|f¯Cf(x)|subscript¯𝑓𝐶𝑓𝑥\displaystyle|\bar{f}_{C}-f(x)|| over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_f ( italic_x ) | =|1(C)C(f(y)f(x))𝖽(y)|absent1𝐶subscript𝐶𝑓𝑦𝑓𝑥differential-d𝑦\displaystyle=|\frac{1}{\mathbb{P}(C)}\int_{C}(f(y)-f(x))\mathsf{d}\mathbb{P}(% y)|= | divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_f ( italic_y ) - italic_f ( italic_x ) ) sansserif_d blackboard_P ( italic_y ) |
1(C)C|f(y)f(x)|𝖽(y)absent1𝐶subscript𝐶𝑓𝑦𝑓𝑥differential-d𝑦\displaystyle\leq\frac{1}{\mathbb{P}(C)}\int_{C}|f(y)-f(x)|\mathsf{d}\mathbb{P% }(y)≤ divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | italic_f ( italic_y ) - italic_f ( italic_x ) | sansserif_d blackboard_P ( italic_y )
1(C)CLxy2β𝖽(y),absent1𝐶subscript𝐶𝐿superscriptsubscriptnorm𝑥𝑦2𝛽differential-d𝑦\displaystyle\leq\frac{1}{\mathbb{P}(C)}\int_{C}L\|x-y\|_{2}^{\beta}\mathsf{d}% \mathbb{P}(y),≤ divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_L ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT sansserif_d blackboard_P ( italic_y ) ,

where the first inequality uses the triangle inequality, and the second inequality is due to the smoothness condition. Since xC𝑥𝐶x\in Citalic_x ∈ italic_C, we further have

|f¯Cf(x)|subscript¯𝑓𝐶𝑓𝑥\displaystyle|\bar{f}_{C}-f(x)|| over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_f ( italic_x ) | 1(C)CLxy2β𝖽(y)absent1𝐶subscript𝐶𝐿superscriptsubscriptnorm𝑥𝑦2𝛽differential-d𝑦\displaystyle\leq\frac{1}{\mathbb{P}(C)}\int_{C}L\|x-y\|_{2}^{\beta}\mathsf{d}% \mathbb{P}(y)≤ divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_L ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT sansserif_d blackboard_P ( italic_y )
1(C)CLdβ/2|C|β𝖽(y)absent1𝐶subscript𝐶𝐿superscript𝑑𝛽2superscript𝐶𝛽differential-d𝑦\displaystyle\leq\frac{1}{\mathbb{P}(C)}\int_{C}Ld^{\beta/2}|C|^{\beta}\mathsf% {d}\mathbb{P}(y)≤ divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_C ) end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_L italic_d start_POSTSUPERSCRIPT italic_β / 2 end_POSTSUPERSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT sansserif_d blackboard_P ( italic_y )
c0|C|β.absentsubscript𝑐0superscript𝐶𝛽\displaystyle\leq c_{0}|C|^{\beta}.≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT .

This completes the proof. ∎

Appendix C Proof of Theorem 4

As we argued after the statement of Theorem 4, one needs to set t1T9/19asymptotically-equalssubscript𝑡1superscript𝑇919t_{1}\asymp T^{9/19}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 9 / 19 end_POSTSUPERSCRIPT, and t2T15/19asymptotically-equalssubscript𝑡2superscript𝑇1519t_{2}\asymp T^{15/19}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≍ italic_T start_POSTSUPERSCRIPT 15 / 19 end_POSTSUPERSCRIPT. Therefore, throughout the proof, we assume this is true and only focus on the number g𝑔gitalic_g of bins.

To construct a hard instance, we partition [0,1]01[0,1][ 0 , 1 ] into z𝑧zitalic_z bins with equal width. Denote the bins by Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for j=1,,z𝑗1𝑧j=1,...,zitalic_j = 1 , … , italic_z, and let qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the center of Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Define a function ϕ:[0,1]:italic-ϕmaps-to01\phi:[0,1]\mapsto\mathbb{R}italic_ϕ : [ 0 , 1 ] ↦ blackboard_R as ϕ(x)=(1|x|)𝟏{|x|1}.italic-ϕ𝑥1𝑥1𝑥1\phi(x)=(1-|x|)\mathbf{1}\{|x|\leq 1\}.italic_ϕ ( italic_x ) = ( 1 - | italic_x | ) bold_1 { | italic_x | ≤ 1 } . Correspondingly define a function φj:[0,1]:subscript𝜑𝑗maps-to01\varphi_{j}:[0,1]\mapsto\mathbb{R}italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : [ 0 , 1 ] ↦ blackboard_R as φj(x)=Dϕz1ϕ(2z(xqj))𝟏{xCj},subscript𝜑𝑗𝑥subscript𝐷italic-ϕsuperscript𝑧1italic-ϕ2𝑧𝑥subscript𝑞𝑗1𝑥subscript𝐶𝑗\varphi_{j}(x)=D_{\phi}z^{-1}\phi(2z(x-q_{j}))\mathbf{1}\{x\in C_{j}\},italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϕ ( 2 italic_z ( italic_x - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) bold_1 { italic_x ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } , where Dϕ=min(21L,1/4)subscript𝐷italic-ϕsuperscript21𝐿14D_{\phi}=\min(2^{-1}L,1/4)italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = roman_min ( 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_L , 1 / 4 ). Define a function f:[0,1]:𝑓maps-to01f:[0,1]\mapsto\mathbb{R}italic_f : [ 0 , 1 ] ↦ blackboard_R:

f(x)=12+φ1(x).𝑓𝑥12subscript𝜑1𝑥f(x)=\frac{1}{2}+\varphi_{1}(x).italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) .

The problem instance of interest is v=(f(1)(x)=f(x),f(1)(x)=12)𝑣formulae-sequencesuperscript𝑓1𝑥𝑓𝑥superscript𝑓1𝑥12v=(f^{(1)}(x)=f(x),f^{(-1)}(x)=\frac{1}{2})italic_v = ( italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) = italic_f ( italic_x ) , italic_f start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ). It is easy to verify v(1,1).𝑣11v\in\mathcal{F}(1,1).italic_v ∈ caligraphic_F ( 1 , 1 ) . Throughout the proof, we condition on the event E𝐸Eitalic_E specified by Lemma 7, which says the number of samples allocated to a bin concentrates well around its expectation. We will show even under this good event, there exists a choice of z𝑧zitalic_z that makes successive elimination fail to remove the suboptimal arms at the end of a batch with constant probability.

C.1 A helper lemma

We begin with presenting a helper lemma that will be used extensively in the later part of the proof. The claim is intuitive: if the sample size is small, it is not sufficient to tell apart two Bernoulli distributions with similar means. Then, in our context, arm elimination will not occur.

Lemma 11.

Assume mB,i2mB,isubscript𝑚𝐵𝑖2superscriptsubscript𝑚𝐵𝑖m_{B,i}\leq 2m_{B,i}^{\star}italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. For any B[0,1]𝐵01B\subseteq[0,1]italic_B ⊆ [ 0 , 1 ] and i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 }. If f¯B(1)f¯B(1)δ1/mB,isuperscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵1𝛿1superscriptsubscript𝑚𝐵𝑖\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}\leq\delta\leq 1/\sqrt{m_{B,i}^{\star}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ≤ italic_δ ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG for some δ>0𝛿0\delta>0italic_δ > 0 , then

(Y¯B,i(1)Y¯B,i(1)>U(mB,i,T,B))tiT.superscriptsubscript¯𝑌𝐵𝑖1superscriptsubscript¯𝑌𝐵𝑖1𝑈subscript𝑚𝐵𝑖𝑇𝐵subscript𝑡𝑖𝑇\mathbb{P}\left(\bar{Y}_{B,i}^{(1)}-\bar{Y}_{B,i}^{(-1)}>U(m_{B,i},T,B)\right)% \leq\frac{t_{i}}{T}.blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT , italic_T , italic_B ) ) ≤ divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG .
Proof.

Fix 0<τmB,i0𝜏superscriptsubscript𝑚𝐵𝑖0<\tau\leq m_{B,i}^{\star}0 < italic_τ ≤ italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Let Y¯τ(k)=l=1τYl(k)/τsuperscriptsubscript¯𝑌𝜏𝑘superscriptsubscript𝑙1𝜏superscriptsubscript𝑌𝑙𝑘𝜏\bar{Y}_{\tau}^{(k)}=\sum_{l=1}^{\tau}Y_{l}^{(k)}/\tauover¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT / italic_τ where Yl(k)superscriptsubscript𝑌𝑙𝑘Y_{l}^{(k)}italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT’s are i.i.d. random variables with Yl(k)[0,1]superscriptsubscript𝑌𝑙𝑘01Y_{l}^{(k)}\in[0,1]italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] and 𝔼[Yl(k)]=f¯B(k)𝔼delimited-[]superscriptsubscript𝑌𝑙𝑘superscriptsubscript¯𝑓𝐵𝑘\mathbb{E}[Y_{l}^{(k)}]=\bar{f}_{B}^{(k)}blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ] = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT for k{1,1}𝑘11k\in\{1,-1\}italic_k ∈ { 1 , - 1 }. Recall U(τ,T,B)=4log(2T|B|)τ𝑈𝜏𝑇𝐵42𝑇𝐵𝜏U(\tau,T,B)=4\sqrt{\frac{\log(2T|B|)}{\tau}}italic_U ( italic_τ , italic_T , italic_B ) = 4 square-root start_ARG divide start_ARG roman_log ( 2 italic_T | italic_B | ) end_ARG start_ARG italic_τ end_ARG end_ARG333We remark the constant 4 is not essential for the proof to work. For any c>0𝑐0c>0italic_c > 0 , clog(2T|B|)=log((2T|B|)c)𝑐2𝑇𝐵superscript2𝑇𝐵𝑐c\log(2T|B|)=\log((2T|B|)^{c})italic_c roman_log ( 2 italic_T | italic_B | ) = roman_log ( ( 2 italic_T | italic_B | ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) so the final success probability is still tiny as long as T𝑇Titalic_T is sufficiently large.. Then,

(Y¯τ(1)Y¯τ(1)>U(2τ,T,B))superscriptsubscript¯𝑌𝜏1superscriptsubscript¯𝑌𝜏1𝑈2𝜏𝑇𝐵\displaystyle\mathbb{P}\left(\bar{Y}_{\tau}^{(1)}-\bar{Y}_{\tau}^{(-1)}>U(2% \tau,T,B)\right)blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( 2 italic_τ , italic_T , italic_B ) ) (i)(Y¯τ(1)Y¯τ(1)>δ+log(2T/g)2τ)isuperscriptsubscript¯𝑌𝜏1superscriptsubscript¯𝑌𝜏1𝛿2𝑇𝑔2𝜏\displaystyle\overset{\mathrm{(i)}}{\leq}\mathbb{P}\left(\bar{Y}_{\tau}^{(1)}-% \bar{Y}_{\tau}^{(-1)}>\delta+\sqrt{\frac{\log(2T/g)}{2\tau}}\right)start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_δ + square-root start_ARG divide start_ARG roman_log ( 2 italic_T / italic_g ) end_ARG start_ARG 2 italic_τ end_ARG end_ARG )
(ii)(Y¯τ(1)Y¯τ(1)>f¯B(1)f¯B(1)+log(2T/g)2τ)iisuperscriptsubscript¯𝑌𝜏1superscriptsubscript¯𝑌𝜏1superscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵12𝑇𝑔2𝜏\displaystyle\overset{\mathrm{(ii)}}{\leq}\mathbb{P}\left(\bar{Y}_{\tau}^{(1)}% -\bar{Y}_{\tau}^{(-1)}>\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}+\sqrt{\frac{\log(2% T/g)}{2\tau}}\right)start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT + square-root start_ARG divide start_ARG roman_log ( 2 italic_T / italic_g ) end_ARG start_ARG 2 italic_τ end_ARG end_ARG )
(iii)gT,iii𝑔𝑇\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{g}{T},start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG italic_g end_ARG start_ARG italic_T end_ARG ,

where step (i) is because δ1/mB,i1/τ𝛿1superscriptsubscript𝑚𝐵𝑖1𝜏\delta\leq 1/\sqrt{m_{B,i}^{\star}}\leq 1/\sqrt{\tau}italic_δ ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG ≤ 1 / square-root start_ARG italic_τ end_ARG, step (ii) is due to f¯B(1)f¯B(1)δsuperscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵1𝛿\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}\leq\deltaover¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ≤ italic_δ, and step (iii) uses Hoeffding’s inequality. Applying union bound to get

(0<τmB,i:Y¯τ(1)Y¯τ(1)>U(2τ,T,B))mB,igTtiT.\mathbb{P}\left(\exists 0<\tau\leq m_{B,i}^{\star}:\bar{Y}_{\tau}^{(1)}-\bar{Y% }_{\tau}^{(-1)}>U(2\tau,T,B)\right)\leq\frac{m_{B,i}^{\star}g}{T}\leq\frac{t_{% i}}{T}.blackboard_P ( ∃ 0 < italic_τ ≤ italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT : over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( 2 italic_τ , italic_T , italic_B ) ) ≤ divide start_ARG italic_m start_POSTSUBSCRIPT italic_B , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT italic_g end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG .

This finishes the proof. ∎

C.2 Three failure cases for g𝑔gitalic_g

Fix some small constant ε>0𝜀0\varepsilon>0italic_ε > 0 to be specified later. From now on, we use π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG to denote π^staticsubscript^𝜋static\hat{\pi}_{\mathrm{static}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_static end_POSTSUBSCRIPT for simplicity. We split the proof into three cases: (1) gT3/19+ε𝑔superscript𝑇319𝜀g\geq T^{3/19+\varepsilon}italic_g ≥ italic_T start_POSTSUPERSCRIPT 3 / 19 + italic_ε end_POSTSUPERSCRIPT; (2) gT3/19ε𝑔superscript𝑇319𝜀g\leq T^{3/19-\varepsilon}italic_g ≤ italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε end_POSTSUPERSCRIPT; (3) and g(T3/19ε,T3/19+ε)𝑔superscript𝑇319𝜀superscript𝑇319𝜀g\in(T^{3/19-\varepsilon},T^{3/19+\varepsilon})italic_g ∈ ( italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT 3 / 19 + italic_ε end_POSTSUPERSCRIPT ).

Case 1: gT3/19+ε𝑔superscript𝑇319𝜀g\geq T^{3/19+\varepsilon}italic_g ≥ italic_T start_POSTSUPERSCRIPT 3 / 19 + italic_ε end_POSTSUPERSCRIPT.

Set z=T3/19ε/2𝑧superscript𝑇319𝜀2z=T^{3/19-\varepsilon/2}italic_z = italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε / 2 end_POSTSUPERSCRIPT. Assume without loss of generality that g=Hz𝑔𝐻𝑧g=H\cdot zitalic_g = italic_H ⋅ italic_z for some H4𝐻4H\geq 4italic_H ≥ 4; see Figure 3 for an illustration of the instance. Suppose C1=l=1HBlsubscript𝐶1superscriptsubscript𝑙1𝐻subscript𝐵𝑙C_{1}=\cup_{l=1}^{H}B_{l}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, where Blsubscript𝐵𝑙B_{l}italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT’s are the bins produced by π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG that lie in C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. It is clear that

𝔼[RT(π^)]𝔼delimited-[]subscript𝑅𝑇^𝜋\displaystyle\mathbb{E}[R_{T}(\hat{\pi})]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] (i)𝔼[t=t1+1t2(f(Xt)fπ^t(Xt)(Xt))]i𝔼delimited-[]superscriptsubscript𝑡subscript𝑡11subscript𝑡2superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡\displaystyle\overset{\mathrm{(i)}}{\geq}\mathbb{E}\left[\sum_{t=t_{1}+1}^{t_{% 2}}\left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\right]start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≥ end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
=(ii)𝔼[t=t1+1t2(f(Xt)fπ^t(Xt)(Xt))𝟏{XtC1}]ii𝔼delimited-[]superscriptsubscript𝑡subscript𝑡11subscript𝑡2superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐶1\displaystyle\overset{\mathrm{(ii)}}{=}\mathbb{E}\left[\sum_{t=t_{1}+1}^{t_{2}% }\left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\mathbf{1}\{X_{t% }\in C_{1}\}\right]start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG = end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ]
(iii)t=t1+1t2l=H/43H/4𝔼[(f(Xt)fπ^t(Xt)(Xt))𝟏{XtBl}],iiisuperscriptsubscript𝑡subscript𝑡11subscript𝑡2superscriptsubscript𝑙𝐻43𝐻4𝔼delimited-[]superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐵𝑙\displaystyle\overset{\mathrm{(iii)}}{\geq}\sum_{t=t_{1}+1}^{t_{2}}\sum_{l=H/4% }^{3H/4}\mathbb{E}\left[\left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})% \right)\mathbf{1}\{X_{t}\in B_{l}\}\right],start_OVERACCENT ( roman_iii ) end_OVERACCENT start_ARG ≥ end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = italic_H / 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_H / 4 end_POSTSUPERSCRIPT blackboard_E [ ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ] , (20)

where step (i) is because the total regret is greater than the regret incurred during the second batch, step (ii) uses the fact that under the instance v𝑣vitalic_v, the mean rewards of the two arms differ only in C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and step (iii) arises since C1=l=1HBlsubscript𝐶1superscriptsubscript𝑙1𝐻subscript𝐵𝑙C_{1}=\cup_{l=1}^{H}B_{l}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Now we turn to lower bounding 𝔼[(f(Xt)fπ^t(Xt)(Xt))𝟏{XtBl}]𝔼delimited-[]superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐵𝑙\mathbb{E}\left[\left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)% \mathbf{1}\{X_{t}\in B_{l}\}\right]blackboard_E [ ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ] for each H/4l3H/4𝐻4𝑙3𝐻4H/4\leq l\leq 3H/4italic_H / 4 ≤ italic_l ≤ 3 italic_H / 4.

Consider any such Blsubscript𝐵𝑙B_{l}italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. We drop the subscripts and write B𝐵Bitalic_B instead for simplicity. By the design of v𝑣vitalic_v, we have f¯B(1)f¯B(1)Dϕz1=δsuperscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵1subscript𝐷italic-ϕsuperscript𝑧1𝛿\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}\leq D_{\phi}z^{-1}=\deltaover¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_δ, which obeys Dϕz11/mB,1subscript𝐷italic-ϕsuperscript𝑧11superscriptsubscript𝑚𝐵1D_{\phi}z^{-1}\leq 1/\sqrt{m_{B,1}^{\star}}italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG—a consequence of the choice of z𝑧zitalic_z. Additionally, we have mB,12mB,1subscript𝑚𝐵12superscriptsubscript𝑚𝐵1m_{B,1}\leq 2m_{B,1}^{\star}italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT under E𝐸Eitalic_E. Therefore, we can invoke Lemma 11 to obtain

(Y¯B,1(1)Y¯B,1(1)>U(mB,1,T,B))t1T12.superscriptsubscript¯𝑌𝐵11superscriptsubscript¯𝑌𝐵11𝑈subscript𝑚𝐵1𝑇𝐵subscript𝑡1𝑇12\mathbb{P}\left(\bar{Y}_{B,1}^{(1)}-\bar{Y}_{B,1}^{(-1)}>U(m_{B,1},T,B)\right)% \leq\frac{t_{1}}{T}\leq\frac{1}{2}.blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT , italic_T , italic_B ) ) ≤ divide start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG .

In words, with probability exceeding 1/2121/21 / 2, no elimination will happen for the bin B𝐵Bitalic_B. As a result, we obtain

𝔼[RT(π^)]𝔼delimited-[]subscript𝑅𝑇^𝜋\displaystyle\mathbb{E}[R_{T}(\hat{\pi})]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] t=t1+1t2l=H/43H/4𝔼[(f(Xt)fπ^t(Xt)(Xt))𝟏{XtBl}]absentsuperscriptsubscript𝑡subscript𝑡11subscript𝑡2superscriptsubscript𝑙𝐻43𝐻4𝔼delimited-[]superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐵𝑙\displaystyle\geq\sum_{t=t_{1}+1}^{t_{2}}\sum_{l=H/4}^{3H/4}\mathbb{E}\left[% \left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\mathbf{1}\{X_{t}% \in B_{l}\}\right]≥ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = italic_H / 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_H / 4 end_POSTSUPERSCRIPT blackboard_E [ ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ]
Ht2gz1t2z2=T919+ϵ,greater-than-or-equivalent-toabsent𝐻subscript𝑡2𝑔superscript𝑧1asymptotically-equalssubscript𝑡2superscript𝑧2superscript𝑇919italic-ϵ\displaystyle\gtrsim H\cdot\frac{t_{2}}{g}\cdot z^{-1}\asymp\frac{t_{2}}{z^{2}% }=T^{\frac{9}{19}+\epsilon},≳ italic_H ⋅ divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_g end_ARG ⋅ italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≍ divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG + italic_ϵ end_POSTSUPERSCRIPT ,

where we have used the choice of z𝑧zitalic_z. So Theorem 4 holds with κ=ϵ𝜅italic-ϵ\kappa=\epsilonitalic_κ = italic_ϵ.

Case 2: gT3/19ε𝑔superscript𝑇319𝜀g\leq T^{3/19-\varepsilon}italic_g ≤ italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε end_POSTSUPERSCRIPT.

Set z=T3/19ε/8𝑧superscript𝑇319𝜀8z=T^{3/19-\varepsilon/8}italic_z = italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε / 8 end_POSTSUPERSCRIPT. We have g<z𝑔𝑧g<zitalic_g < italic_z and there exists H>1𝐻1H>1italic_H > 1 such that z=Hg𝑧𝐻𝑔z=H\cdot gitalic_z = italic_H ⋅ italic_g; see Figure 4 for an illustration of the instance. Let B𝐵Bitalic_B be the bin produced by π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG such that C1Bsubscript𝐶1𝐵C_{1}\subset Bitalic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ italic_B. By the design of v𝑣vitalic_v, we have

f¯B(1)f¯B(1)1H(1/2+Dϕz1)+(11H)1212=Dϕz1H.superscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵11𝐻12subscript𝐷italic-ϕsuperscript𝑧111𝐻1212subscript𝐷italic-ϕsuperscript𝑧1𝐻\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}\leq\frac{1}{H}(1/2+D_{\phi}z^{-1})+(1-% \frac{1}{H})\frac{1}{2}-\frac{1}{2}=\frac{D_{\phi}z^{-1}}{H}.over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ( 1 / 2 + italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + ( 1 - divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG = divide start_ARG italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H end_ARG .

Let δ=Dϕz1H𝛿subscript𝐷italic-ϕsuperscript𝑧1𝐻\delta=\frac{D_{\phi}z^{-1}}{H}italic_δ = divide start_ARG italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H end_ARG, we have δ1/mB,1𝛿1superscriptsubscript𝑚𝐵1\delta\leq 1/\sqrt{m_{B,1}^{\star}}italic_δ ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG due to our choice of z𝑧zitalic_z. Additionally, we have mB,12mB,1subscript𝑚𝐵12superscriptsubscript𝑚𝐵1m_{B,1}\leq 2m_{B,1}^{\star}italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT under E𝐸Eitalic_E. Therefore, we can invoke Lemma 11 to obtain

(Y¯B,1(1)Y¯B,1(1)>U(mB,1,T,B))t1T12.superscriptsubscript¯𝑌𝐵11superscriptsubscript¯𝑌𝐵11𝑈subscript𝑚𝐵1𝑇𝐵subscript𝑡1𝑇12\mathbb{P}\left(\bar{Y}_{B,1}^{(1)}-\bar{Y}_{B,1}^{(-1)}>U(m_{B,1},T,B)\right)% \leq\frac{t_{1}}{T}\leq\frac{1}{2}.blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT , italic_T , italic_B ) ) ≤ divide start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG .

Thus, with probability exceeding 1/2121/21 / 2, the suboptimal arm is not eliminated in B𝐵Bitalic_B. Similar to the previous case, we obtain

𝔼[RT(π^)]𝔼delimited-[]subscript𝑅𝑇^𝜋\displaystyle\mathbb{E}[R_{T}(\hat{\pi})]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] 𝔼[t=t1+1t2(f(Xt)fπ^t(Xt)(Xt))]absent𝔼delimited-[]superscriptsubscript𝑡subscript𝑡11subscript𝑡2superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡\displaystyle\overset{\mathrm{}}{\geq}\mathbb{E}\left[\sum_{t=t_{1}+1}^{t_{2}}% \left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\right]start_OVERACCENT end_OVERACCENT start_ARG ≥ end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
=𝔼[t=t1+1t2(f(Xt)fπ^t(Xt)(Xt))𝟏{XtC1}]absent𝔼delimited-[]superscriptsubscript𝑡subscript𝑡11subscript𝑡2superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐶1\displaystyle\overset{\mathrm{}}{=}\mathbb{E}\left[\sum_{t=t_{1}+1}^{t_{2}}% \left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\mathbf{1}\{X_{t}% \in C_{1}\}\right]start_OVERACCENT end_OVERACCENT start_ARG = end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ]
t2z2=T919+ϵ4.subscript𝑡2superscript𝑧2absentsuperscript𝑇919italic-ϵ4\displaystyle\apprge\frac{t_{2}}{z^{2}}\overset{\mathrm{}}{=}T^{\frac{9}{19}+% \frac{\epsilon}{4}}.≳ divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_OVERACCENT end_OVERACCENT start_ARG = end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT .

So Theorem 4 holds with κ=ϵ/4𝜅italic-ϵ4\kappa=\epsilon/4italic_κ = italic_ϵ / 4.

Case 3: g(T3/19ε,T3/19+ε)𝑔superscript𝑇319𝜀superscript𝑇319𝜀g\in(T^{3/19-\varepsilon},T^{3/19+\varepsilon})italic_g ∈ ( italic_T start_POSTSUPERSCRIPT 3 / 19 - italic_ε end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT 3 / 19 + italic_ε end_POSTSUPERSCRIPT ).

Set zT1/4asymptotically-equals𝑧superscript𝑇14z\asymp T^{1/4}italic_z ≍ italic_T start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT. We then have g<z𝑔𝑧g<zitalic_g < italic_z, as long as ε1/19𝜀119\varepsilon\leq 1/19italic_ε ≤ 1 / 19. And there exists H>1𝐻1H>1italic_H > 1 such that z=Hg𝑧𝐻𝑔z=H\cdot gitalic_z = italic_H ⋅ italic_g; see Figure 4 for an illustration of the instance. Let B𝐵Bitalic_B be the bin produced by π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG such that C1Bsubscript𝐶1𝐵C_{1}\subset Bitalic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ italic_B. By the design of v𝑣vitalic_v, we have

f¯B(1)f¯B(1)1H(1/2+Dϕz1)+(11H)1212=Dϕz1H.superscriptsubscript¯𝑓𝐵1superscriptsubscript¯𝑓𝐵11𝐻12subscript𝐷italic-ϕsuperscript𝑧111𝐻1212subscript𝐷italic-ϕsuperscript𝑧1𝐻\bar{f}_{B}^{(1)}-\bar{f}_{B}^{(-1)}\leq\frac{1}{H}(1/2+D_{\phi}z^{-1})+(1-% \frac{1}{H})\frac{1}{2}-\frac{1}{2}=\frac{D_{\phi}z^{-1}}{H}.over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ( 1 / 2 + italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + ( 1 - divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG = divide start_ARG italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H end_ARG .

Let δ=Dϕz1H𝛿subscript𝐷italic-ϕsuperscript𝑧1𝐻\delta=\frac{D_{\phi}z^{-1}}{H}italic_δ = divide start_ARG italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H end_ARG, we have δ1/mB,1𝛿1superscriptsubscript𝑚𝐵1\delta\leq 1/\sqrt{m_{B,1}^{\star}}italic_δ ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG due to our choice of z𝑧zitalic_z. Additionally, we have mB,12mB,1subscript𝑚𝐵12superscriptsubscript𝑚𝐵1m_{B,1}\leq 2m_{B,1}^{\star}italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT under E𝐸Eitalic_E. Therefore, we can invoke Lemma 11 to obtain

(Y¯B,1(1)Y¯B,1(1)>U(mB,1,T,B))t1T14.superscriptsubscript¯𝑌𝐵11superscriptsubscript¯𝑌𝐵11𝑈subscript𝑚𝐵1𝑇𝐵subscript𝑡1𝑇14\mathbb{P}\left(\bar{Y}_{B,1}^{(1)}-\bar{Y}_{B,1}^{(-1)}>U(m_{B,1},T,B)\right)% \leq\frac{t_{1}}{T}\leq\frac{1}{4}.blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT , italic_T , italic_B ) ) ≤ divide start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 4 end_ARG .

This means with probability at least 3/4343/43 / 4, arm elimination does not occur in B𝐵Bitalic_B after the first batch. Moreover, since δ1/mB,2𝛿1superscriptsubscript𝑚𝐵2\delta\leq 1/\sqrt{m_{B,2}^{\star}}italic_δ ≤ 1 / square-root start_ARG italic_m start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG by the choice of z𝑧zitalic_z, and mB,22mB,2subscript𝑚𝐵22superscriptsubscript𝑚𝐵2m_{B,2}\leq 2m_{B,2}^{\star}italic_m start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT ≤ 2 italic_m start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT under E𝐸Eitalic_E, we can apply Lemma 11 again to get

(Y¯B,2(1)Y¯B,2(1)>U(mB,2,T,B))t2T14.superscriptsubscript¯𝑌𝐵21superscriptsubscript¯𝑌𝐵21𝑈subscript𝑚𝐵2𝑇𝐵subscript𝑡2𝑇14\mathbb{P}\left(\bar{Y}_{B,2}^{(1)}-\bar{Y}_{B,2}^{(-1)}>U(m_{B,2},T,B)\right)% \leq\frac{t_{2}}{T}\leq\frac{1}{4}.blackboard_P ( over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT > italic_U ( italic_m start_POSTSUBSCRIPT italic_B , 2 end_POSTSUBSCRIPT , italic_T , italic_B ) ) ≤ divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 4 end_ARG .

In all, with probability at least 1/2121/21 / 2, arm elimination does not occur in B𝐵Bitalic_B after the second batch. Similar to before, we reach the conclusion that

𝔼[RT(π^)]𝔼delimited-[]subscript𝑅𝑇^𝜋\displaystyle\mathbb{E}[R_{T}(\hat{\pi})]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ] 𝔼[t=t2+1T(f(Xt)fπ^t(Xt)(Xt))]absent𝔼delimited-[]superscriptsubscript𝑡subscript𝑡21𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡\displaystyle\overset{\mathrm{}}{\geq}\mathbb{E}\left[\sum_{t=t_{2}+1}^{T}% \left(f^{\star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t})\right)\right]start_OVERACCENT end_OVERACCENT start_ARG ≥ end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
=𝔼[t=t2+1T(f(Xt)fπ^t(Xt)(Xt))𝟏{XtC1}]absent𝔼delimited-[]superscriptsubscript𝑡subscript𝑡21𝑇superscript𝑓subscript𝑋𝑡superscript𝑓subscript^𝜋𝑡subscript𝑋𝑡subscript𝑋𝑡1subscript𝑋𝑡subscript𝐶1\displaystyle\overset{\mathrm{}}{=}\mathbb{E}\left[\sum_{t=t_{2}+1}^{T}(f^{% \star}(X_{t})-f^{\hat{\pi}_{t}(X_{t})}(X_{t}))\mathbf{1}\{X_{t}\in C_{1}\}\right]start_OVERACCENT end_OVERACCENT start_ARG = end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_1 { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ]
Tz2=T12.absentgreater-than-or-equivalent-to𝑇superscript𝑧2absentsuperscript𝑇12\displaystyle\overset{\mathrm{}}{\gtrsim}\frac{T}{z^{2}}\overset{\mathrm{}}{=}% T^{\frac{1}{2}}.start_OVERACCENT end_OVERACCENT start_ARG ≳ end_ARG divide start_ARG italic_T end_ARG start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_OVERACCENT end_OVERACCENT start_ARG = end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT .

We see that Theorem 4 holds with κ=1/38𝜅138\kappa=1/38italic_κ = 1 / 38.

Appendix D Proof of Theorem 5

When β=d=1𝛽𝑑1\beta=d=1italic_β = italic_d = 1, we denote γ(α)=(α+1)/3𝛾superscript𝛼superscript𝛼13\gamma(\alpha^{\star})=(\alpha^{\star}+1)/3italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 ) / 3. Fix some small constant ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 to be specified later. We deal with M=2𝑀2M=2italic_M = 2 and M=3𝑀3M=3italic_M = 3 separately. In both cases, we use the fact that the algorithm needs to provide its first batch size t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT prior to the game, and design αsuperscript𝛼\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT such that any choice of t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT would fail.

D.1 When M=2𝑀2M=2italic_M = 2

Under M=2𝑀2M=2italic_M = 2, the theoretical optimal rate is T(1γ(α))/(1γ(α)2)=T3/(α+4)T^{(1-\gamma(\alpha^{\star}))/(1-\gamma(\alpha^{\star})^{2}})=T^{3/(\alpha^{% \star}+4)}italic_T start_POSTSUPERSCRIPT ( 1 - italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) / ( 1 - italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = italic_T start_POSTSUPERSCRIPT 3 / ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 4 ) end_POSTSUPERSCRIPT.

Case of t1=T3/5+ϵsubscript𝑡1superscript𝑇35italic-ϵt_{1}=T^{3/5+\epsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 3 / 5 + italic_ϵ end_POSTSUPERSCRIPT.

Take α=1superscript𝛼1\alpha^{\star}=1italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 1. Since fixed grid and adaptive grid are the same when M=2𝑀2M=2italic_M = 2, by relation (9), we have

supf(α,β)𝔼[RT(π)]max{t1,Tt1α+13}t1=T35+κ1,greater-than-or-equivalent-tosubscriptsupremum𝑓superscript𝛼𝛽𝔼delimited-[]subscript𝑅𝑇𝜋subscript𝑡1𝑇superscriptsubscript𝑡1superscript𝛼13subscript𝑡1superscript𝑇35subscript𝜅1\sup_{f\in\mathcal{F}(\alpha^{\star},\beta)}\mathbb{E}[R_{T}(\pi)]\gtrsim\max% \left\{t_{1},\frac{T}{t_{1}^{\frac{\alpha^{\star}+1}{3}}}\right\}\geq t_{1}=T^% {\frac{3}{5}+\kappa_{1}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_β ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≳ roman_max { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , divide start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG } ≥ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 5 end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where κ1=ϵsubscript𝜅1italic-ϵ\kappa_{1}=\epsilonitalic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ.

Case of t1=T3/4ϵsubscript𝑡1superscript𝑇34italic-ϵt_{1}=T^{3/4-\epsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 3 / 4 - italic_ϵ end_POSTSUPERSCRIPT.

Take α=o(1)superscript𝛼𝑜1\alpha^{\star}=o(1)italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_o ( 1 ). By relation (9), we have

supf(α,β)𝔼[RT(π)]max{t1,Tt1α+13}Tt1α+13=T1(34ϵ)α+13=T34+κ1,greater-than-or-equivalent-tosubscriptsupremum𝑓superscript𝛼𝛽𝔼delimited-[]subscript𝑅𝑇𝜋subscript𝑡1𝑇superscriptsubscript𝑡1superscript𝛼13𝑇superscriptsubscript𝑡1superscript𝛼13superscript𝑇134italic-ϵsuperscript𝛼13superscript𝑇34subscript𝜅1\sup_{f\in\mathcal{F}(\alpha^{\star},\beta)}\mathbb{E}[R_{T}(\pi)]\gtrsim\max% \left\{t_{1},\frac{T}{t_{1}^{\frac{\alpha^{\star}+1}{3}}}\right\}\geq\frac{T}{% t_{1}^{\frac{\alpha^{\star}+1}{3}}}=T^{1-(\frac{3}{4}-\epsilon)\frac{\alpha^{% \star}+1}{3}}=T^{\frac{3}{4}+\kappa_{1}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_β ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≳ roman_max { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , divide start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG } ≥ divide start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG = italic_T start_POSTSUPERSCRIPT 1 - ( divide start_ARG 3 end_ARG start_ARG 4 end_ARG - italic_ϵ ) divide start_ARG italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where κ1=(α+1)ϵ/3α/4>0subscript𝜅1superscript𝛼1italic-ϵ3superscript𝛼40\kappa_{1}=(\alpha^{\star}+1)\epsilon/3-\alpha^{\star}/4>0italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 1 ) italic_ϵ / 3 - italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT / 4 > 0.

D.2 When M=3𝑀3M=3italic_M = 3

Under M=3𝑀3M=3italic_M = 3, the theoretical optimal rate is T(1γ(α))/(1γ(α)3)=T9/((α)2+5α+13)superscript𝑇1𝛾superscript𝛼1𝛾superscriptsuperscript𝛼3superscript𝑇9superscriptsuperscript𝛼25superscript𝛼13T^{(1-\gamma(\alpha^{\star}))/(1-\gamma(\alpha^{\star})^{3})}=T^{9/((\alpha^{% \star})^{2}+5\alpha^{\star}+13)}italic_T start_POSTSUPERSCRIPT ( 1 - italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) / ( 1 - italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT 9 / ( ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + 13 ) end_POSTSUPERSCRIPT.

Case of t1=T9/19+ϵsubscript𝑡1superscript𝑇919italic-ϵt_{1}=T^{9/19+\epsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 9 / 19 + italic_ϵ end_POSTSUPERSCRIPT.

Take α=1superscript𝛼1\alpha^{\star}=1italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 1. During the first batch, the learner can do no better than pull an arm uniformly at random. We can use the instance (f1(x)=1,f2(x)=0)formulae-sequencesubscript𝑓1𝑥1subscript𝑓2𝑥0(f_{1}(x)=1,f_{2}(x)=0)( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = 1 , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = 0 ) so that

supf(α,β)𝔼[RT(π)]t1=T919+κ1,greater-than-or-equivalent-tosubscriptsupremum𝑓superscript𝛼𝛽𝔼delimited-[]subscript𝑅𝑇𝜋subscript𝑡1superscript𝑇919subscript𝜅1\sup_{f\in\mathcal{F}(\alpha^{\star},\beta)}\mathbb{E}[R_{T}(\pi)]\gtrsim t_{1% }=T^{\frac{9}{19}+\kappa_{1}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_β ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≳ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 19 end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where κ1=ϵsubscript𝜅1italic-ϵ\kappa_{1}=\epsilonitalic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ.

Case of t1=T9/13ϵsubscript𝑡1superscript𝑇913italic-ϵt_{1}=T^{9/13-\epsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 9 / 13 - italic_ϵ end_POSTSUPERSCRIPT.

Take α=o(1)superscript𝛼𝑜1\alpha^{\star}=o(1)italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_o ( 1 ). Denote T2=T1/(γ(α)+1)t1γ(α)/(γ(α)+1)subscript𝑇2superscript𝑇1𝛾superscript𝛼1superscriptsubscript𝑡1𝛾superscript𝛼𝛾superscript𝛼1T_{2}=T^{1/(\gamma(\alpha^{\star})+1)}t_{1}^{\gamma(\alpha^{\star})/(\gamma(% \alpha^{\star})+1)}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 1 / ( italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + 1 ) end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) / ( italic_γ ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + 1 ) end_POSTSUPERSCRIPT. Define the events E2={T2<t2}subscript𝐸2subscript𝑇2subscript𝑡2E_{2}=\{T_{2}<t_{2}\}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and E3={t2T2}subscript𝐸3subscript𝑡2subscript𝑇2E_{3}=\{t_{2}\leq T_{2}\}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Recall Qi()=ωΩsiqi(ω)π,i,ω()subscript𝑄𝑖subscript𝜔subscriptΩsubscript𝑠𝑖subscript𝑞𝑖𝜔subscript𝜋𝑖𝜔Q_{i}(\cdot)=\sum_{\omega\in\Omega_{s_{i}}}q_{i}(\omega)\mathbb{P}_{\pi,i,% \omega}(\cdot)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω ) blackboard_P start_POSTSUBSCRIPT italic_π , italic_i , italic_ω end_POSTSUBSCRIPT ( ⋅ ) from (13) given zi>0subscript𝑧𝑖0z_{i}>0italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 and sizidαβsubscript𝑠𝑖superscriptsubscript𝑧𝑖𝑑𝛼𝛽s_{i}\coloneqq\lceil z_{i}^{d-\alpha\beta}\rceilitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ⌈ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - italic_α italic_β end_POSTSUPERSCRIPT ⌉. Take z2=(36t122)1/(2β+d)subscript𝑧2superscript36subscript𝑡1superscript2212𝛽𝑑z_{2}=\lceil(36t_{1}2^{2})^{1/(2\beta+d)}\rceilitalic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ⌈ ( 36 italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT ⌉ and z3=(36T222)1/(2β+d)subscript𝑧3superscript36subscript𝑇2superscript2212𝛽𝑑z_{3}=\lceil(36T_{2}2^{2})^{1/(2\beta+d)}\rceilitalic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ⌈ ( 36 italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT ⌉. Since E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be determined by observations up to t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have

|Q2(E2)Q3(E2)|=|Q2t1(E2)Q3t1(E2)|TV(Q2t1,Q3t1)(i)12t1z2(2β+d)(ii)14,subscript𝑄2subscript𝐸2subscript𝑄3subscript𝐸2superscriptsubscript𝑄2subscript𝑡1subscript𝐸2superscriptsubscript𝑄3subscript𝑡1subscript𝐸2TVsuperscriptsubscript𝑄2subscript𝑡1superscriptsubscript𝑄3subscript𝑡1i12subscript𝑡1superscriptsubscript𝑧22𝛽𝑑ii14|Q_{2}(E_{2})-Q_{3}(E_{2})|=|Q_{2}^{t_{1}}(E_{2})-Q_{3}^{t_{1}}(E_{2})|\leq% \mathrm{TV}(Q_{2}^{t_{1}},Q_{3}^{t_{1}})\overset{\mathrm{(i)}}{\leq}\frac{1}{2% }\sqrt{t_{1}z_{2}^{-(2\beta+d)}}\overset{\mathrm{(ii)}}{\leq}\frac{1}{4},| italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | = | italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ roman_TV ( italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_OVERACCENT ( roman_i ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( 2 italic_β + italic_d ) end_POSTSUPERSCRIPT end_ARG start_OVERACCENT ( roman_ii ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG 4 end_ARG , (21)

where step (i) applies Lemma 2, and step (ii) is due to the definition of z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Consequently,

Q2(E2)+Q3(E3)subscript𝑄2subscript𝐸2subscript𝑄3subscript𝐸3\displaystyle Q_{2}(E_{2})+Q_{3}(E_{3})italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) =Q2(E2)Q3(E2)+Q3(E2)+Q3(E3)absentsubscript𝑄2subscript𝐸2subscript𝑄3subscript𝐸2subscript𝑄3subscript𝐸2subscript𝑄3subscript𝐸3\displaystyle=Q_{2}(E_{2})-Q_{3}(E_{2})+Q_{3}(E_{2})+Q_{3}(E_{3})= italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
14+Q3(E2)+Q3(E3)=34,absent14subscript𝑄3subscript𝐸2subscript𝑄3subscript𝐸334\displaystyle\geq-\frac{1}{4}+Q_{3}(E_{2})+Q_{3}(E_{3})=\frac{3}{4},≥ - divide start_ARG 1 end_ARG start_ARG 4 end_ARG + italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = divide start_ARG 3 end_ARG start_ARG 4 end_ARG ,

where the second step uses inequality (21) and the last equality is because E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT form a whole partition of the probability space. Then we would have at least one of Q2(E2)1/4subscript𝑄2subscript𝐸214Q_{2}(E_{2})\geq 1/4italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 1 / 4 oder Q3(E3)1/4subscript𝑄3subscript𝐸314Q_{3}(E_{3})\geq 1/4italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≥ 1 / 4. If Q2(E2)1/4subscript𝑄2subscript𝐸214Q_{2}(E_{2})\geq 1/4italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 1 / 4, by Lemma 4 we obtain

supf(α,β)𝔼[RT(π)]supf𝒞z2RT2(π;f)T2z2β(1+α)T913+κ1,subscriptsupremum𝑓superscript𝛼𝛽𝔼delimited-[]subscript𝑅𝑇𝜋subscriptsupremum𝑓subscript𝒞subscript𝑧2subscript𝑅subscript𝑇2𝜋𝑓subscript𝑇2superscriptsubscript𝑧2𝛽1superscript𝛼asymptotically-equalssuperscript𝑇913subscript𝜅1\sup_{f\in\mathcal{F}(\alpha^{\star},\beta)}\mathbb{E}[R_{T}(\pi)]\geq\sup_{f% \in\mathcal{C}_{z_{2}}}R_{T_{2}}(\pi;f)\apprge T_{2}z_{2}^{-\beta(1+\alpha^{% \star})}\asymp T^{\frac{9}{13}+\kappa_{1}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_β ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≥ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≳ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β ( 1 + italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ≍ italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 13 end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for some κ1>0subscript𝜅10\kappa_{1}>0italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0. If Q3(E3)1/4subscript𝑄3subscript𝐸314Q_{3}(E_{3})\geq 1/4italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≥ 1 / 4, we similarly have

supf(α,β)𝔼[RT(π)]supf𝒞z3RT3(π;f)Tz3β(1+α)T913+κ1,subscriptsupremum𝑓superscript𝛼𝛽𝔼delimited-[]subscript𝑅𝑇𝜋subscriptsupremum𝑓subscript𝒞subscript𝑧3subscript𝑅subscript𝑇3𝜋𝑓𝑇superscriptsubscript𝑧3𝛽1superscript𝛼asymptotically-equalssuperscript𝑇913subscript𝜅1\sup_{f\in\mathcal{F}(\alpha^{\star},\beta)}\mathbb{E}[R_{T}(\pi)]\geq\sup_{f% \in\mathcal{C}_{z_{3}}}R_{T_{3}}(\pi;f)\apprge Tz_{3}^{-\beta(1+\alpha^{\star}% )}\asymp T^{\frac{9}{13}+\kappa_{1}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F ( italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_β ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ) ] ≥ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ; italic_f ) ≳ italic_T italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β ( 1 + italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ≍ italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 13 end_ARG + italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for some κ1>0subscript𝜅10\kappa_{1}>0italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0.