A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Priyank Agrawal [email protected] Theja Tulabandhula [email protected] Vashist Avadhanula [email protected] 500 W 120th St, New York, NY 10027 University Hall, 601 S Morgan St, Chicago, IL 60607 1100 Enterprise Way, Sunnyvale, CA 94089

Abstract

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker’s problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$ . Though this problem has recently attracted considerable attention, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, current algorithms for this problem have regret bounded by $O(\sqrt{\kappa dT})$ , where $\kappa$ is a problem-dependent constant that may have an exponential dependency on the number of attributes, $d$ . In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT}+\kappa)$ , significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favorable regret guarantee. We also demonstrate that our algorithm has robust performance for varying $\kappa$ values through numerical experiments.

keywords:

Revenue management , OR in marketing , Multi-armed bandit , Multinomial Logit model , Sequential decision-making

^†^†journal: European Journal of Operational Research

1 Introduction

Assortment optimization problems arise in many industries, and prominent examples include retailing and online advertising (check Alfandari et al. [2021], Timonina-Farkas et al. [2020], Wang et al. [2020] and see Kök & Fisher [2007] for a detailed review). The problem faced by a decision-maker is that of selecting a subset (assortment) of items to offer from a universe of substitutable items¹¹1If all consumers have identical preferences towards same characteristics of an item, then that item is termed as substitutable such that the expected revenue is maximized. In many e-commerce applications, the data on consumer choices tends to be either limited or non-existent (similar to the cold start problem in recommendation systems). Consumer preferences must be learned by experimenting with various assortments and observing consumer choices, but this experimentation with various assortments must be balanced to maximize cumulative revenue. Furthermore, in many settings, the retailer has to consider a very large number of products that are similar (examples range from apparel to consumer electronics). The commonality in their features can be expressed with the aid of auxiliary variables which summarize product attributes. This enables a significant reduction in dimensionality but introduces additional challenges in designing policies that have to dynamically balance demand learning (exploration) while simultaneously maximizing cumulative revenues (exploitation).

Motivated by these issues, we consider the dynamic assortment optimization problem. In every round, the retailer offers a subset (assortment) of products to a consumer and observes the consumer response. Consumers purchase (at most one product from each assortment) products that maximize their utility, and the retailer enjoys revenue from the successful purchase. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the retailer’s problem of dynamically learning the model parameters while optimizing cumulative revenues over the selling horizon $T$ . Specifically, we have a universe of $N$ substitutable items, and each item $i$ is associated with an attribute vector $x_{i}\in\mathbb{R}^{d},$ which is known a priori. The mean utility for the consumer for the product $i$ is given by the inner product $\theta\cdot x_{i},$ where $\theta\in\mathbb{R}^{d}$ is some fixed but initially unknown parameter vector. Each of the $d$ coordinates of $x_{i}$ for product $i$ represent a variety of characteristics such as cost, popularity, brand, etc. Given the substitutable good assumption, the preference of all consumers towards these characteristics are identical and denoted by the same parameter²²2This assumption may appear quite restrictive at first. But, as described in the following paragraph and Section 2.2, the model is rich enough to model non-identical consumer behavior as well. $\theta\in\mathbb{R}^{d}$ . Further, any two products $i$ and $j$ could vary in terms of these characteristics and hence are associated with different vectors $x_{i}$ and $x_{j}\in\mathbb{R}^{d}$ respectively. Our goal is to offer assortments $\mathcal{Q}_{1},\cdots,\mathcal{Q}_{T}$ at times $1,\cdots,T$ from a feasible collection of assortments such that the cumulative expected revenue of the retailer over the said horizon is maximized. In general, the feasible set of assortments can reflect the constraints of retailers and online platforms (such as cardinality, inventory availability and other related constraints).

For an intuitive understanding of the choice model, consider an example of an online furniture retailer that offers $N$ distinct products where the $i^{th}$ product has an attribute vector $x_{i}$ (in general, this attribute can vary over time, representing varying consumers’ choices, and is more appropriately represented by $x_{t,i}$ ). Suppose consumers query for a specific product category, say tables. In this example, the $\theta$ parameter will be a distinct vector corresponding to the product category: table. As discussed before, th true $\theta_{*}$ that determines consumer choice behavior is unknown. With each interaction with the consumer, the online retailer is learning which of the $N$ products offers the most utility (captured by $\theta\cdot x_{i}$ for each product $i$ ) to the consumer by observing the past purchase decisions of the consumers. The online furniture retailer is constrained to offer at most $K$ of $N$ products in each interaction with the consumer. Such a constraint may be encountered in practical situations: limitation of the online consumer interface to display large number of products; consumer preferring to examine only a subset of products at a time etc. Out of $N$ furniture items, some particular table $j$ could have high utility, $\theta\cdot x_{j}$ , whereas, for a some product $k$ (say, a table with unpopular color, bad design or inferior material etc.) $\theta\cdot x_{k}$ could be low. The consumer may purchase one or none of the $K$ presented products. Later in Section 2.2, we demonstrate that when the consumer’s propensity to purchase a specific product is driven by its utility, the retailer’s expected revenue at each round is given by a softmax function.

The rest of this section is organized as follows: We first describe the related literature and qualitative significance of the parameter $\kappa$ . Then, we highlight our contributions and end the section by contrasting them with recent notable research works.

1.1 Related literature

The MNL model is a widely used choice model for capturing consumer purchase behavior in assortment selection models (see Flores et al. [2019] and Avadhanula [2019]). Recently, large-scale field experiments at Alibaba [Feldman et al., 2018] have demonstrated the efficacy of the MNL model in boosting revenues. Rusmevichientong et al. [2010] and Sauré & Zeevi [2013] were a couple of early works that studied explore-then-commit strategies for the dynamic assortment selection problem under the MNL model when there are no contexts/product features. The works of Agrawal et al. [2019] and Agrawal et al. [2017] revisited this problem and presented adaptive online learning algorithms based on the Upper Confidence Bounds(UCB) and Thompson Sampling (TS) ideas. These approaches, unlike earlier ideas, did not require prior information about the problem parameters and had near-optimal regret bounds. Following these developments, the contextual variant of the problem has received considerable attention. Cheung & Simchi-Levi [2017] and Oh & Iyengar [2019] propose TS-based approaches and establish Bayesian regret bounds on their performance³³3Our results give worst-case regret bound which is strictly stronger than Bayesian regret bound. Worst-case regret bounds directly imply Bayesian regret bounds with same order dependence.. Chen et al. [2020] present a UCB-based algorithm and establish min-max regret bounds. However, these contextual MNL algorithms and their performance bounds depend on a problem parameter $\kappa$ that can be prohibitively large, even for simple real-life examples. See Figure 1 for an illustration and Section 1.2 for a detailed discussion.

Refer to caption — Figure 1: Illustration of the impact of the $\kappa$ parameter (logistic case, multinomial logit case closely follows): A representative plot of the derivative of the reward function. The x-axis represents the linear function $x^{\top}\theta$ and the y-axis is proportional to $1/\kappa$ . Parameter $\kappa$ is small only in the narrow region around $0$ and grows arbitrarily large depending on the problem instance (captured by $x^{\top}\theta$ values).

We note that Ou et al. [2018] also consider a similar problem of developing an online algorithm for the MNL model with linear utility parameters. Though they establish a regret bound that does not depend on the aforementioned parameter $\kappa$ , they work with an inaccurate version of the MNL model. More specifically, in the MNL model, the probability of a consumer preferring an item is proportional to the exponential of the utility parameter and is not linear in the utility parameter as assumed in Ou et al. [2018].

The multi-armed bandit problem, which underlies these dynamic decision making settings, has been well studied in the literature (see Xu et al. [2021], Grant & Szechtman [2021]). Our problem is closely related to the parametric bandit problem, where a common unknown parameter connects the rewards of each arm. In particular, for linear bandits, each arm $a\in A$ (consider $A$ to be the set of all arms) is associated with a $d$ -dimensional vector $x_{a}\in\mathbb{R}^{d}$ known a priori. And the expected reward upon selecting arm $a\in A$ is given by the inner product $\theta\cdot x_{a}$ , for some unknown parameter vector $\theta$ (see Dani et al. [2008], Rusmevichientong & Tsitsiklis [2010], Abbasi-Yadkori et al. [2011]). The key difference is that the rewards (i.e., the revenue of the retailer) corresponding to an assortment under the MNL cannot be modeled in the framework of linear payoffs. Closer to our formulation is the literature on generalized linear bandits (see Filippi et al. [2010] and Faury et al. [2020]), where the expected payoff upon selecting arm $a$ is given by $f(\theta\cdot x_{a})$ , where $f$ is a real-valued, non-linear function. However, unlike our setting, where an arm could be a collection of $K$ products (thus involving $K$ $d$ -dimensional vectors), $f(.)$ is a single variable function in these prior works.

1.2 On the parameter $\kappa$

As discussed earlier, the retailer’s revenue (reward function) is the softmax function. Intuitively, the curvature of the reward function influences how easy (or difficult) it is to learn the true choice parameter $\theta_{*}$ . In Section 2, we explicitly define $\kappa$ as inversely proportional to the lower bound on of the derivative of the reward function in the entire decision region. Existence of a global lower bound on the curvature of the reward function is a necessary assumption for the maximum likelihood estimation of $\theta_{*}$ .

In previous works on generalized linear bandits and variants [Filippi et al., 2010, Li et al., 2017, Oh & Iyengar, 2019], the quantity $\kappa$ features in regret guarantees as a multiplicative factor of the primary term (i.e., as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$ ), and this is because they ignore the local effect of the curvature, and use global properties (via $\kappa$ ) leading to loose worst-case bounds. For a cleaner exposition of this issue, lets take $K=1$ , i.e., the rewards are given by a sigmoid function of $\theta\cdot x$ . The derivative of sigmoid is “bell”-shaped (see Figure 1). When $\theta\cdot x$ is very high (i.e., the assortment contains products with high utilities) or when $\theta\cdot x$ is very low (i.e., the assortment contains products with low utility), the value of $\kappa$ will be large. From Assumption 2, for $K=1$ , $\kappa$ is equivalent to $\max\frac{1}{a(1-a)}$ , for some $a\in(0,1)$ . Thus, when $a$ is close to $1$ oder $0$ , the value of $\kappa$ will be large. The exponential dependence for $K=1$ case follows when we replace $a$ with a sigmoid function. In the context of our problem, this translates to an exponential dependence of the per-round regret on the magnitude of utilities (i.e., $\theta\cdot x$ ).

1.3 Contributions

In this paper, we build on recent developments for generalized linear bandits (Faury et al. [2020]) to propose a new optimistic algorithm, CB-MNL for the problem of contextual multinomial logit bandits. CB-MNL follows the standard template of optimistic parameter search strategies (also known as optimism in the face of uncertainty approaches) [Abbasi-Yadkori et al., 2011, Abeille et al., 2021]. We use Bernstein-style concentration for self-normalized martingales, which were previously proposed in the context of scalar logistic bandits in Faury et al. [2020], to define our confidence set over the true parameter, taking into account the effects of the local curvature of the reward function. We show that the performance of CB-MNL (as measured by regret) is bounded as $\tilde{\mathrm{O}}\del{d\sqrt{T}+\kappa}$ , significantly improving the theoretical performance over existing algorithms where $\kappa$ appears as a multiplicative factor in the leading term. We also leverage a self-concordance [Bach, 2010] like relation for the multinomial logit reward function [Zhang & Lin, 2015], which helps us limit the effect of $\kappa$ on the final regret upper bound to only the higher-order terms. Finally, we propose a different convex confidence set for the optimization problem in the decision set of CB-MNL, which reduces the optimization problem to a constrained convex problem.

In summary, our work establishes strong worst-case regret guarantees by carefully accounting for local gradient information and using second-order function approximation for the estimation error.

1.4 Comparison with notable prior works

Comparison with Filippi et al. [2010] Our setting is different from the standard generalized linear bandit of Filippi et al. [2010]. In our setting, the reward due to an action (assortment) can be dependent on up to $K$ variables ( $\theta_{*}\cdot x_{t,i},\,i\in\mathcal{Q}_{t}$ ) instead of a single variable. Further, we focus on removing the multiplicative dependence on $\kappa$ from the regret bounds. This leads to a more involved technical treatment in our work.

Comparison with Oh & Iyengar [2019] The Thompson Sampling based approach is inherently different from our Optimism in the face of uncertainty (OFU) style Algorithm CB-MNL. However, the main result in Oh & Iyengar [2019] also relies on a confidence set based analysis along the lines of Filippi et al. [2010] but has a multiplicative $\kappa$ factor in the bound.

Comparison with Faury et al. [2020] Faury et al. [2020] use a bonus term for optimization in each round, and their algorithm performs non-trivial projections on the admissible log-odds. While we do reuse the Bernstein-style concentration inequality as proposed by them, their results do not seem to extend directly to the MNL setting without requiring significantly more work. Further, our algorithm CB-MNL performs an optimistic parameter search for making decisions instead of using a bonus term, which allow for a cleaner and shorter analysis.

Comparison with Oh & Iyengar [2021] While the authors in Oh & Iyengar [2021] provide sharper bounds by a factor of $\tilde{\mathrm{O}}(\sqrt{d})$ , they still retain the $\kappa$ multiplicative factor in their regret bounds. Their focus is on improving the dependence on the dimension parameter $d$ for the dynamic assortment optimization problem.

Comparison with Abeille et al. [2021] Abeille et al. [2021] recently proposed the idea of convex relaxation of the confidence set for the more straightforward logistic bandit setting. Our work can be viewed as an extension of their construction to the MNL setting.

Comparison with Amani & Thrampoulidis [2021] While the authors in Amani & Thrampoulidis [2021] also extend the algorithms of Faury et al. [2020] to a multinomial problem, their setting is materially different from ours. They model various click-types for the same advertisement (action) via the multinomial distribution. further, they consider actions played at each round to be non-combinatorial, i.e., a single action as opposed to a bundle of actions, which differs from the assortment optimization setting in this work. Therefore, their approach and technical analysis are different from ours.

2 Preliminaries

2.1 Notations

For a vector $x\,\in\,\mathbb{R}^{d}$ , $x^{\top}$ denotes the transpose. Given a positive definite matrix $\mathbf{M}\,\in\,\mathbb{R}^{d\times d}$ , the induced norm is given by $||x||_{\mathbf{M}}=\sqrt{x\mathbf{M}x}$ . For two symmetric matrices $\mathbf{M_{1}}$ and $\mathbf{M_{2}}$ , $\mathbf{M_{1}}\succeq\mathbf{M_{2}}$ means that $\mathbf{M_{1}}-\mathbf{M_{2}}$ is positive semi-definite. For any positive integer $n$ , $[n]\coloneqq\{1,2,3,\cdots,n\}$ . $\mathbf{I}_{d}$ denotes an identity matrix of dimension $d\times d$ . The platform (i.e. the learner) is referred using the pronouns she/her/hers.

2.2 Model setting

Rewards Model:

At every round $t$ , the platform (learner) is presented with set $\mathcal{N}$ of distinct items, indexed by $i\,\in\,[N]$ and their attribute vectors (contexts): $\{x_{t,i}\}_{i=1}^{N}$ such that $\forall\,i\,\in[N],\,x_{t,i}\,\in\,\mathbb{R}^{d}$ , where $N=|\mathcal{N}|$ is the cardinality of set $\mathcal{N}$ . The platform then selects an assortment $\mathcal{Q}_{t}\subset\mathcal{N}$ and the interacting consumer (environment) offers the reward $r_{t}$ to the platform. The assortments have a cardinality of at most $K$ , i.e. $|\mathcal{Q}_{t}|\leq K$ . The platform’s decision is based on the entire history of interaction. The history is represented by the filtration set $\mathcal{F}_{t}\coloneqq\{\mathcal{F}_{0},\sigma(\{\{x_{s,i}\}_{i=1}^{N},% \mathcal{Q}_{s}\}_{s=1}^{t-1})\}$ ,⁴⁴4 $\sigma(\{\cdot\})$ denotes the $\sigma$ -algebra set over the sequence $\{\cdot\}$ . where $\mathcal{F}_{0}$ is any prior information available to the platform. The interaction lasts for $t=1,2,\cdots,T$ rounds. Conditioned on $\mathcal{F}_{t}$ , the reward $r_{t}$ is a binary vector such that $r_{t}\,\in\,\{0,1\}^{N}$ and the vector $\{r_{t,i}\}_{i\in\mathcal{Q}_{t}}$ follows a multinomial distribution. We have $r_{t,i}=0,\forall\,i\,\notin\,\mathcal{Q}_{t}$ . Specifically, the probability that $r_{t,i}=1,\forall\,i\,\in\,\mathcal{Q}_{t}$ is given by the softmax function:

\displaystyle\mathbb{P}(r_{t,i)=1|\mathcal{Q}_{t},\mathcal{F}_{t}}=\mu_{i}(% \mathcal{Q}_{t},\theta_{*})\coloneqq\frac{\exp(x_{t,i}^{\top}\theta_{*})}{1+% \sum_{j\in\mathcal{Q}_{t}}\exp(x_{t,j}^{\top}\theta_{*})},

(1)

where $\theta_{*}$ is an unknown time-invariant parameter. The numeral $1$ in the denominator accounts for the case when the consumer purchases none of the items in the assortment. By definition, $\sum_{i\in\mathcal{Q}_{t}}r_{t,i}\leq 1$ , i.e., $r_{t}$ is multinomial with a single trial. Also, the expected revenue due to the assortment⁵⁵5Each item $i$ is also associated with a price (or revenue) parameter, $p_{t,i}$ for round $t$ . We assume $p_{t,i}=1$ for all items and rounds for an uncluttered exposition of results. If $p_{t,i}$ is not $1$ , then it features as a fixed factor in the definition of $\mu_{i}(\cdot)$ and the analysis exactly follows as that presented here $p_{t,i}=1$ for all rounds and items. $\mathcal{Q}_{t}$ is given by:

\mu(\mathcal{Q}_{t},\theta_{*})\coloneqq\sum_{i\in\mathcal{Q}_{t}}\mu_{i}(% \mathcal{Q}_{t},\theta_{*}).

(2)

Also, $\{x_{t,i}\}$ may vary adversarially in each round in our model, unlike in Li et al. [2017], where the attribute vectors are assumed to be drawn from an unknown i.i.d. distribution. When $K=1$ , the above model reduces to the case of the logistic bandit.

Choice Modeling Perspective:

Eq 1 can be considered from a discrete choice modeling viewpoint, where the platform presents an assortment of items to a user, and the user selects at most one item from this assortment. In this interpretation, the probability of choosing an item $i$ is given by $\mu_{i}(\mathcal{Q}_{t},\theta_{*})$ . Likewise, the probability of the user not selecting any item is given by: $\nicefrac{{1}}{{(1+\sum_{j\in\mathcal{Q}_{t}}\exp(x_{t,j}^{\top}\theta_{*}))}}$ . The platform is motivated to offer such an assortment that the user’s propensity to make a successful selection is high.

Regret:

The platform does not know the value of $\theta_{*}$ . Our learning algorithm CB-MNL (see Algorithm 1) sequentially makes the assortment selection decisions, $\mathcal{Q}_{1},\mathcal{Q}_{2},\cdots,\mathcal{Q}_{T}$ so that the cumulative expected revenue $\sum_{t=1}^{T}\mu(\mathcal{Q}_{t},\theta_{*})$ is high. Its performance is quantified by pseudo-regret, which is the gap between the expected revenue generated by the algorithm and that of the optimal assortments in hindsight. The learning goal is to minimize the cumulative pseudo-regret up to time $T$ , defined as:

\mathbf{R}_{T}\coloneqq\sum_{t=1}^{T}[\mu(\mathcal{Q}_{t}^{*},\theta_{*})-\mu(% \mathcal{Q}_{t},\theta_{*})],

(3)

where $\mathcal{Q}_{t}^{*}$ is the offline optimal assortment at round $t$ under full information of $\theta_{*}$ , defined as: $\mathcal{Q}_{t}^{*}\coloneqq\operatorname*{argmax}_{\mathcal{Q}\subset\mathcal% {N}}\mu(\mathcal{Q},\theta_{*}).$

As in the case of contextual linear bandits Abbasi-Yadkori et al. [2011], Chu et al. [2011], the emphasis here is to make good sequential decisions while tracking the true parameter $\theta_{*}$ with a close estimate $\hat{\theta}_{t}$ (see Section 2.4). Our algorithm (like others) does not necessarily improve the estimate at each round. However, it ensures that $\theta_{*}$ is always within a confidence interval of the estimate of $\theta_{*}$ (with high probability) and the future analysis demonstrates that the aggregate prediction error over all $T$ rounds is bounded.

Our model is fairly general, as the contextual information $x_{t,i}$ may be used to model combined information of the item $i$ in the set $\mathcal{N}$ and the user at round $t$ . Suppose the user at round $t$ is represented by a vector $v_{t}$ and the item $i$ has attribute vector as $w_{t,i}$ , then $x_{t,i}=\text{vec}(v_{t}w_{t,i}^{\top})$ (vectorized outer product of $v_{t}$ and $w_{t,i}$ ). We assume that the platform knows the interaction horizon $T$ .
Additional notations: $\mathbf{X}_{\mathcal{Q}_{t}}$ denotes a design matrix whose columns are the attribute vectors ( $x_{t,i}$ ) of the items in the assortment $\mathcal{Q}_{t}$ . Also, we now denote $\mu(\mathcal{Q}_{t},\theta_{*})$ as $\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})$ to signify that $\mu(\mathcal{Q}_{t},\theta_{*}):\mathbb{R}^{|\mathcal{Q}_{t}|}\to\mathbb{R}$ .

2.3 Assumptions

Following Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2019], Faury et al. [2020], we introduce the following assumptions on the problem structure.

Assumption 1 (Bounded parameters).

$\theta_{*}\,\in\,\Theta$ , where $\Theta$ is a compact subset of $\mathbb{R}^{d}$ . $S\coloneqq\max_{\theta\in\Theta}||\theta||_{2}$ is known to the learner. Further, $||x_{t,i}||_{2}\leq 1$ for all values of $t$ and $i$ .

This assumption simplifies analysis and removes scaling constants from the equations.

Assumption 2.

There exists $\kappa>0$ such that for every item $i\,\in\,\mathcal{Q}_{t}$ and for any $\mathcal{Q}_{t}\subset\mathcal{N}$ and all rounds $t$ :

\inf_{\mathcal{Q}_{t}\subset\mathcal{N},\theta\in\mathbb{R}^{d}}\mu_{i}(% \mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)(1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{% t}}^{\top}\theta))\geq\frac{1}{\kappa}.

Note that $\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)(1-\mu_{i}(\mathbf{X}_{% \mathcal{Q}_{t}}^{\top}\theta))$ denotes the derivative of the softmax function along the $i_{th}$ direction. This assumption is necessary from the likelihood theory Lehmann & Casella [2006] as it ensures that the fisher matrix for $\theta_{*}$ estimation is invertible for all possible input instances. We refer to Oh & Iyengar [2019] for a detailed discussion in this regard. We denote $L$ and $M$ as the upper bounds on the first and second derivatives of the softmax function along any component, respectively. We have $L,M\leq 1$ [Gao & Pavel, 2017] for all problem instances.

2.4 Maximum likelihood estimate

CB-MNL, described in Algorithm 1, uses a regularized maximum likelihood estimator to compute an estimate $\hat{\theta}_{t}$ of $\theta_{*}$ . Since $\{r_{t,i}\}_{i\in\mathcal{Q}_{t}}$ follows a multinomial distribution, the regularized log-likelihood (negative cross entropy loss) function, till the $(t-1)_{th}$ round, under parameter $\theta$ could be written as:

\displaystyle\mathcal{L}_{t}^{\lambda_{t}}(\theta)=\sum_{s=1}^{t-1}\sum_{i\in% \mathcal{Q}_{s}}r_{s,i}\log(\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)% )-\frac{\lambda_{t}}{2}||\theta||_{2}^{2},

(4)

$\mathcal{L}_{t}^{\lambda_{t}}(\theta)$ is concave in $\theta$ for $\lambda_{t}>0$ , and the maximum likelihood estimator is given by calculating the critical point of $\mathcal{L}_{t}^{\lambda_{t}}(\theta)$ . Setting $\nabla_{\theta}\mathcal{L}_{t}^{\lambda_{t}}(\theta)=0$ , we get $\hat{\theta}_{t}$ as the solution of:

\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}[\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}% }^{\top}\hat{\theta}_{t})-r_{t,i}]x_{s,i}+\lambda_{t}\hat{\theta}_{t}=0.

(5)

For future analysis we also define

\displaystyle g_{t}(\theta)\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}% \mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)x_{s,i}+\lambda_{t}\theta,% \quad g_{t}(\hat{\theta}_{t})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s% }}r_{s,i}x_{s,i}.

(6)

At the start of the interaction, when no contexts have been observed, $\hat{\theta}_{t}$ is well-defined by Eq (5) when $\lambda_{t}>0$ . Therefore, the regularization parameter $\lambda_{t}$ makes CB-MNL burn-in period free, in contrast to some previous works, e.g. Filippi et al. [2010].

2.5 Confidence sets

Algorithm 1 follows the template of in the face of uncertainty (OFU) strategies [Auer et al., 2002, Filippi et al., 2010, Faury et al., 2020]. Technical analysis of OFU algorithms relies on two key factors: the design of the confidence set and the ease of choosing an action using the confidence set.

In Section 4, we derive $E_{t}(\delta)$ (defined below) as the confidence set on $\theta_{*}$ such that $\theta_{*}\in C_{t}(\delta),\,\forall t$ with probability at least $1-\delta$ (randomness is over user choices). $E_{t}(\delta)$ used for making decisions at each round (see Eq (12)) by CB-MNL in Algorithm 1:

E_{t}(\delta)\coloneqq\{\theta\in\Theta,\,\mathcal{L}^{\lambda_{t}}_{t}(\theta% )-\mathcal{L}^{\lambda_{t}}_{t}(\hat{\theta}_{t})\leq\beta^{2}_{t}(\delta)\},

(7)

where $\beta_{t}(\delta)\coloneqq\gamma_{t}(\delta)+\frac{\gamma_{t}^{2}(\delta)}{% \lambda_{t}}$ , and

\displaystyle\gamma_{t}(\delta)\coloneqq

\displaystyle\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}\log(% \frac{(\lambda_{t}+LKt/d)^{d/2}\lambda_{t}^{-d/2}}{\delta})+\frac{2d}{\sqrt{% \lambda_{t}}}\log(2).

(8)

A confidence set similar to $E_{t}(\delta)$ in Eq (7) was recently proposed in Abeille et al. [2021] for the simpler logisitic bandit setting. Here, we extend its construction to the MNL setting. The set $E_{t}(\delta)$ is convex since the log-loss function is convex. This makes the decision step in Eq (12) a constraint convex optimization problem. However, it is difficult to prove bounds directly with $E_{t}(\delta)$ . Therefore we leverage a result in Faury et al. [2020], where the authors proposed a new Bernstein-like tail inequality for self-normalized vectorial martingales (see Appendix A.1), to derive another confidence set on $\theta_{*}$ :

C_{t}(\delta)\coloneqq\{\theta\in\Theta,\,||g_{t}(\theta)-g_{t}(\hat{\theta}_{% t})||_{\mathbf{H}_{t}^{-1}(\theta)}\leq\gamma_{t}(\delta)\}.

(9)

where

\mathbf{H}_{t}(\theta_{1})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}% \dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{% \top}+\lambda_{t}\mathbf{I}_{d}.

(10)

$\dot{\mu}_{i}(\cdot)$ is the partial derivative of $\mu_{i}$ in the direction of the $i_{th}$ component of the assortment and $\gamma_{t}(\delta)$ is defined in Eq (8). The value of $\gamma_{t}(\delta)$ is an outcome of the concentration result of Faury et al. [2020]. As a consequence of this concentration, we have $\theta_{*}\,\in\,C_{t}(\delta)$ with probability at least $1-\delta$ (randomness is over user choices). The Bernstein-like concentration inequality used here is similar to Theorem 1 of Abbasi-Yadkori et al. [2011] with the difference that we take into account local variance information (hence local curvature information of the reward function) in defining $\mathbf{H}_{t}$ . The above discussion is formalized in Appendix A.1.

The set $C_{t}(\delta)$ is non-convex, which follows from the non-linearity of $\mathbf{H}^{-1}_{t}(\theta)$ . We use $C_{t}(\delta)$ directly to prove regret guarantees. In Section 4.3, we mention how the a convex set $E_{t}(\delta)$ is related to $C_{t}(\delta)$ and share many useful properties of $C_{t}(\delta)$ . Till then, to maintain ease of technical flow and to compare it with the previous work Faury et al. [2020], we assume that the algorithm uses $C_{t}(\delta)$ as the confidence set. We highlight that for the confidence sets, $C_{t}(\delta)$ and $E_{t}(\delta)$ , Algorithm CB-MNL is identical except for the calculation in Eq (12). For later sections we also define the following norm inducing design matrix based on all the contexts observed till time $t-1$ :

\mathbf{V}_{t}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}x_{s,i% }^{\top}+\lambda_{t}\mathbf{I}_{d}.

(11)

3 Algorithm

At each round $t$ , the attribute parameters (contexts) $\{x_{t,1},x_{t,2},\cdots,x_{t,N}\}$ are made available to the algorithm (online platform) CB-MNL. The algorithm calculates an estimate of the true parameter $\theta_{*}$ according to Eq (5). The algorithm keeps track of the confidence set $C_{t}(\delta)$ ( $E_{t}(\delta)$ ) as defined in Eq (9) (Eq (7). Let the set $\mathbf{\mathcal{A}}$ contain all feasible assortments of $\mathcal{N}$ with cardinality up to $K$ . The algorithm makes the following decision:

(\mathcal{Q}_{t},\theta_{t})=\operatorname*{argmax}_{A_{t}\in\mathcal{A},% \theta\in C_{t}(\delta)}\mu(\mathbf{X}_{A_{t}}^{\top}\theta).

(12)

In each round $t$ , the reward of the online platform is denoted by the vector $r_{t}$ . Also, the prediction error of $\theta$ at $\mathbf{X}_{\mathcal{Q}_{t}}$ , defined as:

\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\coloneqq|\mu(\mathbf% {X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top% }\theta)|.

(13)

$\Delta^{\text{pred}}\del{\mathbf{X}_{\mathcal{Q}_{t}},\theta}$ represents the difference in perceived rewards due to the inaccuracy in the estimation of the parameter $\theta_{*}$ .

Remark 1 (Optimistic parameter search).

CB-MNL enforces optimism via an optimistic parameter search (e.g. in Abbasi-Yadkori et al. [2011]), which is in contrast to the use of an exploration bonus as seen in Faury et al. [2020], Filippi et al. [2010]. Optimistic parameter search provides a cleaner description of the learning strategy. In non-linear reward models, both approaches may not follow similar trajectory but may have overlapping analysis styles (see Filippi et al. [2010] for a short discussion).

Remark 2 (Tractable decision-making).

In Section 4.3, we show that the decision problem of Eq (12) can be relaxed to an convex optimization problem by using a convex set $E_{t}(\delta)$ , instead of $C_{t}(\delta)$ , while keeping the regret performance of Algorithm 1 intact up to constant factors.

Input: regularization parameters: $\lambda_{t},\forall\,t\,\in\,[T]$ , $N$ distinct items: $\mathcal{N}$ , $K$
for $t\,\geq\,1$ do

Given: Set

\{x_{t,1},x_{t,2},\cdots,x_{t,N}\}

d

-dimensional parameters.
Estimate

\hat{\theta}_{t}

according to Eq (5).
Construct

C_{t}(\delta)

as defined in Eq (9).
Construct the set

\mathbf{\mathcal{A}}

of all feasible assortments of

\mathcal{N}

with cardinality upto

K

.
Play

(\mathcal{Q}_{t},\theta_{t})=\operatorname*{argmax}_{A_{t}\in\mathcal{A},% \theta\in C_{t}(\delta)}\mu(\mathbf{X}_{A_{t}}^{\top}\theta)

.
Observe rewards

\mathbf{r}_{t}

end for

Algorithm 1 CB-MNL

4 Main results

We present a regret upper bound for the CB-MNL algorithm in Theorem 1.

Theorem 1.

With probability at least $1-\delta$ over the randomness of user choices:

\displaystyle\mathbf{R}_{T}\leq

\displaystyle C_{1}\gamma_{T}(\delta)\sqrt{2d\log(1+\frac{LKT}{d\lambda_{T}})T% }+C_{2}\kappa\gamma_{T}(\delta)^{2}d\log(1+\frac{KT}{d\lambda_{T}}),

where the constants are given as $C_{1}=(4+8S)$ , $C_{2}=4(4+8S)^{\nicefrac{{3}}{{2}}}M$ , and $\gamma_{T}(\delta)$ is given by Eq (8).

The formal proof is deferred to the technical Appendix, in this section we discuss the key technical ideas leading to this result. The order dependence on the model parameters is made explicit by the following corollary.

Corollary 2.

Setting the regularization parameter $\lambda_{T}=\mathrm{O}(d\log(KT))$ , where $K$ is the maximum cardinality of the assortments to be selected, makes $\gamma_{T}(\delta)=\mathrm{O}(d^{\nicefrac{{1}}{{2}}}\log^{\nicefrac{{1}}{{2}}% }(KT))$ . The regret upper bound is given by $\mathbf{R}_{T}=\mathrm{O}(d\sqrt{T}\log(KT)+\kappa d^{2}\log^{2}(KT))$ .

Recall the expression for cumulative regret

	$\displaystyle\mathbf{R}_{T}$	$\displaystyle=\sum_{t=1}^{T}[\mu(\mathbf{X}_{\mathcal{Q}^{}_{t}}^{\top}\theta% _{})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})]$
		$\displaystyle=\sum_{t=1}^{T}\underbrace{[\mu(\mathbf{X}_{\mathcal{Q}^{}_{t}}^% {\top}\theta_{})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t})]}_{\text{% pessimism}}+\sum_{t=1}^{T}\underbrace{[\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}% \theta_{t})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})]}_{\text{% prediction error}},$

where pessimism is the additive inverse of the optimism (difference between the payoffs under true parameters and those estimated by CB-MNL). Due to optimistic decision-making and the fact that $\theta_{*}\in C_{t}(\delta)$ (see Eq (12)), pessimism is non-positive, for all rounds. Thus, the regret is upper bounded by the sum of the prediction error for $T$ rounds. In Section 4.1 we derive an the expression for prediction error upper bound for a single round $t$ . We also contrast with the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] and point out specific technical differences which allow us to use Bernstein-like tail concentration inequality and therefore, achieve stronger regret guarantees. In Section 4.2, we describe the additional steps leading to the statement of Theorem 1. The style of the arguments is simpler and shorter than that in Faury et al. [2020]. Finally, in Section 4.3, we discuss the relationship between two confidence sets $C_{t}(\delta)$ and $E_{t}(\delta)$ and show that even using $E_{t}(\delta)$ in place of $C_{t}(\delta)$ , we get the regret upper bounds with same parameter dependence as in Corollary 2. Lemma 3 gives the expression for an upper bound on the prediction error.

4.1 Bounds on prediction error

Lemma 3.

For $\theta_{t}\in C_{t}(\delta)$ (see Eq (12)) with probability at least $1-\delta$ :

	$\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq$	$\displaystyle(2+4S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}(% \mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{})\|\|x_{t,i}\|\|_{\mathbf{H}_{t}^{-1}% (\theta_{})}$
		$\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{s% }}\|\|x_{t,i}\|\|^{2}_{\mathbf{V}_{t}^{-1}},$		(14)

where $\mathbf{V}_{t}^{-1}$ is given by Eq (11).

The detailed proof is provided in A.4. Here we develop the main ideas leading to this result and develop an analytical flow which will be re-used while working with convex confidence set $E_{t}(\delta)$ in Section 4.3. In the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], global upper and lower bounds of the derivative of the link function (here softmax) are employed early in the analysis, leading to loss of local information carried by the MLE estimate $\theta_{t}$ . In those previous works the first step was to upper bound the prediction error by the Lipschitz constant (which is a global property) of the softmax (or sigmoid for the logistic bandit case) function, as:

|\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q% }_{t}}^{\top}\theta_{t})|\leq L|\mathbf{X}_{\mathcal{Q}_{t}}^{\top}(\theta_{*}% -\theta_{t})|.

(15)

For building intuition, assume that $\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}$ lies on “flatter” region of $\mu(\cdot)$ , then Eq (15) is a loose upper bound.

Next we show how using a global lower bound in form of $\kappa$ (see Assumption 2) early in the analysis in the works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] lead to loose prediction error upper bound. For this we first introduce a new notation:

\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})x_{% t,i}^{\top}(\theta_{*}-\theta_{t})\coloneqq\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}% }^{\top}\theta_{*})-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t}).

(16)

We also define $\mathbf{G}_{t}(\theta_{t},\theta_{*})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in% \mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{t},\theta_{*})% x_{s,i}x_{s,i}^{\top}+\lambda\mathbf{I}_{d}.$ From Eq (6), we obtain (see A.2 for details of this derivation):

\displaystyle g(\theta_{*})-g(\theta_{t})=\mathbf{G}_{t}(\theta_{t},\theta_{*}% )(\theta_{*}-\theta_{t}).

(17)

From Assumption 2, $\mathbf{G}_{t}(\theta_{t},\theta_{*})$ is a positive definite matrix for $\lambda>0$ and therefore can be used to define a norm. Using Cauchy-Schwarz inequality with Eq (17) simplifies the prediction error as:

\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq% \big{|}\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},% \theta_{t},\theta_{*})||x_{t,i}||_{\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*})}% ||\theta_{*}-\theta_{t}||_{\mathbf{G}_{t}(\theta_{t},\theta_{*})}\big{|}

(18)

The previous literature Filippi et al. [2010], Oh & Iyengar [2021] has utilized $\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*})\succeq\kappa^{-1}\mathbf{V}_{t}$ and upper bounded $\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})$ by Lipschitz constant (directly at this stage), thereby incurring loose regret bounds. Instead, here we work with the norm induced by $\mathbf{H}_{t}(\theta_{*})$ and retain the location information in $\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})$ .

\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq% \big{|}\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},% \theta_{t},\theta_{*})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}||\theta_{*% }-\theta_{t}||_{\mathbf{H}_{t}(\theta_{*})}\big{|}

(19)

It is not straight-forward to bound $||\theta_{*}-\theta_{t}||_{\mathbf{H}_{t}(\theta_{*})}$ , we extend the self-concordance style relations from Faury et al. [2020] for the multinomial logit function which allow us to relate $\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*})$ and $\mathbf{H}_{t}^{-1}(\theta_{t})$ (or $\mathbf{H}_{t}^{-1}(\theta_{*})$ ) to develop a bound on $||\theta_{*}-\theta||_{\mathbf{H}_{t}(\theta_{*})}$ .

Lemma 4.

For all $\theta_{1},\theta_{2}\,\in\,\Theta$ , the following inequalities hold:

\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}% _{t}(\theta_{1}),\quad\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}% \mathbf{H}_{t}(\theta_{2})

Lemma 5.

For $\theta_{t}\in C_{t}(\delta)$ , we have the following relation with probability at least $1-\delta$ : $||\theta_{t}-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}\leq 2(1+2S)\gamma_{t}(% \delta).$

Proofs of Lemma 4 and 5 have been deferred to A.3. Notice that Lemma 5 is a key result which characterizes worthiness of the confidence set $C_{t}(\delta)$ . Recall that $\gamma_{T}(\delta)=\mathrm{O}(\sqrt{d\log(KT)})$ (with a tuned $\lambda$ as in Corollary 2). Therefore, any $\theta\,\in\,C_{t}(\delta)$ is not too far from the optimal $\theta_{*}$ under the norm induced by $||\cdot||_{\mathbf{H}_{t}(\theta_{*})}$ . Now, we use Lemma 5 in Eq (19) to get:

\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq 2% (1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}|\alpha_{i}(\mathbf{X}_{% \mathcal{Q}_{t}},\theta_{*},\theta_{t})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta% _{*})}|.

(20)

The quantity $\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})$ as described in the Eq (16) is upper bounded in the following result

Lemma 6.

For the assortment chosen by the algorithm CB-MNL, $\mathcal{Q}_{t}$ as given by Eq (12) and any $\theta\in C_{t}(\delta)$ the following holds with probability at least $1-\delta$ : $\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)\leq\dot{\mu}_{i}(% \mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})+2(1+2S)M\gamma_{t}(\delta)||x_{% t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}.$

We use the result of Lemma 6 in Eq (20) followed by an application of Lemma 4 and the relation $\sum_{i\in\mathcal{Q}_{t}}||x_{t,i}||^{2}_{\mathbf{H}_{t}^{-1}(\theta_{t})}% \leq\kappa\sum_{i\in\mathcal{Q}_{t}}||x_{t,i}||^{2}_{\mathbf{V}_{t}^{-1}}$ from Assumption 2, to arrive at the statement of Lemma 3.

4.2 Regret calculation

The complete technical work is provided in A.4 and A.5. The key step to retrieve the upper bounds of Theorem 1 is to calculate $T$ rounds summation of the prediction error as given in Eq (3). Compared to the previous literature Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], the term $\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}% \theta_{*})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}$ is new here. Further, our treatment of this term is much simpler and straight-forward as compared to that in Faury et al. [2020]

4.3 Convex relaxation of the optimization step

Sections 4.1 & 4.2 provide an analytical framework for calculating the regret bounds given: (1) the confidence set (Eq (9)) with the guarantee that $\theta_{*}\,\in\,C_{t}(\delta)$ with probability at least $1-\delta$ ; (2) the assurance that the confidence set is small (Lemma 5). In order to re-use previously developed techniques, we show: (1) $E_{t}(\delta)\supseteq C_{t}(\delta)$ (see Eq (7)) and therefore $\theta_{*}\,\in\,E_{t}(\delta)$ with probability at least $1-\delta$ (see Lemma 7; (2) an analog of Lemma 5 using $E_{t}(\delta)$ (see Lemma 8). The proof of Theorem 1 is therefore repeated while using Lemma 8, following steps as sketched in sections 4.1 & 4.2. The order dependence of the regret upper bound is retained (see Corollary 2).

Lemma 7.

$E_{t}(\delta)\supseteq C_{t}(\delta)$ , therefore for any $\theta\in C_{t}(\delta)$ , we also have $\theta\,\in\,E_{t}(\delta)$ (see Eq (7)).

The complete proof is provided in A.6. We highlight the usefulness of Lemma 7. Since all of set $C_{t}(\delta)$ lies within $E_{t}(\delta)$ , the consequence of the concentration inequality also implies $\mathbb{P}(\forall t\geq 1,\theta_{*}\in E_{t}(\delta))\geq 1-\delta$ .

Lemma 8.

Under the event $\theta_{*}\,\in\,C_{t}(\delta)$ , the following holds $\forall\,\theta\,\in\,E_{t}(\delta)$ :

\displaystyle||\theta-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)% \gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When $\lambda_{t}=\mathrm{O}(d\log(Kt))$ , then $\gamma_{t}(\delta)=\tilde{\mathrm{O}}(\sqrt{d\log(t)})$ , $\beta_{t}(\delta)=\tilde{\mathrm{O}}(\sqrt{d\log(t)})$ , and $||\theta-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}=\tilde{\mathrm{O}}(\sqrt{d% \log(t)})$ .

The complete proof can be found in A.6.

5 Numerical experiments

In this section we compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data. We focus on performance comparison for varying values of parameter $\kappa$ , and show that our algorithm has a consistently superior performance for different $\kappa$ values in Figure 2. This highlights the primary contribution of our theoretical analysis. Refer to A.8 for additional empirical analysis.

For each experimental configuration, we consider a problem instance with inventory size $N=15$ , instance dimensions $d=5$ , maximum assortment size $K=4$ , and time horizon $T=100$ , averaged over $25$ Monte Carlo simulation runs. $\theta_{*}\in\mathbb{R}^{d}$ is a $d-$ dimensional random vector with each coordinate in $[0,1]$ , independently and uniformly distributed. The contexts follow a multivariate Gaussian distribution. The $\lambda$ parameter is manually tuned. Algorithm CB-MNL only knows the value of $N,T,K,d$ . In contrast, algorithms TS-MNLand UCB-MNL also need to know the value of $\kappa$ for their implementation. We observe that that our algorithm CB-MNL has robust performance for varying values of $\kappa$ .

6 Conclusion and discussion

In this work, we proposed an optimistic algorithm for learning under the MNL contextual bandit framework. Using techniques from Faury et al. [2020], we developed an improved technical analysis to deal with the non-linear nature of the MNL reward function. As a result, the leading term in our regret bound does not suffer from the problem-dependent parameter $\kappa$ . This contribution is significant as $\kappa$ can be very large (refer to Section 1.2). For example, for $\kappa=\mathrm{O}(\sqrt{T})$ , the results of Oh & Iyengar [2021, 2019] suffer $\tilde{\mathrm{O}}(T)$ regret, while our algorithm continues to enjoy $\tilde{\mathrm{O}}(\sqrt{T})$ . Further, we also presented a tractable version of the decision-making step of the algorithm by constructing a convex relaxation of the confidence set.

Our result is still $\mathrm{O}(\sqrt{d})$ away from the minimax lower of bound Chu et al. [2011] known for the linear contextual bandit. In the case of logistic bandits, Li et al. [2017] makes an i.i.d. assumption on the contexts to bridge the gap (however, they still retain the $\kappa$ factor). Improving the worst-case regret bound by $\mathrm{O}(\sqrt{d})$ while keeping $\kappa$ as an additive term is an open problem. It may be possible to improve the dependence on $\kappa$ by using a higher-order approximation for estimation error. Finding a lower bound on dependence $\kappa$ is an interesting open problem and may require newer techniques than presented in this work.

Oh & Iyengar [2019] gave a Thompson sampling (TS) based learning strategy for the MNL contextual bandit. Thompson sampling approaches may not have to search the entire action space to take decisions as optimistic algorithms (such as ours) do. TS-based strategies are likely to have better empirical performance. Authors in Oh & Iyengar [2019] use a confidence set based analysis to bound the estimation error term. However, results in Oh & Iyengar [2019] suffer from the prohibitive scaling of the problem-dependent parameter $\kappa$ that we have overcome here. Modifying our analysis for a TS-based learning strategy could bring together the best of both worlds.

References

Abbasi-Yadkori et al. [2011] Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2312–2320.
Abeille et al. [2021] Abeille, M., Faury, L., & Calauzènes, C. (2021). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691–3699). PMLR.
Agrawal et al. [2017] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017). Thompson sampling for the mnl-bandit. In Conference on Learning Theory (pp. 76–78). PMLR.
Agrawal et al. [2019] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2019). Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67, 1453–1485. doi:10.1287/opre.2018.1832.
Alfandari et al. [2021] Alfandari, L., Hassanzadeh, A., & Ljubić, I. (2021). An exact method for assortment optimization under the nested logit model. European Journal of Operational Research, 291, 830–845. doi:https://doi.org/10.1016/j.ejor.2020.12.007.
Amani & Thrampoulidis [2021] Amani, S., & Thrampoulidis, C. (2021). Ucb-based algorithms for multinomial logistic regression bandits. Advances in Neural Information Processing Systems, 34, 2913–2924.
Auer et al. [2002] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 235–256.
Avadhanula [2019] Avadhanula, V. (2019). The MNL-Bandit Problem: Theory and Applications. Ph.D. thesis Columbia University.
Bach [2010] Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384 – 414. URL: https://doi.org/10.1214/09-EJS521. doi:10.1214/09-EJS521.
Chen et al. [2020] Chen, X., Wang, Y., & Zhou, Y. (2020). Dynamic assortment optimization with changing contextual information. Journal of Machine Learning Research, 21, 1–44.
Cheung & Simchi-Levi [2017] Cheung, W. C., & Simchi-Levi, D. (2017). Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Available at SSRN 3075658, .
Chu et al. [2011] Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 208–214).
Dani et al. [2008] Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Conference on Learning Theory.
Faury et al. [2020] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052–3060). PMLR.
Feldman et al. [2018] Feldman, J., Zhang, D., Liu, X., & Zhang, N. (2018). Taking assortment optimization from theory to practice: Evidence from large field experiments on alibaba. Available at SSRN, .
Filippi et al. [2010] Filippi, S., Cappe, O., Garivier, A., & Szepesvári, C. (2010). Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems (pp. 586–594).
Flores et al. [2019] Flores, A., Berbeglia, G., & Van Hentenryck, P. (2019). Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research, 273, 1052–1064. doi:https://doi.org/10.1016/j.ejor.2018.08.047.
Gao & Pavel [2017] Gao, B., & Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, .
Grant & Szechtman [2021] Grant, J. A., & Szechtman, R. (2021). Filtered poisson process bandit on a continuum. European Journal of Operational Research, 295, 575–586. doi:https://doi.org/10.1016/j.ejor.2021.03.033.
Kök & Fisher [2007] Kök, A. G., & Fisher, M. L. (2007). Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research, 55, 1001–1021. doi:https://doi.org/10.1287/opre.1070.0409.
Lehmann & Casella [2006] Lehmann, E. L., & Casella, G. (2006). Theory of point estimation. Springer Science & Business Media.
Li et al. [2017] Li, L., Lu, Y., & Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2071–2080).
Oh & Iyengar [2019] Oh, M.-h., & Iyengar, G. (2019). Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems (pp. 3151–3161).
Oh & Iyengar [2021] Oh, M.-h., & Iyengar, G. (2021). Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 9205–9213). volume 35.
Ou et al. [2018] Ou, M., Li, N., Zhu, S., & Jin, R. (2018). Multinomial logit bandit with linear utility functions. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 2602–2608).
Perivier & Goyal [2022] Perivier, N., & Goyal, V. (2022). Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35, 3461–3474.
Rusmevichientong et al. [2010] Rusmevichientong, P., Shen, Z.-J. M., & Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58, 1666–1680. doi:https://doi.org/10.1287/opre.1100.0866.
Rusmevichientong & Tsitsiklis [2010] Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35, 395–411. doi:https://doi.org/10.1287/moor.1100.0446.
Sauré & Zeevi [2013] Sauré, D., & Zeevi, A. (2013). Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15, 387–404. doi:https://doi.org/10.1287/msom.2013.0429.
Timonina-Farkas et al. [2020] Timonina-Farkas, A., Katsifou, A., & Seifert, R. W. (2020). Product assortment and space allocation strategies to attract loyal and non-loyal customers. European Journal of Operational Research, 285, 1058–1076. doi:https://doi.org/10.1016/j.ejor.2020.02.019.
Wang et al. [2020] Wang, X., Zhao, X., & Liu, B. (2020). Design and pricing of extended warranty menus based on the multinomial logit choice model. European Journal of Operational Research, 287, 237–250. doi:https://doi.org/10.1016/j.ejor.2020.05.012.
Xu et al. [2021] Xu, J., Chen, L., & Tang, O. (2021). An online algorithm for the risk-aware restless bandit. European Journal of Operational Research, 290, 622–639. doi:https://doi.org/10.1016/j.ejor.2020.08.028.
Zhang & Lin [2015] Zhang, Y., & Lin, X. (2015). Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning (pp. 362–370).

Appendix A Appendix

A.1 Confidence set

In this section, we justify the design of confidence set defined in Eq (9). This particular choice is based on the following concentration inequality for self-normalized vectorial martingales.

Theorem 9.

Appears as Theorem 4 in Abeille et al. [2021] Let $\{\mathcal{F}_{t}\}_{t=1}^{\infty}$ be a filtration. Let $\{x_{t}\}_{t=1}^{\infty}$ be a stochastic process in $\mathcal{B}_{2}(d)$ such that $x_{t}$ is $\mathcal{F}_{t}$ measurable. Let $\{\varepsilon_{t}\}_{t=2}^{\infty}$ be a martingale difference sequence such that $\varepsilon_{t+1}$ is $\mathcal{F}_{t+1}$ measurable. Furthermore, assume that conditionally on $\mathcal{F}_{t}$ we have $|\varepsilon_{t+1}|\leq 1$ almost surely, and note $\sigma_{t}^{2}\coloneqq\mathbb{E}\left[\varepsilon_{t+1}^{2}|\mathcal{F}_{t}\right]$ . Let $\{\lambda_{t}\}_{t=1}^{\infty}$ be a predictable sequence of non-negative scalars. Define:

\displaystyle\mathbf{H}_{t}\coloneqq\sum_{s=1}^{t-1}\sigma_{s}^{2}x_{s}x_{s}^{% T}+\lambda_{t}\mathbf{I}_{d},\qquad S_{t}\coloneqq\sum_{s=1}^{t-1}\varepsilon_% {s+1}x_{s}.

Then for any $\delta\in(0,1]$ :

\displaystyle\mathbb{P}\Bigg{(}\exists t\geq 1,\,\left\lVert S_{t}\right\rVert% _{\mathbf{H}_{t}^{-1}}\!\geq\!\frac{\sqrt{\lambda_{t}}}{2}\!+\!\frac{2}{\sqrt{% \lambda_{t}}}\log\!\left(\frac{\det\left(\mathbf{H}_{t}\right)^{\frac{1}{2}}\!% \lambda_{t}^{-\frac{d}{2}}}{\delta}\right)+\frac{2}{\sqrt{\lambda_{t}}}d\log(2% )\Bigg{)}\leq\delta.

Theorem 9 cannot be directly used in our setting as in the MNL model the actual rewards (for any time step $s$ ) $\{r_{s,i}\}_{i\in Q_{s}}$ are correlated. Hence a concentration almost identical (varying only in minor constant modification) to Theorem 9, appearing as Theorem C.6 in Perivier & Goyal [2022] is used instead.

Lemma 10 (confidence bounds for multinomial logistic rewards).

With $\hat{\theta}_{t}$ as the regularized maximum log-likelihood estimate as defined in Eq (5), the following follows with probability at least $1-\delta$ :

\displaystyle\forall t\,\geq\,1,\quad\enVert{g_{t}(\hat{\theta}_{t})-g_{t}(% \theta_{*})}_{\mathbf{H}_{t}^{-1}}\leq\gamma_{t}\del{\delta}

where $\mathbf{H}_{t}\del{\theta_{1}}=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{% \mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}}x_{s,i}x_{s,i}^{% \top}+\lambda\mathbf{I}_{d}$ and $g_{t}(\cdot)$ is defined in Eq (6).

Proof.

$\hat{\theta}_{t}$ is the maximizer of the regularized log-likelihood:

\displaystyle\mathcal{L}_{t}^{\lambda_{t}}(\theta)

\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}\log\left(\mu_{% i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\right)-\frac{\lambda_{t}}{2}% \enVert{\theta}_{2}^{2},

where $\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)$ is given by Eq (1) as $\frac{e^{x^{\top}_{s,i}\theta}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x^{\top}_{s,j}% \theta}}$ . Solving for $\nabla_{\theta}\mathcal{L}_{t}^{\lambda_{t}}=0$ , we obtain:

\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mu_{i}(\mathbf{X}_{% \mathcal{Q}_{s}}^{\top}\theta)x_{s,i}+\lambda_{t}\hat{\theta}_{t}=\sum_{s=1}^{% t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}x_{s,i}

This result, combined with the definition of $g_{t}(\theta_{*})=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mu_{i}(\mathbf{X}% _{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}\\ +\lambda_{t}\theta_{*}$ yields:

	$\displaystyle g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*})$	$\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\varepsilon_{s,i}x_{s,% i}-\lambda_{t}\theta_{*}$
		$\displaystyle=S_{t,K}-\lambda_{t}\theta_{*}$

where we denoted $\varepsilon_{s,i}\coloneqq r_{s,i}-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}% \theta_{*})$ for all $s\geq 1$ and $i\in[K]$ and $S_{t,K}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\varepsilon_{s,i}x_{% s,i}$ for all $t\geq 1$ . For any $\lambda_{t}\geq 1$ , from the definition of $\mathbf{H}_{t}(\theta_{*})$ it follows that $\mathbf{H}^{-1}_{t}(\theta_{*})\preceq\mathbf{I}_{d}$ . Hence, $\enVert{\lambda_{t}\theta_{*}}_{\mathbf{H}^{-1}_{t}(\theta_{*})}\leq\enVert{% \lambda_{t}\theta_{*}}_{2}$ . Later in the proof of Theorem 1, we present our choice of $\lambda_{t}$ which always ensures $\lambda_{t}\geq 1$ .

\displaystyle\enVert{g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*})}_{\mathbf{H}_{t% }^{-1}(\theta_{*})}\leq\enVert{S_{t,K}}_{\mathbf{H}_{t}^{-1}(\theta_{*})}+% \sqrt{\lambda_{t}}S

(21)

Conditioned on the filtration set $\mathcal{F}_{t,i}$ (see Section 2.2 to review the definition of the filtration set), $\varepsilon_{s,i}$ is a martingale difference is bounded by $1$ as we assume the maximum reward that is accrued at any round is upper bounded by $1$ . We calculate for all $s\geq 1$ :

		$\displaystyle\mathbb{E}\left[\varepsilon^{2}_{s,i}\big{\|}\mathcal{F}_{t}\right% ]=\mathbb{E}\left[\del{r_{s,i}-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}% \theta_{*})}^{2}\bigg{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{V}\left[r_{s,i}\|\mathcal{F}_{t}\right]=\mu_{i}(\mathbf{X}% _{\mathcal{Q}_{s}}^{\top}\theta_{})\del{1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}% }^{\top}\theta_{})}.$		(22)

Also from Remark 3, we have :

\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})=\mu_{i}(\mathbf{X% }_{\mathcal{Q}_{s}}^{\top}\theta_{*})\del{1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s% }}^{\top}\theta_{*})}.

Therefore setting ${H}_{t}$ as $\mathbf{H}_{t}(\theta_{*})=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}% _{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}x_{s,i}^{\top}+% \lambda_{t}\mathbf{I}_{d}$ and $U_{t}$ as $S_{t,K}$ we invoke an instance of Theorem C.6 in Perivier & Goyal [2022] to obtain:

	$\displaystyle 1-\delta\leq$	$\displaystyle\mathbb{P}\left(\forall t\geq 1,\enVert{S_{t}}_{\mathbf{H}_{t}^{-% 1}(\theta_{})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}% \log\left(\frac{2^{d}\det(\mathbf{H}_{t}(\theta_{}))^{1/2}\lambda_{t}^{-d/2}}% {\delta}\right)\right.$
		$\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)$

We simplify $\det(\mathbf{H}_{t}(\theta_{*}))$ , using the fact that the multinomial logistic function is $L$ -Lipschitz (see Assumption 2):

	$\displaystyle\det(\mathbf{H}_{t}(\theta_{*}))=$	$\displaystyle\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}% (\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}x_{s,i}^{\top}+\lambda_{% t}\mathbf{I}_{d}\right)$
	$\displaystyle\leq$	$\displaystyle L^{d}\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}% x_{s,i}^{\top}+\frac{\lambda_{t}}{L}\mathbf{I}_{d}\right).$

Further, using Lemma 18 and using $\enVert{x_{s,i}}_{2}\leq 1$ we write:

\displaystyle L^{d}\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}% x_{s,i}^{\top}+\frac{\lambda_{t}}{L}\mathbf{I}_{d}\right)\leq\left(\lambda_{t}% +\frac{LKt}{d}\right)^{d}.

This we simplify Eq (10) as:

$\displaystyle 1-\delta\leq$	$\displaystyle\mathbb{P}\left(\forall t\geq 1,\enVert{S_{t}}_{\mathbf{H}_{t}^{-% 1}(\theta_{*})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}% \log\left(\frac{\left(\lambda_{t}+LKt/d\right)^{d/2}\lambda_{t}^{-d/2}}{\delta% }\right)\right.$
	$\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)$
	$\displaystyle\leq\mathbb{P}\left(\forall t\geq 1,\enVert{S_{t}}_{\mathbf{H}_{t% }^{-1}(\theta_{*})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}% }}\log\left(\frac{\left(1+\frac{LKt}{\lambda_{t}d}\right)^{d/2}}{\delta}\right% )\right.$
	$\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)$
	$\displaystyle=\mathbb{P}\left(\forall t\geq 1,\enVert{S_{t}}_{\mathbf{H}_{t}^{% -1}(\theta_{*})}\leq\gamma_{t}(\delta)-\sqrt{\lambda_{t}}S\right)$	(24)

Combining Eq ((21)) and Eq ((24)) yields:

	$\displaystyle\mathbb{P}\left(\forall t\geq 1,\,\enVert{g_{t}(\hat{\theta}_{t})% -g_{t}(\theta_{})}_{\mathbf{H}_{t}^{-1}(\theta_{})}\leq\gamma_{t}(\delta)\right)$
	$\displaystyle\geq\mathbb{P}\left(\forall t\geq 1,\,\enVert{S_{t}}_{\mathbf{H}_% {t}^{-1}(\theta_{*})}+\sqrt{\lambda_{t}}S\leq\gamma_{t}(\delta)\right)$
	$\displaystyle\geq 1-\delta.$

This completes the proof. ∎

It is insightful to compare Theorem 9 with Theorem 1 of Abbasi-Yadkori et al. [2011]. The later is re-stated below:

Theorem 11.

Let $\{\mathcal{F}\}_{t=0}^{\infty}$ be a filtration. Let $\{\eta\}_{t=1}^{\infty}$ be a real-valued stochastic process such that $\eta_{t}$ is $\mathcal{F}_{t}$ -measurable and $\eta_{t}$ is conditionally $R$ -sub-Gaussian for some $R\geq 0$ , i.e

\forall\,\lambda_{t}\,\in\mathbb{R},\qquad\mathbb{E}\sbr{\exp(\lambda_{t}\eta_% {t})\mid\mathcal{F}_{t-1}}\leq\exp\del{\frac{\lambda_{t}^{2}R^{2}}{2}}.

Let $\{x_{t}\}_{t=1}^{\infty}$ be an $\mathbb{R}^{d}-$ valued stochastic process such that $X_{t}$ is $\mathcal{F}_{t-1}$ -measurable. Assume $\mathbf{V}$ is a $d\times d$ positive definite matrix. For any $t\geq 0$ , define:

\overline{\mathbf{V}}_{t}=\mathbf{V}+\sum_{s=1}^{t}x_{s}x_{s}^{\top},\qquad% \quad S_{t}=\sum_{s=1}^{t}\eta_{s}x_{s}.

Then, for any $\delta>0$ , with probability at least $1-\delta$ . for all $t\geq 0$ ,

||S_{t}||_{\overline{\mathbf{V}}_{t}^{-1}}\leq 2R\log\del{\frac{\det(\overline% {\mathbf{V}})^{-\nicefrac{{1}}{{2}}}\det(\mathbf{V})^{-\nicefrac{{1}}{{2}}}}{% \delta}}.

Theorem 11 makes an uniform sub-Gaussian assumption and unlike Theorem 9 does not take into account local variance information.

A.2 Local information preserving norm

Deviating from the previous analyses as in Filippi et al. [2010], Li et al. [2017], we describe norm which preserves the local information The matrix $\mathbf{X}_{\mathcal{Q}_{s}}$ is the design matrix composed of the contexts $x_{s,1},x_{s,2},\cdots,x_{s,K}$ received at time step $s$ as its columns. The expected reward due to the $i_{th}$ item in the assortment is given by:

\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)=\frac{e^{x^{\top}_{s,i}% \theta}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x^{\top}_{s,j}\theta}}.

Further, we consider the following integral:

\displaystyle\int_{\nu=0}^{1}\dot{\mu}_{i}\del{\nu\mathbf{X}_{\mathcal{Q}_{s}}% ^{\top}\theta_{2}+\del{1-\nu}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}}% \cdot d\nu

\displaystyle=\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\frac{% 1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{i}(t_{i})\cdot dt_{i},

(25)

where $\dot{\mu}_{i}$ is the partial derivative of $\mu_{i}$ in the direction of the $i_{th}$ component and $\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{\mu}_{i}(t_{i})% \cdot dt_{i}$ represents integration of $\dot{\mu}(\cdot)$ with respect to the coordinate $t_{i}$ ( hence the limits of the integration only consider change in the coordinate $t_{i}$ ). For notation purposes which would become clear later, we define:

$\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2})x_{% s,i}^{\top}(\theta_{2}-\theta_{1})$	$\displaystyle\coloneqq\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2})-% \mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})$
	$\displaystyle=\frac{e^{x_{s,i}^{\top}\theta_{2}}}{1+\sum_{j\in\mathcal{Q}_{s}}% e^{x_{s,j}^{\top}\theta_{2}}}-\frac{e^{x_{s,i}^{\top}\theta_{1}}}{1+\sum_{j\in% \mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{1}}}$	(26)
	$\displaystyle=\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{% \mu}_{i}(t_{i})\cdot dt_{i},$

where the second step is due to Fundamental Theorem of Calculus. We have exploited the two ways to view the multinomial logit function: sum of individual probabilities and a vector valued function. We write:

\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},% \theta_{2})x_{s,i}^{\top}(\theta_{2}-\theta_{1})=\sum_{i\in\mathcal{Q}_{s}}% \int_{\nu=0}^{1}\dot{\mu}_{i}\del{\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta% _{2}+\del{1-\nu}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}}\cdot d\nu

(27)

We also have:

\displaystyle\mu(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})-\mu(\mathbf{X}% _{\mathcal{Q}_{s}}^{\top}\theta_{2})=\sum_{i=1}^{K}\alpha_{i}(\mathbf{X}_{% \mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s,i}^{\top}(\theta_{1}-\theta_{2}).

(28)

It follows that:

	$\displaystyle g(\theta_{1})-g(\theta_{2})=$	$\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\del{\frac{e^{x_{s,i}^{% \top}\theta_{1}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{1}}}-% \frac{e^{x_{s,i}^{\top}\theta_{2}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{% \top}\theta_{2}}}}x_{s,i}$
		$\displaystyle+\lambda_{t}(\theta_{1}-\theta_{2})$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{% \mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s}x_{x}^{\top}(\theta_{1}-\theta_{2}% )+\lambda_{t}(\theta_{1}-\theta_{2})$
	$\displaystyle=$	$\displaystyle\mathbf{G}_{t}(\theta_{2},\theta_{1})(\theta_{1}-\theta_{2}),$

where $\mathbf{G}_{t}\del{\theta_{1},\theta_{2}}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in% \mathcal{Q}_{s}}\alpha_{i}\del{\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_% {2}}x_{s}x_{s}^{\top}+\lambda_{t}\mathbf{I}_{d}$ . Since $\alpha\del{\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2}}\geq\frac{1}{\kappa}$ (from Assumption 2), therefore $\mathbf{G}_{t}(\theta_{1},\theta_{2})\succ\mathbf{O}_{d\times d}$ . Hence we get:

\enVert{\theta_{1}-\theta_{2}}_{\mathbf{G}_{t}(\theta_{2},\theta_{1})}=\enVert% {g(\theta_{1})-g(\theta_{2})}_{\mathbf{G}_{t}^{-1}(\theta_{2},\theta_{1})}.

(29)

A.3 Self-Concordance Style Relations for Multinomial Logistic Function

Lemma 12.

For an assortment $\mathcal{Q}_{s}$ and $\theta_{1},\theta_{2}\,\in\,\Theta$ , the following holds:

	$\displaystyle\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}}% ,\theta_{2},\theta_{1})$	$\displaystyle=\sum_{i\in\mathcal{Q}_{s}}\int_{\nu=0}^{1}\dot{\mu}_{i}\del{\nu% \mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\del{1-\nu}\mathbf{X}_{\mathcal{% Q}_{s}}^{\top}\theta_{1}}\cdot d\nu$
		$\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{% Q}_{s}}^{\top}\theta_{1})\del{1+\|x_{s,i}^{\top}\theta_{1}-x_{s,i}^{\top}\theta% _{2}\|}^{-1}$

Proof.

We write:

\displaystyle\int_{\nu=0}^{1}\dot{\mu}_{i}\del{\nu\mathbf{X}_{\mathcal{Q}_{s}}% ^{\top}\theta_{2}+\del{1-\nu}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}}% \cdot d\nu

\displaystyle=\sum_{i\in\mathcal{Q}_{s}}\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,% i}^{\top}\theta_{2}}\frac{1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{% i}(t_{i})\cdot dt_{i},

(30)

where $\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{\mu}_{i}(t_{i})% \cdot dt_{i}$ represents integration of $\dot{\mu}(\cdot)$ with respect to the coordinate $t_{i}$ ( hence the limits of the integration only consider change in the coordinate $t_{i}$ ). For some $z>z_{1}\,\in\,\mathbb{R}$ , consider:

\displaystyle\int_{z_{1}}^{z}\frac{d}{dt_{i}}\log\del{\dot{\mu}_{i}(t_{i})}% \cdot dt_{i}=\int_{z_{1}}^{z}\frac{\nabla^{2}\mu_{i,i}(t_{i})}{\dot{\mu}_{i}(t% _{i})}dt_{i},

where $\nabla^{2}\mu_{i,i}(\cdot)$ is the double derivative of $\mu(\cdot)$ . Using Lemma 16, we have $-1\leq\frac{\nabla^{2}\mu_{i,i}(\cdot)}{\dot{\mu}_{i}(\cdot)}\leq 1$ . Thus we get:

-(z-z_{1})\leq\int_{z_{1}}^{z}\frac{d}{dt_{i}}\log\del{\dot{\mu}_{i}(t_{i})}.% dt_{i}\leq(z-z_{1})

Using Fundamental Theorem of Calculus, we get:

		$\displaystyle-(z-z_{1})\leq\log\del{\dot{\mu}_{i}(z)}-\log\del{\dot{\mu}_{i}(z% _{1})}\leq(z-z_{1})$
	$\displaystyle\therefore$	$\displaystyle~{}\dot{\mu}_{i}(z_{1})\exp(-(z-z_{1}))\leq\dot{\mu}_{i}(z)\leq% \dot{\mu}_{i}(z_{1})\exp(z-z_{1})$		(31)

Using Eq (A.3) and for $z_{2}\geq z_{1}\,\in\,\mathbb{R}$ , and for all $i\,\in\,[K]$ and we have:

		$\displaystyle\dot{\mu}_{i}(z_{1})\del{1-\exp(-(z_{2}-z_{1}))}\leq\int_{z_{1}}^% {z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{\mu}_{i}(z_{1})\del{\exp(z_{2}-z_{1})-1}$
	$\displaystyle\therefore$	$\displaystyle~{}\dot{\mu}_{i}(z_{1})\frac{1-\exp(-(z_{2}-z_{1}))}{z_{2}-z_{1}}% \leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{% \mu}_{i}(z_{1})\frac{\exp(z_{2}-z_{1})-1}{z_{2}-z_{1}}.$		(32)

Reversing the role of $z_{1}$ and $z_{2}$ , such that $z_{2}\leq z_{1}$ then again by using Eq (A.3) we write:

\displaystyle\dot{\mu}_{i}(z_{1})\frac{\exp(-(z_{1}-z_{2}))-1}{z_{2}-z_{1}}% \leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{% \mu}_{i}(z_{1})\frac{\exp(z_{1}-z_{2})-1}{z_{2}-z_{1}}.

(33)

Combining Eq (A.3) and (33) and for all $i\,\in\,[K]$ we get:

\displaystyle\dot{\mu}_{i}(z_{1})\frac{1-\exp(-|z_{1}-z_{2}|)}{|z_{1}-z_{2}|}% \leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}.

(34)

If $x\geq 0$ , then $e^{-x}\leq(1+x)^{-1}$ , and therefore $(1-e^{-x})/x\geq(1+x)^{-1}$ . Thus we lower bound the left hand side of Eq (34) as:

\displaystyle\dot{\mu}_{i}(z_{1})\del{1+|z_{1}-z_{2}|}^{-1}\leq\dot{\mu}_{i}(z% _{1})\frac{1-\exp(-|z_{1}-z_{2}|)}{|z_{1}-z_{2}|}\leq\frac{1}{z_{2}-z_{1}}\int% _{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}.

Using above with $z_{2}=x_{s,i}^{\top}\theta_{2}$ and $z_{1}=x_{s,i}^{\top}\theta_{1}$ in Eq (30) gives:

		$\displaystyle\sum_{i\in\mathcal{Q}_{s}}\int_{\nu=0}^{1}\dot{\mu}_{i}\del{\nu% \mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\del{1-\nu}\mathbf{X}_{\mathcal{% Q}_{s}}^{\top}\theta_{1}}\cdot d\nu$
	$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{Q}_{s}}\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i% }^{\top}\theta_{2}}\frac{1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{i% }(t_{i})\cdot dt_{i}\geq\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{% \mathcal{Q}_{s}}^{\top}\theta_{1})\del{1+\|x_{s,i}^{\top}\theta_{1}-x_{s,i}^{% \top}\theta_{2}\|}^{-1}.$

∎

Lemma 4.

For all $\theta_{1},\theta_{2}\,\in\,\Theta$ such that $S\coloneqq\max_{\theta\,\in\,\Theta}\enVert{\theta}_{2}$ (Assumption 1), the following inequalities hold:

	$\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}% _{t}(\theta_{1})$
	$\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}% _{t}(\theta_{2})$

Proof.

From Lemma 12, we have:

$\displaystyle\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}}% ,\theta_{2},\theta_{1})$	$\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\del{1+\|x_{s,i}^{\top}\theta_{1}-x_% {s,i}^{\top}\theta_{2}\|}^{-1}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}% \theta_{1})$
	$\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\left(1+\enVert{x_{s,i}}_{2}\enVert% {\theta_{1}-\theta_{2}}_{2}\right)^{-1}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{% s}}^{\top}\theta_{1})$	(Cauchy-Schwartz)
	$\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\left(1+2S\right)^{-1}\dot{\mu}_{i}% (\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})$	( $\theta_{1},\theta_{2}\in\Theta,\,\|\|x_{s,i}\|\|_{2}\leq 1$ )

Now we write $\mathbf{G}_{t}(\theta_{1},\theta_{2})$ as:

	$\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})$	$\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_% {\mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s,i}x_{s,i}^{\top}+\lambda_{t}% \mathbf{I}_{d}$
		$\displaystyle\succeq(1+2S)^{-1}\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{% \mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+% \lambda_{t}\mathbf{I}_{d}$
		$\displaystyle=(1+2S)^{-1}\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{% \mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+(% 1+2S)\lambda_{t}\mathbf{I}_{d}}$
		$\displaystyle\succeq(1+2S)^{-1}\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}% \dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{% \top}+\lambda_{t}\mathbf{I}_{d}}$
		$\displaystyle=(1+2S)^{-1}\mathbf{H}_{t}(\theta_{1}).$

Since, $\theta_{1}$ and $\theta_{2}$ have symmetric roles in the definition of $\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1})$ , we also obtain the second relation by a change of variable directly. ∎

The following Lemma presents a crucial bound over the deviation $(\theta-\theta_{*})$ , which we extensively use in our derivations.

Lemma 5.

For $\theta\in C_{t}\del{\delta}$ , we have the following relation with probability at least $1-\delta$ :

\displaystyle\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta)}\leq 2(1+2S)% \gamma_{t}(\delta).

(35)

Proof.

Since $\theta,\theta_{*}\in\Theta$ , then by Lemma 4, it follows that:

\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta)}\leq\sqrt{1+2S}\enVert{% \theta-\theta_{*}}_{\mathbf{G}_{t}(\theta,\theta_{*})}.

From triangle inequality, we write :

\displaystyle\enVert{g(\theta_{*})-g(\theta)}_{\mathbf{G}_{t}^{-1}(\theta,% \theta_{*})}\leq\enVert{g(\theta_{*})-g(\hat{\theta}_{t})}_{\mathbf{G}_{t}^{-1% }(\theta,\theta_{*})}+\enVert{g(\hat{\theta}_{t})-g(\theta)}_{\mathbf{G}_{t}^{% -1}(\theta,\theta_{*})},

where $\hat{\theta}_{t}$ is the MLE estimate. Further Lemma 4 gives:

	$\displaystyle\enVert{g(\theta_{})-g(\theta)}_{\mathbf{G}_{t}^{-1}(\theta,% \theta_{})}$	$\displaystyle\leq\sqrt{1+2S}\enVert{g(\theta_{})-g(\hat{\theta}_{t})}_{% \mathbf{H}_{t}^{-1}(\theta_{})}$
		$\displaystyle+\sqrt{1+2S}\enVert{g(\hat{\theta}_{t})-g(\theta)}_{\mathbf{H}_{t% }^{-1}(\theta)}.$

since $\theta$ is the minimizer of $\enVert{g(\theta)-g(\hat{\theta}_{t})}_{\mathbf{H}_{t}^{-1}(\theta)}$ , therefore we write:

\displaystyle\enVert{g(\theta_{*})-g(\theta)}_{\mathbf{G}_{t}^{-1}(\theta,% \theta_{*})}\leq 2\sqrt{1+2S}\enVert{g(\theta_{*})-g(\hat{\theta}_{t})}_{% \mathbf{H}_{t}^{-1}(\theta_{*})}.

Finally, the Eq (35) follows by an application of Lemma 10 as:

\displaystyle\enVert{g(\theta_{*})-g(\theta)}_{\mathbf{H}_{t}^{-1}(\theta_{*})}

\displaystyle\leq\gamma_{t}(\delta).

∎

A.4 Bounds on prediction error

Lemma 6.

For the assortment chosen by the algorithm CB-MNL, $\mathcal{Q}_{t}$ as given by Eq (12) and any $\theta\in C_{t}(\delta)$ the following holds with probability at least $1-\delta$ :

\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)

\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% *}}+2(1+2S)M\gamma_{t}(\delta)\enVert{x_{t,i}}_{\mathbf{H}_{t}^{-1}(\theta_{*}% )}.

Proof.

Consider the mulinomial logit function:

\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)x_{t,i}% ^{\top}(\theta-\theta_{*})=\frac{e^{x_{t,i}^{\top}\theta}}{1+\sum_{j\in% \mathcal{Q}_{t}}e^{x_{t,j}^{\top}\theta}}-\frac{e^{x_{t,i}^{\top}\theta_{*}}}{% 1+\sum_{j\in\mathcal{Q}_{t}}e^{x_{t,j}^{\top}\theta_{*}}}.

(36)

We use second-order Taylor expansion for each component of the multinomial logit function at $a_{i}$ . Consider for all $i\,\in\,[K]$ :

	$\displaystyle f_{i}(r_{i})$	$\displaystyle=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\in\mathcal{Q}_{s},j\neq i}e% ^{r_{j}}}$
		$\displaystyle\leq f(a_{i})+f_{i}^{\prime}(a_{i})(r_{i}-a_{i})+\frac{f_{i}^{% \prime\prime}(a_{i})(r_{i}-a_{i})^{2}}{2}.$		(37)

In Eq (A.4), we substitute: $f_{i}(\cdot)\to\mu_{i}$ , $r_{i}\to x_{t,i}^{\top}\theta$ , and $a_{i}\to x_{t,i}^{\top}\theta_{*}$ . Thus we re-write Eq (36) as:

	$\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{},\theta)x_{s,i}% ^{\top}(\theta-\theta_{})$	$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}(x_{t,i}^{\top}(\theta-\theta_{}))+\ddot{\mu}_{i}\del{\mathbf{X}_{\mathcal% {Q}_{t}}^{\top}\theta_{}}(x_{t,i}^{\top}(\theta-\theta_{})^{2},$
	$\displaystyle\therefore~{}~{}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*% },\theta)$	$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}(x_{t,i}^{\top}(\theta-\theta_{}))+\ddot{\mu}_{i}\del{\mathbf{X}_{\mathcal% {Q}_{t}}^{\top}\theta_{}}\|x_{t,i}^{\top}(\theta-\theta_{})\|$
		$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}+M\envert{x_{t,i}^{\top}(\theta-\theta_{})}$

where we upper bound $\ddot{\mu}_{i}$ by $M$ . An application of Cauchy-Schwarz gives us:

\displaystyle\envert{x_{t,i}^{\top}(\theta_{*}-\theta)}

\displaystyle\leq\enVert{x_{t,i}}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\enVert{% \theta_{*}-\theta}_{\mathbf{H}_{t}(\theta_{*})}

(38)

Upon Combining the last two equations we get:

\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)

\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% *}}+\enVert{x_{t,i}}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\enVert{\theta_{*}-% \theta}_{\mathbf{H}_{t}(\theta_{*})}.

From Lemma 5 we get:

\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)

\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% *}}+2(1+2S)M\gamma_{t}(\delta)\enVert{x_{t,i}}_{\mathbf{H}_{t}^{-1}(\theta_{*}% )}.

∎

Lemma 3.

For the assortment chosen by the algorithm CB-MNL, $\mathcal{Q}_{t}$ as given by Eq (12) and any $\theta\in C_{t}(\delta)$ the following holds with probability at least $1-\delta$ :

	$\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq$	$\displaystyle\del{2+4S}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{% i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{}}\enVert{x_{t,i}}_{\mathbf% {H}_{t}^{-1}(\theta_{})}$
		$\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t% }}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}}$

Proof.

$\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)$	$\displaystyle=~{}\envert{\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)-\mu(% \mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})}$
$\displaystyle=$	$\displaystyle~{}\envert{\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{% \mathcal{Q}_{t}},\theta_{},\theta)x_{t,i}^{\top}(\theta-\theta_{})}$	(From Eq (28))
$\displaystyle\leq$	$\displaystyle~{}\envert{\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{% \mathcal{Q}_{t}},\theta_{},\theta)\enVert{x_{t,i}}_{\mathbf{H}_{t}^{-1}(% \theta_{})}\enVert{\theta_{}-\theta}_{\mathbf{H}_{t}(\theta_{})}}$	(Cauchy-Schwarz inequality and Eq (29))
$\displaystyle\leq$	$\displaystyle~{}2(1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\envert{% \alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{},\theta)\enVert{x_{t,i}}_{% \mathbf{H}_{t}^{-1}(\theta_{})}}$	(From Lemma 5)
$\displaystyle\leq$	$\displaystyle~{}2(1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\left(\dot{% \mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{}}\enVert{x_{t,i}}_{% \mathbf{H}_{t}^{-1}(\theta_{})}\right.$
	$\displaystyle+\left.2(1+2S)M\gamma_{t}(\delta)\enVert{x_{t,i}}^{2}_{\mathbf{H}% _{t}^{-1}(\theta_{*})}\right)$	(From Lemma 6)

Upon re-arranging the terms we get:

	$\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq$	$\displaystyle\del{2+4S}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{% i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{}}\enVert{x_{t,i}}_{\mathbf% {H}_{t}^{-1}(\theta_{})}$
		$\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t% }}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}},$

where we use $\mathbf{H}_{t}^{-1}(\theta_{*})\succeq\kappa^{-1}\mathbf{V}_{t}$ from Assumption 2. ∎

Corollary 7.

For the assortment chosen by the algorithm CB-MNL, $\mathcal{Q}_{t}$ as given by Eq (12) and any $\theta\in C_{t}(\delta)$ the following holds with probability at least $1-\delta$ :

	$\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq$	$\displaystyle 2\del{1+2S}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\enVert{% \tilde{x}_{t,i}}_{\mathbf{J}_{t}^{-1}}$
		$\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t% }}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}},$

where $\tilde{x}_{t,i}=\sqrt{\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_% {*})}x_{t,i}$ and $\enVert{x}_{\mathbf{H}^{-1}_{t}(\theta_{*})}=\enVert{x}_{\mathbf{J}^{-1}_{t}}$ .

Proof.

This directly follows from the uniqueness and realizability of $\theta_{*}$ .

∎

A.5 Regret calculation

The following two lemmas give the upper bounds on the self-normalized vector summations.

Lemma 13.

\displaystyle\sum_{t=1}^{T}\min\cbr

\displaystyle\leq~{}2d\log\del{1+\frac{LKT}{d\lambda_{t}}}.

Proof.

The proof follows by a direct application of Lemma 17 and 18 as:

	$\displaystyle\sum_{t=1}^{T}\min\cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{\tilde{x% }_{t,i}}^{2}_{\mathbf{J}^{-1}_{T+1}(\theta)},1}$
$\displaystyle\leq$	$\displaystyle~{}2\log\del{\frac{\det(\mathbf{J}_{T+1})}{\lambda_{t}^{d}}}$	(From Lemma 17)
$\displaystyle=$	$\displaystyle~{}2\log\del{\frac{\det\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}% _{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})x_{t,i}x_{t,i% }^{\top}+\lambda_{t}\mathbf{I}_{d}}}{\lambda_{t}^{d}}}$
$\displaystyle\leq$	$\displaystyle~{}2\log\del{\frac{\det\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}% _{s}}Lx_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}}}{\lambda_{t}^{d}}}$	(Upper bound by Lipschitz constant)
$\displaystyle\leq$	$\displaystyle~{}2\log\del{\frac{L^{d}\det\del{\sum_{s=1}^{t-1}\sum_{i\in% \mathcal{Q}_{s}}x_{t,i}x_{t,i}^{\top}+\nicefrac{{\lambda_{t}}}{{L}}\mathbf{I}_% {d}}}{\lambda_{t}^{d}}}$
$\displaystyle\leq$	$\displaystyle~{}2\log\del{\frac{\det\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}% _{s}}Lx_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}}}{\lambda_{t}^{d}}}$
$\displaystyle\leq$	$\displaystyle~{}2d\log\del{1+\frac{LKT}{d\lambda_{t}}}.$	(From Lemma 18)

∎

Similar to Lemma 13, we prove the following.

Lemma 14.

\displaystyle\sum_{t=1}^{T}\min\cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{x_{t,i}}% ^{2}_{\mathbf{V}^{-1}_{T+1}(\theta)},1}

\displaystyle\leq~{}2d\log\del{1+\frac{KT}{d\lambda_{t}}}.

Proof.

	$\displaystyle\sum_{t=1}^{T}\min\cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{x_{t,i}}% ^{2}_{\mathbf{V}^{-1}_{T+1}(\theta)},1}$
$\displaystyle\leq$	$\displaystyle~{}2\log\del{\frac{\det(\mathbf{V}_{T+1})}{\lambda_{t}^{d}}}$	(From Lemma 17, set $\dot{\underline{\mu_{i}}}(\cdot)=1$ )
$\displaystyle=$	$\displaystyle~{}2\log\del{\frac{\det\del{\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}% _{s}}x_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}}}{\lambda_{t}^{d}}}$
$\displaystyle\leq$	$\displaystyle~{}2d\log\del{1+\frac{KT}{d\lambda_{t}}}.$	(From Lemma 18)

∎

Theorem 1.

With probability at least $1-\delta$ :

\displaystyle\mathbf{R}_{T}\leq

\displaystyle C_{1}\gamma_{t}(\delta)\sqrt{2d\log\del{1+\frac{LKT}{d\lambda_{t% }}}T}+C_{2}\kappa\gamma_{t}(\delta)^{2}d\log\del{1+\frac{KT}{d\lambda_{t}}},

where the constants are given as $C_{1}=\del{4+8S}$ , $C_{2}=4(4+8S)^{\nicefrac{{3}}{{2}}}M$ and $\gamma_{t}(\delta)$ is given by Eq (8).

Proof.

The regret is upper bounded by the prediction error.

$\displaystyle\mathbf{R}_{T}\leq$	$\displaystyle\sum_{t=1}^{T}\min\cbr{\Delta^{\text{pred}}\del{\mathbf{X}_{% \mathcal{Q}_{t}},\theta_{t}^{\,\text{est}}},1}$	( $R_{\max}=1$ )
$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}\min\left\{\del{2+4S}\gamma_{t}(\delta)\sum_{i\in% \mathcal{Q}_{t}}\enVert{x_{t,i}}_{\mathbf{J}_{t}^{-1}}\right.$
	$\displaystyle+\left.8\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal% {Q}_{t}}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}},1\right\}$	(From Lemma 7)
$\displaystyle\leq$	$\displaystyle 2\del{1+2S}\gamma_{t}(\delta)\sum_{t=1}^{T}\min\cbr{\sum_{i\in% \mathcal{Q}_{t}}\enVert{x_{t,i}}_{\mathbf{J}_{t}^{-1}},1}$
	$\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}\sum_{t=1}^{T}\min\cbr{% \sum_{i\in\mathcal{Q}_{t}}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}},1}$
$\displaystyle\leq$	$\displaystyle 2\del{1+2S}\gamma_{t}(\delta)\sqrt{T}\sqrt{\sum_{t=1}^{T}\min% \cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{x_{t,i}}^{2}_{\mathbf{J}_{t}^{-1}},1}}$
	$\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}\sum_{t=1}^{T}\min\cbr{% \sum_{i\in\mathcal{Q}_{t}}\enVert{x_{t,i}}^{2}_{\mathbf{V}_{t}^{-1}},1}$	(Using Cauchy-Schwarz inequality)
$\displaystyle\leq$	$\displaystyle 2\del{1+2S}\gamma_{t}(\delta)\sqrt{2d\log\del{1+\frac{LKT}{d% \lambda_{t}}}T}$
	$\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}d\log\del{1+\frac{KT}{d% \lambda_{t}}}.$	(From Lemma 13 and 14)

For a choice of $\lambda_{t}=d\log(KT)$ $\gamma_{t}(\delta)=\mathrm{O}\del{d^{\nicefrac{{1}}{{2}}}\log^{\nicefrac{{1}}{% {2}}}\del{KT}}$ . ∎

A.6 Convex relaxation

Lemma 8.

$E_{t}\del{\delta}\supseteq C_{t}\del{\delta}$ , therefore for any $\theta\in C_{t}(\delta)$ , we also have $\theta\,\in\,E_{t}(\delta)$ (see Eq (7)).

Proof.

Let $\hat{\theta}_{t}$ be the maximum likelihood estimate (see Eq (5)), the second-order Taylor series expansion of the log-loss (with integral remainder term) for any $\theta\in\mathbb{R}^{d}$ is given by:

	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)=$	$\displaystyle\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\nabla\mathcal{L}^{% \lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\hat{\theta}_{t})$
		$\displaystyle+(\theta-\hat{\theta}_{t})\del{\int^{1}_{\nu=0}(1-\nu)\nabla^{2}% \mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t}+\nu(\theta-\hat{\theta}_{t}))\cdot d% \nu}(\theta-\hat{\theta}_{t})$		(40)

$\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})=0$ by definition since $\hat{\theta}_{t}$ is maximum likelihood estimate. Therefore :

$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)$	$\displaystyle=\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+(\theta-\hat{\theta}% _{t})^{\top}\del{\int^{1}_{\nu=0}(1-\nu)\nabla^{2}\mathcal{L}^{\lambda}_{t}(% \hat{\theta}_{t}+\nu(\theta-\hat{\theta}_{t}))\cdot d\nu}(\theta-\hat{\theta}_% {t})$
	$\displaystyle=\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+(\theta-\hat{\theta}% _{t})^{\top}\del{\int^{1}_{\nu=0}(1-\nu)\mathbf{H}_{t}(\hat{\theta}_{t}+\nu(% \theta-\hat{\theta}_{t}))\cdot d\nu}(\theta-\hat{\theta}_{t})$	( $\nabla^{2}\mathcal{L}^{\lambda}_{t}(\cdot)=\mathbf{H}_{t}(\cdot)$ )
	$\displaystyle\leq\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\enVert{\theta-% \hat{\theta}_{t}}^{2}_{\mathbf{G}_{t}(\theta,\hat{\theta}_{t})}$	(def. of $\mathbf{G}_{t}(\theta,\hat{\theta}_{t})$ )
	$\displaystyle\leq\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\enVert{g_{t}(% \theta)-g_{t}(\hat{\theta}_{t})}^{2}_{\mathbf{G}^{-1}_{t}(\theta,\hat{\theta}_% {t})}.$	(Eq (29))

Thus we obtain:

	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)-\mathcal{L}^{\lambda}_{t}(\hat{% \theta}_{t})$	$\displaystyle\leq\enVert{g_{t}(\theta)-g_{t}(\hat{\theta_{t}})}^{2}_{\mathbf{G% }_{t}^{-1}(\theta,\hat{\theta}_{t})}$
		$\displaystyle\leq\del{\frac{\gamma_{t}^{2}(\delta)}{\lambda_{t}}+\gamma_{t}(% \delta)}^{2}=\beta^{2}_{t}(\delta),$		(from Lemma 15)

where the last inequality suggests that $\theta\in E_{t}(\delta)$ by the definition of the set $E_{t}(\delta)$ . Therefore, $\mathbb{P}\del{\forall t\geq 1,\theta_{*}\in E_{t}(\delta)}\geq 1-\delta$ . ∎

The following helper lemma, which translates the confidence set definition of Lemma 10 to the norm defined by $\mathbf{G}_{t}^{-1}(\theta_{1},\theta_{2})$ .

Lemma 15.

Let $\delta\in(0,1]$ . For all $\theta\in C_{t}(\delta)$ and $\hat{\theta}_{t}$ as the maximum likelihood estimate in Eq (5).

\enVert{g_{t}(\theta)-g_{t}(\hat{\theta_{t}})}_{\mathbf{G}_{t}^{-1}(\theta,% \hat{\theta}_{t})}\leq\frac{\gamma_{t}^{2}(\delta)}{\lambda_{t}}+\gamma_{t}(% \delta).

Proof.

We have:

$\displaystyle\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}$	$\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}\del{\mathbf% {X}_{\mathcal{Q}_{s}},\theta,\hat{\theta}_{t}}x_{s,i}x_{s,i}^{\top}+\lambda_{t% }\mathbf{I}_{d}$	(def. of $\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}$ )
	$\displaystyle\geq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(% \mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\del{1+\|x_{s,i}^{\top}\theta-x_{s,i}% ^{\top}\hat{\theta}_{t}\|}^{-1}x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}$	(from Lemma 12)
	$\displaystyle\geq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(% \mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\del{1+\enVert{x_{s,i}}_{\mathbf{G}^% {-1}_{t}\del{\theta,\hat{\theta}_{t}}}\enVert{\theta-\hat{\theta}_{t}}_{% \mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}}}^{-1}x_{s,i}x_{s,i}^{\top}$
	$\displaystyle+\lambda_{t}\mathbf{I}_{d}$	(Cauchy-Schwarz inequality)
	$\displaystyle\geq\del{1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\enVert{\theta-\hat{% \theta}_{t}}_{\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}}}^{-1}\sum_{s=1}^{t-% 1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}% \theta)x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}$	( $\mathbf{G}_{t}(\theta,\hat{\theta}_{t})\succeq\lambda_{t}\mathbf{I}_{d}$ )
	$\displaystyle\geq\del{1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\enVert{\theta-\hat{% \theta}_{t}}_{\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}}}^{-1}\del{\sum_{s=1% }^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{% \top}\theta)x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}}$
	$\displaystyle=\del{1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\enVert{\theta-\hat{% \theta}_{t}}_{\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}}}^{-1}\mathbf{H}_{t}% \del{\theta}$	(def. of $\mathbf{H}_{t}\del{\theta}$ )
	$\displaystyle=\del{1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\enVert{g_{t}(\theta)-g% _{t}(\hat{\theta}_{t})}_{\mathbf{G}_{t}^{-1}\del{\theta,\hat{\theta}_{t}}}}^{-% 1}\mathbf{H}_{t}\del{\theta},$	(from Eq (29))

where

\mathbf{G}_{t}\del{\theta,\hat{\theta}_{t}}\succeq\del{1+\lambda_{t}^{-% \nicefrac{{1}}{{2}}}\enVert{g_{t}(\theta)-g_{t}(\hat{\theta}_{t})}_{\mathbf{G}% _{t}^{-1}\del{\theta,\hat{\theta}_{t}}}}^{-1}\mathbf{H}_{t}\del{\theta}

is analogous to local information containing counterpart of the relation in Lemma 4. This gives:

	$\displaystyle\enVert{g_{t}(\theta)-g_{t}(\hat{\theta}_{t})}^{2}_{\mathbf{G}_{t% }^{-1}\del{\theta,\hat{\theta}_{t}}}$
$\displaystyle\leq$	$\displaystyle\del{1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\enVert{g_{t}(\theta)-g_% {t}(\hat{\theta}_{t})}_{\mathbf{G}_{t}^{-1}\del{\theta,\hat{\theta}_{t}}}}^{-1% }\enVert{g_{t}(\theta)-g_{t}(\hat{\theta}_{t})}^{2}_{\mathbf{H}_{t}^{-1}\del{% \theta}}$
$\displaystyle\leq$	$\displaystyle\lambda_{t}^{-\nicefrac{{1}}{{2}}}\gamma_{t}^{2}(\delta)\enVert{g% _{t}(\theta)-g_{t}(\hat{\theta}_{t})}_{\mathbf{G}_{t}^{-1}\del{\theta,\hat{% \theta}_{t}}}+\gamma^{2}_{t}(\delta),$	(from Lemma 10)

where the last relation is a quadratic inequality in $\enVert{g_{t}(\theta)-g_{t}(\hat{\theta}_{t})}_{\mathbf{G}_{t}^{-1}\del{\theta% ,\hat{\theta}_{t}}}$ , which on solving completes the proof of the statement in the lemma. ∎

Lemma 9.

Under the event $\theta_{*}\,\in\,C_{t}(\delta)$ , the following holds $\forall\,\theta\,\in\,E_{t}(\delta)$ :

\displaystyle\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)% \gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When $\lambda_{t}=d\log(t)$ , then $\gamma_{t}(\delta)=\tilde{\mathrm{O}}\del{\sqrt{d\log(t)}}$ , $\beta_{t}(\delta)=\tilde{\mathrm{O}}\del{\sqrt{d\log(t)}}$ , and

\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta_{*})}=\tilde{\mathrm{O}}\del% {\sqrt{d\log(t)}}.

Proof.

Second-order Taylor expansion of the log-likelihood function with integral remainder term gives:

	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)=$	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{})+\nabla\mathcal{L}^{\lambda}% _{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{})$
		$\displaystyle+(\theta-\theta_{})\del{\int^{1}_{\nu=0}(1-\nu)\nabla^{2}% \mathcal{L}^{\lambda}_{t}(\theta_{}+\nu(\theta-\theta_{}))\cdot d\nu}(\theta% -\theta_{})$
	$\displaystyle=$	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{})+\nabla\mathcal{L}^{\lambda}% _{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{})+\enVert{\theta-\theta_{}}^{2% }_{\mathbf{\tilde{G}}_{t}(\theta_{},\theta)},$

where $\mathbf{\tilde{G}}_{t}(\theta_{*},\theta)=(\theta-\theta_{*})\del{\int^{1}_{% \nu=0}(1-\nu)\mathbf{H}_{t}(\theta_{*}+\nu(\theta-\theta_{*}))\cdot d\nu}(% \theta-\theta_{*})$ . From Lemma 4 and Lemma 8 of Abeille et al. [2021] is also follows that

\enVert{\theta-\theta_{*}}^{2}_{\mathbf{\tilde{G}}_{t}(\theta_{*},\theta)}\geq% (2+2S)^{-1}\enVert{\theta-\theta_{*}}^{2}_{\mathbf{H}_{t}(\theta_{*})}

Therefore we have:

	$\displaystyle\enVert{\theta-\theta_{}}^{2}_{\mathbf{H}_{t}(\theta_{})}$
$\displaystyle\leq$	$\displaystyle(2+2S)\envert{\mathcal{L}^{\lambda}_{t}(\theta)-\mathcal{L}^{% \lambda}_{t}(\theta_{})}+(2+2S)\envert{\nabla\mathcal{L}^{\lambda}_{t}(\hat{% \theta}_{t})^{\top}(\theta-\theta_{})}$
$\displaystyle\leq$	$\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\envert{\nabla\mathcal{L}^{% \lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{*})}$	(def. of $E_{t}(\delta)$ )
$\displaystyle\leq$	$\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\enVert{\nabla\mathcal{L}^{% \lambda}_{t}(\hat{\theta}_{t})}_{\mathbf{H}_{t}^{-1}(\theta_{})}\enVert{(% \theta-\theta_{})}_{\mathbf{H}_{t}(\theta_{*})}$	(Cauchy-Schwarz inequality)
$\displaystyle\leq$	$\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\gamma_{t}(\delta)\enVert{(% \theta-\theta_{})}_{\mathbf{H}_{t}(\theta_{})}.$

Solving the quadratic inequality in $\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta_{*})}$ , we get:

\displaystyle\enVert{\theta-\theta_{*}}_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)% \gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When $\lambda_{t}=d\log(t)$ , then $\gamma_{t}(\delta)=\tilde{\mathrm{O}}\del{\sqrt{d\log(t)}}$ and $\beta_{t}(\delta)=\tilde{\mathrm{O}}\del{\sqrt{d\log(t)}}$ . ∎

A.7 Technical lemmas

Remark 3 (Derivatives for MNL choice function).

For the multinomial logit choice function, where the expected reward due to item $i$ of the assortment $S_{t}$ is modeled as:

f_{i}(S_{t},\mathbf{r})=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\,\in\,S_{t},\,j% \neq i}^{K}e^{r_{j}}}

the partial derivative with respect to the expected reward of $i_{th}$ item is given as:

\frac{\partial f_{i}}{\partial r_{i}}=f_{i}(S_{t},\mathbf{r})\del{1-f_{i}(S_{t% },\mathbf{r})}

and the double derivative as:

\frac{\partial^{2}f_{i}}{\partial r_{i}^{2}}=f_{i}(S_{t},\mathbf{r})\del{1-f_{% i}(S_{t},\mathbf{r})}\del{1-2f_{i}(S_{t},\mathbf{r})}.

Lemma 16 (Self-Concordance like relation for MNL).

For the multinomial logit choice function, where the expected reward due to item $i$ of the assortment $S_{t}$ is modeled as:

f_{i}(S_{t},\mathbf{r})=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\,\in\,S_{t},\,j% \neq i}^{K}e^{r_{j}}}

the following relation holds:

\envert{\frac{\partial^{2}f_{i}}{\partial r_{i}^{2}}}\leq\frac{\partial f_{i}}% {\partial r_{i}}

Proof.

The proof directly follows from Remark 3 and the observatio $\envert{1-2f_{i}(S_{t},\mathbf{r})}\leq 1$ for all items, $i$ in the assortment choice. ∎

Lemma 17 (Generalized elliptical potential).

Let $\cbr{\mathbf{X}_{\mathcal{Q}_{s}}}$ be a sequence in $\mathbb{R}^{d\times K}$ such that for each $s$ , $\mathbf{X}_{\mathcal{Q}_{s}}$ has columns as $\{x_{s,1},x_{s,2},\cdots,x_{s,K}\}$ where $\enVert{x_{s,i}}_{2}\leq w,\,\in\,\mathbb{R}^{d}$ for all $s\geq 1$ and $i\,\in\,[K]$ . Also, let $\lambda_{t}$ be a non-negative scalar. For $t\geq 1$ , define $\mathbf{J}_{t}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{% \underline{\mu_{i}}}\del{\mathbf{X}_{\mathcal{Q}_{s}}}x_{s,i}x_{s,i}^{\top}+% \lambda_{t}\mathbf{I}_{d}$ where $\dot{\underline{\mu_{i}}}\del{\mathbf{X}_{\mathcal{Q}_{s}}}$ is strictly positive for all $i,\in,[K]$ . Then the following inequality holds:

\sum_{t=1}^{T}\min\cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{\tilde{x}_{t,i}}^{2}_% {\mathbf{J}^{-1}_{t}},1}\leq 2\log\del{\frac{\det(\mathbf{J}_{T+1})}{\lambda_{% t}^{d}}}

with $\tilde{x}_{t,i}=\sqrt{\dot{\underline{\mu_{i}}}\del{\mathbf{X}_{\mathcal{Q}_{s% }}}}x_{s,i}$ .

Proof.

By the definition of $\mathbf{J}_{t}$ :

	$\displaystyle\det\del{\mathbf{J}_{t+1}}$	$\displaystyle=\det\del{\mathbf{J}_{t}+\sum_{i\in\mathcal{Q}_{t}}\tilde{x}_{t,i% }\tilde{x}_{t,i}^{\top}}$
		$\displaystyle=\det\del{\mathbf{J}_{t}}\det\del{\mathbf{I}_{d}+\mathbf{J}_{t}^{% -\nicefrac{{1}}{{2}}}\sum_{i\in\mathcal{Q}_{t}}\tilde{x}_{t,i}\tilde{x}_{t,i}^% {\top}\mathbf{J}_{t}^{-\nicefrac{{1}}{{2}}}}$
		$\displaystyle=\det\del{\mathbf{J}_{t}}\del{1+\sum_{i\in\mathcal{Q}_{t}}\enVert% {\tilde{x}_{t,i}}^{2}_{\mathbf{J}^{-1}_{t}}}.$

Taking log from both sides and summing from $t=1$ to $T$ :

$\displaystyle\sum_{t=1}^{T}\log\del{1+\sum_{i\in\mathcal{Q}_{t}}\enVert{\tilde% {x}_{t,i}}^{2}_{\mathbf{J}^{-1}_{t}}}$	$\displaystyle=\sum_{t=1}^{T}\log\del{\det(\mathbf{J}_{t+1})}-\log\del{\det(% \mathbf{J}_{t})}$
	$\displaystyle=\sum_{t=1}^{T}\log\del{\frac{\mathbf{J}_{t+1}}{\mathbf{J}_{t}}}$
	$\displaystyle=\log\del{\frac{\det(\mathbf{J}_{t+1})}{\det(\lambda_{t}\mathbf{I% }_{d})}}$	(By a telescopic sum cancellation)
	$\displaystyle=\log\del{\frac{\det(\mathbf{J}_{t+1})}{\lambda_{t}^{d}}}.$	(44)

For any $a$ such that $0\leq a\leq 1$ , it follows that $a\leq 2\log(1+a)$ . Therefore, we write:

	$\displaystyle\sum_{t=1}^{T}\min\cbr{\sum_{i\in\mathcal{Q}_{t}}\enVert{\tilde{x% }_{t,i}}^{2}_{\mathbf{J}^{-1}_{t}},1}$	$\displaystyle\leq 2\sum_{t=1}^{T}\log\del{1+\sum_{i\in\mathcal{Q}_{t}}\enVert{% \tilde{x}_{t,i}}^{2}_{\mathbf{J}^{-1}_{t}}}$
		$\displaystyle=2\log\del{\frac{\det(\mathbf{J}_{T+1})}{\lambda_{t}^{d}}}.$		(From Eq (44))

∎

Lemma 18 (Determinant-trace inequality, see Lemma 10 in Abbasi-Yadkori et al. [2011]).

Let $\{x_{s}\}_{s=1}^{\infty}$ a sequence in $\mathbb{R}^{d}$ such that $\enVert{x_{s}}_{2}\leq X$ for all $s\in\mathbb{N}$ , and let $\lambda_{t}$ be a non-negative scalar. For $t\geq 1$ define $\mathbf{V}_{t}\coloneqq\sum_{s=1}^{t-1}x_{s}x_{s}^{\top}+\lambda_{t}\mathbf{I}% _{d}$ . The following inequality holds:

\displaystyle\det(\mathbf{V}_{t+1})\leq\left(\lambda_{t}+tX^{2}/d\right)^{d}.

A.8 Numerical experiments

We build on Section 5 and compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data for varying model parameters. $\theta_{*}\in\mathbb{R}^{d}$ is $d-$ dimensional uniformly random variable with each coordinate in $[0,1]$ independently and uniformly distributed. The contexts follow multivariate Gaussian distribution. Algorithm CB-MNL only knows the value of $N,T,K,d$ . Besides, algorithms TS-MNLand UCB-MNL also need to know the value of $\kappa$ for their implementation. Here we simulate for two additional parameter instances again averaged over $25$ Monte Carlo runs.

	$\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{},\theta)x_{s,i}% ^{\top}(\theta-\theta_{})$	$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}(x_{t,i}^{\top}(\theta-\theta_{}))+\ddot{\mu}_{i}\del{\mathbf{X}_{\mathcal% {Q}_{t}}^{\top}\theta_{}}(x_{t,i}^{\top}(\theta-\theta_{})^{2},$
	$\displaystyle\therefore~{}~{}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*% },\theta)$	$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}(x_{t,i}^{\top}(\theta-\theta_{}))+\ddot{\mu}_{i}\del{\mathbf{X}_{\mathcal% {Q}_{t}}^{\top}\theta_{}}\|x_{t,i}^{\top}(\theta-\theta_{})\|$
		$\displaystyle\leq\dot{\mu}_{i}\del{\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{% }}+M\envert{x_{t,i}^{\top}(\theta-\theta_{})}$

	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)=$	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{})+\nabla\mathcal{L}^{\lambda}% _{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{})$
		$\displaystyle+(\theta-\theta_{})\del{\int^{1}_{\nu=0}(1-\nu)\nabla^{2}% \mathcal{L}^{\lambda}_{t}(\theta_{}+\nu(\theta-\theta_{}))\cdot d\nu}(\theta% -\theta_{})$
	$\displaystyle=$	$\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{})+\nabla\mathcal{L}^{\lambda}% _{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{})+\enVert{\theta-\theta_{}}^{2% }_{\mathbf{\tilde{G}}_{t}(\theta_{},\theta)},$

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Abstract

keywords:

1 Introduction

1.1 Related literature

1.2 On the parameter κ𝜅\kappaitalic_κ

1.3 Contributions

1.4 Comparison with notable prior works

2 Preliminaries

2.1 Notations

2.2 Model setting

Rewards Model:

Choice Modeling Perspective:

Regret:

2.3 Assumptions

Assumption 1 (Bounded parameters).

Assumption 2.

2.4 Maximum likelihood estimate

2.5 Confidence sets

3 Algorithm

Remark 1 (Optimistic parameter search).

Remark 2 (Tractable decision-making).

4 Main results

Theorem 1.

Corollary 2.

4.1 Bounds on prediction error

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

4.2 Regret calculation

4.3 Convex relaxation of the optimization step

Lemma 7.

Lemma 8.

5 Numerical experiments

6 Conclusion and discussion

References

Appendix A Appendix

A.1 Confidence set

Theorem 9.

Lemma 10 (confidence bounds for multinomial logistic rewards).

Proof.

Theorem 11.

A.2 Local information preserving norm

A.3 Self-Concordance Style Relations for Multinomial Logistic Function

Lemma 12.

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

A.4 Bounds on prediction error

Lemma 6.

Proof.

Lemma 3.

Proof.

Corollary 7.

Proof.

A.5 Regret calculation

Lemma 13.

Proof.

Lemma 14.

Proof.

Theorem 1.

Proof.

A.6 Convex relaxation

Lemma 8.

Proof.

Lemma 15.

Proof.

Lemma 9.

Proof.

A.7 Technical lemmas

Remark 3 (Derivatives for MNL choice function).

Lemma 16 (Self-Concordance like relation for MNL).

Proof.

Lemma 17 (Generalized elliptical potential).

Proof.

Lemma 18 (Determinant-trace inequality, see Lemma 10 in Abbasi-Yadkori et al. [2011]).

A.8 Numerical experiments

1.2 On the parameter $\kappa$