A Statistical Framework for Data-dependent Retrieval-Augmented Models

Soumya Basu ^∗ Ankit Singh Rawat ^∗ Manzil Zaheer Equal contribution in alphabetical order. Google, New York
{basusoumya,ankitsrawat,manzilzaheer}@google.com

Abstract

Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a retriever to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a predictor that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance.We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.

1 Introduction

Recent advancements in machine learning (ML) have not only led to breakthroughs on long-standing challenging tasks across various fields, but they have also inspired a great deal of interest to develop ML models that can solve even harder tasks (Meinhardt et al., 2022; Lewkowycz et al., 2022; Cramer, 2021) or focus on completely new fields (Austin et al., 2021; OpenAI, 2023; Singhal et al., 2023). While scaling the size of parametric ML models, such as neural networks, is becoming the predominant approach to meet such demands (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023; Dosovitskiy et al., 2021; Dehghani et al., 2023), the excellent performance realized by this approach is marred by drawbacks such as high computational cost, inefficient storage of world knowledge in parameters, lack of transparency in model behavior, and reduced grounding/factuality of model predictions.

Recognizing these shortcomings, retrieval-augmented models (RAMs) have emerged as a promising alternative. Such models typically employ two components, namely retriever and predictor, during inference on a given input instance: The retriever first identifies instance-specific relevant information from a data-store, and then the predictor jointly processes the retrieved information and the input instance to make a final prediction. In practice, RAMs have enjoyed favorable performance vs. compute trade-off (Borgeaud et al., 2021; Das et al., 2021; Thai et al., 2023) as employing moderate-size parametric models as retriever and predictor in a RAM often matches or exceeds the performance of a much larger standalone ML model that directly maps input instances to predictions. Similarly, conditioning prediction on the retrieved information has shown to exhibit improved grounding (Shuster et al., 2021; Lin et al., 2023; Asai et al., 2023). Furthermore, having access to an external corpus can obviate the need to store task-specific world knowledge in model parameters and enable incorporating dynamically evolving knowledge (Izacard et al., 2022; Liska et al., 2022).

Despite these desirable characteristics, training RAMs presents multiple challenges. The natural approach of independently training retriever and predictor can be sub-optimal (Izacard et al., 2022). Moreover, it requires collecting intermediate supervision on the instance-dependent relevant information to retrieve, which is missing in common datasets and expensive to obtain in general. A common strategy to circumvent the lack of intermediate supervision is to perform end-to-end training which presents its own unique challenges in the context of RAMs. Fundamentally, the retrieval corresponds to the non-differentiable discrete operation of selecting relevant information from a data-store, e.g., via top-k selection based on retriever scores, which prevents direct gradient propagation to the entire retriever. Several clever solutions to above-mentioned issues have been proposed in the literature that focus on different training objectives to propagate learning signal from the predictor into the retriever. However, a formal study that unifies these solutions is missing from the literature.

Another key challenge that prevents the resource-efficient development and deployment of RAMs is the limited understanding of their basic properties such as their generalization behavior and expressive power. For instance, how do the retriever and predictor components interact to ensure good task-specific performance? Are there any principles guiding the selection of the retriever and predictor components? How does (size of) the data-store feature in the final performance of a RAM?

In this paper, we address both aforementioned shortcoming in the literature pertaining RAMs. To unify the training of RAMs, we begin with writing down the natural objective function, which somehow has eluded the literature. This natural objective simply minimizes the expected prediction loss, where the expectation is taken over the distribution induced by the retriever. Empirically, we find this objective to be effective on standard benchmarks: NaturalQuestions (NQ; Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017).

As for the generalization and expressive power, we present an excess risk bound for RAMs that captures the effect of retrieval and prediction function classes. The proposed bound allows us to highlight how retriever and predictor components play complementary roles to reduce approximation error as we increase their respective function class complexity. We also capture the role of data store in improving the model performance by reducing the approximation error. On the generalization front, we carefully decouple the generalization term in the excess risk over the predictor and retriever function classes. This allows us to tightly control the generalization term with only logarithmic dependence on the data store size. As a concrete instantiation for our excess risk bounds, we consider feed-forward neural networks of varying depth for both the retriever and the predictor.

To summarize, our main contributions include:

•

We present a principled objective for end-to-end training of RAMs focusing on a classification setting (Sec. 2 3) and draw connections between existing approaches for training RAMs (Sec. 3.6).
•

We derive excess risk bound highlighting the role played by retriever and predictor functions classes as well as the data-store towards ensuing improved performance by RAMs (Sec. 3.4); capturing the trade off between model capacities at retriever and predictor (Sec. 3.5).
•

We validated the utility of the proposed objective on two standard QA benchmarks: NaturalQuestions (NQ) and TriviaQA (Sec. 4).

2 Problem setup

In this paper, we focus on developing a systematic understanding of RAMs with learned retrievers in a classification setting where the model has access to a data-store. Towards this, we begin by formally defining the problem setup and providing the necessary background along with the notations used.

Let’s first consider the standard classification setting which requires predicting a class in $\mathscr{Y}$ for a given instance $x\in\mathscr{X}$ . Assume that $\mathsf{D}_{XY}$ captures the underlying data distribution and one has access to $n$ training examples $\mathscr{S}_{n}\triangleq\{(x_{i},y_{i})\}_{i\in[n]}$ that are independent and identically distributed (i.i.d.) according to $\mathsf{D}_{XY}$ . Given $\mathscr{S}_{n}$ , one hopes to learn a classifier $f:\mathscr{X}\to\mathbb{R}^{|\mathscr{Y}|}$ that minimizes the miss-classification error:

\displaystyle R(f)=\mathbb{P}_{(X,Y)\sim\mathsf{D}_{XY}}\big{[}\operatorname*{% arg\,max}_{y\in\mathscr{Y}}f^{y}(X)\neq Y\big{]},

(1)

where $f^{y}(x)$ denotes the score that $f$ assigns to the $y$ -th class, given the input instance $x$ . Since directly optimizing the miss-classification error or $0/1$ -loss poses computational challenges, one typically selects the classifier that minimizes the empirical risk associated with a well behaved surrogate loss function $\ell:\mathbb{R}^{|\mathscr{Y}|}\times\mathscr{Y}\to\mathbb{R}$ on the training sample $\mathscr{S}_{n}$ :

\displaystyle R_{\ell,n}(f)=\frac{1}{n}\sum_{i\in[n]}\ell\big{(}f(x_{i}),y_{i}% \big{)}.

(2)

The (population) risk associated with the surrogate loss function takes the following form:

\displaystyle R_{\ell}(f)=\mathbb{E}_{(X,Y)\sim\mathsf{D}_{XY}}\big{[}\ell\big% {(}f(X),Y\big{)}\big{]}.

(3)

Different from the standard classification setup described above, we now consider the classification task with access to a data-store: Given an instance $x$ , the classifier can potentially leverage a data-store $\mathscr{I}\subseteq\mathscr{Z}$ – a collection of potentially relevant information or evidences, where $\mathscr{Z}$ denotes the space of all possible evidences. Accordingly, one can define the empirical and population risks of a classifier $f(\cdot,\mathscr{I}):\mathscr{X}\to\mathbb{R}^{|\mathscr{Y}|}$ as follows:

	$\displaystyle R_{\ell,\mathscr{I},n}(f)$	$\displaystyle=\frac{1}{n}\sum_{i\in[n]}\ell\big{(}f(x_{i},\mathscr{I}),y_{i}% \big{)},$		(4)
	$\displaystyle R_{\ell,\mathscr{I}}(f)$	$\displaystyle=\mathbb{E}\big{[}\ell\big{(}f(X,\mathscr{I}),Y\big{)}\big{]},$		(5)

where expectation is take over in $(X,Y)\sim\mathsf{D}_{XY}$ as well as the possible randomness in $f(\cdot,\mathscr{I})$ . However, due its prohibitive computational cost, such a general classifier that directly processes the entire data-store for each prediction is far from how an additional data-store is utilized by ML models in practice.

This motivates us to study the following explicit retrieval-augmented classification setup to utilize the data-store: Given an input instance $x\in\mathscr{X}$ , one first retrieves input-dependent supporting evidences $\mathcal{E}^{x}\subset\mathscr{I}$ with the help of a retriever model which has access to the entire data-store $\mathscr{I}$ . Now, given $x$ and $\mathscr{E}^{x}$ , one invokes a predictor model to predict the class associated with $x$ . Thus, a retriever-augmented classification setup consists of two key components models: 1) retriever model and 2) predictor model, which we formally introduce next.

Retriever model. For the retrieval stage, we rely on a retriever model to capture the relevance of an evidence $z\in\mathscr{I}$ towards the input instance $x\in\mathscr{X}$ . Let $r_{\theta}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}$ be the retriever model parameterized by $\theta\in\Theta$ that assigns a relevance score $r_{\theta}(x,z)$ to the instance-evidence pair $(x,z)$ . Furthermore, for each instance $x$ , the retriever model $r_{\theta}$ induces the following distribution over the set of potential evidences:

\displaystyle p_{\theta,\mathscr{I}}\big{(}z|x\big{)}=\frac{\exp\big{(}r_{% \theta}(x,z)\big{)}}{\sum_{z^{\prime}\in\mathscr{I}}\exp\big{(}r_{\theta}(x,z^% {\prime})\big{)}},\quad\forall~{}z\in\mathscr{I}.

(6)

There are multiple strategies to construct the set of input-dependent supporting evidences $\mathscr{E}^{x}$ based on $r_{\theta}$ . For example, for a fixed integer $k\geq 1$ , one could select $k$ evidences corresponding to the $k$ highest scores in $\{r_{\theta}(x,z)\}_{z\in\mathscr{I}}$ . Another strategy is to sample $k$ evidences according to the distribution $p_{\theta,\mathscr{I}}(\cdot|x)$ in (6). Here, one could perform the sampling with or without replacement. In what follows, we denote the retrieved supporting evidence for the instance $x$ as $\mathscr{E}^{x}_{\theta}$ to highlight the dependence on the underlying retriever model.

Predictor model. Let $h_{\xi}:\mathscr{X}\times\mathscr{I}^{\ast}\to\mathbb{R}^{|\mathscr{Y}|}$ be the predictor model parameterized by $\xi\in\Xi$ , where $\mathscr{I}^{\ast}$ denotes the Kleene star on $\mathscr{I}$ . Given $x\in\mathscr{X}$ and $\mathcal{E}\in\mathscr{I}^{\ast}$ , the predictor model $h_{\xi}$ assigns a score to each class in $\mathscr{Y}$ , defining a distribution over $\mathscr{Y}$ as follows:

\displaystyle p_{\xi}\big{(}y|x,\mathscr{E})=\frac{\exp\big{(}h^{y}_{\xi}(x,% \mathcal{E})\big{)}}{\sum_{y^{\prime}\in\mathscr{Y}}\exp\big{(}h^{y^{\prime}}_% {\xi}(x,\mathcal{E})\big{)}},\quad\forall~{}y\in\mathscr{Y},

(7)

where $h^{y}_{\xi}(\cdot,\cdot)$ denotes the score assigned to the $y$ -th class by the predictor model $h_{\xi}$ .

For ease of exposition, we focus on the setting with $k=|\mathscr{E}^{x}_{\theta}|=1,\forall x\in\mathscr{X},$ in our analysis throughout this paper. This corresponds to retrieving a single supporting evidence for each input instance. Our analysis can be generalized to $k>1$ by working with a $\tilde{\mathscr{I}}\subseteq\mathscr{I}^{k}$ as the new data-store and $\tilde{p}_{\theta,\mathscr{I}}(\cdot|x)$ as a distribution over $\tilde{\mathscr{I}}$ obtained by suitably modifying $p_{\theta,\mathscr{I}}$ in (6). For example, when $k$ supporting evidences are sampled with replacement, then the following holds $\forall(z_{1},\ldots,z_{k})\in\mathscr{I}^{k}$ .

\displaystyle\tilde{p}_{\theta,\mathscr{I}}\big{(}(z_{1},\ldots,z_{k})\big{|}x% \big{)}=\prod_{j\in[k]}p_{\theta,\mathscr{I}}(z_{j}|x).

Empirical risk minimization and excess risk for RAMs. For a pair of retriever and predictor models parameterized by $\theta$ and $\xi$ , respectively, we can define the empirical and population risks associated with a (surrogate) loss function $\ell$ as follows:

	$\displaystyle R_{\ell,\mathscr{I},n}(\xi,\theta)$	$\displaystyle=\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x)% \ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)},$		(8)
	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\theta)$	$\displaystyle=\mathbb{E}\big{[}\ell\big{(}h_{\xi}(X,\mathcal{E}^{X}_{\theta}),% Y\big{)}\big{]}.$		(9)

Note that the expectation in (9) is taken over $(X,Y)\sim\mathsf{D}_{XY}$ as well as the randomness involved in the retrieval stage, e.g., sampling the evidences according to $p_{\theta,\mathscr{I}}(\cdot|x)$ in (6). Given a pair of predictor class $\Xi$ and retriever class $\Theta$ , let $(\hat{\xi},\hat{\theta})$ denote the predictor-retriever pair obtained via empirical risk minimization (ERM) as follows:

\displaystyle(\hat{\xi},\hat{\theta})\in\operatorname*{arg\,min}_{(\xi,\theta)% \in\Xi\times\Theta}R_{\ell,\mathscr{I},n}(\xi,\theta).

(10)

Let $\mathscr{F}_{\rm all}$ denote the set of all measurable functions from $\mathscr{X}\times\mathscr{Z}$ to $\mathbb{R}^{|\mathscr{Y}|}$ . The optimal risk for the classification with access to the data-store is achieved by the best possible predictor $f^{\ell}_{\rm opt,\mathscr{I}}\in\mathscr{F}_{\rm all}$ when it has access to the best retrieved evidence in $\mathscr{I}$ . In particular, we have

\displaystyle f^{\ell}_{\mathrm{opt},\mathscr{I}}=\operatorname*{arg\,min}_{f% \in\mathscr{F}_{\mathrm{all}}}\mathbb{E}\big{[}\min_{z\in\mathscr{I}}\ell(f(X,% z),Y)\big{]}.

(11)

Given $f^{\ell}_{\mathrm{opt},\mathscr{I}}$ , we defined the excess risk of a predictor-retriever pair $(\xi,\theta)$ as follows:

\displaystyle\Delta_{\ell,\mathscr{I}}(\xi,\theta)=R_{\ell,\mathscr{I}}(\xi,% \theta)-R_{\ell,\mathscr{I}}(f^{\ell}_{\mathrm{opt},\mathscr{I}})\triangleq R_% {\ell,\mathscr{I}}(\xi,\theta)-\mathbb{E}\big{[}\min_{z\in\mathscr{I}}\ell(f^{% \ell}_{\mathrm{opt},\mathscr{I}}(X,z),Y)\big{]}.

(12)

With the formal definition of the classification setting with access to a data-store and the necessary background in place, we proceed to address the two key objectives of this work: 1) Proposing a natural and efficient joint end-to-end training procedure for the predictor-retriever pair in a RAM; and 2) Developing a rigorous statistical understanding of RAMs focusing on the interaction between predictor and retriever components towards reducing overall excess risk.

3 Joint training and excess risk

Recall that training a RAM involves training both the retriever $r_{\theta}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}$ and the predictor $h_{\xi}:\mathscr{X}\times\mathscr{I}\to\mathbb{R}^{|\mathscr{Y}|}$ components of the model without access to intermediate supervision on retrieval, which is infeasible to obtain in most practical settings. Thus, it becomes critical to devise methods to jointly train $r_{\theta}$ and $h_{\xi}$ with access to only labeled instances $\mathscr{S}_{n}=\{(x_{i},y_{i})\}_{i\in[n]}\subseteq\mathscr{X}\times\mathscr{Y}$ with the predictor guiding the retriever training based on how valuable the retriever-provided evidences are towards the correct final prediction.

Towards this, we leverage the empirical risk from (8) along with the log-loss $\ell(h_{\xi}(x,z),y)=-\log p_{\xi}(y|x,z)$ , where $p_{\xi}(y|x,z)$ is defined in (7). In particular, this leads to the following joint end-to-end training objective:

\displaystyle\mathscr{L}_{n}(\xi,\theta;\mathscr{I})\triangleq R_{{\rm log},% \mathscr{I},n}(\xi,\theta)=-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{% \theta,\mathscr{I}}(z|x_{i})\cdot\log p_{\xi}(y_{i}|x_{i},z).

(13)

Note that the objective in (13) aims to improve the end-to-end performance of a RAM in deployment in the sense that the objective aims to minimize the expected loss given the selected evidences as per the retriever-induced distribution. One can use gradient-based methods to jointly minimize the objective in (13) with respect to $(\xi,\theta)$ ; however, its efficient implementation is non-trivial due to the sum over entire data-store $\mathscr{I}$ . In App. C.1, we discuss some approximate design choices. Lastly, please refer to Sec. 3.6 for connections between our proposed objective in (13) and some of the existing end-to-end training approaches for RAMs.

Next, to study the generalization and expressive power of RAMs, we want to bound the excess risk $\Delta_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})$ as defined in (12). We consider $\mathscr{X}$ to be a compact subspace of $\mathbb{R}^{d_{x}}$ and, for simplicity, take $\mathscr{X}\subseteq[-1,1]^{d_{x}}$ . Similarly, we consider that each retrieval example $z\in\mathscr{I}$ is embedded in the space $[-1,1]^{d_{z}}$ . We consider a data-store that polynomially scales with training data size, i.e., $|\mathscr{I}|={\rm poly}(n)$ . For the purpose of analysis, we specialize our log-loss to be bounded by $\ell_{\max}>0$ , which is given as

\displaystyle\ell(h_{\xi}(x,z),y)=\min(\ell_{\max},-\log p_{\xi}(y|x,z))=\min% \bigg{(}\ell_{\max},\log\Big{(}\sum_{y^{\prime}\in\mathscr{Y}}\exp(h^{y^{% \prime}}_{\xi}(x,z))\Big{)}-h^{y}_{\xi}(x,z)\bigg{)},

where $p_{\xi}(y|x,z)$ and $h^{y}_{\xi}(x,z)$ are defined in (7).

3.1 Excess risk decomposition

Our excess risk relies on separating out the contribution coming from the retriever and the predictor during the joint training. Moreover, the retriever and predictor errors can be each split into generalization and approximation error.

The population risk optimizer of our joint training over the space $\Xi\times\Theta$ is defined as

\displaystyle\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint}=\operatorname*{% arg\,min}_{(\xi,\theta)\in\Xi\times\Theta}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z% \sim p_{\theta}(\cdot|X)}\mathbb{E}_{Y|X}\ell\big{(}h_{\xi}(X,Z),Y)\big{]}.

For a predictor $\xi$ , sample $x\in\mathscr{X}$ and retrieved example $z\in\mathscr{I}$ , let us denote the risk averaged over the labels $\mathscr{Y}$ as

g_{\xi}(x,z)=\mathbb{E}_{Y|X=x}[\ell\big{(}h_{\xi}(x,z),Y)].

(14)

For any fixed predictor $\xi$ (not necessarily in $\Xi$ ) and fixed data-store $\mathscr{I}$ , the retriever that optimizes the joint population risk is given as $p^{\ast,\xi}(z|x)=\mathbbm{1}_{\operatorname*{arg\,min}_{z^{\prime}\in\mathscr% {I}}g_{\xi}(x,z^{\prime})}(z)$ , where a tie is broken arbitrarily. Note that, for each sample $x$ , the best retrieved evidence $z$ may change. We define the optimal predictor within the class $\Xi$ with best possible retriever as

\xi^{\ast}=\operatorname*{arg\,min}_{\xi\in\Xi}\mathbb{E}_{X}\big{[}\min_{z\in% \mathscr{I}}g_{\xi}(X,z)\big{]}.

The optimal retriever within the class $\Theta$ for a given predictor $\xi$ is defined as

\theta(\xi)=\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{X}\big{[}% \mathbb{E}_{Z\sim p_{\theta}(\cdot|X)}g_{\xi}(X,Z)\big{]}.

The excess risk for the classes $\Theta$ and $\Xi$ can be bounded as

		$\displaystyle\Delta_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})\leq\underbrace{% \sum_{(\theta,\xi)\in\{(\hat{\theta},\hat{\xi}),(\theta^{\ast}_{\rm joint},\xi% ^{\ast}_{\rm joint})\}}\|R_{\ell,\mathscr{I}}(\xi,\theta)-R_{\ell,\mathscr{I},n% }(\xi,\theta)\|}_{\text{Generalization Error}}$
		$\displaystyle\qquad+\underbrace{R_{\ell,\mathscr{I}}(\xi^{},\theta(\xi^{\ast}% ))-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{}}(X,z)\big{]}}_{\text{% retriever error}}+\underbrace{\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{% \xi^{*}}(X,z)\big{]}-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})}_{% \text{predictor error}}$		(15)

3.2 Generalization error

We first bound the generalization error and relate it to the covering number of the retriever and predictor class.

As our loss is bounded by $\ell_{\max}$ , through standard concentration bounds (Shalev-Shwartz and Ben-David, 2014), we obtain that, for any $\delta>0$ , with probability at least $(1-\delta)$ :

|R_{\ell,\mathscr{I}}(\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint})-R_{% \ell,\mathscr{I},n}(\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint})|\leq 3% \ell_{\max}\sqrt{\tfrac{\log(1/\delta)}{n}}.

However, $(\hat{\xi},\hat{\theta})$ is learned from the data. A high probability generalization error requires taking union over the space of $\Xi\times\Theta$ . We employ Rademacher complexity based generalization error bounds. Next, the covering number of the space $\Xi$ is used to bound the associated Rademacher complexity. See Shalev-Shwartz and Ben-David (2014) for details.

We define two norms which are used in defining the covering numbers for $\Theta$ and $\Xi$ . In particular, $\forall\mathbf{u}\in\mathbb{R}^{n\times|\mathscr{I}|}$ and fixed $\xi\in\Xi,\theta\in\Theta$ ,

		$\displaystyle\\|\mathbf{u}\\|_{2,[n],\xi}=\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{% (}\sum_{z\in\mathscr{I}}u_{i,z}\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\big{)}% ^{2}\Big{)}^{1/2},$
		$\displaystyle\\|\mathbf{u}\\|_{2,[n],\theta}=\Big{(}\tfrac{1}{n}\sum_{i\in[n]}% \big{(}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})u_{i,z}\big{)}^{2}\Big{)}^{1/2}.$		(16)

We also define $\mathcal{N}(\Xi,\nu,{\|\cdot\|_{2,[n],\theta}})$ to be the $\nu$ -covering number for the class $\Xi$ with respect to the norm $\|\cdot\|_{2,[n],\theta}$ , and $\mathcal{N}(\Theta,\nu,{\|\cdot\|_{2,[n],\xi}})$ to be the $\nu$ -covering number for the class $\Theta$ with respect to the norm $\|\cdot\|_{2,[n],\xi}$ . Then we have the generalization bound given as

\displaystyle|R_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})-R_{\ell,\mathscr{I}% ,n}(\hat{\xi},\hat{\theta})|\leq\inf_{\varepsilon\in[0,\ell_{\max}/2]}\Big{(}8% \varepsilon+\tfrac{24}{\sqrt{n}}\int_{\varepsilon}^{\tfrac{\ell_{\max}}{2}}f_{% \mathcal{N}}(\nu/2;\Theta,\Xi)+f_{\mathcal{N}}(\nu/2;\Xi,\Theta)d\nu\Big{)},

(17)

for $f_{\mathcal{N}}(\nu;\mathcal{A},\mathcal{B})=\sup_{b\in\mathcal{B}}\sqrt{\log(% \mathcal{N}(\mathcal{A},\nu,\|\cdot\|_{2,[n],b}))}.$

We use ideas in Zhang (2023) to upper bound the covering number with pseudo-dimension (defined in the Appendix A) of the function class. This allows us to have a $\log|\mathscr{I}|$ dependence in the generalization error, while working with norm unbounded function classes.

3.3 Approximation error

We next proceed to bound the retriever and predictor approximation errors. Towards this, we extensively use the Sobolev functions spaces. A Sobolev space for a domain $\Omega$ is characterized by two quantities, $\kappa$ – the number of weak-derivatives a (real-valued) function within it possesses, and $L_{p}(\Omega)$ – the norm with respect to which these derivatives are integrable. Please see Appendix A for a complete definition.

3.3.1 Retriever error

The retriever error is given by how well the score function $r_{\theta}(x,z)$ approximates the optimal retriever given $\xi^{*}$ . In order to do so we first need to impose some smoothness constraints on the function $g_{\xi^{*}}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}$ . In particular, we assume the following.

Assumption 3.1 (Complexity of $g_{\xi^{*}}$ ).

There exists a baseline function $b_{\xi^{*}}:[-1,1]^{d_{x}}\to\mathbb{R}$ such that the function $\mathrm{gap}_{\xi^{*}}:[-1,1]^{d_{x}+d_{z}}\to\mathbb{R}$ defined by $\mathrm{gap}_{\xi^{*}}(x,z)\triangleq(g_{\xi^{*}}(x,z)-b_{\xi^{*}}(x))$ lies in the Sobolev space with $\kappa$ derivatives and $L_{\infty}([-1,1]^{d_{x}+d_{z}})$ norm.

The above assumption says that for the predictor $\xi^{*}$ the loss profile (averaged over labels in $\mathscr{Y}$ ) $g_{\xi^{*}}(x,z)$ , has two components – a (possibly) complex $b_{\xi^{*}}(x)$ component that is uniform over $z$ , and a ‘smooth’ $\mathrm{gap}_{\xi^{*}}(x,z)$ component. In other words, given two similar retrieved evidences, the predictor incurs similar losses when each of the evidences is utilized with an input instance.

Then, for any $\tau>0$ , we can bound the retriever loss as follows:

\displaystyle R_{\ell,\mathscr{I}}(\xi^{*},\theta(\xi^{\ast}))-\mathbb{E}_{X}% \big{[}\min_{z\in\mathscr{I}}g_{\xi^{*}}(X,z)\big{]}\leq\inf_{\theta\in\Theta}% \ell_{\max}\|r_{\theta}+\tau\cdot\mathrm{gap}_{\xi^{*}}\|_{\infty}+\frac{\log|% \mathscr{I}|}{\tau^{2}}

(18)

3.3.2 Predictor error

The predictor error is measured with the optimal retrieval (as the retriever error is considered separately above). For this, we need to first quantify how the retrieval augmentation using the data-store $\mathscr{I}$ helps.

Usefulness of retrieval set:

We start with characterization of the prediction task in the presence of the data-store $\mathscr{I}\subset\mathscr{Z}$ . We assume that there exists a score function $h_{*}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}^{|\mathscr{Y}|}$ , and the corresponding probability distribution

p_{*}^{y}(x,z)=\frac{\exp(h_{*}^{y}(x,z))}{\sum_{y^{\prime}}\exp(h_{*}^{y^{% \prime}}(x,z))},

(19)

that approximates $p_{\mathsf{D}_{XY}}^{y}(x):=\mathbb{P}_{Y\sim\mathsf{D}_{Y|X}}(y|X=x)$ well for all $x\in\mathscr{X}$ and $y\in\mathscr{Y}$ . Furthermore, we want this score function $h_{*}$ to lie coordinate wise in a Sobolev space. The following assumption formalizes this.

Assumption 3.2 (Retrieval quality).

There exists a score function $h_{*}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}^{|\mathscr{Y}|}$ such that

1.

for each $y\in\mathscr{Y}$ , the function $h_{*}^{y}$ lies in the Sobolev space with $\kappa_{\mathscr{I}}$ derivatives and finite $L_{\infty}([-1,1]^{d_{x}+d_{z}})$ norm,

for any $x\in\mathscr{X}$ , there exists a retrieved evidence $z^{*}(x)\in\mathscr{I}$ such that $p_{*}^{y}(x,z)$ , as defined in (19), satisfies

\max_{y\in\mathscr{Y}}\sup_{x\in\mathscr{X}}|p_{*}^{y}(x,z^{*}(x))-p_{\mathsf{% D}_{XY}}^{y}(x)|\leq c_{\mathscr{I}}|\mathscr{I}|^{-\gamma_{\mathscr{I}}}.

Note that this is independent of the retriever class $\Theta$ and $\Xi$ , and captures intrinsic property of the data-store $\mathscr{I}$ . The tuple $(\gamma_{\mathscr{I}},d_{z},\kappa_{\mathscr{I}})$ defines the usefulness of $\mathscr{I}$ . In particular, the higher $\gamma_{\mathscr{I}}$ the closer the approximation; and the higher the $\kappa_{\mathscr{I}}$ and smaller the embedding dimension $d_{z}$ the ‘simpler’ the score function used for this approximation.

Under the Assumption 3.2, we bound the predictor error as

	$\displaystyle\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{*}}(X,z)\big{]% }-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$	$\displaystyle\leq\inf_{\xi\in\Xi}2\mathbb{E}_{X}\big{[}\max_{y\in\mathscr{Y}}\|% h_{\xi}^{y}(X,z^{}(X))-h_{}^{y}(X,z^{*}(X))\|\big{]}+$
		$\displaystyle\qquad\quad(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|% \mathscr{I}\|^{-\gamma_{\mathscr{I}}}\exp(\ell_{\max}).$		(20)

One key step in arriving to the above inequality is expressing the loss of $f_{{\rm opt},\mathscr{I}}^{\ell}$ using the probability function $h_{*}$ defined in Assumption 3.2. In particular, under Assumption 3.2, we show that

\displaystyle\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},% \mathscr{I}}^{\ell}}(X,z)\big{]}\geq\mathbb{E}_{X}\big{[}g_{h_{*}}(X,z^{*}(X))% \big{]}-(|\mathscr{Y}|-1)\exp(-\ell_{\max})-c_{\mathscr{I}}|\mathscr{I}|^{-% \gamma_{\mathscr{I}}}\exp(\ell_{\max}).

3.4 Final excess risk bound

We now combine the three components of the excess risk bounds under Assumptions 3.1 and 3.2 and discuss the design tradeoffs. The following theorem captures our main theoretical result.

Theorem 3.3 (Excess risk of joint training).

Under Assumption 3.1 and 3.2, the excess risk for the retriever class $\Theta$ and predictor class $\Xi$ is bounded as

	$\displaystyle\Delta_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})\leq 3\ell_{\max% }(\tfrac{1}{n}+\sqrt{\tfrac{\log(n)}{n}})+\inf_{\varepsilon\in[0,\tfrac{\ell_{% \max}}{2}]}8\varepsilon+\tfrac{24}{\sqrt{n}}\int_{\varepsilon}^{\tfrac{\ell_{% \max}}{2}}f_{\mathcal{N}}(\tfrac{\nu}{2};\Theta,\Xi)+f_{\mathcal{N}}(\tfrac{% \nu}{2};\Xi,\Theta)d\nu$
	$\displaystyle\qquad+\inf_{\theta\in\Theta}\inf_{\tau>0}\ell_{\max}\\|r_{\theta}% +\tau\cdot\mathrm{gap}_{\xi^{*}}\\|_{\infty}+\frac{\log\|\mathscr{I}\|}{\tau^{2}}$
	$\displaystyle\qquad+\inf_{\xi\in\Xi}2\mathbb{E}_{X}\big{[}\max_{y\in\mathscr{Y% }}\|h_{\xi}^{y}(X,z^{}(X))-h_{}^{y}(X,z^{*}(X))\|\big{]}+(\|\mathscr{Y}\|-1)\exp% (-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_{\mathscr{I}}}\exp(\ell_{% \max}),$

where $f_{\mathcal{N}}(\nu;\mathcal{A},\mathcal{B})\triangleq\sup_{b\in\mathcal{B}}% \sqrt{\log(\mathcal{N}(\mathcal{A},\nu,\|\cdot\|_{2,[n],b}))}$ and $\|\cdot\|_{2,[n],\theta}$ and $\|\cdot\|_{2,[n],\xi}$ are defined in (3.2).

3.5 Illustrative example: MLPs

We instantiate both our retriever and predictor classes to be multi-layer perceptron (MLP) with depth $L_{\rm ret}$ & width $W_{\rm ret}=O(d_{x}+d_{z})$ and depth $L_{\rm pred}$ & width $W_{\rm pred}=O(|\mathscr{Y}|(d_{x}+d_{z}))$ , respectively. The class ${\rm MLP}\left(\mathbb{R}^{d},\mathbb{R}^{k};L,W\right)$ is defined in Appendix A. The specialized excess risk bound for this setting is given as

Theorem 3.4 (Excess risk for MLP).

Under Assumption 3.1 and 3.2, the excess risk for the retriever class $\Theta=MLP\left(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};L_{\rm ret},O(d_{x}+d_{z})\right)$ and predictor class $\Xi=MLP\left(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R}^{|\mathscr{Y}|};L_{\rm pred},% O(|\mathscr{Y}|(d_{x}+d_{z}))\right)$ is bounded as

	$\displaystyle\Delta_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})\leq\tilde{O}% \left(\frac{\ell_{\max}}{\sqrt{n}}\left(L_{\rm ret}+L_{\rm pred}\|\mathscr{Y}\|% \right)\right)+O\Big{(}\ell_{\max}L_{\rm ret}^{-\tfrac{4\kappa}{3(d_{x}+d_{z})% }}\log^{1/3}(\|\mathscr{I}\|)\Big{)}+$
	$\displaystyle O\left(L_{\rm pred}^{-\tfrac{2\kappa_{\mathscr{I}}}{(d_{x}+d_{z}% )}}+(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_% {\mathscr{I}}}\exp(\ell_{\max})\right).$

Finally, to capture the optimal trade-off under finite data size $n$ , we consider classes of retriever and predictors that change with the data size, denoted by $\Theta_{n}$ and $\Xi_{n}$ , with growing depths $L_{{\rm ret},n}$ and $L_{{\rm pred},n}$ respectively. Similarly, we also consider growing upper bound on the loss function by $\ell_{\max,n}$ . Let $d_{\rm tot}=d_{x}+d_{z}$ . For $L_{{\rm ret},n}=n^{\tfrac{3d_{\rm tot}}{6d_{\rm tot}+8\kappa}}$ , $L_{{\rm pred},n}=(\sqrt{n}/|\mathscr{Y}|)^{\tfrac{d_{\rm tot}}{2d_{\rm tot}+4% \kappa_{\mathscr{I}}}}$ , and $\ell_{\max,n}=\log|\mathscr{Y}|+\frac{\kappa_{\mathscr{I}}}{(d_{\rm tot}+2% \kappa_{\mathscr{I}})}\log n$ , the excess risk is bounded by

O\bigg{(}n^{-\tfrac{2\kappa}{3d_{\rm tot}+4\kappa}}+\max\Big{(}|\mathscr{I}|^{% -\gamma_{\mathscr{I}}}|\mathscr{Y}|n^{\tfrac{\kappa_{\mathscr{I}}}{d_{\rm tot}% +2\kappa_{\mathscr{I}}}},\big{(}\frac{n}{|\mathscr{Y}|^{2}}\big{)}^{-\tfrac{% \kappa_{\mathscr{I}}}{d_{\rm tot}+2\kappa_{\mathscr{I}}}}\Big{)}\bigg{)}.

We should contrast the above result with the prediction when there is no retrieval. Let us assume that the functions $p_{\mathsf{D}_{XY}}^{y}(x)$ for all $y\in\mathscr{Y}$ lies in the Sobolev space with derivative $\kappa_{\rm true}$ and $L_{\infty}$ norm. The predictor excess risk rate with $L_{{\rm pred},n}=(\sqrt{n}/|\mathscr{Y}|)^{\frac{d_{x}}{d_{x}+2\kappa_{\rm true% }}}$ is $O((n/|\mathscr{Y}|^{2})^{-\tfrac{\kappa_{\rm true}}{d_{x}+2\kappa_{\rm true}}})$ .

Note that our analysis indicates that we may gain through retrieval: For large enough data store $|\mathscr{I}|\geq|\mathscr{Y}|^{\tfrac{d_{\rm tot}\gamma_{\mathscr{I}}^{-1}}{d% _{\rm tot}+2\kappa_{\mathscr{I}}}}n^{\tfrac{2\kappa_{\mathscr{I}}\gamma_{% \mathscr{I}}^{-1}}{d_{\rm tot}+2\kappa_{\mathscr{I}}}}$ , as the data size $n$ increases and $\kappa>\tfrac{3d_{\rm tot}}{2d_{x}}\kappa_{\rm true}$ and $\kappa_{\mathscr{I}}>\tfrac{d_{\rm tot}}{d_{x}}\kappa_{\rm true}$ (see Fig. 1).

Refer to caption — Figure 1: Left: Excess risk bound as we vary retriever and predictor size for a fixed $n$ and $\mathscr{I}$ based on Theorem 3.4. Note that different size combination of predictor and retriever achieves same risk bound. Right: Excess risk bound of RAM as we increase data-store size in contrast to direct MLP predictor with no retrieval. We plot for various values of $n$ , with each color corresponding to a fixed $n$ .

Method	small			base			large
Method	small	base	large	small	base	large	small	base	large
No retriever, train predictor $\xi$
Reverse Cross-Entropy		19.6			25.5			29.1
Fixed retriever $\theta_{0}$ , train predictor $\xi$
Reverse Cross-Entropy	23.2	26.6	28.3	27.5	32.4	34.7	32.2	36.4	37.8
Fixed predictor $\xi^{\star}(\theta_{0})$ , train retriever $\theta$
EMDR2	23.9	28.5	31.0	29.2	34.2	36.6	33.4	38.0	40.8
PDist	30.1	34.5	38.4	34.0	39.7	42.8	37.6	42.8	44.7
Reverse Cross-Entropy + PG	25.9	30.6	31.7	31.5	36.4	37.9	36.0	40.2	41.4
Reverse Cross-Entropy + TopK	29.4	35.5	37.9	33.8	39.7	43.0	37.2	42.3	45.0
Jointly train predictor $\xi$ and retriever $\theta$
EMDR2	24.1	30.4	32.7	30.4	35.6	39.3	34.5	39.7	42.1
PDist	28.7	33.2	36.6	33.3	37.1	38.8	36.2	40.2	41.6
Reverse Cross-Entropy + PG	27.1	31.0	32.7	33.3	37.2	38.2	36.5	39.8	41.4
Reverse Cross-Entropy + TopK	32.8	37.8	40.1	36.6	41.8	44.8	38.8	43.8	46.4

Table 1: Exact match accuracy on NQ. We measure the performance of RAMs across various training paradigms and model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

3.6 Connections with prior end-to-end training

We conclude our treatment of end-to-end training of RAMs by drawing parallels between our proposed method with some representative approaches from the literature.

EMDR² Sachan et al. (2021) minimize the following objective based on the negative log-likelihood:

\displaystyle\mathscr{L}^{\textsc{Emdr}^{2}}_{n}(\xi,\theta;\mathscr{I})=-% \frac{1}{n}\sum_{i\in[n]}\log p_{\xi,\theta,\mathscr{I}}(y|x)=-\frac{1}{n}\sum% _{i\in[n]}\log\Big{(}\sum_{z\in\mathscr{I}}p_{\theta,\mathscr{I}}(z|x_{i})% \cdot p_{\xi}(y_{i}|x_{i},z)\Big{)}.

(21)

It follows from the convexity of $-\log(\cdot)$ and Jensen’s inequality that our objective in (13) upper bounds the EMDR² objective in (21); as a result, minimizing the former also minimizes the latter but not vice versa.

Perplexity distillation (PDist) Another approach for joint training of RAMs in the literature involves optimizing two distinct objectives for training the predictor and retriever components. For example, Izacard et al. (2022) propose multiple objectives for retriever training, including PDist (Sachan et al., 2023) which is defined as follows:

\displaystyle\mathscr{L}^{\textsc{PDist}}_{\mathscr{I},n}(\theta;\xi,\mathscr{% I})=\frac{1}{n}\sum_{i\in[n]}\mathrm{CE}\big{(}p^{\textsc{PDist}}_{\xi,% \mathscr{I}}(Z|x_{i},y_{i}),p_{\theta,\mathscr{I}}(Z|x_{i})\big{)},

(22)

where $\mathrm{CE}(\cdot,\cdot)$ denotes the cross entropy between two distributions and

\displaystyle p^{\textsc{PDist}}_{\xi,\mathscr{I}}(z|x,y)={p_{\xi}(y|x,z)}/{% \sum_{z^{\prime}\in\mathscr{I}}p_{\xi}(y|x,z^{\prime})}\quad\forall~{}z\in% \mathscr{I},

represents a predictor-assigned distribution over evidences based on their utility towards making correct prediction. As for the predictor training, they optimize an objective akin to (13) with respect to $\xi$ . Besides this similarity in the predictor training, our approach for retrieval training has a subtle connection with PDist. Note that PDist optimizes forward cross-entropy between the predictor and the retriever induced distributions to train the retriever. On the other hand, our objective in (13) is closer to $\frac{1}{n}\sum_{i}\mathrm{CE}(p_{\theta,\mathscr{I}}(Z|x_{i}),p^{\textsc{% PDist}}_{\xi,\mathscr{I}})$ , the reversed cross-entropy between the two distributions. The former has the “mean-seeking” behavior whereas the latter has the “mode-seeking” behavior (Huszár, 2015; Gu et al., 2023; Agarwal et al., 2023).

Similarity with RLHF/RLAIF Note that the per-example objective of our retrieval training approach takes the form:

\displaystyle\mathbb{E}_{Z\sim p_{\theta,\mathscr{I}}(\cdot|x_{i})}\big{[}\ell% \big{(}h_{\xi}(x_{i},Z),y_{i}\big{)}\big{]},

(23)

i.e., the predictor model provides feedback on the (value) of the evidences sampled by the retriever model. Alternatively, one can view $-\ell\big{(}h_{\xi}(x_{i},Z),y_{i}\big{)}$ as the reward assigned to the evidence $z$ by the predictor model $h_{\xi}$ and retriever model aims to select those evidences that maximize this reward value. This is similar to RLHF (Ziegler et al., 2019) or RLAIF (Bai et al., 2022) paradigm, where the underlying LLM aims to sample those generations which maximize the reward assigned by a reward model. However, note that in RLHF/RLAIF paradigm the policy network and reward model are not jointly trained together unlike in RAM.

4 Experiments

There have been numerous successful practical applications of RAMs in the literature (e.g., Sachan et al. (2021); Izacard et al. (2022)). Here, we present a brief empirical study for such models in order to corroborate the benefits predicted by our theoretical results. In particular, we consider the task of open-domain question answering and show that proposed objective is competitive to the objectives proposed in the literature and observe the trade-offs in model capacity between retriever and predictor model.

Data Our evaluation is based on two benchmark datasets: NQOpen Kwiatkowski et al. (2019) and TriviaQA Joshi et al. (2017), which serve as sources for supervised examples $(x,y)$ , while chunked Wikipedia 2018 is used as the data-store $\mathscr{I}$ following literature (Karpukhin et al., 2020a). Consistent with established practices, we employ the exact match metric to assess the correspondence between the predicted answers and the ground truth. Additionally, we introduce a recall metric to measure the frequency at which the answer string appears within the retrieved documents.

Models We implement the retriever component using GTR (Ni et al., 2022) and the predictor component using T5 (Raffel et al., 2020). We sweep across small, base, and large configurations for both retriever and predictor. The details regarding the model sizes, expressed in terms of the number of parameters, are provided in Table 6 (App. C).

Method	small			base			large
Method	small	base	large	small	base	large	small	base	large
No retriever, train predictor $\xi$
Reverse Cross-Entropy		17.9			23.1			28.0
Fixed retriever $\theta_{0}$ , train predictor $\xi$
Reverse Cross-Entropy	31.5	34.9	38.8	37.0	40.6	44.4	43.4	45.9	49.7
Fixed predictor $\xi^{\star}(\theta_{0})$ , train retriever $\theta$
EMDR2	34.6	41.3	48.3	40.1	48.2	53.4	46.0	50.7	54.9
PDist	45.7	53.3	57.2	50.8	53.2	61.6	53.5	55.4	62.3
Reverse Cross-Entropy + PG	43.2	46.7	54.3	48.6	56.1	55.1	51.7	56.4	56.7
Reverse Cross-Entropy + TopK	43.6	50.4	54.4	48.6	54.9	58.5	52.1	56.6	60.3
Jointly train predictor $\xi$ and retriever $\theta$
EMDR2	37.0	43.1	49.7	42.4	50.5	55.6	47.1	53.4	59.2
PDist	46.7	54.3	57.3	48.8	56.7	60.7	51.0	58.5	63.3
Reverse Cross-Entropy + PG	47.0	52.9	55.7	49.9	57.6	61.1	52.1	59.8	59.2
Reverse Cross-Entropy + TopK	46.8	52.9	56.0	49.2	56.6	60.1	52.3	58.8	62.4

Table 2: Exact match accuracy on TriviaQA. We measure the performance of RAMs across various training paradigms and model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

Methods We compare following approaches: 1) utilizing no retriever, directly training predictor, 2) employing a fixed retriever, but training the predictor, 3) using a fixed predictor, but training the retriever, and 4) conducting joint training of both components. For the joint training and the retriever training phases, we experiment with multiple objectives: EMDR2 (cf. (21)), PDist (cf. (22)), Reverse Cross-Entropy + PG (cf. (45) in App. C.1), and Reverse Cross-Entropy + TopK (cf. (44) in App. C.1). Efficiently implementing any of these objectives is challenging due to the need to compute the gradient with respect to expectation over the entire data-store. We consider two approaches for computing the gradients approximately by: 1) restricting the expectation to top-K elements similar to EMDR2 and PDist; and 2) using REINFORCE (Williams, 1992) to obtain an unbaised estimate. More details can be found in App. C.1.

Observation 1 The addition of a retrieval component markedly enhances performance, as demonstrated in Tables 1 and 2, which present the exact match accuracy. Further improvements are observed when the retriever is specifically trained while keeping the predictor fixed. Joint training emerges as the most effective strategy.

Observation 2 Tables 4 and 5 (App. C) list the recall for the presence of the answer string within the retrieved content. PDist consistently achieves the highest recall, aligning with expectations given its design for distilling the retriever based on the predictor’s scores. However, despite its superior recall, other objectives may lead to better overall performance than PDist, suggesting that different objectives optimize the retriever and predictor with varying efficiencies.

Observation 3 Finally, in Table 3, we report the query per second (QPS), as a proxy for computational cost, achieved by different configuration of retriever and predictor model sizes. For achieving a specific accuracy threshold (e.g., $\geq$ 38.8 on NQ), multiple configurations are viable, such as pairing a large predictor with a small retriever, a base model for both, or a small predictor with a large retriever. The associated query per second (QPS) rates for these configurations are 135, 333, and 800, respectively, illustrating that equivalent accuracy levels can be attained with significantly differing QPS rate. This corroborates with our trade-offs in excess risk bounds for MLPs with different capacity in retriever and predictor components as illustrated in Figure 1. Thus, adding capacity to different parts of the model has different repercussion on quality and computational cost.

5 Discussion and related work

small				base				large
small	base	large		small	base	large		small	base	large
822.60	819.83	800.89		334.30	333.22	331.06		135.06	135.34	134.87

Table 3: Query per second. We measure the query per second processed by RAMs as a proxy for computational cost across various model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

Several works have proposed some form of retrieval augmented models. Here, we provide a brief account of the evolution of RAMs and discuss how our proposed joint-learning objective and the framework for excess risk analysis compare with existing end-to-end training methods.

Augment with local neighborhood The first approaches dating back to 1970s employed just augmenting training instance in the local neighborhood of the input space (Stone, 1977, 1980). Such approaches gained a lot of attention as parametric regression was not adequate in various practical applications of the time. This line of work aims to fit a low-degree polynomial at each point in the data set based on a subset of data points, which resulted in a rich literature on local polynomial regression in low dimensions (Katkovnik and Kheisin, 1979; Cleveland, 1979; Pinsker, 1980; Donoho and Liu, 1988; Ruppert and Wand, 1994; Ibragimov and Has Minskii, 2013). These classical ideas have found their application in many ML algorithms such as face recognition (Jain and Learned-Miller, 2011), dimensionality reduction via local linear embeddings (Roweis and Saul, 2000), domain adaptation (Yang et al., 2021), test time training on neighboring points (Sun et al., 2020; Gandelsman et al., 2022), etc. Recently, Basu et al. (2023) generalized this setup of augmenting with a local neighborhood of the input instance in the context of modern ML models like neural networks and proposed a statistical framework to study such retrieval-based models. However, they do not consider a learned or a specialized distance metric to find the augmenting set, which is critical for realizing good performance in practice (Schonberger et al., 2017; Karpukhin et al., 2020b) and studied in the present work.

Fixed retriever augmentation Next generation retrieval augmented models started to deploy either a hand crafted or a learned retriever. Zhang et al. (2006) employed SIFT (Lowe, 1999) based retrieval followed by a SVM (Cortes and Vapnik, 1995) classifier to improve performance on multiple vision tasks. Chen et al. (2009) studied generalization bounds for SVM-kNN methods – one of the limited works in this domain with formal analysis. For natural language understanding, methods like TF-IDF (Sparck Jones, 1972) were employed in the tasks like case based reasoning (Leake et al., 1996) and open-domain question answering (ODQA; Voorhees et al. 1999). Unlike many previous methods, one retrieves relevant text passages in ODQA settings as opposed to retrieving labelled training pairs. With introduction of transformers (Vaswani et al., 2017), both retriever and predictor models based on encoder and decoder, respectively, have become popular across various domains, including image classification (Long et al., 2022; Iscen et al., 2023), text classification (Wang et al., 2022; Zemlyanskiy et al., 2022), ODQA (Lee et al., 2019; Izacard and Grave, 2021), language modelling (Borgeaud et al., 2021), and even protein folding prediction (Cramer, 2021). Even using the same transformer model as both retriever and predictor boosts performance in language modeling (Khandelwal et al., 2020). Unlike SVM-kNN (Chen et al., 2009), to best of our knowledge, a formal analysis of retrieval-augmented approaches with modern neural networks is missing from the literature. Interestingly, retrieving examples also helps in-context learning (Rubin et al., 2022; Li et al., 2023). Our framework covers this scenario with $z$ representing the in-context examples retrieved from a data-store of examples. Our risk bounds can provide insights into why in-context learning with retrieved few-shot examples performs better than a zero-shot model.

End-to-end trained retriever augmentation For ODQA, Guu et al. (2020) proposed maximizing the marginalized likelihood by considering the retrieved set as a latent variable. EMDR2 (Sachan et al., 2021) optimized the same objective by approximating it based on the retriever induced distribution on the elements that receive top-K scores by the retriever. Hindsight (Paranjape et al., 2022) instead optimizes the ELBO by introducing a variational distribution with access to the outputs. VOD (Liévin et al., 2023) further generalized the standard ELBO based on KL divergence by employing Rényi divergence thereby tightening the lower bound. On the other hand, Atlas (Izacard et al., 2022) proposed an auxiliary loss for training the retriever directly rather than following the latent variable approach. Interestingly, RAG (Lewis et al., 2020) proposed to only train the query encoder for retriever, leaving the retrieval index fixed, thereby alleviating much of the end-to-end training difficulties of RAMs, but at cost of limiting model adaptation flexibility. None of these prior works studied statistical properties vis-à-vis expressivity and generalization of RAMs.

6 Conclusion

In this work, we initiate the development of a theoretical framework to study the statistical properties of RAMs with data-dependent retrieval. Our excess-risks analysis allows us to highlight how retriever and predictor components play complementary roles in reducing approximation error as we increase their respective function class complexity. We surface both theoretically and empirically a Pareto surface achieving the same performance with different size predictors and retrievers. As future work, it would be interesting to study the effect of dynamically updatable data-store and multi-step retrievals for making predictions.

References

Agarwal et al. [2023] Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
Asai et al. [2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
Bartlett et al. [2019] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
Basu et al. [2023] Soumya Basu, Ankit Singh Rawat, and Manzil Zaheer. A statistical perspective on retrieval-based models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1852–1886. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/basu23a.html.
Borgeaud et al. [2021] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens. CoRR, abs/2112.04426, 2021.
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
Chen et al. [2009] Yihua Chen, Eric K Garcia, Maya R Gupta, Ali Rahimi, and Luca Cazzanti. Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Research, 10(3), 2009.
Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Cleveland [1979] William S Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368):829–836, 1979.
Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
Cramer [2021] Patrick Cramer. Alphafold2 and the future of structural biology. Nature Structural & Molecular Biology, 28(9):704–705, 2021.
Das et al. [2021] Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. Case-based reasoning for natural language queries over knowledge bases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9594–9611, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.755.
Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
Donoho and Liu [1988] David L Donoho and Richard C Liu. The" automatic" robustness of minimum distance functionals. The Annals of Statistics, 16(2):552–586, 1988.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
Epasto et al. [2020] Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, and Emmanouil Zampetakis. Optimal approximation-smoothness tradeoffs for soft-max functions. Advances in Neural Information Processing Systems, 33:2651–2660, 2020.
Gandelsman et al. [2022] Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 35:29374–29385, 2022.
Grathwohl et al. [2021] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
Gu et al. [2023] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
Guu et al. [2020] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
henrikl [https://math.stackexchange.com/users/351007/henrikl] henrikl (https://math.stackexchange.com/users/351007/henrikl). 1-smoothness of the symmetric softmax function. Mathematics Stack Exchange, 2021. URL https://math.stackexchange.com/q/4170855. URL:https://math.stackexchange.com/q/4170855 (version: 2021-06-12).
Huszár [2015] Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
Ibragimov and Has Minskii [2013] Ildar Abdulovich Ibragimov and Rafail Zalmanovich Has Minskii. Statistical estimation: asymptotic theory, volume 16. Springer Science & Business Media, 2013.
Iscen et al. [2023] Ahmet Iscen, Alireza Fathi, and Cordelia Schmid. Improving image recognition by retrieving from web-scale image-text data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19295–19304, 2023.
Izacard and Grave [2021] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.74. URL https://aclanthology.org/2021.eacl-main.74.
Izacard et al. [2022] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
Jain and Learned-Miller [2011] Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011.
Joshi et al. [2017] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.
Karpukhin et al. [2020a] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020a. Association for Computational Linguistics.
Karpukhin et al. [2020b] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020b.
Katkovnik and Kheisin [1979] Vladimir Yakovlevich Katkovnik and VE Kheisin. Dynamic stochastic approximation of polynomials drifts. Avtomatika i Telemekhanika, pages 89–98, 1979.
Khandelwal et al. [2020] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020.
Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
Leake et al. [1996] David B. Leake, Andrew Kinley, and David C. Wilson. Acquiring case adaptation knowledge: A hybrid approach. In AAAI/IAAI, Vol. 1, 1996. URL https://api.semanticscholar.org/CorpusID:11169287.
Lee et al. [2019] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. URL https://aclanthology.org/P19-1612.
Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Lewkowycz et al. [2022] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
Li et al. [2023] Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19565–19594. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/li23l.html.
Liévin et al. [2023] Valentin Liévin, Andreas Geert Motzfeldt, Ida Riis Jensen, and Ole Winther. Variational open-domain question answering. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 20950–20977. PMLR, 23–29 Jul 2023.
Lin et al. [2023] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2023.
Liska et al. [2022] Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien De Masson D’Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-Mcmahon, Sophia Austin, Phil Blunsom, and Angeliki Lazaridou. StreamingQA: A benchmark for adaptation to new knowledge over time in question answering models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13604–13622. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/liska22a.html.
Long et al. [2022] Alexander Long, Wei Yin, Thalaiyasingam Ajanthan, Vu Nguyen, Pulak Purkait, Ravi Garg, Alan Blair, Chunhua Shen, and Anton van den Hengel. Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6959–6969, 2022.
Lowe [1999] David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
McSherry and Talwar [2007] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007.
Meinhardt et al. [2022] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022.
Ni et al. [2022] Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, 2022.
OpenAI [2023] OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
Paranjape et al. [2022] Ashwin Paranjape, Omar Khattab, Christopher Potts, Matei Zaharia, and Christopher D Manning. Hindsight: Posterior-guided training of retrievers for improved open-ended generation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Vr_BTpw3wz.
Pinsker [1980] Mark Semenovich Pinsker. Optimal filtering of square-integrable signals in gaussian noise. Problemy Peredachi Informatsii, 16(2):52–68, 1980.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Roweis and Saul [2000] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
Rubin et al. [2022] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191.
Ruppert and Wand [1994] David Ruppert and Matthew P Wand. Multivariate locally weighted least squares regression. The annals of statistics, pages 1346–1370, 1994.
Sachan et al. [2021] Devendra Singh Sachan, Siva Reddy, William L. Hamilton, Chris Dyer, and Dani Yogatama. End-to-end training of multi-document reader and retriever for open-domain question answering. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=5KWmB6JePx.
Sachan et al. [2023] Devendra Singh Sachan, Mike Lewis, Dani Yogatama, Luke Zettlemoyer, Joelle Pineau, and Manzil Zaheer. Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics, 11:600–616, 2023.
Schonberger et al. [2017] Johannes L Schonberger, Hans Hardmeier, Torsten Sattler, and Marc Pollefeys. Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1482–1491, 2017.
Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Shuster et al. [2021] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
Siegel [2023] Jonathan W Siegel. Optimal approximation rates for deep relu neural networks on sobolev and besov spaces. Journal of Machine Learning Research, 24(357):1–52, 2023.
Singhal et al. [2023] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, pages 1–9, 2023.
Sparck Jones [1972] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
Stone [1977] Charles J Stone. Consistent nonparametric regression. The annals of statistics, pages 595–620, 1977.
Stone [1980] Charles J Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980.
Sun et al. [2020] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Thai et al. [2023] Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, and Andrew McCallum. Machine reading comprehension using case-based reasoning. arXiv preprint arXiv:2305.14815, 2023.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Voorhees et al. [1999] Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82, 1999.
Wang et al. [2022] Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and Michael Zeng. Training data is more valuable than you think: A simple and effective method by retrieving from training data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3170–3179, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.226. URL https://aclanthology.org/2022.acl-long.226.
Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
Yang et al. [2021] Shiqi Yang, Joost van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. Advances in neural information processing systems, 34:29393–29405, 2021.
Yarotsky [2017] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
Zemlyanskiy et al. [2022] Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie, Panupong Pasupat, Peter Shaw, Linlu Qiu, Sumit Sanghai, and Fei Sha. Generate-and-retrieve: Use your predictions to improve retrieval for semantic parsing. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4946–4951, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.438.
Zhang et al. [2006] Hao Zhang, Alexander C Berg, Michael Maire, and Jitendra Malik. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2126–2136. IEEE, 2006.
Zhang [2023] Tong Zhang. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023.
Ziegler et al. [2019] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Appendix A Preliminaries

Definition A.1 (Rademacher complexity).

Given a sample $\mathscr{S}_{n}=\{(x_{i},y_{i})\}_{i\in[n]}\subset\mathscr{X}\times\mathscr{Y}$ and a real-valued function class $\mathscr{F}:\mathscr{X}\times\mathscr{Y}\to\mathbb{R}$ , the empirical Rademacher complexity of $\mathscr{F}$ with respect to $\mathscr{S}_{n}$ is defined as

\displaystyle\mathfrak{R}_{\mathscr{S}_{n}}(\mathscr{F})=\frac{1}{n}\mathbb{E}% _{\bm{\sigma}}\left[\sup_{f\in\mathscr{F}}\sum_{i=1}^{n}\sigma_{i}f(x_{i},y_{i% })\right],

(24)

where $\bm{\sigma}=\{\sigma_{i}\}_{i\in[n]}$ is a collection of $n$ i.i.d. Bernoulli random variables. For $n\in{\mathbb{N}}$ , the Rademacher complexity $\bar{\mathfrak{R}}_{n}(\mathscr{F})$ and worst case Rademacher complexity $\mathfrak{R}_{n}(\mathscr{F})$ are defined as follows.

\displaystyle\bar{\mathfrak{R}}_{n}(\mathscr{F})=\mathbb{E}_{\mathscr{S}_{n}% \sim\mathsf{D}^{n}}\left[\mathfrak{R}_{\mathscr{S}}(\mathscr{F})\right],\quad% \text{and}\quad\mathfrak{R}_{n}(\mathscr{F})=\sup_{\mathscr{S}_{n}\sim(% \mathscr{X}\times\mathscr{Y})^{n}}\mathfrak{R}_{\mathscr{S}}(\mathscr{F}).

(25)

Definition A.2 (Covering nsumber).

Let $\epsilon>0$ and $\|\cdot\|$ be a norm defined over $\mathbb{R}^{n}$ . Given a function class $\mathscr{F}:\mathscr{X}\times\mathscr{Y}\to\mathbb{R}$ and a collection of points $\mathscr{S}_{n}=\{(x_{i},y_{i})\}_{i\in[n]}\subset\mathscr{X}\times\mathscr{Y}$ , we call a set of points $\{u_{j}\}_{j\in[m]}\subset\mathbb{R}^{n}$ an $(\epsilon,\|\cdot\|)$ -cover of $\mathscr{F}$ with respect to $\mathscr{S}$ , if we have

\displaystyle\sup_{f\in\mathscr{F}}\min_{j\in[m]}\|f(\mathscr{S}_{n})-u_{j}\|% \leq\epsilon,

(26)

where $f(\mathscr{S}_{n})=\big{(}f(x_{1},y_{1}),\ldots,f(x_{n},y_{n})\big{)}\in% \mathbb{R}^{n}$ . The $\|\cdot\|$ -covering number $\mathcal{N}(\mathscr{F},\epsilon,\|\cdot\|;\mathscr{S}_{n})$ denotes the cardinality of the minimal $(\epsilon,\|\cdot\|)$ -cover of $\mathscr{F}$ with respect to $\mathscr{S}_{n}$ . In particular, if $\|\cdot\|$ is a $\ell_{p}$ norm (e.g. $\|v\|=(\sum_{j=1}^{d}|v_{j}|^{p})^{1/p}$ for $v\in\mathbb{R}^{d}$ ), then we simply use $\mathcal{N}(\mathscr{F},\epsilon,\|\cdot\|_{L_{p}};\mathscr{S}_{n})$ to denote the corresponding $\ell_{p}$ -covering number.

When $\mathscr{S}_{n}$ is unambiguous we may drop it, i.e., we use $\mathcal{N}(\mathscr{F},\epsilon,\|\cdot\|_{L_{p}})$ to represent the covering number.

Definition A.3 (Multi-layer perceptron (MLP)).

We consider for both retrieval and predictor, the class of multi-layer-perceptron, aka fully connected Deep Neural Network, with Relu nonlinearity $\sigma(x)=\max(x,0)$ . An MLP is specified by the number of layers $L$ , and the width $W$ . We define with weight $\mathbf{W}\in\mathbb{R}^{d_{2}}\times\mathbb{R}^{d_{1}}$ and bias $b\in\mathbb{R}^{d_{2}}$ , an affine transform $A_{\mathbf{W},b}(\mathbb{R}^{d_{1}},\mathbb{R}^{d_{2}}):x\to\mathbf{W}x+b$ . Let $\sigma\circ A_{\mathbf{W},b}(\mathbb{R}^{d_{1}},\mathbb{R}^{d_{2}})$ define the elementwise application of the Relu non-linearity on the affine transform. The class of $L$ layers and $W$ width MLP is defined as

{\rm MLP}(\mathbb{R}^{d},\mathbb{R}^{k};W,L)=\{A_{\mathbf{W}_{L},b_{L}}\circ% \sigma\circ A_{\mathbf{W}_{L-1},b_{L-1}}\circ\dots\sigma\circ A_{\mathbf{W}_{0% },b_{0}}\},

(27)

where $\mathbf{W}_{L}\in\mathbb{R}^{k\times W}$ and $b_{L}\in\mathbb{R}^{k}$ ; $\mathbf{W}_{i}\in\mathbb{R}^{W\times W}$ and $b_{i}\in\mathbb{R}^{W}$ , for $1\leq i\leq(L-1)$ ; and $\mathbf{W}_{0}\in\mathbb{R}^{W\times d}$ and $b_{0}\in\mathbb{R}^{W}$ .

Definition A.4 (Sobolev space).

For $p\geq 1$ , we denote the set of functions with finite $L_{p}$ norm over $\Omega$ as $L_{p}(\Omega)$ , i.e., for any $f\in L_{p}(\Omega)$ , $\|f\|_{L_{p}(\Omega)}\triangleq\big{(}\int_{s\in\Omega}f(s)^{p}ds\big{)}^{1/p}<\infty$ . Note that for $p=\infty$ , we have $\|f\|_{L_{\infty}(\Omega)}=\mathrm{ess}\sup_{s\in\Omega}|f(s)|.$ Let $\alpha\in\mathbb{N}^{d}$ denote a multi-index, and $|\alpha|=\sum_{i\in d}\alpha_{i}$ be it’s degree. We denote by $D^{\alpha}$ the weak-derivative with respect to multi-index $\alpha$ for any function.

For an integer $\kappa>0$ , the Sobolev semi-norm $W^{\kappa}(L_{p}(\Omega))$ for a function $f$ that has weak-derivatives of order $\kappa$ is defined as

\forall 1\leq p<\infty,|f|_{W^{\kappa}(L_{p}(\Omega))}\triangleq\big{(}\sum_{% \alpha:|\alpha|=\kappa}\|D^{\alpha}f\|_{L_{p}(\Omega)}^{p}\big{)}^{1/p}\text{ % and }|f|_{W^{\kappa}(L_{\infty}(\Omega))}\triangleq\max_{\alpha:|\alpha|=% \kappa}\|D^{\alpha}f\|_{L_{\infty}(\Omega)}.

The Sobolev norm $W^{\kappa}(L_{p}(\Omega))$ for the same function $f$ is defined as $\|f\|_{W^{\kappa}(L_{p}(\Omega))}=\|f\|_{L_{p}(\Omega)}+|f|_{W^{\kappa}(L_{p}(% \Omega))}.$ A function $f$ with all weak-derivatives of order $\kappa$ , and a finite $W^{\kappa}(L_{p}(\Omega))$ norm lies in the Sobolev space with $\kappa$ derivatives and $L_{p}(\Omega)$ norm.

In our approximation guarantees for MLP retreiver and predictor classes later, we use [Siegel, 2023, Theorem 1]. We restate the result here for completeness.

Theorem A.5 (Restated Siegel [2023] Theorem 1).

Let $f_{0}:\Omega\to\mathbb{R}$ be a function in the Sobolev space with $\kappa$ derivatives and norm $L_{q}(\Omega)$ , for $q\in[1,\infty)$ and $\kappa\in(0,\infty)$ . For ${\Omega=[-1,1]^{d}}$ and any $p\in[1,\infty)$ satisfying $(1/q-1/p)\leq s/d$ , we have for $C=c(\kappa,d)<\infty$ , and $W=25d+31$

\inf_{f\in{\rm MLP}(\mathbb{R}^{d},\mathbb{R};W,L)}\|f-f_{0}\|_{L_{p}(\Omega)}% \leq C\|f_{0}\|_{W^{\kappa}(L_{q}(\Omega))}L^{-\tfrac{2\kappa}{d}}.

Our generalization bounds leverages VC Dimension bounds of MLP Bartlett et al. [2019]. Here, we state some results from Bartlett et al. [2019] for completeness.

Definition A.6 (VC dimension and growth of a binary function class).

For $\mathcal{H}$ , a class of functions from $\mathcal{A}$ to $\{0,1\}$ the growth function of $\mathcal{H}$ evaluated on an input set of size $m$ , is defined as

\Pi_{\mathcal{H}}(m)=\max_{a_{1},\dots,a_{m}\in\mathcal{A}}|\{h(a_{1}),\dots,h% (a_{m}):h\in\mathcal{H}\}|.

The ${\rm VCdim}(\mathcal{H})$ is defined as the largest $m$ such that $\Pi_{\mathcal{H}}(m)=2^{m}$ , where if no such $m$ is there we have ${\rm VCdim}(\mathcal{H})=\infty$ .

Definition A.7 (Pseudo dimension of real valued function class).

Let $\mathcal{F}$ be a class of functions from some space $\mathcal{A}$ to the real $\mathbb{R}$ . The pseudo-dimension of class $\mathcal{F}$ , denoted by $Pdim(\mathcal{F})$ , is the largest $m$ such that there exists $\{a_{1},\dots,a_{m},r_{1},\dots,r_{m}\}\in\mathcal{A}^{m}\times\mathbb{R}^{m}$ such that for any binary sequence $\{b_{1},\dots,b_{m}\}\in\{0,1\}^{m}$ there exists a function $f\in\mathcal{F}$ satisfying $\forall i:f(a_{i})>r_{i}\iff b_{i}=1$ .

Note that the pseudo-dimension is same as the VC dimension of the subgraph of class $\mathcal{F}$ which is used in Zhang [2023]. Let $sgn(x)=\mathbbm{1}(x\geq 0)$ . We denote by $sgn(f)$ the sign of the function $f:\mathcal{A}\to\mathbb{R}$ . We define $sgn(\mathcal{F})\triangleq\{sgn(f):f\in\mathcal{F}\}$ , and the VC dimension of the real valued function class $\mathcal{F}$ as ${\rm VCdim}(\mathcal{F})\triangleq{\rm VCdim}(sgn(\mathcal{F}))$ . It is mentioned in Bartlett et al. [2019] that for neural network with a fixed architecture and fixed activation functions, namely class ${\rm MLP}$ , we have that ${\rm VCdim}(sgn({\rm MLP}))=Pdim({\rm MLP})$ .

We now adapt [Bartlett et al., 2019, Theorem 6] to use it for the class ${\rm MLP}(\mathbb{R}^{d},\mathbb{R};L,W)$ the employs the Relu non-linearity. In terminology of Bartlett et al. [2019], it amounts to focusing on the number of breakpoints $pnt=1$ , and degree of polynomial $deg=1$ .¹¹1Originally in Bartlett et al. [2019] degree is denoted by $d$ and break point by $p$ , but we use $deg$ and $pnt$ , respectively, to avoid confusion. These notations are used for the rest of the paper.

Theorem A.8 (Adaptation of Bartlett et al. [2019] Theorem 6).

Consider the neural network class ${\rm MLP}(\mathbb{R}^{d},\mathbb{R};L,W)$ that has the Relu non-linearity. Let $W_{total,l}$ denote the total number of parameters (weights and biases) up to layer $l\leq(L-1)$ , and $k_{l}$ denote the number of non-linear units (output width) in layer $l$ . Also define the parameters $\bar{L}=\tfrac{1}{W_{total,L}}\sum_{l=1}^{L}W_{total,l}\leq L$ , and $R=\sum_{l=1}^{L}lk_{l}\leq L^{2}W$ . Then for the function class $\mathcal{F}$ of all real-valued functions computed by the MLP class and $m$

\Pi_{sgn(\mathcal{F})}(m)\leq\prod_{l=1}^{L}2\left(\frac{2emk_{l}l}{W_{total,l% }}\right)^{W_{total,l}}\leq(4emL)^{W_{total,L}}.

Moreover, we have

{\rm VCdim}(\mathscr{F})=L+\bar{L}W_{total,L}\log_{2}(4e\sum_{l}lk_{l}\log_{2}% (\sum_{l}2elk_{l}))=O(\bar{L}W_{total,L}\log(L^{2}W)).

We generalize the above result to capture the MLP with multi dimensional output as used by our predictor.

Theorem A.9 (Multi-ouput version of Bartlett et al. [2019] Theorem 6).

Consider the neural network class ${\rm MLP}(\mathbb{R}^{d},\mathbb{R}^{k};L,W)$ that has Relu non-linearity with $W_{total,l}$ , $k_{l}$ , $\bar{L}$ , and $R$ as defined in Theorem A.8. We denote by $\mathcal{F}$ the class of functions $f:\mathbb{R}^{d}\times[k]\to\mathbb{R}$ where $f(\cdot,k)$ is the $k$ -th output coordinate of a neural network in class ${\rm MLP}(\mathbb{R}^{d},\mathbb{R}^{k};L,W)$ . Then, we have

{\rm VCdim}(\mathscr{F})=L+\bar{L}W_{total,L}\log_{2}(4e\sum_{l}lk_{l}\log_{2}% (\sum_{l}2elk_{l}))=O(\bar{L}W_{total,L}\log(L^{2}W)).

Proof.

Let $a\in\mathbb{R}^{W_{total,L}}$ parameterize one function $f\in\mathcal{F}$ . Based on the discussions, we need to find the ${\rm VCdim}$ of the set $\{sgn(f(x_{i},j,a)):a\in\mathbb{R}^{W_{total,L}},j\in[k],i\in[m]\}$ . Note that here we have $f:\mathbb{R}^{W_{total,L}}\times[m]\times[k]\to\mathbb{R}$ is a function mapping the tuple $(x_{i},j,a)$ to a real number.

We obtain the following inequality.

	$\displaystyle\|\{sgn(f(x_{i},j,a)):a\in\mathbb{R}^{W_{total,L}},i\in[m],j\in[k]\}\|$
	$\displaystyle\qquad\leq\sum_{j\in[k]}\|\{sgn(f(x_{i},j,a)):a\in\mathbb{R}^{W_{% total,L}},i\in[m]\}\|$
	$\displaystyle\qquad\leq\sum_{j\in[k]}\Pi_{sgn({\rm MLP}(\mathbb{R}^{d},\mathbb% {R};L,W))}(m)$
	$\displaystyle\qquad\leq k2^{L}(2eRm/{W_{total,L}})^{W_{total,L}}.$

In the first inequality, we partition the set with respect to $j\in[k]$ . For the second inequality we notice that for a fixed $j$ the function $f(x_{i},j,a)$ is computed by ${\rm MLP}(\mathbb{R}^{d},\mathbb{R};L,W)$ and bound it with the growth function $\Pi_{sgn({\rm MLP}(\mathbb{R}^{d},\mathbb{R};L,W))}$ over $m$ points. Therefore, for the third inequality we can apply the specified bound for $\Pi_{sgn({\rm MLP}(\mathbb{R}^{d},\mathbb{R};L,W))}(m)$ inside the proof of Theorem 6 in Bartlett et al. [2019]. Note that, here we have specialized for Relu nonlinearlity, i.e. breaking point $pnt=1$ , and degree $deg=1$ . Applying Lemma 6 in Bartlett et al. [2019] we obtain

{\rm VCdim}(\mathscr{F})\leq L\log(k)+W_{total,L}\log_{2}(4eR\log_{2}(4eR))=O(% L\log(k)+L^{2}W^{2}\log(LW)).

∎

Finally, we state a bounded version of the Gibb’s inequality, that lower bounds the cross entropy of two discrete probability distributions.

Proposition A.10 (Truncated Gibb’s inequality).

Let us consider two discrete distributions $\alpha,\beta$ over alphabet size $K$ . Then for any constant $C>0$ , we have

\sum_{i=1}^{K}\alpha_{i}\min(C,-\log(\beta_{i}))\geq\sum_{i=1}^{K}\alpha_{i}% \min(C,-\log(\alpha_{i}))-(K-1)\exp(-C).

Proof.

For two discrete distributions $\alpha,\beta$ over alphabet size $K$ .

	$\displaystyle\sum_{i=1}^{K}\alpha_{i}\min(C,-\log(\beta_{i}))$
	$\displaystyle=-\sum_{i=1}^{K}\alpha_{i}\log(\max(\exp(-C),\beta_{i}))$
	$\displaystyle=-\sum_{i=1}^{K}\alpha_{i}\log(\alpha_{i})+\sum_{i=1}^{K}\alpha_{% i}\log\big{(}\alpha_{i}/\max(\exp(-C),\beta_{i})\big{)}$
	$\displaystyle\geq-\sum_{i=1}^{K}\alpha_{i}\log(\alpha_{i})+(\sum_{i=1}^{K}% \alpha_{i})\log\big{(}\sum_{i=1}^{K}\alpha_{i}/\sum_{i=1}^{K}\max(\exp(-C),% \beta_{i}))\big{)}$
	$\displaystyle\geq-\sum_{i=1}^{K}\alpha_{i}\log(\alpha_{i})-\log(1+(K-1)\exp(-C))$
	$\displaystyle\geq-\sum_{i=1}^{K}\alpha_{i}\log(\alpha_{i})-(K-1)\exp(-C)$
	$\displaystyle\geq\sum_{i=1}^{K}\alpha_{i}\min(C,-\log(\alpha_{i}))-(K-1)\exp(-C)$

The first inequality follows from the log-sum-inequality. The second inequality follows as $\sum_{i=1}^{K}\max(\exp(-C),\beta_{i})$ is maximized by setting one $\beta_{i}=1$ for some $1\leq i\leq K$ , while the rest are set to $0$ . The second last inequality follows by $\log(1+x)\leq x$ . The final inequality follows by taking a minimum with $C$ can only decrease the value. ∎

Appendix B Derivations of main result

As discussed in Section 2, the objective here is to study the excess risk in Eq. (12) which has three main components, generalization error, retriever approximation error, and predictor approximation error (cf. (3.1)). In this section, we structure our results somewhat differently than the main body to capture the general setting of learning retriever with a fixed predictor, and vice versa. We first prove excess risk bounds for learning the retriever, then excess risk bounds for learning the predictor. Finally, we combine the results to obtain the guarantees for jointly learning the retriever and the predictor presented in the paper. For the rest of the analysis we need to specify the space of retrieved examples to define the complexity of the gap function (cf. 3.1). We recall that our retrieved samples are embedded in a compact subspace of $\mathbb{R}^{d_{z}}$ , and $\mathscr{X}$ is a compact subspace of $\mathbb{R}^{d}$ . In particular, for simplicity, we assume that $\mathscr{X}\subseteq[-1,1]^{d_{x}}$ and $\mathscr{Z}\subseteq[-1,1]^{d_{z}}$ .

B.1 Learning the retriever

We first study learning the retriever over class $\Theta$ when the predictor $\xi$ is fixed. The task of learning the retriever corresponds to minimizing the following over $\theta\in\Theta$ ,

\displaystyle\mathbb{E}_{(X,Y)\sim\mathscr{D}}[\mathbb{E}_{Z\sim p_{\theta}(% \cdot|X)}\ell(h_{\xi}(X,Z),Y)]

\displaystyle=\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot|X)}% \mathbb{E}_{Y|X}\ell(h_{\xi}(X,Z),Y)|X]\big{]}=\mathbb{E}_{X}\big{[}\mathbb{E}% _{Z\sim p_{\theta}(\cdot|X)}g_{\xi}(X,Z)\big{]},

where $g_{\xi}(X,Z)=\mathbb{E}_{Y|X}\ell(h_{\xi}(X,Z),Y)$ . We have a closed form for the optimal retriever when not restricted within a function class. The optimal retriever is $p^{\ast,\xi}(z|x)=\mathbbm{1}_{\operatorname*{arg\,min}_{z^{\prime}\in\mathscr% {I}}g_{\xi}(x,z^{\prime})}(z)$ , where a tie is broken arbitrarily.

For the fixed predictor $\xi$ , let $\hat{\theta}(\xi)$ minimize the empirical risk given, and $\theta(\xi)$ minimize the population risk over the class $\Theta$ , i.e.

	$\displaystyle\hat{\theta}(\xi)=\operatorname*{arg\,min}_{\theta\in\Theta}\frac% {1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})\ell\big{(}h_{\xi% }(x_{i},z),y_{i}\big{)},$
	$\displaystyle\theta(\xi)=\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_% {X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]}.$

Here, the probability is defined using the softmax operator for a given $\theta\in\Theta$ as follows:

p_{\theta,\mathscr{I}}\big{(}z|x\big{)}=\frac{\exp\big{(}r_{\theta}(x,z)\big{)% }}{\sum_{z^{\prime}\in\mathscr{I}}\exp\big{(}r_{\theta}(x,z^{\prime})\big{)}},% \quad\forall~{}z\in\mathscr{I},x\in\mathscr{X}.

Hardness of retrieval:

We recall the Sobolev space with $\kappa$ derivatives as defined in Section A. The following is the restatement of Assumption 3.1 but for any $\xi\in\Xi$ and not just the optimal one $\xi^{*}$ .

Assumption B.1 (Complexity of $\mathrm{g}_{\xi}$ ).

For any $\xi\in\Xi$ , there exists a baseline $b_{\xi}:[-1,1]^{d_{x}}\to\mathbb{R}$ such that the function $\mathrm{gap}_{\xi}:[-1,1]^{d_{x}+d_{z}}\to\mathbb{R}$ with baseline $b_{\xi}$ , as defined by $\mathrm{gap}_{\xi}(x,z)=(g_{\xi}(x,z)-b_{\xi}(x))$ lies in the Sobolev space with $\kappa$ derivatives and $L_{\infty}([-1,1]^{d_{x}+d_{z}})$ norm.

As noted in the main text this means that the predictor loss has a possibly ‘complex’ component $b_{\xi}(x)$ , and a relatively ‘smooth’ component $gap_{\xi}(x,z)$ that ensures two retrieved examples that are close leads to similar loss for the predictor $\xi$ for any $x\in\mathscr{X}$ . As $gap_{\xi}(x,z)$ solely determines the optimal retrieved set, it’s smoothness defines the hardness of underlying retrieval task.

Excess risk decomposition:

With the fixed predictor $\xi$ , excess risk in (12) takes the following form

	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\hat{\theta}(\xi))-R_{\ell,\mathscr{I}}% (f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad=\underbrace{\sum_{\theta\in\{\theta(\xi),\hat{\theta}(\xi)% \}}\big{\|}\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})% \ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\mathbb{E}_{X}\big{[}\mathbb{E}_{Z% \sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]}\big{\|}}_{\text{retriever % generalization error}}$
	$\displaystyle\qquad+\underbrace{R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{% E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi}(X,z)\big{]}}_{\text{retriever % approximation error}}+\underbrace{\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g% _{\xi}(X,z)\big{]}-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})}_{% \text{error from predictor $\xi$}}.$

B.1.1 Generalization error

We now proceed to bound the generalization error using the Radamacher complexity. With probability at least $(1-\delta)$ for any $\delta>0$ ,

	$\displaystyle\Big{\|}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\hat{\theta}(\xi% )}(\cdot\|X)}g_{\xi}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}% }p_{\hat{\theta}(\xi)}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{\|}$
	$\displaystyle\qquad\leq 2\mathbb{E}_{\bm{\sigma}}\Big{[}\max_{\theta\in\Theta}% \frac{1}{n}\sum_{i\in[n]}\sigma_{i}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})% \ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{]}+3\ell_{\max}\sqrt{\tfrac{\log(% 2/\delta)}{n}}$
	$\displaystyle\qquad\leq 2\times\inf_{\varepsilon\in[0,c_{\xi}/2]}\big{(}4% \varepsilon+\tfrac{12}{\sqrt{n}}\int_{\varepsilon}^{c_{\xi}/2}\sqrt{\log(% \mathcal{N}(\Theta,\nu,\\|\cdot\\|_{2,[n],\xi}))}d\nu\big{)}+3\ell_{\max}\sqrt{% \tfrac{\log(2/\delta)}{n}}$		(28)

Using covering number bound with chaining we obtain the final inequality, where

c_{\xi}=\sup_{\theta\in\Theta}\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z% \in\mathscr{I}}p_{\theta}(z|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\big% {)}^{2}\Big{)}^{1/2},

and $\mathcal{N}(\Theta,\nu,\|\cdot\|_{2,[n],\xi})$ denote the covering number of the retriever function $\Theta$ with error $\nu$ in $L_{2}$ norm w.r.t. the set $\{(x_{i},y_{i}):i\in[n]\}$ and $\xi$ fixed,

\|\mathbf{u}\|_{2,[n],\xi}=\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in% \mathscr{I}}u_{i,z}\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\big{)}^{2}\Big{)}^% {1/2},\forall\mathbf{u}\in\mathbb{R}^{n\times|\mathscr{I}|}.

The generalization error in retriever learning depends on the covering number of $\Theta$ (which we shall see is dependent on the embedding space of the retrieved examples).

As $\theta(\xi)$ is a fixed retriever, we do not need to take any union bound over the retriever space. Therefore, we have

\Big{|}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta(\xi)}(\cdot|X)}g_{\xi}% (X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta(\xi)}(z|% x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{|}\leq 3\ell_{\max}\sqrt{% \tfrac{\log(2/\delta)}{n}}.

B.1.2 Approximation error

The approximation error for learning the retriever depends on the hardness of the function $\min_{z\in\mathscr{I}}g_{\xi}(X,z)$ . We recall that this term is approximated using softmax over $r_{\theta}(X,Z)$ (cf. (6)).

We want to approximate the term $\min_{z\in\mathscr{I}}g_{\xi}(x,z)$ for all $x\in\mathscr{X}$ , by $\sum_{z\in\mathscr{I}}p_{\theta,\mathscr{I}}(z|x)g_{\xi}(x,z)$ . We can break down the approximation into two parts. First we show that the function $\mathrm{softmax}(-\tau\times g_{\xi}(x,z))$ approximates $\min_{z}g_{\xi}(x,z)$ for large $\tau$ . In particular, if $\tau=O(\log(|\mathscr{I}|)/\delta)$ then softmax approximates minimum with error $\delta$ (see, McSherry and Talwar [2007], Epasto et al. [2020]). Second, we show that $p_{\theta,\mathscr{I}}\big{(}z|x\big{)}$ can approximate $\mathrm{softmax}(-\tau\times g_{\xi}(x,z))$ well in $L_{2}$ norm.

We define

\tilde{p}_{\xi}(z|x)=\frac{\exp(-\tau g_{\xi}(x,z))}{\sum_{z^{\prime}}\exp(-% \tau g_{\xi}(x,z^{\prime}))}=\frac{\exp(-\tau(g_{\xi}(x,z)-b_{\xi}(x)))}{\sum_% {z^{\prime}}\exp(-\tau(g_{\xi}(x,z^{\prime})-b_{\xi}(x)))}.

Here recall that $b_{\xi}(x)$ is the baseline function in Assumption 3.1. An example of such baseline is the loss under the optimal retrieved sample for each $x\in\mathscr{X}$ , i.e. $b_{\xi}(x)=\min_{\tilde{z}}g_{\xi}(x,\tilde{z})$ .

For any $\theta\in\Theta$ , we have

	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_% {z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\leq}R_{\ell,\mathscr{I}}(\xi,\theta)-\mathbb% {E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(ii)}{=}\mathbb{E}_{X}\big{[}\sum_{z\in\mathscr{I}% }(p_{\theta,\mathscr{I}}(z\|x)-\tilde{p}_{\xi}(z\|x))g_{\xi}(x,z)\big{]}+\mathbb% {E}_{X}\big{[}\sum_{z\in\mathscr{I}}\tilde{p}_{\xi}(z\|x)-\min_{z\in\mathscr{I}% }g_{\xi}(x,z)\big{]}$
	$\displaystyle\qquad\overset{(iii)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot% )\\|_{\infty}\\|p_{\theta,\mathscr{I}}(\cdot\|x)-\tilde{p}_{\xi}(\cdot\|x)\\|_{1}% \big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iv)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot)% \\|_{\infty}\\|r_{\theta}(x,\cdot)+\tau\mathrm{gap}_{\xi}(x,\cdot)\\|_{\infty}% \big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(v)}{\leq}\ell_{\max}\\|r_{\theta}+\tau\mathrm{gap}% _{\xi}\\|_{\infty}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$

In the first inequality $(i)$ , we replace $\theta(\xi)$ which is the optimal retriever for predictor $\xi$ with an arbitrary retriever $\theta$ . The first term in the inequality $(iii)$ uses the norm bounds for inner product, while the second term follows from Theorem 3.1 in [Epasto et al., 2020] (which originates from [McSherry and Talwar, 2007]). The inequality $(iv)$ uses the fact that softmax functions over $K$ classes follow $\|softmax(x)-softmax(y)\|_{1}\leq\|x-y\|_{\infty}$ (see henrikl [https://math.stackexchange.com/users/351007/henrikl]). In the final inequality $(v)$ , we use $\ell_{\max}$ to bound the norm of $g_{\xi}$ .

As the above bound hold for any $\tau>0$ , by optimizing of $\tau$ and $\theta$ we obtain,

R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{% I}}g_{\xi}(X,z)\big{]}\leq\inf_{\theta\in\Theta}\inf_{\tau>0}\ell_{\max}\|r_{% \theta}+\tau\mathrm{gap}_{\xi}\|_{\infty}+\frac{\log(|\mathscr{I}|)}{\tau^{2}}.

(29)

Since the right had side in the inequality $(v)$ holds for any $\theta\in\Theta$ , if there exists a $\theta\in\Theta$ such that the function $r_{\theta}(x,z)$ approximates the function $-\tau\mathrm{gap}_{\xi}(x,z)$ well we end up with small approximation error.

B.1.3 Instantiation of MLP retriever

We consider $\Theta$ to be the class of MLP defined in Equation (27). As we know MLP with appropriate depth and width has universal approximation properties, this choice of $\Theta$ ensures the function $r_{\theta}(x,z)$ approximates the function $-\tau\mathrm{gap}_{\xi}(x,z)$ well. To bound the excess risk of learning the retriever, we need to prove generalization error, and approximation error bounds for the MLP class.

Generalization error for MLP retriever:

To bound the generalization error, we need to first bound the covering number $\mathcal{N}(\Theta,\nu,\|\cdot\|_{2,[n],\xi})$ , for $\Theta={\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};W,L)$ . Here, $\mathscr{X}\subseteq\mathbb{R}^{d_{x}}$ and $\mathscr{I}\subseteq\mathbb{R}^{d_{z}}$ i.e., the retrieved space is embedded in $\mathbb{R}^{d_{z}}$ . We first want to bound the covering number $\mathcal{N}(\Theta,\nu,\|\cdot\|_{2,[n],\xi})$ with a covering number of ${\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};W,L)$ .

For a fixed data set $\mathcal{S}_{n}:=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ ; predictor $\xi$ ; and two retrievers $\theta,\theta^{\prime}\in\Theta$

	$\displaystyle\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in\mathscr{I}}(p_% {\theta}(z\|x_{i})-p_{\theta^{\prime}}(z\|x_{i}))\ell\big{(}h_{\xi}(x_{i},z),y_{% i}\big{)}\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(i)}{\leq}\ell_{\max}\Big{(}\tfrac{1}{n}\sum_{i\in% [n]}\big{(}\sum_{z\in\mathscr{I}}\|p_{\theta}(z\|x_{i})-p_{\theta^{\prime}}(z\|x_% {i})\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(ii)}{\leq}\ell_{\max}\Big{(}\tfrac{1}{n}\sum_{i% \in[n]}\big{(}\max_{z\in\mathscr{I}}\|r_{\theta}(x_{i},z)-r_{\theta^{\prime}}(x% _{i},z)\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(iii)}{\leq}\ell_{\max}\sup_{x\in\mathcal{S}_{n},z% \in\mathscr{I}}\|r_{\theta}(x,z)-r_{\theta^{\prime}}(x,z)\|$

Above, the inequality $(i)$ follow by upper bounding $\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}$ with $\ell_{\max}$ . The inequality $(ii)$ uses the fact that softmax functions over $K$ classes follow $\|softmax(x)-softmax(y)\|_{1}\leq\|x-y\|_{\infty}$ .

Let us define the norm $\|\cdot\|_{\infty,n|\mathscr{I}|}$ as $\|u\|_{\infty,n|\mathscr{I}|}\triangleq\sup_{x_{i}\in\mathcal{S}_{n}}\sup_{z% \in\mathscr{I}}|u_{i,z}|,~{}\forall\mathbf{u}\in\mathbb{R}^{n\times|\mathscr{I% }|}.$ Now consider a $\|\cdot\|_{\infty,n|\mathscr{I}|}$ norm cover of $\Theta$ , $\Theta_{{\rm cov}}$ with cardinality $\mathcal{N}(\Theta,\nu/\ell_{\max},\|\cdot\|_{\infty,n|\mathscr{I}|})$ .

Note that, by definition, for any $\theta\in\Theta$ , there exists a $\theta_{{\rm cov}}(\theta)\in\Theta_{{\rm cov}}$ such that $\sup_{x\in\mathcal{S}_{n},z\in\mathscr{I}}|r_{\theta}(x,z)-r_{\theta_{{\rm cov% }}(\theta)}(x,z)|\leq\nu/\ell_{\max}$ . This means, that $\Theta_{{\rm cov}}$ forms a $\nu$ -cover in the $\|\cdot\|_{2,[n],\xi}$ norm. In other words, we have $\mathcal{N}(\Theta,\nu,\|\cdot\|_{2,[n],\xi})\leq\mathcal{N}(\Theta,\nu/\ell_{% \max},\|\cdot\|_{\infty,n|\mathscr{I}|}).$

Most existing results on covering number bounds for MLP assumes norm bounds for the MLP weights and biases. However, we do not impose such norm bounds for the MLP weights and biases. Therefore, we will use pseudo-dimension of the class $\Theta$ from Bartlett et al. [2019] to bound the covering number $\mathcal{N}(\Theta,\nu,\|\cdot\|_{\infty,n|\mathscr{I}|})$ using Zhang [2023]. In particular, if the pseudo-dimension of $\Theta$ is $d_{VC}$ , then we have $\log\mathcal{N}(\Theta,\nu,\|\cdot\|_{\infty,n|\mathscr{I}|})\leq 1+\log(1+d_{% VC})+d_{VC}\log(\max\{2,en|\mathscr{I}|/d_{VC}\nu\})$ as per in [Zhang, 2023, Theorem 5.11]. From [Bartlett et al., 2019, Theorem 6] we know that for the class ${\rm MLP}(\mathbb{R}^{d},\mathbb{R};W,L)$ the pseudo-dimension is $O(LN\log(M))$ , where $N=O(LW^{2})$ is the number of parameters, and $M=O(LW)$ is the number of computation units. By setting $\varepsilon=c/\sqrt{n}$ for a constant $c$ , and $\delta=1/n$ in Equation (28), for large enough $L$ (we will set $L$ as a function of the data size $n$ ) we obtain the final generalization error as

\Big{|}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\hat{\theta}(\xi)}(\cdot|X)}g% _{\xi}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\hat{% \theta}(\xi)}(z|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{|}=O\big{(}% \frac{\ell_{\max}LW\sqrt{\log(LW)\log(n|\mathscr{I}|)}}{\sqrt{n}}\big{)}.

(30)

Approximation error for MLP retriever:

Our excess risk bounds closely follow the work of Siegel [2023] which generalizes Yarotsky [2017].²²2We note Siegel [2023] works with $\Omega=[0,1]^{d}$ , and as mentioned therein, the analysis can be extended to bounded domain, e.g. $[a,b]^{d}$ which includes our setting. Furthermore, one can extend the analysis to non-integer Sobolev and Besov spaces following Siegel [2023]. Under Assumption B.1, by specializing [Siegel, 2023, Theorem 1] with $p=q=\infty$ we get that

\inf_{f\in{\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};W,L)}\|f-\mathrm{gap}_% {\xi}\|_{L_{\infty}(\Omega)}\leq C\|\mathrm{gap}_{\xi}\|_{W^{\kappa}(L_{\infty% }(\Omega))}L^{-2\kappa/(d_{x}+d_{z})}

for $\Omega\in[-1,1]^{d_{x}+d_{z}}$ , $W=25(d_{x}+d_{z})+31$ and $C=c(\kappa,d_{x}+d_{z})<\infty$ (independent of L). Note that $\kappa$ is the number of derivatives of the Sobolev space under consideration in Assumption B.1.

Therefore, under Assumption B.1 for $\Theta={\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};25(d_{x}+d_{z})+31,L)$ we show that

R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{% I}}g_{\xi}(X,z)\big{]}\leq C^{\prime}\ell_{\max}L^{-2\kappa/(d_{x}+d_{z})}+% \frac{\log(|\mathscr{I}|)}{\tau^{2}}.

(31)

This follows from the following series of inequalities:

	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_% {z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot)% \\|_{\infty}\big{]}\mathbb{E}_{X}\big{[}\\|r_{\theta}(x,\cdot)+\tau\mathrm{gap}_% {\xi}(x,\cdot)\\|_{\infty}\big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(ii)}{=}\tau\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot% )\\|_{\infty}\big{]}\\|\tilde{r}_{\theta}-\mathrm{gap}_{\xi}\\|_{L_{\infty}(% \Omega)}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iii)}{\leq}C\tau\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,% \cdot)\\|_{\infty}\big{]}\\|\mathrm{gap}_{\xi}\\|_{W^{\kappa}(L_{\infty}(\Omega))% }L^{-2\kappa/(d_{x}+d_{z})}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iv)}{\leq}C^{\prime}\ell_{\max}\tau L^{-2\kappa/(% d_{x}+d_{z})}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$

The first inequality $(i)$ follows from Equation (29). The second equality $(ii)$ , replaces $\tilde{r}_{\theta}=-\tau r_{\theta}$ . The inequality $(iii)$ follows by optimizing $\tilde{r}_{\theta}$ over the class $\Theta$ , as we see then $-\tau r_{\theta}$ also lies in $\Theta$ , and applying Theorem 1 in Siegel [2023]. The final inequality $(iv)$ combines $C^{\prime}=C\|\mathrm{gap}_{\xi}\|_{W^{\kappa}(L_{\infty}(\Omega))}$ and bounds $\mathbb{E}_{X}\big{[}\|g_{\xi}(x,\cdot)\|_{\infty}\big{]}\leq\ell_{\max}$ .

Note that the choice of $\tau$ is not algorithmic, we can optimize for $\tau$ . In particular, we choose $\tau=cL^{-2\kappa/3(d_{x}+d_{z})}\log^{1/3}(|\mathscr{I}|)$ to obtain the approximation error bound as $O(\ell_{\max}L^{-4\kappa/3(d_{x}+d_{z})}\log^{1/3}(|\mathscr{I}|))$ , where we treat the remaining terms that are independent of $\tau$ and $L$ as constants.

Excess risk for MLP retriever learning:

Adding the approximation error (31), and the generalization error (30) we bound the excess risk as

	Excess Risk	$\displaystyle\leq\underbrace{\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi% }(X,z)\big{]}-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})}_{\text{% error from predictor $\xi$}}+\underbrace{O(\ell_{\max}L^{-\tfrac{4\kappa}{3(d_% {x}+d_{z})}}\log^{1/3}(\|\mathscr{I}\|))}_{\text{retriever approximation error}}$
		$\displaystyle\qquad\qquad\quad+\underbrace{O\big{(}\frac{\ell_{\max}LW\sqrt{% \log(LW)\log(n\|\mathscr{I}\|)}}{\sqrt{n}}\big{)}}_{\text{retriever % generalization error}}$		(32)

By choosing $L=n^{\tfrac{3(d_{x}+d_{z})}{6(d_{x}+d_{z})+8\kappa}}$ , and using the data-store size $|\mathscr{I}|=poly(n)$ and width $W=O(d_{x}+d_{z})$ we obtain

\displaystyle\text{ Excess Risk}\leq\underbrace{\mathbb{E}_{X}\big{[}\min_{z% \in\mathscr{I}}g_{\xi}(X,z)\big{]}-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{% I}}^{\ell})}_{\text{error from predictor $\xi$}}+\underbrace{\tilde{O}(\ell_{% \max}n^{-\tfrac{2\kappa}{3(d_{x}+d_{z})+4\kappa}})}_{\text{retriever combined % error}}.

(33)

B.2 Learning the predictor

We now quantify the excess risk of a predictor $\xi$ for a fixed retriever $\theta$ . For a fixed retriever $\theta$ , the learning task of the predictor is to minimize

\displaystyle\mathbb{E}_{(X,Y)\sim\mathscr{D}_{XY}}[\mathbb{E}_{Z\sim p_{% \theta}(\cdot|X)}\ell(h_{\xi}(X,Z),Y)]=\mathbb{E}_{((X,Z),Y)\sim\mathscr{D}_{% XY}\times p_{\theta}(\cdot|X)}\big{[}\ell(h_{\xi}(X,Z),Y)|X\big{]}

The predictor now learns from the joint distribution $\mathscr{D}_{XY}\times p_{\theta}(\cdot|X)$ . We assume that the hardness of the classification task performed by the predictor varies with the selected retriever $\theta$ .

Similar to retriever learning in Section B.1, for a fixed retriever $\theta$ , the predictor that minimizes the empirical risk $\hat{\xi}(\theta)$ , and the predictor that minimizes the population risk $\xi^{\ast}(\theta)$ over the class $\Xi$ are defined as

	$\displaystyle\hat{\xi}(\theta)=\operatorname*{arg\,min}_{\xi\in\Xi}\frac{1}{n}% \sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i% },z),y_{i}\big{)},$
	$\displaystyle\xi^{\ast}(\theta)=\operatorname*{arg\,min}_{\xi\in\Xi}\mathbb{E}% _{X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]},$

where $g_{\xi}(X,Z)=\mathbb{E}_{Y|X}\ell(h_{\xi}(X,Z),Y)$ . We also define the predictor over the class $\Xi$ with ‘optimal’ retrieval (possibly outside of $\Theta$ ) that minimizes the population risk as $\xi^{\ast}$ as $\xi^{\ast}=\operatorname*{arg\,min}_{\xi\in\Xi}\mathbb{E}_{X}\big{[}\min_{z\in% \mathscr{I}}g_{\xi}(X,z)\big{]}.$

Usefulness of data-store:

We start with characterization of the prediction task in the presence of the data-store $\mathscr{I}$ . We consider that there exists a score function $h_{*}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}^{|\mathscr{Y}|}$ and corresponding probability distribution

p_{*}^{y}(x,z)=\frac{\exp(h_{*}^{y}(x,z))}{\sum_{y^{\prime}}\exp(h_{*}^{y^{% \prime}}(x,z))}

(34)

that approximates well $p_{\mathsf{D}_{XY}}^{y}(x)\triangleq\mathbb{P}_{Y\sim\mathsf{D}_{XY}}(y|X=x)$ for all $x\in\mathscr{X}$ and $y\in\mathscr{Y}$ . Furthermore, this score function $h_{*}$ lies coordinate wise in the Sobolev space (see Definition A.4). The Assumption 3.2 captures the above. We restate the assumption here for convenience.

Assumption B.2 (Retrieval quality).

There exists a score function $h_{*}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}^{|\mathscr{Y}|}$ such that

1.

for each $y\in\mathscr{Y}$ , the function $h_{*}^{y}$ (the $y$ -th coordinate of $h^{*}$ ) lies in the Sobolev space with $\kappa_{\mathscr{I}}$ derivatives and finite $L_{\infty}([-1,1]^{d_{x}+d_{z}})$ norm,

for any $x\in\mathscr{X}$ there exists a retrieved example $z^{*}(x)\in\mathscr{I}$ such that for $p_{*}^{y}(x,z)$ as defined in Equation (34)

\max_{y\in\mathscr{Y}}\sup_{x\in\mathscr{X}}|p_{*}^{y}(x,z(x))-p_{\mathsf{D}_{% XY}}^{y}(x)|\leq c_{\mathscr{I}}|\mathscr{I}|^{-\gamma_{\mathscr{I}}}.

Note that the tuple $(\gamma_{\mathscr{I}},d_{z},\kappa_{\mathscr{I}})$ defines the usefulness of the data-store $\mathscr{I}$ . In particular, the higher the $\gamma_{\mathscr{I}}$ the closer the approximation, and the higher the $\kappa_{\mathscr{I}}$ and the smaller the embedding dimension $d_{z}$ the ‘easier’ the score function used for this approximation.

Excess risk decomposition

The excess risk decomposition for the learned predictor $\hat{\xi}(\theta)$ takes the following form.

	$\displaystyle R_{\ell,\mathscr{I}}(\hat{\xi}(\theta),\theta)-R_{\ell,\mathscr{% I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(i)}{\leq}\sum_{\xi=\xi^{\ast}(\theta),\hat{\xi}(% \theta)}\big{\|}\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{% i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\mathbb{E}_{X}\big{[}\mathbb{E}_{Z% \sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]}\big{\|}$
	$\displaystyle\qquad\quad+R_{\ell,\mathscr{I}}(\xi^{\ast}(\theta),\theta)-R_{% \ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(ii)}{\leq}\sum_{\xi=\xi^{\ast}(\theta),\hat{\xi}(% \theta)}\big{\|}\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{% i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\mathbb{E}_{X}\big{[}\mathbb{E}_{Z% \sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]}\big{\|}$
	$\displaystyle\qquad\quad+\underbrace{R_{\ell,\mathscr{I}}(\xi^{\ast}(\theta),% \theta)-R_{\ell,\mathscr{I}}(\xi^{\ast},\theta)}_{\leq 0}+R_{\ell,\mathscr{I}}% (\xi^{\ast},\theta)-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(iii)}{\leq}\underbrace{\sum_{\xi=\xi^{\ast}(% \theta),\hat{\xi}(\theta)}\big{\|}\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I% }}p_{\theta}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\mathbb{E}_{X}% \big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot\|X)}g_{\xi}(X,Z)\big{]}\big{\|}}_{% \text{generalization error}}$
	$\displaystyle\qquad\quad+\underbrace{R_{\ell,\mathscr{I}}(\xi^{\ast},\theta)-% \mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{\ast}}(X,z)\big{]}}_{\text{% retriever error}}+\underbrace{\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{% \xi^{\ast}}(X,z)\big{]}-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt% },\mathscr{I}}^{\ell}}(X,z)\big{]}}_{\text{predictor error}}$		(35)

Note that in the inequality $(ii)$ , the predictor $\xi^{\ast}(\theta)$ which is optimised for the fixed retriever $\theta$ has lower risk compared to the predictor $\xi^{\ast}$ , i.e. $R_{\ell,\mathscr{I}}(\xi^{\ast}(\theta),\theta)\leq R_{\ell,\mathscr{I}}(\xi^{% \ast},\theta)$ .

B.2.1 Approximation error

We specialize our analysis for the log-loss bounded by $\ell_{\max}>0$ given as

\ell(h_{\xi}(x,z),y)=\min(\ell_{\max},-\log(p_{\xi}(y|x,z)))=\min(\ell_{\max},% \log(\sum_{y^{\prime}\in\mathscr{Y}}\exp(h^{y^{\prime}}_{\xi}(x,z)))-h^{y}_{% \xi}(x,z)).

(36)

Note that we need to bound the predictor error $(\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{\ast}}(X,z)\big{]}-\mathbb% {E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},\mathscr{I}}^{\ell}}(X,z)% \big{]})$ for the bounded log-loss. We want to relate this term to the $p_{*}^{y}(x,z)$ (cf. Equation.(34)) for which we have good control over its complexity. We first need a lower bound for $\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},\mathscr{I}}^{\ell}% }(X,z)\big{]}$ as a function of $p_{*}^{y}(x,z)$ . We proceed as follows:

	$\displaystyle\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},% \mathscr{I}}^{\ell}}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in\mathscr{% Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{\mathsf{D}_{XY}}^{y}(X))% )\big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})$
	$\displaystyle\qquad\overset{(ii)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in\mathscr% {Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{}^{y}(X,z^{}(X)))\big% {]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})$
	$\displaystyle\quad\quad-\exp(\ell_{\max})\,\mathbb{E}_{X}\big{[}\max_{y\in% \mathscr{Y}}\|p_{}^{y}(X,z^{}(X))-p_{\mathsf{D}_{XY}}^{y}(X)\|\big{]}$
	$\displaystyle\qquad\overset{(iii)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in% \mathscr{Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{}^{y}(X,z^{}(% X)))\big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})-c_{\mathscr{I}}\|\mathscr{I}\|^{% -\gamma_{\mathscr{I}}}\exp(\ell_{\max})$
	$\displaystyle\qquad\overset{(iv)}{=}\mathbb{E}_{X}\big{[}g_{h_{}}(X,z^{}(X))% \big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})-c_{\mathscr{I}}\|\mathscr{I}\|^{-% \gamma_{\mathscr{I}}}\exp(\ell_{\max})$		(37)

In the first inequality, applying Proposition A.10 to our setting with $C=\ell_{\max}$ and $K=|\mathscr{Y}|$ we obtain the lower bound. The second inequality follows from mean-value theorem as below,

\displaystyle|\min(C,-\log(x))-\min(C,-\log(y))|\leq\sup_{x}\big{\lvert}\frac{% \partial}{\partial x}\min(C,-\log(x))\big{\rvert}\times|x-y|\leq\exp(C)|x-y|

Next inequality $(iii)$ is obtained by Assumption B.2 with $z^{*}(x)$ is ad defined therein. The final inequality substitutes $g_{h^{*}}(x,z^{*}(x))=\mathbb{E}_{Y|X=x}[\ell(h_{*}(x,z^{*}(x)),y)]$ where $h_{*}(x,z)$ is the score function used in Equation (34).

We now derive an upper bound for the predictor error part of our excess risk bound in Equation (35). Let $\xi\in\Xi$ be an arbitrary predictor

	Predictor Error	$\displaystyle\triangleq\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{\ast% }}(X,z)\big{]}-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},% \mathscr{I}}^{\ell}}(X,z)\big{]}$
		$\displaystyle\overset{(i)}{\leq}\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{% \xi^{\ast}}(X,z)\big{]}-\mathbb{E}_{X}\big{[}g_{h_{}}(X,z^{}(X))\big{]}+(\|% \mathscr{Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_{% \mathscr{I}}}\exp(\ell_{\max})$
		$\displaystyle\overset{(ii)}{\leq}\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_% {\xi}(X,z)\big{]}-\mathbb{E}_{X}\big{[}g_{h_{}}(X,z^{}(X))\big{]}+(\|\mathscr% {Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_{\mathscr{I}}}% \exp(\ell_{\max})$
		$\displaystyle\overset{(iii)}{\leq}\mathbb{E}_{X}\big{[}g_{\xi}(X,z^{}(X))\big% {]}-\mathbb{E}_{X}\big{[}g_{h_{}}(X,z^{*}(X))\big{]}+(\|\mathscr{Y}\|-1)\exp(-% \ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_{\mathscr{I}}}\exp(\ell_{% \max})$

The second inequality follows by substituting the lower bound of $\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},\mathscr{I}}^{\ell}% }(X,z)\big{]}$ from Equation (37). As $\xi^{\ast}$ optimizes $\ell$ -risk over $\Xi$ , we can substitute with the arbitrary predictor $\xi$ to obtain an upper bound. The final inequality is obtained by substituting $z^{*}(X)$ instead of minimizing with respect to $z\in\mathscr{I}$ . Note that the final inequality holds for all $\xi\in\Xi$ as the initial choice of $\xi$ was arbitrary.

Bounding the term $\mathbb{E}_{X}\big{[}g_{\xi}(X,z^{*}(X))\big{]}-\mathbb{E}_{X}\big{[}g_{h_{*}}% (X,z^{*}(X))\big{]}$ , is similar to bounding the $\ell$ -risk for classification with the data distribution $\mathbb{P}(X=x,Z=z,Y=y)=\mathbb{P}_{\mathsf{D}_{XY}}(X=x,Y=y)\mathbbm{1}(z=z^{% *}(X))$ . Our strategy is to bound $\ell$ -risk with $L_{\infty}$ distance between the score functions $h_{\xi}^{y}(x,z)$ and the score function $h_{*}^{y}(x,z)$ which lies in the Sobolev space as given in the Assumption B.2. In particular, we have the following $L_{\infty}$ norm bound.

	$\displaystyle\mathbb{E}_{X}\big{[}g_{\xi}(X,z^{}(X))\big{]}-\mathbb{E}_{X}% \big{[}g_{h_{}}(X,z^{*}(X))\big{]}$
	$\displaystyle\qquad\overset{(i)}{=}\mathbb{E}_{XY}\big{[}\ell(h_{\xi}^{Y}(X,z^% {}(X)))-\ell(h_{}^{Y}(X,z^{*}(X)))\big{]}$
	$\displaystyle\qquad\overset{(ii)}{\leq}\mathbb{E}_{XY}\big{[}\|h_{\xi}^{Y}(X,z^% {}(X))-h_{}^{Y}(X,z^{}(X))\|+\max_{y\in\mathscr{Y}}\|h_{\xi}^{y}(X,z^{}(X))-% h_{}^{y}(X,z^{}(X))\|\big{]}$
	$\displaystyle\qquad\overset{(iii)}{\leq}2\times\mathbb{E}_{X}\big{[}\max_{y\in% \mathscr{Y}}\|h_{\xi}^{y}(X,z^{}(X))-h_{}^{y}(X,z^{*}(X))\|\big{]}$

The inequality $(ii)$ follows by substituting the bounded log-loss, and using the fact that for any two $s,s^{\prime}\in\mathbb{R}^{K}$ , $|\log(\sum_{k}\exp(s_{k}))-\log(\sum_{k}\exp(s^{\prime}_{k}))|\leq\max_{k}|s_{% k}-s^{\prime}_{k}|$ . The final inequality $(iii)$ follows by bounding the first term by second.

We note that the above holds for all $\xi$ . This gives the general approximation error bound as

\text{Predictor Error}\leq\inf_{\xi\in\Xi}2\mathbb{E}_{X}\big{[}\max_{y\in% \mathscr{Y}}|h_{\xi}^{y}(X,z^{*}(X))-h_{*}^{y}(X,z^{*}(X))|\big{]}+(|\mathscr{% Y}|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}|\mathscr{I}|^{-\gamma_{\mathscr{I}}}% \exp(\ell_{\max}).

(38)

Note the predictor approximation error is independent of retriever learning as it is compared with respect to the Bayes optimal retriever (i.e. $\min_{z\in\mathscr{I}}g_{\xi}(x,z)$ ) as seen in the Equation (35).

B.2.2 Generalization error

The generalization error in Equation (35) can be bounded in a similar manner as the retriever learning in Section B.1. Note that the predictor is learnt over the space $\Xi$ while the retriever is fixed in this setup.

	$\displaystyle\|\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot\|X)}g_{% \hat{\xi}(\theta)}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}% p_{\theta}(z\|x_{i})\ell\big{(}h_{\hat{\xi}(\theta)}(x_{i},z),y_{i}\big{)}\|$
	$\displaystyle\qquad\overset{(i)}{\leq}2\mathbb{E}_{\bm{\sigma}}\Big{[}\max_{% \xi\in\Xi}\frac{1}{n}\sum_{i\in[n]}\sigma_{i}\sum_{z\in\mathscr{I}}p_{\theta}(% z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{]}+3\ell_{\max}\sqrt{% \tfrac{\log(2/\delta)}{n}}$
	$\displaystyle\qquad\overset{(ii)}{\leq}2\times\inf_{\varepsilon\in[0,c_{\theta% }/2]}\big{(}4\varepsilon+\tfrac{12}{\sqrt{n}}\int_{\varepsilon}^{c_{\theta}/2}% \sqrt{\log(\mathcal{N}(\Xi,\nu,\\|\cdot\\|_{2,[n],\theta}))}d\nu\big{)}+3\ell_{% \max}\sqrt{\tfrac{\log(2/\delta)}{n}}$

The final inequality again follows using covering number based bounds with chaining (cf. Shalev-Shwartz and Ben-David [2014]). We have used for a fixed retriever $\theta$

c_{\theta}=\sup_{\xi\in\Xi}\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in% \mathscr{I}}p_{\theta}(z|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\big{)}% ^{2}\Big{)}^{1/2},

and $\mathcal{N}(\Xi,\nu,\|\cdot\|_{2,[n],\theta})$ denote the covering number of the predictor function class $\Xi$ with error $\nu$ in $L_{2}$ norm w.r.t. the set $\{(x_{i},y_{i}):i\in[n]\}$ and fixed $\theta$ ,

\|\mathbf{u}\|_{2,[n],\theta}:=\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z% \in\mathscr{I}}p_{\theta}(z|x_{i})u_{i,z}\big{)}^{2}\Big{)}^{1/2},\,\forall% \mathbf{u}\in\mathbb{R}^{n\times|\mathscr{I}|}.

As $\xi^{\ast}(\theta)$ is fixed for a fixed $\theta$ , we can directly bound without any union over the learner/predictor space,

|\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot|X)}g_{\xi^{\ast}(% \theta)}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta}% (z|x_{i})\ell\big{(}h_{\xi^{\ast}(\theta)}(x_{i},z),y_{i}\big{)}|\leq 3\ell_{% \max}\sqrt{\tfrac{\log(2/\delta)}{n}}.

B.2.3 Instantiation of MLP predictor

As a concrete example, we now consider the space $\Xi={\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R}^{\mathscr{Y}};W,L)$ as defined in Equation (27).

Approximation error of MLP predictor:

Our approximation results rely mainly on the results in Siegel [2023]. The key difference here is the output is now $|\mathscr{Y}|$ dimensional. We find an MLP of depth $L$ and width at most $W^{\prime}=O(d_{x}+d_{z})$ to individually approximate the functions $h_{*}^{y}(x,z)$ for each $y\in\mathscr{Y}$ . Later we can join these networks in parallel to obtain a final network with depth $L$ and width at most $O((d_{x}+d_{z})|\mathscr{Y}|)$ . In principle, these networks may share sub-networks (e.g. the bit extraction networks, the sub-domain indexation network for $p=q$ in Siegel [2023]) used for constructing the approximation. However, this is out of scope for this work, and we leave this open.

From Theorem 1 in Siegel [2023], by taking $p=q=\infty$ in the theorem statement, under Assumption B.2 we get that for each $y\in\mathscr{Y}$ there exists a MLP $f_{y}\in{\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R};W,L)$ such that

\|f_{y}-h_{*}^{y}\|_{L_{\infty}(\Omega)}\leq C_{y}\|h_{*}^{y}\|_{W^{\kappa}(L_% {\infty}(\Omega))}L^{-2\kappa_{\mathscr{I}}/(d_{x}+d_{z})}

for $\Omega\in[-1,1]^{d_{x}+d_{z}}$ , $W=25(d_{x}+d_{z})+31$ and $C_{y}=c(\kappa_{\mathscr{I}},d_{x}+d_{z})<\infty$ (independent of L). By concatenating the networks $f_{y}$ for $y\in\mathscr{Y}$ in parallel (c.f. Lemma 5 in Siegel [2023]), and using the first layer to share the $(d_{x}+d_{z})$ input to these parallel networks we obtain a MLP $f_{{\rm opt}}\in{\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R}^{K};W_{\mathscr{% Y}},L+1)$ , $W_{\mathscr{Y}}=O(|\mathscr{Y}|(d_{x}+d_{z}))$ , such that we have

\|f^{y}_{{\rm opt}}-h_{*}^{y}\|_{L_{\infty}(\Omega)}\leq\big{(}\max_{y\in% \mathscr{Y}}C_{y}\|h_{*}^{y}\|_{W^{\kappa}(L_{\infty}(\Omega))}\big{)}L^{-2% \kappa_{\mathscr{I}}/(d_{x}+d_{z})}.

By using $\xi=f^{y}_{{\rm opt}}$ in our bounds we obtain the predictor error as

\displaystyle\text{Predictor Error}\leq 2\big{(}\max_{y\in\mathscr{Y}}C_{y}\|h% _{*}^{y}\|_{W^{\kappa}(L_{\infty}(\Omega))}\big{)}L^{-2\kappa_{\mathscr{I}}/(d% _{x}+d_{z})}+(|\mathscr{Y}|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}|\mathscr{I}|^% {-\gamma_{\mathscr{I}}}

(39)

Generalization error for MLP predictor:

We now bound the generalization error in Equation 35 when $\Xi$ denotes a class of multi-layer perceptron (MLP) with Relu nonlinearity ${\rm MLP}(\mathbb{R}^{(d_{x}+d_{z})},\mathbb{R}^{|\mathscr{Y}|};W,L)$ .

The first step is to bound the covering number $\mathcal{N}(\Xi,\nu,\|\cdot\|_{2,[n],\theta})$ norm with the covering number $\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})$ . Where $\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|}$ is defined as $\|u\|_{\infty,n|\mathscr{I}||\mathscr{Y}|}=\sup_{x_{i}\in\mathcal{S}_{n}}\sup_% {z\in\mathscr{I}}\sup_{y\in\mathscr{Y}}|u_{i,z,y}|,~{}\forall\mathbf{u}\in% \mathbb{R}^{n\times|\mathscr{I}|\times|\mathscr{Y}|}.$

For a fixed data set $\mathcal{S}_{n}:=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ and retriever $\xi$ , and two predictors $\xi,\xi^{\prime}\in\Xi$ , we have

	$\displaystyle\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in\mathscr{I}}p_{% \theta}(z\|x_{i})(\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\ell\big{(}h_{\xi^{% \prime}}(x_{i},z),y_{i}\big{)})\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(i)}{\leq}\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\sum_{z% \in\mathscr{I}}p_{\theta}(z\|x_{i})\big{(}\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big% {)}-\ell\big{(}h_{\xi^{\prime}}(x_{i},z),y_{i}\big{)}\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(ii)}{\leq}\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\sum_{% z\in\mathscr{I}}p_{\theta}(z\|x_{i})\big{(}\|h^{y_{i}}_{\xi}(x_{i},z)-h^{y_{i}}_% {\xi^{\prime}}(x_{i},z)\|+\max_{y\in\mathscr{Y}}\|h^{y}_{\xi}(x_{i},z)-h^{y}_{% \xi^{\prime}}(x_{i},z)\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(iii)}{\leq}\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\sum_% {z\in\mathscr{I}}p_{\theta}(z\|x_{i})\big{(}\|h^{y_{i}}_{\xi}(x_{i},z)-h^{y_{i}}% _{\xi^{\prime}}(x_{i},z)\|+\max_{y\in\mathscr{Y}}\|h^{y}_{\xi}(x_{i},z)-h^{y}_{% \xi^{\prime}}(x_{i},z)\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(iv)}{\leq}\sqrt{2}\sup_{x\in\mathscr{X}}\sup_{y% \in\mathscr{Y}}\sup_{z\in\mathscr{I}}\|h^{y}_{\xi}(x,z)-h^{y}_{\xi^{\prime}}(x,% z)\|$

The first inequality follows from Jensen. For the case of bounded log-loss, we obtain the second inequality using the fact that for any two $s,s^{\prime}\in\mathbb{R}^{K}$ , $|\log(\sum_{k}\exp(s_{k}))-\log(\sum_{k}\exp(s^{\prime}_{k}))|\leq\max_{k}|s_{% k}-s^{\prime}_{k}|$ .

Let $\Xi_{{\rm cov}}$ be a $\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|}$ norm cover for the space $\Xi$ of cardinality $\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})$ . That implies, for any $\xi\in\Xi$ there exists a $\xi^{\prime}(\xi)\in\Xi_{{\rm cov}}$ such that $\sup_{x\in\mathscr{X}}\sup_{y\in\mathscr{Y}}\sup_{z\in\mathscr{I}}|h^{y}_{\xi}% (x,z)-h^{y}_{\xi^{\prime}}(x,z)|\leq\nu$ . Therefore, due to the above inequality, we have $\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in\mathscr{I}}p_{\theta}(z|x_{% i})(\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}-\ell\big{(}h_{\xi^{\prime}}(x_{i}% ,z),y_{i}\big{)})\big{)}^{2}\Big{)}^{1/2}\leq\nu$ . So $\Xi_{{\rm cov}}$ forms a cover of $\Xi$ with respect to the $\|\cdot\|_{2,[n],\theta}$ norm. Hence, $\mathcal{N}(\Xi,\nu,\|\cdot\|_{2,[n],\theta})\leq\mathcal{N}(\Xi,\nu,\|\cdot\|% _{\infty,n|\mathscr{I}||\mathscr{Y}|}).$

We need to bound $\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})$ next. Similar to the retrieval analysis in Section B.1, we first apply Zhang [2023] to bound the covering number $\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})$ with pseudo-dimension. However, we need slight reformulation of the function $h_{\xi}:\mathscr{X}\times\mathscr{Z}\to\mathbb{R}^{|\mathscr{Y}|}$ to apply the results therein. Let us define function $\tilde{h}_{\xi}:\mathscr{X}\times\mathscr{Z}\times\mathscr{Y}\to\mathbb{R}$ , where for each $y\in\mathscr{Y}$ we have $\tilde{h}_{\xi}(x,y,z)=h^{y}_{\xi}(x,z)$ . It is easy to see that $\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})$ covering of set $\Xi$ remains unchanged due to this reformulation. In particular, if the pseudo-dimension of $\{\tilde{h}_{\xi}:\xi\in\Xi\}$ is $\tilde{d}_{VC}$ , then we have $\log\mathcal{N}(\Xi,\nu,\|\cdot\|_{\infty,n|\mathscr{I}||\mathscr{Y}|})\leq 1+% \log(1+\tilde{d}_{VC})+\tilde{d}_{VC}\log(\max\{2,en|\mathscr{I}||\mathscr{Y}|% /\tilde{d}_{VC}\nu\})$ as per Theorem 5.11 in Zhang [2023].

Next we derive the pseudo-dimension of the class $\{\tilde{h}_{\xi}:\xi\in\Xi\}$ using Bartlett et al. [2019]. One challenge here is that for the MLP we are considering the label $y$ does not lie in the input space, rather this correspond to one coordinate of the $|\mathscr{Y}|$ -dimensional output. This can be captured with the slight modification of Theorem 6 in Bartlett et al. [2019], namely Theorem A.9 in Appendix A. By Theorem A.9 we have for $\Xi={\rm MLP}(\mathbb{R}^{d_{x}+d_{z}},\mathbb{R}^{|\mathscr{Y}|};L,W)$ the VC dimension of $\Xi$ as ${\rm VCdim}(\Xi)=O(L\log(|\mathscr{Y}|)+L^{2}W^{2}\log(LW))$ . The final generalization bound obtained is as

\text{Generalization Error}\leq O\bigg{(}\frac{\ell_{\max}\sqrt{(L\log(|% \mathscr{Y}|)+L^{2}W^{2}\log(LW))\log(n|\mathscr{I}||\mathscr{Y}|)}}{\sqrt{n}}% \bigg{)}.

(40)

Excess risk of predictor learning:

We can now combine the generalization error (40) and approximation error (38) to obtain the final excess risk. The final excess risk is upper bounded as

Excess Risk	$\displaystyle\leq\underbrace{R_{\ell,\mathscr{I}}(\xi^{\ast},\theta)-\mathbb{E% }_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{\ast}}(X,z)\big{]}}_{\text{error % from retriever $\theta$}}+\underbrace{O\big{(}L^{-2\kappa_{\mathscr{I}}/(d_{x}% +d_{z})}+(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-% \gamma_{\mathscr{I}}}\exp(\ell_{\max})\big{)}}_{\text{predictor approximation % error}}$
	$\displaystyle+\underbrace{O\bigg{(}\frac{\ell_{\max}\sqrt{(L\log(\|\mathscr{Y}\|% )+L^{2}W^{2}\log(LW))\log(n\|\mathscr{I}\|\|\mathscr{Y}\|)}}{\sqrt{n}}\bigg{)}}_{% \text{predictor generalization error}}$
	$\displaystyle=\underbrace{R_{\ell,\mathscr{I}}(\xi^{\ast},\theta)-\mathbb{E}_{% X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{\ast}}(X,z)\big{]}}_{\text{error from % retriever $\theta$}}+\underbrace{\tilde{O}\bigg{(}\|\mathscr{Y}\|^{\tfrac{2% \kappa_{\mathscr{I}}}{(d_{x}+d_{z})+2\kappa_{\mathscr{I}}}}n^{-\tfrac{\kappa_{% \mathscr{I}}}{(d_{x}+d_{z})+2\kappa_{\mathscr{I}}}}\bigg{)}}_{\text{predictor % combined error}}$	(41)

We have data store grow polynomially with data, $|\mathscr{I}|=\Omega(n^{s}|\mathscr{Y}|^{1/\gamma_{\mathscr{I}}})$ , and we let $\ell_{\max}=\log(|\mathscr{Y}|)+s^{\prime}\log(n)$ . For $s\geq\frac{2\kappa_{\mathscr{I}}}{((d_{x}+d_{z})+2\kappa_{\mathscr{I}})\gamma_% {\mathscr{I}}}$ and $s^{\prime}\geq\frac{\kappa_{\mathscr{I}}}{((d_{x}+d_{z})+2\kappa_{\mathscr{I}})}$ , the final error bound for predictor follows by setting $L=n^{\tfrac{(d_{x}+d_{z})}{2(d_{x}+d_{z})+4\kappa_{\mathscr{I}}}}|\mathscr{Y}|% ^{-\frac{d_{x}+d_{z}}{(d_{x}+d_{z})+2\kappa_{\mathscr{I}}}}$ . Note that the choice of $L$ and $W$ here are related to predictor size, and are independent of the choices in retriever size. Moreover, here we see Assumption B.2 forces the quality of retriever set to become the bottleneck in predictor excess risk, if we have $|\mathscr{I}|=o(n^{s}|\mathscr{Y}|^{1/\gamma_{\mathscr{I}}})$ for $s=\frac{2\kappa_{\mathscr{I}}}{((d_{x}+d_{z})+2\kappa_{\mathscr{I}})\gamma_{% \mathscr{I}}}$ .

B.3 Joint learning of retriever and predictor

In this section, we consider the task of joint learning the predictor and retriever from the space $\Xi$ and $\Theta$ , respectively. The empirical optimizer pair $(\hat{\xi}_{\rm joint},\hat{\theta}_{\rm joint})$ and the population optimizer $(\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint})$ for the joint task are given as follows.

	$\displaystyle\hat{\xi}_{\rm joint},\hat{\theta}_{\rm joint}=\operatorname*{arg% \,min}_{\xi\in\Xi,\hat{\theta}\in\Theta}\frac{1}{n}\sum_{i\in[n]}\sum_{z\in% \mathscr{I}}p_{\theta}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)},$
	$\displaystyle\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint}=\operatorname*{% arg\,min}_{\xi\in\Xi}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot\|X% )}g_{\xi}(X,Z)\big{]}.$

Recall, the optimal predictor with best possible retrieval is $\xi^{\ast}=\operatorname*{arg\,min}_{\xi\in\Xi}\mathbb{E}_{X}\big{[}\min_{z\in% \mathscr{I}}g_{\xi}(X,z)\big{]}.$ We denote the optimal retriever for $\xi^{*}$ as $\theta(\xi^{\ast})=\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{X}% \big{[}\mathbb{E}_{Z\sim p_{\theta}(\cdot|X)}g_{\xi^{\ast}}(X,Z)\big{]}$ .

The excess risk for the classes $\Theta$ and $\Xi$ can be bounded as

	$\displaystyle R_{\ell,\mathscr{I}}(\hat{\xi}_{\rm joint},\hat{\theta}_{\rm joint% })-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(i)}{\leq}R_{\ell,\mathscr{I}}(\hat{\xi}_{\rm joint% },\hat{\theta}_{\rm joint})-\underbrace{\bigg{(}R_{\ell,\mathscr{I},n}(\hat{% \xi}_{\rm joint},\hat{\theta}_{\rm joint})-R_{\ell,\mathscr{I},n}(\xi^{\ast}_{% \rm joint},\theta^{\ast}_{\rm joint})\bigg{)}}_{\leq 0\text{ as ERM minimizes % empirical risk}}$
	$\displaystyle\qquad\quad-R_{\ell,\mathscr{I}}(\xi^{\ast}_{\rm joint},\theta^{% \ast}_{\rm joint})+R_{\ell,\mathscr{I}}(\xi^{\ast}_{\rm joint},\theta^{\ast}_{% \rm joint})-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(ii)}{\leq}\sum_{(\theta,\xi)\in\{(\hat{\theta}_{% \rm joint},\hat{\xi}_{\rm joint}),(\theta^{\ast}_{\rm joint},\xi^{\ast}_{\rm joint% })\}}\|R_{\ell,\mathscr{I}}(\xi,\theta)-R_{\ell,\mathscr{I},n}(\xi,\theta)\|+R_{% \ell,\mathscr{I}}(\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint})-R_{\ell,% \mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(iii)}{\leq}\sum_{(\theta,\xi)\in\{(\hat{\theta}_{% \rm joint},\hat{\xi}_{\rm joint}),(\theta^{\ast}_{\rm joint},\xi^{\ast}_{\rm joint% })\}}\|R_{\ell,\mathscr{I}}(\xi,\theta)-R_{\ell,\mathscr{I},n}(\xi,\theta)\|+R_{% \ell,\mathscr{I}}(\xi^{*},\theta(\xi^{\ast}))-R_{\ell,\mathscr{I}}(f_{{\rm opt% },\mathscr{I}}^{\ell})$
	$\displaystyle\qquad\overset{(iv)}{\leq}\underbrace{\sum_{(\theta,\xi)\in\{(% \hat{\theta}_{\rm joint},\hat{\xi}_{\rm joint}),(\theta^{\ast}_{\rm joint},\xi% ^{\ast}_{\rm joint})\}}\|R_{\ell,\mathscr{I}}(\xi,\theta)-R_{\ell,\mathscr{I},n% }(\xi,\theta)\|}_{\text{Generalization Error}}$
	$\displaystyle\qquad\quad+\underbrace{R_{\ell,\mathscr{I}}(\xi^{},\theta(\xi^{% \ast}))-\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi^{}}(X,z)\big{]}}_{% \text{retriever error}}+\underbrace{\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}% }g_{\xi^{*}}(X,z)\big{]}-R_{\ell,\mathscr{I}}(f_{{\rm opt},\mathscr{I}}^{\ell}% )}_{\text{predictor error}}$

In the inequality $(iii)$ , we substitute the pair $(\xi^{*},\theta(\xi^{\ast}))$ for $(\xi^{\ast}_{\rm joint},\theta^{\ast}_{\rm joint})$ as the former may have higher loss than latter. For the pair $(\xi^{*},\theta(\xi^{\ast}))$ the predictor error is easily controlled. Also, note that the retriever $\theta(\xi^{\ast})$ is optimized for the optimal predictor $\xi^{\ast}$ . Therefore, unlike the fixed predictor case in Section B.1 we do not have additional predictor error. We next bound the generalization and approximation errors separately by combining the retriever and predictor errors derived earlier.

B.3.1 Generalization Error

First, for the fixed $(\theta^{\ast},\xi^{\ast})$ pair we bound the generalization error as

|\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\theta^{\ast}}(\cdot|X)}g_{\xi^{% \ast}}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{\theta^{% \ast}}(z|x_{i})\ell\big{(}h_{\xi^{\ast}}(x_{i},z),y_{i}\big{)}|\leq 3\ell_{% \max}\sqrt{\tfrac{\log(2/\delta)}{n}}.

Next, the generalization for the $(\hat{\xi},\hat{\theta})$ error can be bounded as.

	$\displaystyle\|\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\hat{\theta}}(\cdot\|X)% }g_{\hat{\xi}}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}}p_{% \hat{\theta}}(z\|x_{i})\ell\big{(}h_{\hat{\xi}}(x_{i},z),y_{i}\big{)}\|$
	$\displaystyle\qquad\overset{(i)}{\leq}2\mathbb{E}_{\bm{\sigma}}\Big{[}\max_{(% \theta,\xi)\in\Theta\times\Xi}\frac{1}{n}\sum_{i\in[n]}\sigma_{i}\sum_{z\in% \mathscr{I}}p_{\theta}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{]}% +3\ell_{\max}\sqrt{\tfrac{\log(2/\delta)}{n}}$
	$\displaystyle\qquad\overset{(ii)}{\leq}2\times\inf_{\varepsilon\in[0,c_{\max}/% 2]}\big{(}4\varepsilon+\tfrac{12}{\sqrt{n}}\int_{\varepsilon}^{c_{\max}/2}% \sqrt{\log(\mathcal{N}(\Theta\times\Xi,\nu,\\|\cdot\\|_{2,[n]}))}d\nu\big{)}+3% \ell_{\max}\sqrt{\tfrac{\log(2/\delta)}{n}}.$		(42)

The second inequality again follows using covering number based bounds with chaining Shalev-Shwartz and Ben-David [2014]. We have used for a fixed retriever $\theta$

c_{\max}=\sup_{\theta,\xi\in\Theta\times\Xi}\Big{(}\sum_{i\in[n]}\big{(}\sum_{% z\in\mathscr{I}}p_{\theta}(z|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}% \big{)}^{2}\Big{)}^{1/2},

and $\mathcal{N}(\Xi,\nu,\|\cdot\|_{2,[n]})$ denotes the covering number of the retriever function class $\Xi$ with error $\nu$ in $L_{2}$ norm w.r.t. the set $\{(x_{i},y_{i}):i\in[n]\}$ , i.e.,

\|\mathbf{u}\|_{2,[n]}:=\Big{(}\sum_{i\in[n]}\big{(}\sum_{z\in\mathscr{I}}u_{i% ,z}\big{)}^{2}\Big{)}^{1/2},\,\forall\mathbf{u}\in\mathbb{R}^{n\times|\mathscr% {I}|}.

The covering number in Equation (42) can be bounded using the retriever and predictor learning complexities as

\sqrt{\log(\mathcal{N}(\Theta\times\Xi,\nu,\|\cdot\|_{2,[n]}))}\leq\max_{\xi% \in\Xi}\sqrt{\log(\mathcal{N}(\Theta,\nu/2,\|\cdot\|_{2,[n],\xi}))}+\max_{% \theta\in\Theta}\sqrt{\log(\mathcal{N}(\Xi,\nu/2,\|\cdot\|_{2,[n],\theta}))}.

This implies that the generalization error of joint learning is (orderwise) bounded by the sum of the generalization error of retriever learning (cf. (30)) and predictor learning (cf. (40)).

B.3.2 Approximation error

The approximation error of predictor and retriever decouples under our decomposition, and under Assumption B.1 and B.2. So the approximation error is also bounded by the sum of the approximation error of retriever learning with optimal predictor, and the approximation error of predictor learning. Our derived bounds approximation error of the retriever holds uniformly for all predictor, so it also holds for optimal predictor. This implies that the joint retriever and predictor learning error is bounded (orderwise) by the sum of the predictor and retriever errors derived earlier in (29), and (38) earlier.

Proof of Theorem 3.3.

We define $f_{\mathcal{N}}(\nu;\mathcal{A},\mathcal{B})=\sup_{b\in\mathcal{B}}\sqrt{\log(% \mathcal{N}(\mathcal{A},\nu,\|\cdot\|_{2,n,b}))}$ . Putting the approximation and generalization errors together we obtain the final excess risk bound as

	$\displaystyle\Delta_{\ell,\xi}(\hat{\xi},\hat{\theta})$
	$\displaystyle\quad\leq 3\ell_{\max}(\tfrac{1}{n}+\sqrt{\tfrac{\log(n)}{n}})+% \inf_{\varepsilon\in[0,\tfrac{\ell_{\max}}{2}]}8\varepsilon+\tfrac{24}{\sqrt{n% }}\int_{\varepsilon}^{\tfrac{\ell_{\max}}{2}}f_{\mathcal{N}}(\tfrac{\nu}{2};% \Theta,\Xi)+f_{\mathcal{N}}(\tfrac{\nu}{2};\Xi,\Theta)d\nu$
	$\displaystyle\quad+\inf_{\theta\in\Theta}\inf_{\tau>0}\ell_{\max}\\|r_{\theta}+% \tau\mathrm{gap}_{\xi}\\|_{\infty}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\quad+\inf_{\xi\in\Xi}2\mathbb{E}_{X}\big{[}\max_{y\in\mathscr{Y}% }\|h_{\xi}^{y}(X,z^{}(X))-h_{}^{y}(X,z^{*}(X))\|\big{]}+(\|\mathscr{Y}\|-1)\exp(% -\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-\gamma_{\mathscr{I}}}\exp(\ell_{% \max}).$

This completes the proof. ∎

B.3.3 Instantiation of MLP retriever and predictor

For the scenario where the retriever and predictor are MLP, we can reuse the earlier analysis to provide the excess risk bound here.

Proof of Theorem 3.4.

Let us recall from Appendix B.1.3, in Equation 32 that a retriever MLP with depth $L_{{\rm ret}}$ , and width $O(d_{x}+d_{z})$ gives an approximation error $O\left(\ell_{\max}L_{{\rm ret}}^{-\tfrac{4\kappa}{3(d_{x}+d_{z})}}\log^{1/3}(|% \mathscr{I}|)\right)$ and the generalization error $O\left(\frac{\ell_{\max}LW\sqrt{\log(LW)\log(n|\mathscr{I}|)}}{\sqrt{n}}\right)$ .

Similarly, from Appendix B.2.3, in Equation (39), a MLP predictor with depth $L_{{\rm pred}}$ and width $O(|\mathscr{Y}|(d_{x}+d_{z}))$ has an approximation error $O\left(L_{{\rm pred}}^{-2\kappa_{\mathscr{I}}/(d_{x}+d_{z})}+(|\mathscr{Y}|-1)% \exp(-\ell_{\max})+c_{\mathscr{I}}|\mathscr{I}|^{-\gamma_{\mathscr{I}}}\exp(% \ell_{\max})\right)$ , and a generalization error $O\left(\frac{\ell_{\max}\sqrt{(L_{{\rm pred}}\log(|\mathscr{Y}|)+L_{{\rm pred}% }|\mathscr{Y}|\log(L_{{\rm pred}}|\mathscr{Y}|))\log(n|\mathscr{I}||\mathscr{Y% }|)}}{\sqrt{n}}\right)$ .

Thus, the combined error in this case is given as

	$\displaystyle\Delta_{\ell,\mathscr{I}}(\hat{\xi},\hat{\theta})$	$\displaystyle\leq\tilde{O}\left(\frac{\ell_{\max}}{\sqrt{n}}\left(L_{{\rm ret}% }+L_{{\rm pred}}\|\mathscr{Y}\|\right)\right)+O\Big{(}\ell_{\max}L_{{\rm ret}}^{% -\tfrac{4\kappa}{3(d_{x}+d_{z})}}\log^{1/3}(\|\mathscr{I}\|)\Big{)}$
		$\displaystyle+O\left(L_{{\rm pred}}^{-\tfrac{2\kappa_{\mathscr{I}}}{(d_{x}+d_{% z})}}+(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})+c_{\mathscr{I}}\|\mathscr{I}\|^{-% \gamma_{\mathscr{I}}}\exp(\ell_{\max})\right).$

This completes the proof. ∎

Finally, letting $\ell_{\max}=\log(|\mathscr{Y}|)+\frac{\kappa_{\mathscr{I}}}{((d_{x}+d_{z})+2% \kappa_{\mathscr{I}})}\log(n)$ and combining the excess risk of retriever learning (2nd term in (33)) and of predictor learning (2nd term in (41)), the joint learning excess error rate is given as

	Joint Excess Risk MLP
	$\displaystyle\qquad\leq\begin{cases}\tilde{O}\left(n^{-\tfrac{2\kappa}{3(d_{x}% +d_{z})+4\kappa}}+\|\mathscr{Y}\|^{\tfrac{2\kappa_{\mathscr{I}}}{(d_{x}+d_{z})+2% \kappa_{\mathscr{I}}}}n^{-\tfrac{\kappa_{\mathscr{I}}}{(d_{x}+d_{z})+2\kappa_{% \mathscr{I}}}}\right),&\text{if}~{}\|\mathscr{I}\|=\Omega\Big{(}\|\mathscr{Y}\|^{% \gamma_{\mathscr{I}}^{-1}}n^{\frac{2\kappa_{\mathscr{I}}\gamma_{\mathscr{I}}^{% -1}}{((d_{x}+d_{z})+2\kappa_{\mathscr{I}})}}\Big{)},\\ \tilde{O}\left(n^{-\tfrac{2\kappa}{3(d_{x}+d_{z})+4\kappa}}+\|\mathscr{I}\|^{-% \gamma_{\mathscr{I}}}\|\mathscr{Y}\|n^{\frac{\kappa_{\mathscr{I}}}{((d_{x}+d_{z}% )+2\kappa_{\mathscr{I}})}}\right),&\text{otherwise.}\end{cases}$		(43)

Here $\kappa$ is defined in Assumption B.1, and $(\kappa_{\mathscr{I}},\gamma_{\mathscr{I}})$ are defined in Assumption B.2. Also, $d_{x}$ is the embedding dimension of input $x\in\mathscr{X}$ and $d_{z}$ is the embedding dimension of retrieved example $z\in\mathscr{I}$ .

Appendix C More experiments

Method	small			base			large
Method	small	base	large	small	base	large	small	base	large
EMDR2	40.0	47.7	52.0	41.5	48.0	51.4	41.6	48.8	52.6
PDist	49.7	57.4	61.3	48.6	57.0	61.0	47.7	55.7	58.9
Reverse Cross-Entropy + PG	44.9	52.6	54.7	45.3	53.3	55.2	44.9	51.7	54.9
Reverse Cross-Entropy + TopK	48.9	56.8	60.9	47.9	55.5	59.6	46.7	54.3	58.2

Table 4: Recall on NQ. We measure the recall of answer string being present in the retrieved passage performance of RAMs across various training objectives and model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

Method	small			base			large
Method	small	base	large	small	base	large	small	base	large
EMDR2	46.6	54.7	62.4	46.1	55.7	61.6	46.0	53.9	59.5
PDist	59.6	68.6	72.8	59.1	61.9	72.2	56.4	59.3	69.3
Reverse Cross-Entropy + PG	58.1	60.7	70.7	56.9	66.1	64.2	54.2	61.4	61.3
Reverse Cross-Entropy + TopK	57.1	64.5	69.1	55.9	63.5	68.1	54.2	61.2	65.8

Table 5: Recall on TriviaQA. We measure the recall of answer string being present in the retrieved passage performance of RAMs across various training objectives and model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

small				base				large
small	base	large		small	base	large		small	base	large
96.4M	170.9M	396.4M		258.8M	333.3M	558.9M		773.6M	848.1M	1073.7M

Table 6: Parameters. We report the model parameters in various configuration by RAMs across various model sizes. Top row specifies the predictor size and the second row specifies the retriever size.

C.1 Implementation details

Computing the objective (13), let alone its gradient, requires evaluating the reader and predictor over the entire data-store $\mathscr{I}$ making it prohibitively expensive. We explore two ways to approximately compute the objective:

Top-K approximation

This approach involves constraining the summation to a specific subset. Periodically we compute $p_{\theta}(z|x)$ for all items $z\in\mathscr{I}$ based on the current value of $\theta$ . We use this to obtain a set of $K$ documents $\mathscr{Z}(x_{i})$ with the highest (stale) scores, i.e. $\mathcal{T}_{K}(p_{\theta}(\cdot|x_{i}))$ and evaluate the sum on this.

\mathscr{L}^{\textsc{RCE+TopK}}_{\mathscr{I},n}(\theta;\xi,\mathscr{I})=-\frac% {1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{Z}(x_{i})}p_{\theta,\mathscr{I}}(z|x_{i% })\cdot\log p_{\xi}(y_{i}|x_{i},z)

(44)

This methodology is akin to those adopted by EMDR2 and PDist, with the set being refreshed every 500 training steps and the selection of $K=64$ .

Policy gradient

Based on connection to RLHF/RLAIF, we propose to use policy gradient method [Sutton and Barto, 2018] to obtain an unbaised estimate of gradient with respect to $\theta$ efficiently. However, as policy gradients suffer from high variance [Burda et al., 2015, Grathwohl et al., 2021] we use a constant baseline [Williams, 1992] for variance reduction, i.e. our objective becomes

	$\displaystyle\mathscr{L}^{\textsc{RCE+PG}}_{\mathscr{I},n}(\theta;\xi,\mathscr% {I})$	$\displaystyle=-\frac{1}{n}\sum_{i\in[n]}\sum_{j\in[K]}p_{\theta,\mathscr{I}}(z% _{j}(x_{i})\|x_{i})\cdot\big{[}\log p_{\xi}(y_{i}\|x_{i},z_{j}(x_{i}))-b\big{]}$		(45)
	$\displaystyle\nabla_{\theta}\mathscr{L}^{\textsc{RCE+PG}}_{\mathscr{I},n}(% \theta;\xi,\mathscr{I})$	$\displaystyle=-\frac{1}{n}\sum_{i\in[n]}\sum_{j\in[K]}\nabla_{\theta}\log p_{% \theta,\mathscr{I}}(z_{j}(x_{i})\|x_{i})\cdot\big{[}\log p_{\xi}(y_{i}\|x_{i},z_% {j}(x_{i}))-b\big{]},$		(45)

where $z_{j}(x_{i})\sim p_{\theta}(\cdot|x_{i})$ are $K$ i.i.d. samples from the retriever distribution. We use $K=64$ and $b=5$ .

C.2 Training details

Dataset The versions of the open-domain QA datasets, we use are:

•

TriviaQA: https://www.tensorflow.org/datasets/catalog/trivia_qa#trivia_qaunfilterednocontext
•

NQOpen https://www.tensorflow.org/datasets/catalog/natural_questions_open

Optimization. For all of our experiments, we use ADAM weight decay optimizer with a short warm up period (2000 steps) and a linear decay schedule. We use the peak learning rate of $1\times 10^{-4}$ . The weight decay factor is 0.1. We chose batch sizes to be $64$ . The number of total training steps is as follows:

•

No retriever, train predictor $\xi$ : 40,000
•

Fixed retriever $\theta_{0}$ , train predictor $\xi$ : 20,000
•

Fixed predictor $\xi^{\star}(\theta_{0})$ , train retriever $\theta$ : 20,000
•

Jointly train predictor $\xi$ and retriever $\theta$ : 40,000

Initializations We initialize models for different configurations as follows:

•

No retriever, train predictor $\xi$ : We initialize the predictor from public pretrained T5 checkpoint.
•

Fixed retriever $\theta_{0}$ , train predictor $\xi$ : We initialize the fixed retriever from public pretrained GTR checkpoint and predictor from public pretrained T5 checkpoint.
•

Fixed predictor $\xi^{\star}(\theta_{0})$ , train retriever $\theta$ : We initialize the fixed predictor from the final checkpoint of previous run, i.e. “Fixed retriever $\theta_{0}$ , train predictor $\xi$ ”. The retriever is initialized from public pretrained GTR checkpoint.
•

Jointly train predictor $\xi$ and retriever $\theta$ : We initialize the fixed retriever from public pretrained GTR checkpoint and predictor from public pretrained T5 checkpoint.

	$\displaystyle\Big{\|}\mathbb{E}_{X}\big{[}\mathbb{E}_{Z\sim p_{\hat{\theta}(\xi% )}(\cdot\|X)}g_{\xi}(X,Z)\big{]}-\frac{1}{n}\sum_{i\in[n]}\sum_{z\in\mathscr{I}% }p_{\hat{\theta}(\xi)}(z\|x_{i})\ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{\|}$
	$\displaystyle\qquad\leq 2\mathbb{E}_{\bm{\sigma}}\Big{[}\max_{\theta\in\Theta}% \frac{1}{n}\sum_{i\in[n]}\sigma_{i}\sum_{z\in\mathscr{I}}p_{\theta}(z\|x_{i})% \ell\big{(}h_{\xi}(x_{i},z),y_{i}\big{)}\Big{]}+3\ell_{\max}\sqrt{\tfrac{\log(% 2/\delta)}{n}}$
	$\displaystyle\qquad\leq 2\times\inf_{\varepsilon\in[0,c_{\xi}/2]}\big{(}4% \varepsilon+\tfrac{12}{\sqrt{n}}\int_{\varepsilon}^{c_{\xi}/2}\sqrt{\log(% \mathcal{N}(\Theta,\nu,\\|\cdot\\|_{2,[n],\xi}))}d\nu\big{)}+3\ell_{\max}\sqrt{% \tfrac{\log(2/\delta)}{n}}$		(28)

	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_% {z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\leq}R_{\ell,\mathscr{I}}(\xi,\theta)-\mathbb% {E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(ii)}{=}\mathbb{E}_{X}\big{[}\sum_{z\in\mathscr{I}% }(p_{\theta,\mathscr{I}}(z\|x)-\tilde{p}_{\xi}(z\|x))g_{\xi}(x,z)\big{]}+\mathbb% {E}_{X}\big{[}\sum_{z\in\mathscr{I}}\tilde{p}_{\xi}(z\|x)-\min_{z\in\mathscr{I}% }g_{\xi}(x,z)\big{]}$
	$\displaystyle\qquad\overset{(iii)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot% )\\|_{\infty}\\|p_{\theta,\mathscr{I}}(\cdot\|x)-\tilde{p}_{\xi}(\cdot\|x)\\|_{1}% \big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iv)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot)% \\|_{\infty}\\|r_{\theta}(x,\cdot)+\tau\mathrm{gap}_{\xi}(x,\cdot)\\|_{\infty}% \big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(v)}{\leq}\ell_{\max}\\|r_{\theta}+\tau\mathrm{gap}% _{\xi}\\|_{\infty}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$

	$\displaystyle\Big{(}\tfrac{1}{n}\sum_{i\in[n]}\big{(}\sum_{z\in\mathscr{I}}(p_% {\theta}(z\|x_{i})-p_{\theta^{\prime}}(z\|x_{i}))\ell\big{(}h_{\xi}(x_{i},z),y_{% i}\big{)}\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(i)}{\leq}\ell_{\max}\Big{(}\tfrac{1}{n}\sum_{i\in% [n]}\big{(}\sum_{z\in\mathscr{I}}\|p_{\theta}(z\|x_{i})-p_{\theta^{\prime}}(z\|x_% {i})\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(ii)}{\leq}\ell_{\max}\Big{(}\tfrac{1}{n}\sum_{i% \in[n]}\big{(}\max_{z\in\mathscr{I}}\|r_{\theta}(x_{i},z)-r_{\theta^{\prime}}(x% _{i},z)\|\big{)}^{2}\Big{)}^{1/2}$
	$\displaystyle\qquad\overset{(iii)}{\leq}\ell_{\max}\sup_{x\in\mathcal{S}_{n},z% \in\mathscr{I}}\|r_{\theta}(x,z)-r_{\theta^{\prime}}(x,z)\|$

	$\displaystyle R_{\ell,\mathscr{I}}(\xi,\theta(\xi))-\mathbb{E}_{X}\big{[}\min_% {z\in\mathscr{I}}g_{\xi}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\leq}\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot)% \\|_{\infty}\big{]}\mathbb{E}_{X}\big{[}\\|r_{\theta}(x,\cdot)+\tau\mathrm{gap}_% {\xi}(x,\cdot)\\|_{\infty}\big{]}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(ii)}{=}\tau\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,\cdot% )\\|_{\infty}\big{]}\\|\tilde{r}_{\theta}-\mathrm{gap}_{\xi}\\|_{L_{\infty}(% \Omega)}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iii)}{\leq}C\tau\mathbb{E}_{X}\big{[}\\|g_{\xi}(x,% \cdot)\\|_{\infty}\big{]}\\|\mathrm{gap}_{\xi}\\|_{W^{\kappa}(L_{\infty}(\Omega))% }L^{-2\kappa/(d_{x}+d_{z})}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$
	$\displaystyle\qquad\overset{(iv)}{\leq}C^{\prime}\ell_{\max}\tau L^{-2\kappa/(% d_{x}+d_{z})}+\frac{\log(\|\mathscr{I}\|)}{\tau^{2}}$

	$\displaystyle\mathbb{E}_{X}\big{[}\min_{z\in\mathscr{I}}g_{f_{{\rm opt},% \mathscr{I}}^{\ell}}(X,z)\big{]}$
	$\displaystyle\qquad\overset{(i)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in\mathscr{% Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{\mathsf{D}_{XY}}^{y}(X))% )\big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})$
	$\displaystyle\qquad\overset{(ii)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in\mathscr% {Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{}^{y}(X,z^{}(X)))\big% {]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})$
	$\displaystyle\quad\quad-\exp(\ell_{\max})\,\mathbb{E}_{X}\big{[}\max_{y\in% \mathscr{Y}}\|p_{}^{y}(X,z^{}(X))-p_{\mathsf{D}_{XY}}^{y}(X)\|\big{]}$
	$\displaystyle\qquad\overset{(iii)}{\geq}\mathbb{E}_{X}\big{[}\sum_{y\in% \mathscr{Y}}p_{\mathsf{D}_{XY}}^{y}(X)\min(\ell_{\max},-\ln(p_{}^{y}(X,z^{}(% X)))\big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})-c_{\mathscr{I}}\|\mathscr{I}\|^{% -\gamma_{\mathscr{I}}}\exp(\ell_{\max})$
	$\displaystyle\qquad\overset{(iv)}{=}\mathbb{E}_{X}\big{[}g_{h_{}}(X,z^{}(X))% \big{]}-(\|\mathscr{Y}\|-1)\exp(-\ell_{\max})-c_{\mathscr{I}}\|\mathscr{I}\|^{-% \gamma_{\mathscr{I}}}\exp(\ell_{\max})$		(37)

A Statistical Framework for Data-dependent Retrieval-Augmented Models

Abstract

1 Introduction

2 Problem setup

3 Joint training and excess risk

3.1 Excess risk decomposition

3.2 Generalization error

3.3 Approximation error

3.3.1 Retriever error

Assumption 3.1 (Complexity of gξ∗subscript𝑔superscript𝜉g_{\xi^{*}}italic_g start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT).

3.3.2 Predictor error

Usefulness of retrieval set:

Assumption 3.2 (Retrieval quality).

3.4 Final excess risk bound

Theorem 3.3 (Excess risk of joint training).

3.5 Illustrative example: MLPs

Theorem 3.4 (Excess risk for MLP).

3.6 Connections with prior end-to-end training

4 Experiments

5 Discussion and related work

6 Conclusion

References

Appendix A Preliminaries

Definition A.1 (Rademacher complexity).

Definition A.2 (Covering nsumber).

Definition A.3 (Multi-layer perceptron (MLP)).

Definition A.4 (Sobolev space).

Theorem A.5 (Restated Siegel [2023] Theorem 1).

Definition A.6 (VC dimension and growth of a binary function class).

Definition A.7 (Pseudo dimension of real valued function class).

Theorem A.8 (Adaptation of Bartlett et al. [2019] Theorem 6).

Theorem A.9 (Multi-ouput version of Bartlett et al. [2019] Theorem 6).

Proof.

Proposition A.10 (Truncated Gibb’s inequality).

Proof.

Appendix B Derivations of main result

B.1 Learning the retriever

Hardness of retrieval:

Assumption B.1 (Complexity of gξsubscriptg𝜉\mathrm{g}_{\xi}roman_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT).

Excess risk decomposition:

B.1.1 Generalization error

B.1.2 Approximation error

B.1.3 Instantiation of MLP retriever

Generalization error for MLP retriever:

Approximation error for MLP retriever:

Excess risk for MLP retriever learning:

B.2 Learning the predictor

Usefulness of data-store:

Assumption B.2 (Retrieval quality).

Excess risk decomposition

B.2.1 Approximation error

B.2.2 Generalization error

B.2.3 Instantiation of MLP predictor

Approximation error of MLP predictor:

Generalization error for MLP predictor:

Excess risk of predictor learning:

B.3 Joint learning of retriever and predictor

B.3.1 Generalization Error

B.3.2 Approximation error

Proof of Theorem 3.3.

B.3.3 Instantiation of MLP retriever and predictor

Proof of Theorem 3.4.

Appendix C More experiments

C.1 Implementation details

Top-K approximation

Policy gradient

C.2 Training details

Assumption 3.1 (Complexity of $g_{\xi^{*}}$ ).

Assumption B.1 (Complexity of $\mathrm{g}_{\xi}$ ).