Simple Transferability Estimation for Regression Tasks

Cuong N. Nguyen

{}^{1}

Phong Tran

{}^{2,3}

Lam Si Tung Ho

{}^{4}

Vu Dinh

{}^{5}

Anh T. Tran

{}^{2}

Tal Hassner

{}^{6}

Cuong V. Nguyen Florida International University, USA

{}^{2}

VinAI Research, Vietnam

{}^{3}

MBZUAI, UAE

{}^{4}

Dalhousie University, Canada

{}^{5}

University of Delaware, USA

{}^{6}

Meta AI, USA

Abstract

We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.

1 Introduction

Transferability estimation [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020] aims to develop computationally efficient metrics to predict the effectiveness of transferring a deep learning model from a source to a target task. This problem has recently gained attention as a means for model and task selection [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Bolya et al., 2021, You et al., 2021] that can potentially improve the performance and reduce the cost of transfer learning, especially for expensive deep learning models. In recent years, new transferability estimators were also developed and used in applications such as checkpoint ranking [Huang et al., 2021, Li et al., 2021] and few-shot learning [Tong et al., 2021].

Nearly all existing methods consider only the transferability between classification tasks [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, Huang et al., 2022], with very few designed for regression [You et al., 2021, Huang et al., 2022], despite the importance of regression problems in a wide range of applications such as landmark detection [Fard et al., 2021, Poster et al., 2021], object detection and localization [Cai et al., 2020, Bu et al., 2021], pose estimation [Schwarz et al., 2015, Doersch and Zisserman, 2019], or image generation [Ramesh et al., 2021, Razavi et al., 2019]. Moreover, those few methods are often a byproduct of a classification transferability estimator and were never tested against regression transferability estimation baselines.

In this paper, we explicitly consider transferability estimation for regression tasks and formulate a novel definition for this problem. Our formulation is based on the practical usage of transferability estimation: to compare the actual transferability between different tasks [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, You et al., 2021]. We then propose two simple, efficient, and theoretically grounded approaches for this problem that estimate transferability using the negative regularized mean squared error (MSE) of a linear regression model computed from the source and target training sets. The first approach, Linear MSE, uses the linear regression model between features extracted from the source model (a model trained on the source task) and true labels of the target training set. The second approach, Label MSE, estimates transferability by regressing between the dummy labels, obtained from the source model, and true labels of the target data. In special cases where the source and target data share the inputs, the Label MSE estimators can be computed even more efficiently from the true labels without a source model.

In addition to their simplicity, we show our transferability estimators to have theoretical properties relating them to the actual transferability of the transferred target model. In particular, we prove that the transferability of the target model obtained from transfer learning is lower bounded by the Label MSE minus a complexity term, which depends on the target dataset size and the model architecture. Similar theoretical results can also be proven for the case where the source and target tasks share the inputs.

We conduct extensive experiments on two real-world keypoint detection datasets, CUB-200-2011 [Wah et al., 2011] and OpenMonkey [Yao et al., 2021], as well as the dSprites shape regression dataset [Matthey et al., 2017] to show the advantages of our approaches. The results clearly demonstrate that despite their simplicity, our approaches outperform recently published, state-of-the-art (SotA) regression transferability estimators, such as LogME [You et al., 2021] and TransRate [Huang et al., 2022], in both effectiveness and efficiency. In particular, our approaches can improve SotA results from 12% to 36% on average, while being at least 27% faster.

Summary of contributions. (1) We formulate a new definition for the transferability estimation problem that can be used for comparing the actual transferability (§3). (2) We propose Linear MSE and Label MSE, two simple yet effective transferability estimators for regression tasks (§4). (3) We prove novel theoretical results for these estimators to connect them with the actual task transferability (§5). (4) We rigorously test our approaches in various settings and challenging benchmarks, showing their advantages compared to SotA regression transferability methods (§6).¹¹1Implementations of our methods are available at: https://github.com/CuongNN218/regression_transferability.

2 Related work

Our paper is one of the recent attempts to develop efficient and effective transferability estimators for deep transfer learning [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, You et al., 2021, Huang et al., 2022, Nguyen et al., 2022], which is closely related to the generalization estimation problem [Chuang et al., 2020, Deng and Zheng, 2021]. Most of the existing work for transferability estimation focuses on classification [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, Nguyen et al., 2022], while we are only aware of two methods developed for regression [You et al., 2021, Huang et al., 2022].

One regression transferability method, called LogME [You et al., 2021], takes a Bayesian approach and uses the maximum log evidence of the target data as the transferability estimator. While this method can be sped up using matrix decomposition, its scalability is still limited since the required memory is large. In contrast, our proposed approaches are simpler, faster, and more effective. We also provide novel theoretical properties for our methods that were not available for LogME. Another approach for transferability estimation between regression tasks, called TransRate [Huang et al., 2022], is to divide the real-valued outputs into different bins and apply a classification transferability estimator. In our experiments, we will show that this approach is less accurate than both LogME and our approaches.

Transferability can also be inferred from a task taxonomy [Zamir et al., 2018, Dwivedi and Roig, 2019, Dwivedi et al., 2020] or a task space representation [Achille et al., 2019], which embeds tasks as vectors on a vector space. A popular task taxonomy, Taskonomy [Zamir et al., 2018], exploits the underlying structure of visual tasks by computing a task affinity matrix that can be used for estimating transferability. Constructing the Taskonomy requires training a small classification head, which resembles the training of the regularized linear regression models in our approaches. However, they investigate the global taxonomy of classification tasks, while our paper studies regression tasks with a focus on estimating their transferability efficiently.

Our paper is also related to transfer learning with kernel methods [Radhakrishnan et al., 2022] and with deep models [Tan et al., 2018], which has been successful in real-world regression problems such as object detection and localization [Cai et al., 2020, Bu et al., 2021], landmark detection [Fard et al., 2021, Poster et al., 2021], or pose estimation [Schwarz et al., 2015, Doersch and Zisserman, 2019]. Several previous works have investigated theoretical bounds for transfer learning [Ben-David and Schuller, 2003, Blitzer et al., 2007, Mansour et al., 2009, Azizzadenesheli et al., 2019, Wang et al., 2019, Tripuraneni et al., 2020]; however, these bounds are hard to compute in practice and thus unsuitable for transferability estimation. Some previous transferability estimators have theoretical bounds on the empirical loss of the transferred model [Tran et al., 2019, Nguyen et al., 2020], but these bounds were for classification and did not relate directly to transferability. Our bounds, on the other hand, focus on regression and connect our approaches directly to the notion of transferability.

3 Transferability between regression tasks

In this section, we describe the transfer learning setting that will be used in our subsequent analysis. We then propose a definition of transferability for regression tasks and a new formulation for the transferability estimation problem.

3.1 Transfer learning for regression

Consider a source training set $\mathcal{D}_{s}=\{(x^{s}_{i},y^{s}_{i})\}_{i=1}^{n_{s}}$ and a target training set $\mathcal{D}_{t}=\{(x^{t}_{i},y^{t}_{i})\}_{i=1}^{n_{t}}$ consisting of $n_{s}$ and $n_{t}$ examples respectively, where $x^{s}_{i},x^{t}_{i}\in\mathbb{R}^{d}$ are $d$ -dimensional input vectors, $y^{s}_{i}\in\mathbb{R}^{d_{s}}$ is a $d_{s}$ -dimensional source label vector, and $y^{t}_{i}\in\mathbb{R}^{d_{t}}$ is a $d_{t}$ -dimensional target label vector. Here we allow multi-output regression tasks (with $d_{s},d_{t}\geq 1$ ) where the source and target labels may have different dimensions ( $d_{s}\neq d_{t}$ ). In the simplest case, the source and target tasks are both single-output regression tasks where $d_{s}=d_{t}=1$ .

In this paper, we will refer to a model (such as $w$ , $w^{*}$ , $h$ , $h^{*}$ , $k$ , or $k^{*}$ ) and its parameters interchangeably. Using the source dataset $\mathcal{D}_{s}$ , we train a deep learning model $(w^{*},h^{*})$ consisting of an optimal feature extractor $w^{*}$ and an optimal regression head $h^{*}$ that minimizes the empirical MSE loss:²²2Here we assume $(w^{*},h^{*})$ is a global minimum of Eq. (1). However, practical optimization algorithms often only return a local minimum for this problem. The same is also true for Eq. (3).

\textstyle w^{*},h^{*}=\operatorname*{argmin}_{w,h}\mathcal{L}(w,h;\mathcal{D}% _{s}),

(1)

where $w:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{r}}$ is a feature extractor network that transforms a $d$ -dimensional input vector into a $d_{r}$ -dimensional feature vector, $h:\mathbb{R}^{d_{r}}\rightarrow\mathbb{R}^{d_{s}}$ is a source regression head network that transforms a $d_{r}$ -dimensional feature vector into a $d_{s}$ -dimensional output vector, and $\mathcal{L}(w,h;\mathcal{D}_{s})$ is the empirical MSE loss of the whole model $(w,h)$ on the dataset $\mathcal{D}_{s}$ :

\mathcal{L}(w,h;\mathcal{D}_{s})=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}\|y^{s}_{i}-% h(w(x^{s}_{i}))\|^{2},

(2)

with $\|\cdot\|$ being the $\ell_{2}$ norm. In practice, we usually consider a source model (e.g., a ResNet [He et al., 2016]) as a whole and use its first $l$ layers from the input (for some chosen number $l$ ) as the feature extractor $w$ . The regression head $h$ is the remaining part of the model from the $l$ -th layer to the output layer, and the prediction for any input $x$ is $h(w(x))$ .

After training the optimal source model $(w^{*},h^{*})$ , we perform transfer learning to the target task by freezing the optimal feature extractor $w^{*}$ and re-training a new regression head $k^{*}$ using the target dataset $\mathcal{D}_{t}$ , also by minimizing the empirical MSE loss:

	$\displaystyle k^{*}$	$\displaystyle=\textstyle\operatorname{argmin}_{k}\mathcal{L}(w^{},k;\mathcal% {D}_{t})$
		$\displaystyle={\textstyle\operatorname{argmin}_{k}}\Big{\{}\frac{1}{n_{t}}% \sum_{i=1}^{n_{t}}\\|y^{t}_{i}-k(w^{}(x^{t}_{i}))\\|^{2}\Big{\}},$		(3)

where $k:\mathbb{R}^{d_{r}}\rightarrow\mathbb{R}^{d_{t}}$ is a target regression head network that may have a different architecture than that of $h$ . In general, the regression heads $h$ and $k$ may contain multiple layers and are not necessarily linear.

This transfer learning algorithm, usually called head re-training, has been widely used for deep learning models [Donahue et al., 2014, Oquab et al., 2014, Sharif Razavian et al., 2014, Whatmough et al., 2019] and will be used for our theoretical analysis. In practice and in our experiments, we also consider another transfer learning algorithm, widely known as fine-tuning, where we fine-tune the trained feature extractor $w^{*}$ on the target set, and then train a new target regression head $k^{*}$ with this fine-tuned feature extractor [Agrawal et al., 2014, Girshick et al., 2014, Chatfield et al., 2014, Dhillon et al., 2020].

3.2 Transferability estimation

As our first contribution, we propose a definition of transferability for regression tasks and a new formulation for the transferability estimation problem. For this purpose, we make the standard assumption that the target data $\mathcal{D}_{t}$ are drawn iid from the true but unknown distribution $\mathbb{P}_{t}:=\mathbb{P}(X^{t},Y^{t})$ ; that is, $(x^{t}_{i},y^{t}_{i})\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}\mathbb{P}_{t}$ . We do not make any assumption on the distribution of the source data $\mathcal{D}_{s}$ , but we assume a source model $(w^{*},h^{*})$ is pre-trained on $\mathcal{D}_{s}$ and then transferred to a target model $(w^{*},k^{*})$ using the procedure in Section 3.1.

We now define the transferability between the source dataset $\mathcal{D}_{s}$ and the target task represented by $\mathbb{P}_{t}$ . In our Definition 3.1 below, the transferability is the expected negative $\ell_{2}$ loss of the target model $(w^{*},k^{*})$ on a random example drawn from $\mathbb{P}_{t}$ . From this definition, the lower the loss of $(w^{*},k^{*})$ , the higher the transferability.

Definition 3.1.

The transferability between a source dataset $\mathcal{D}_{s}$ and a target task $\mathbb{P}_{t}$ is defined as: $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t}):=\mathbb{E}_{(x^{t},y^{t})\sim% \mathbb{P}_{t}}\left\{-\|y^{t}-k^{*}(w^{*}(x^{t}))\|^{2}\right\}$ .

In the above definition, transferability is also equivalent to the negative expected (true) risk of $(w^{*},k^{*})$ . Next, we formulate the transferability estimation problem. Previous work [Tran et al., 2019, Huang et al., 2022] defined this problem as estimating $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ from the training sets $(\mathcal{D}_{s},\mathcal{D}_{t})$ , i.e., to derive a real-valued metric $\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\in\mathbb{R}$ such that ${\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\approx\mathrm{Tr}(\mathcal{D}_{s% },\mathbb{P}_{t})}$ . However, in most applications of transferability estimation such as task selection [Tran et al., 2019, Huang et al., 2022, You et al., 2021] or model ranking [Huang et al., 2021, Li et al., 2021], an accurate approximation of $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ is usually not required since $\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})$ is only used for comparing tasks or models. Thus, we propose below an alternative definition for this problem that better aligns with its practical usage.

Definition 3.2.

Transferability estimation aims to find a computationally efficient real-valued metric ${\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\in\mathbb{R}}$ for any pair of training datasets $(\mathcal{D}_{s},\mathcal{D}_{t})$ such that: $\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\leq\mathcal{T}(\mathcal{D}^{% \prime}_{s},\mathcal{D}^{\prime}_{t})$ if and only if $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\leq\mathrm{Tr}(\mathcal{D}^{\prime% }_{s},\mathbb{P}^{\prime}_{t})$ , where $\mathbb{P}_{t}$ and $\mathbb{P}^{\prime}_{t}$ are the tasks corresponding with the datasets $\mathcal{D}_{t}$ and $\mathcal{D}^{\prime}_{t}$ respectively.

In our new definition, a transferability estimator $\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})$ is a function of $(\mathcal{D}_{s},\mathcal{D}_{t})$ that can be used for comparing or ranking transferability. It does not need to be an approximation of $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ . This is a generalization of previous definitions [Nguyen et al., 2020, Huang et al., 2022] and can be used for source task selection (when $\mathbb{P}_{t}=\mathbb{P}^{\prime}_{t}$ and ${\mathcal{D}_{t}=\mathcal{D}^{\prime}_{t}}$ ) as well as target task selection (when $\mathcal{D}_{s}=\mathcal{D}^{\prime}_{s}$ ). It is consistent with the usage of transferability estimators and the way they are evaluated in the literature by correlation analysis [Tran et al., 2019, Nguyen et al., 2020, You et al., 2021, Huang et al., 2022].

4 Simple transferability estimators for regression

In theory, we can use $-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ , the negative MSE of the transferred target model $(w^{*},k^{*})$ , as a transferability estimator, since it is an empirical estimation of $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ using the dataset $\mathcal{D}_{t}$ . However, this method requires us to run the actual transfer learning process, which could be expensive if the network architecture of the target regression heads (e.g., $k$ and $k^{*}$ ) is deep and complex. This violates a crucial requirement for a transferability estimator in Definition 3.2: the estimator must be computationally efficient since it will be computed several times for task comparison. In this section, we propose two simple regression transferability estimators to address this problem.

4.1 Linear MSE estimator

To reduce the cost of computing $\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ , a simple idea is to approximate it with an $\ell_{2}$ -regularized linear regression (Ridge regression) head. This leads to our first simple transferability estimator, Linear MSE, which is defined as the negative regularized MSE of this Ridge regression head. In this definition, $\|\cdot\|_{F}$ is the Frobenius norm.

Definition 4.1.

The Linear MSE transferability estimator with a regularization parameter $\lambda\geq 0$ is: $\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}):=-\min_{% A,b}\big{\{}\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}{\|y^{t}_{i}-Aw^{*}(x^{t}_{i})-b% \|^{2}}+\lambda\|A\|_{F}^{2}\big{\}}$ , where $A\in\mathbb{R}^{d_{r}\times d_{t}}$ is a $d_{r}\times d_{t}$ real-valued matrix and $b\in\mathbb{R}^{d_{t}}$ is a $d_{t}$ -dimensional real-valued vector.

Here we add a regularizer to avoid overfitting when the target dataset $\mathcal{D}_{t}$ is small. Previous work such as LogME [You et al., 2021] proposed to prevent overfitting by taking a Bayesian approach, which is more complicated and expensive. We will show empirically in our experiments (Section 6.3) that our simple regularization approach can tackle the issue more effectively and efficiently.

Given a pre-trained feature extractor $w^{*}$ and a target set $\mathcal{D}_{t}$ , we can compute $\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ efficiently using the closed form solution for Ridge regression or using second-order optimization [Bishop, 2006]. If the target regression head $k^{*}$ is a linear regression model, $\mathcal{T}^{\mathrm{lin}}_{0}(\mathcal{D}_{s},\mathcal{D}_{t})$ with $\lambda=0$ is the negative MSE of the transferred target model $(w^{*},k^{*})$ on $\mathcal{D}_{t}$ . If $k^{*}$ has more than one layer with a non-linear activation, $\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ can be regarded as using a regularized linear model to approximate this non-linear head.

4.2 Label MSE estimator

Although the Linear MSE transferability score above can be computed efficiently, this computation may still be relatively expensive if the feature vectors $w^{*}(x^{t}_{i})$ are high-dimensional. To further reduce the costs, we propose another transferability estimator, Label MSE, which replaces $w^{*}(x^{t}_{i})$ by the “dummy” source label $z_{i}=h^{*}(w^{*}(x^{t}_{i}))$ . Using dummy labels from the pre-trained source model $(w^{*},h^{*})$ is a technique previously used to compute the LEEP transferability score for classification [Nguyen et al., 2020]. We define our Label MSE estimator below.

Definition 4.2.

The Label MSE transferability estimator with a regularization parameter $\lambda\geq 0$ is: $\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}):=-\min_{% A,b}\big{\{}\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|y^{t}_{i}-Az_{i}-b\|^{2}+% \lambda\|A\|_{F}^{2}\big{\}}$ , where ${A\in\mathbb{R}^{d_{s}\times d_{t}}}$ is a $d_{s}\times d_{t}$ real-valued matrix, $b\in\mathbb{R}^{d_{t}}$ is a $d_{t}$ -dimensional real-valued vector, and $z_{i}=h^{*}(w^{*}(x^{t}_{i}))$ .

In practice, since the size of $z_{i}$ is usually much smaller than that of $w^{*}(x^{t}_{i})$ (i.e., $d_{s}\ll d_{r}$ ), computing the Label MSE is usually faster than computing the Linear MSE.

$\bullet$ Special case with shared inputs. When the source and target datasets have the same inputs, i.e., ${\mathcal{D}_{s}=\{(x_{i},y^{s}_{i})\}_{i=1}^{n}}$ and $\mathcal{D}_{t}=\{(x_{i},y^{t}_{i})\}_{i=1}^{n}$ , we can compute the Label MSE even faster using only the true labels. Particularly, we can consider the following version of the Label MSE.

Definition 4.3.

The Shared Inputs Label MSE transferability estimator with a regularization parameter $\lambda\geq 0$ is: $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}% ):=-\min_{A,b}\Big{\{}{\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}% }+\lambda\|A\|_{F}^{2}\Big{\}}$ , where $A\in\mathbb{R}^{d_{s}\times d_{t}}$ and $b\in\mathbb{R}^{d_{t}}$ .

In this definition, the Shared Inputs Label MSE is computed by training a Ridge regression model directly from the true label pairs $(y^{s}_{i},y^{t}_{i})$ , which is less expensive than the original Label MSE since we do not need to train the source model $(w^{*},h^{*})$ or compute the dummy labels.

Intuitively, our estimators use a weaker version of the actual target model that helps trade off the estimators’ accuracy for computational speed. Our estimators can also be viewed as instances of the kernel Ridge regression approach [Smale and Zhou, 2007, Hastie et al., 2009]. While the Linear MSE can be interpreted as a linear approximation to $-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ , properties of the Label MSE and Shared Inputs Label MSE are not well understood. In the next section, we shall prove novel theoretical properties for these estimators.

5 Theoretical properties

We now prove some theoretical properties for the Label MSE with ReLU feed-forward neural networks. These properties are in the form of generalization bounds relating $\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ with the transferability $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ . Throughout this section, we assume the space of all target regression heads $k$ , which may have more than one layer, is a superset of all the linear regression models. This assumption is generally true for ReLU networks [Arora et al., 2018].

First, we show in Lemma 5.1 below a relationship between the negative MSE loss $-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ of $(w^{*},k^{*})$ and the Label MSE. This lemma states that the negative MSE loss $-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ upper bounds the Label MSE. The proof for this lemma is in the Appendix A.1.

Lemma 5.1.

For any $\lambda\geq 0$ , we have: $\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})\leq-% \mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})$ .

Using this lemma, we can prove our main theoretical result in Theorem 5.2 below. In this theorem, $L$ is the number of layers of the ReLU feed-forward neural network $(w^{*},k^{*})$ , and we assume the number of hidden nodes and parameters in each layer are upper bounded by $H$ and $M\geq 1$ respectively. Without loss of generality, we also assume all input and output data are upper bounded by $1$ in $\ell_{\infty}$ -norm. This assumption can easily be satisfied by a pre-processing step that scales them to $[0,1]$ in $\ell_{\infty}$ -norm.

Theorem 5.2.

For any source dataset $\mathcal{D}_{s}$ , $\lambda\geq 0$ and $\delta>0$ , with probability at least $1-\delta$ over the randomness of $\mathcal{D}_{t}$ , we have: $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\geq\mathcal{T}^{\mathrm{lab}}_{% \lambda}(\mathcal{D}_{s},\mathcal{D}_{t})-C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}$ , where $C(d,d_{t},M,H,L,\delta)=16M^{2L+2}H^{2L}[d_{t}^{2}d\sqrt{L+1+\ln d}+d_{t}d^{2}% \sqrt{2\ln(4/\delta)}]$ .

The proof for this theorem is in the Appendix A.2 The theorem shows that the transferability $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ is lower bounded by the Label MSE $\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ minus a complexity term $C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}$ that depends on the target dataset (specifically, the input and output dimensions, as well as the dataset size) and the architecture of the target network. When this complexity term is small (e.g., when $n_{t}$ is large enough), the bound in Theorem 5.2 will be tighter. In this case, a higher Label MSE score will likely lead to better transferability.

$\bullet$ Shared inputs case. We can also derive similar bounds for the Shared Inputs Label MSE $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ . Denote ${A^{*}_{\lambda},b^{*}_{\lambda}:=\operatorname*{argmin}_{A,b}\big{\{}\frac{1}% {n}\sum_{i}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}+\lambda\|A\|_{F}^{2}\big{\}}}$ . We first show the following lemma relating $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ and the losses of the source and target models.

Lemma 5.3.

For any $\lambda\geq 0$ , we have: $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}% )\leq-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})/2+\|A^{*}_{\lambda}\|_{F}^{2}% \mathcal{L}(w^{*},h^{*};\mathcal{D}_{s}).$

Using this lemma, we can prove the following theorem for this shared inputs setting. The proofs for these results are in the Appendix A.3.

Theorem 5.4.

For any source dataset $\mathcal{D}_{s}$ , $\lambda\geq 0$ and $\delta>0$ , with probability at least $1-\delta$ over the randomness of $\mathcal{D}_{t}$ , we have: $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\geq 2\widehat{\mathcal{T}}^{% \mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})-2\|A^{*}_{\lambda}\|_% {F}^{2}\mathcal{L}(w^{*},h^{*};\mathcal{D}_{s})-C(d,d_{t},M,H,L,\delta)/\sqrt{n}$ .

From the theorem, $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})$ can indirectly tell us information about the transferability $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})$ without actually training $w^{*}$ , $h^{*}$ , and $k^{*}$ . This bound becomes tighter when $n$ is large or $\mathcal{L}(w^{*},h^{*};\mathcal{D}_{s})$ is small (e.g., when the source model is expressive enough to fit the source data). An experiment to investigate the usefulness of our theoretical bounds in this section is available in the Appendix A.4.

6 Experiments

In this section, we conduct experiments to evaluate our approaches on the keypoint (or landmark) regression tasks using the following two large-scale public datasets:

$\bullet$ CUB-200-2011 [Wah et al., 2011]. This dataset contains 11,788 bird images with 15 labeled keypoints indicating 15 different parts of a bird body. We use 9,788 images for training and 2,000 images for testing. Since the annotations for occluded keypoints are highly inaccurate, we remove all occluded keypoints during the training for both source and target tasks.

$\bullet$ OpenMonkey [Yao et al., 2021]. This is a benchmark for the non-human pose tracking problem. It offers over 100,000 monkey images in natural contexts, annotated with 17 body landmarks. We use the original train-test split, which contains 66,917 training images and 22,306 testing images.

In our experiments, we use ResNet34 [He et al., 2016] as the backbone since it provides good performance as a source model. Following previous work [Tran et al., 2019, Nguyen et al., 2020, Huang et al., 2022, Nguyen et al., 2022], we investigate how well our transferability estimators correlate (using Pearson correlation) with the negative test MSE of the target model obtained from actual transfer learning. This correlation analysis is a good method to measure how well transferability estimators satisfy our Definition 3.2. In the Table C.1 and C.2 in the Appendix C.2, we provide additional results for other non-linear correlation measures, including Kendall’s $\tau$ and Spearman correlations. The conclusions in our paper remain the same when comparing these correlations.

We consider three standard transfer learning algorithms: (1) head re-training [Donahue et al., 2014, Sharif Razavian et al., 2014]: We fix all layers of the source model up until the penultimate layer and re-train the last fully-connected (FC) layer using the target training set; (2) half fine-tuning [Donahue et al., 2014, Sharif Razavian et al., 2014]: We fine-tune the last convolutional block and all the FC layers of the source model, while keeping all other layers fixed; and (3) full fine-tuning [Agrawal et al., 2014, Girshick et al., 2014]: We fine-tune the whole source model using the target training set. Among these settings, head re-training resembles the transfer scenario in Section 3.1, while half and full fine-tuning are more commonly used in practice. For half fine-tuning, around half of the parameters in the network will be fine-tuned ( $\sim$ 13M parameters). More details of our experiment settings are in the Appendix B.1.

We compare our transferability estimators, Linear MSE and Label MSE, with two recent SotA baselines for regression: LogME [You et al., 2021] and TransRate [Huang et al., 2022]. For our methods, we consider $\lambda=0$ (named LinMSE0 and LabMSE0) for the estimators without regularization, and $\lambda=1$ (named LinMSE1 and LabMSE1) for the estimators with the default $\lambda$ value. The effects of $\lambda$ on our algorithms are investigated in Section 6.6.

For the baselines, besides the usual versions (LogME and TransRate) that are computed from the extracted features and the target labels, we also consider the versions where they are computed from the dummy labels and the target labels (named LabLogME and LabTransRate). As in previous work [Huang et al., 2022], we divide the target label values into equal-sized bins (five bins in our case) to compute TransRate and LabTransRate.

6.1 General transfer between two different domains

Table 1: Correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Detailed correlation plots are in the Appendix C.2. Our estimators improve up to 25.9% in comparison with SotA (LogME) while being 12.9% better on average.

Transfer setting	Label-based method				Feature-based method
Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	0.824	0.165	0.991	0.995*	0.969	0.121	0.982	0.995*
Half fine-tuning	0.706	0.392	0.881	0.885*	0.870	0.304	0.866	0.885*
Full fine-tuning	0.691	0.410	0.870*	0.869	0.861	0.311	0.855	0.869*

This experiment considers the general case where source models are trained on one dataset (OpenMonkey) and then transferred to another (CUB-200-2011). Specifically, we train a source model for each of the 17 keypoints of the OpenMonkey dataset and transfer them to each of the 15 keypoints of the CUB-200-2011 dataset, resulting in a total of 255 final models. Since each keypoint consists of x and y positions, all source and target tasks in this experiment have two dimensional labels. The actual MSEs of these models are computed on the respective test sets and then used to calculate the Pearson correlation coefficients with the transferability estimators. In this experiment, LabMSE0, LabMSE1, LabLogME, and LabTransRate are computed from the dummy source labels and the actual target labels.

Results for this experiment are in Table 1. In this setting, TransRate and LabTransRate perform poorly, while our methods are equal or better than LogME and LabLogME in most cases, especially when using $\lambda=1$ (LinMSE1) or dummy labels (LabMSE0 and LabMSE1). The results show our approaches improve up to 25.9% in comparison with SotA (LogME) while being 12.9% better on average.

It is interesting to observe that LabMSE0 and LabMSE1 provide competitive or even better correlations than LinMSE0 and LinMSE1 in this experiment. This shows that the dummy labels (i.e., body parts of monkeys) can provide as much information about the target labels (i.e., body parts of birds) as the extracted features.

In the Appendix C.2, we also report additional results where both source and target tasks have 10-dimensional labels (i.e., each task predicts 5 keypoints simultaneously). We also achieve better correlations than the baselines in this case.

6.2 Transfer with shared-inputs tasks

Table 2: Correlation coefficients when transferring between tasks with shared inputs. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Detailed correlation plots are in the Appendix C.3. Our estimators improve up to 113% in comparison with SotA (LogME) while being 36.6% better on average.

Dataset	Transfer setting	Label-based method				Feature-based method
Dataset	Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
CUB-200-2011	Head re-training	0.547	0.019	0.916	0.946*	0.890	0.029	0.921	0.960*
	Half fine-tuning	0.401	0.006	0.536	0.565*	0.560	0.064	0.628*	0.619
	Full fine-tuning	0.128*	0.041	0.056	0.057	0.100	0.109*	0.097	0.082
Open Monkey	Head re-training	0.890	0.666	0.973*	0.773	0.695	0.711	0.946	0.975*
	Half fine-tuning	0.615	0.340	0.754	0.890*	0.446	0.488	0.899*	0.801
	Full fine-tuning	0.569	0.269	0.705	0.882*	0.403	0.439	0.859*	0.761

In this experiment, we consider the setting where the source and target tasks have the same inputs (the special setting in Section 4.2). Since images in our datasets contain multiple labels (15 keypoints for CUB-200-2011 and 17 keypoints for OpenMonkey), we can use any two different keypoints on the same dataset as source and target tasks. In total, we construct 210 source-target pairs for CUB-200-2011 and 272 pairs for OpenMonkey that all have the same source and target inputs but different labels. The labels for all tasks are also two dimensional real values.

We repeat the experiment in Section 6.1 with these source-target pairs for CUB-200-2011 and OpenMonkey separately. The main difference in this experiment is that we use the true source labels (instead of dummy labels) when computing LabLogME, LabTransRate, LabMSE0, and LabMSE1. Under this setting, the LabMSE estimators here are the Shared Inputs Label MSE estimators in Definition 4.3. These estimators can be computed without any source models, and thus incurring very low computational costs in this setting.

Results for these experiments are in Table 2. In the results, both versions of TransRate perform poorly on CUB-200-2011, while TransRate is slightly better than LogME on OpenMonkey. In most settings, LabMSE0 and LabMSE1 both outperform LabLogME and LabTransRate, while LinMSE0 and LinMSE1 both outperform LogME and TransRate. In the setting where we transfer by full fine-tuning on the CUB-200-2011 dataset, all methods perform poorly. From these results, our approaches improve up to 113% in comparison with SotA (LogME) while being 36.6% better on average.

We also report in the Appendix C.3 additional results for each individual source task. The results show that our methods are consistently better than LogME, LabLogME, TransRate, and LabTransRate for most source tasks on both datasets. Furthermore, our methods are also better than these baselines when transferring to higher dimensional target tasks (tasks that predict 5 keypoints simultaneously and have 10-dimensional labels). These additional results further confirm the effectiveness of our approaches.

6.3 Evaluations on small target sets

Refer to caption — Figure 1: Correlation coefficients with small target training sets on CUB-200-2011 (left) and OpenMonkey (right). LinMSE1 and LogME are designed to avoid overfitting, but LinMSE1 is better than LogME in both datasets.

In many real-world transfer learning scenarios, the target set is usually small. This experiment will evaluate the effectiveness of the feature-based transferability estimators (LogME, TransRate, LinMSE0, and LinMSE1) in this small data regime where the number of samples is smaller than the feature dimension. For this experiment, we fix a source task (Belly for CUB-200-2011 and Right eye for OpenMonkey) and transfer to all other tasks in the corresponding dataset using head re-training. These source tasks are chosen since they have fewer missing labels and thus can be used to train reasonably good source models for transfer learning. For each target task, instead of using the full data, we randomly select a small subset of 100 to 400 images to perform transfer learning and to compute the transferability scores. The actual MSEs of the transferred models are still computed using the full target test sets.

Figure 3 compares the correlations of the 4 methods on different target set sizes between 100 and 400. The results are averaged over 10 runs with 10 different random seeds. From the figure, LogME and LinMSE1 are better than TransRate and LinMSE0. This is expected since LogME and LinMSE1 are designed to avoid overfitting on small data. Both LogME and LinMSE1 are also more stable, but LinMSE1 is slightly better than LogME on all dataset sizes.

6.4 Efficiency of our estimators

One of the main strengths of our methods is their efficiency due to the simplicity of training the Ridge regression head. In this experiment, we first use the settings in Section 6.2 to compare the running time of our methods with that of the baselines on the CUB-200-2011 dataset. Figure 3 (left) reports the results (averaged over 5 runs with different random seeds) for this experiment. From these results, our methods, LabMSE0, LabMSE1, LinMSE0, and LinMSE1, are all faster than the corresponding label-based or feature-based baselines. The figure also shows that LabMSE1 and LinMSE1 achieve the best running time among the label-based and feature-based methods respectively.

In Figure 3 (right), we also compare the average running time of the 4 transferability estimators using the CUB-200-2011 experiment in Section 6.3. This figure clearly shows that our methods, LinMSE0 and LinMSE1, are more computationally efficient than LogME and TransRate. Both results in Figure 3 show that LinMSE1 and LabMSE1 are significantly faster than other corresponding feature-based and label-based methods. In these experiments, LinMSE1 and LabMSE1 converge faster than LinMSE0 and LabMSE0 respectively, and thus are more efficient.

6.5 Source task selection

Source task selection is important for applying transfer learning since the right source task can improve transfer learning performance [Nguyen et al., 2020]. In this experiment, we examine the application of our transferability estimation methods for selecting source tasks on the CUB-200-2011 dataset. We use the head re-training setting similar to Section 6.2, but fix one of the tasks as the target and choose the best source task from the rest of the task pool. We repeat this process for all 15 target tasks and measure the top- $k$ matching rate of each transferability estimator.

The top- $k$ matching rate is defined as $m_{\text{match}}/m_{\text{target}}$ , where $m_{\text{target}}$ is the total number of target tasks (15 in our case), and $m_{\text{match}}$ is the number of times the selected source task gives a target model within the best $k$ models. Here the best $k$ models are determined by the actual test MSE on the target task.

Results for this experiment are in Table 3. From the results, our methods are better than the baselines in terms of top- $3$ and top- $5$ matching rates. When comparing top- $1$ matching rates, our methods are competitive with LogME and LabLogME for the feature-based and label-based approaches respectively. This experiment shows that our transferability estimators are useful for source task selection.

Table 3: Top-

k

matching rates for source task selection on CUB-200-2011. Bold numbers indicate best results in each column. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.

$k$	Label-based method				Feature-based method
$k$	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
1	6/15*	4/15	6/15*	2/15	11/15*	2/15	9/15	10/15
3	9/15	9/15	10/15*	9/15	12/15	6/15	12/15	13/15*
5	10/15	12/15	14/15*	14/15*	12/15	6/15	12/15	13/15*

6.6 Effects of $\lambda$

Table 4: Correlation coefficients for different values of

\lambda

on CUB-200-2011. Bold numbers indicate best results in each column. Results of the baselines are given in the last 2 rows for comparison. When there are meaningful correlations (head re-training and half fine-tuning), our methods are better than the corresponding baselines for all

\lambda

values.

$\lambda$	Head re-training		Half fine-tuning		Full fine-tuning
$\lambda$	LabMSE	LinMSE	LabMSE	LinMSE	LabMSE	LinMSE
0	0.916	0.921	0.536	0.628	0.056	0.097
0.001	0.921	0.933	0.562	0.645	0.051	0.091
0.01	0.922	0.943	0.560	0.643	0.048	0.089
0.1	0.935	0.954	0.552	0.639	0.043	0.089
0.5	0.945	0.960	0.562	0.629	0.053	0.085
1	0.946	0.960	0.565	0.619	0.057	0.082
2	0.945	0.958	0.567	0.607	0.059	0.077
5	0.945	0.954	0.568	0.594	0.061	0.072
10	0.945	0.951	0.568	0.586	0.061	0.069
15	0.945	0.950	0.568	0.582	0.061	0.067
20	0.945	0.949	0.568	0.580	0.061	0.066
(Lab)LogME	0.547	0.889	0.400	0.560	0.120	0.099
(Lab)TransRate	0.008	0.029	0.006	0.006	0.001	0.100

In this experiment, we investigate the effects of $\lambda$ on our proposed transferability estimators. We use the setting in Section 6.2 with the CUB-200-2011 dataset and vary the value of $\lambda$ in [0, 20] for both LabMSE and LinMSE. Table 4 reports the results for all three transfer learning settings.

For head re-training, we observe that the best correlations are achieved at $\lambda=1$ for both LabMSE and LinMSE. For half fine-tuning, $\lambda\geq 5$ gives the best result for LabMSE, while $\lambda=0.001$ gives the best result for LinMSE. For full fine-tuning, we do not observe significant correlations for both transferability estimators.

Notably, from the results in Table 4 for the head re-training and half fine-tuning settings (where we have significant correlations for at least one transferability estimator), LabMSE with any tested $\lambda$ value in [0, 20] is better than LabLogME and LabTransRate, while LinMSE with any tested $\lambda$ value in this range is better than LogME and TransRate. These results show that our methods are better than the baselines for a wide range of $\lambda$ values.

6.7 Beyond regression

Although our paper mainly focuses on regression tasks, the main idea of using the negative regularized MSE of a Ridge regression model for transferability estimation goes beyond regression. In principle, this idea can be applied for transferring between classification tasks (in this case, we should train a linear classifier and use its regularized log-likelihood as the transferability estimator) or between a classification and a regression task.

In this section, we demonstrate that our idea can be applied for transferability estimation between a classification and a regression task. Particularly, we use 8 source models pre-trained on ImageNet [Deng et al., 2009] and transfer to a target regression task on the dSprite dataset [Matthey et al., 2017] using full fine-tuning. This setting is similar to You et al. [2021] where the target is a regression task with 4-dimensional labels: x and y positions, scale, and orientation. We compute the transferability scores from the extracted features and the labels of the target training set. More details about this experiment are in the Appendix B.2.

From the results in Figure 3, the trends for LogME, LinMSE0, and LinMSE1 are correct (i.e., transferability scores have negative correlations with actual MSEs), while that of TransRate is incorrect. Note that there is a discrepancy between the ranges of the transferability and the transferred MSE because of two reasons: (1) The transferability estimators are computed from the target training set, while the transferred MSEs are computed from the target test set, and (2) there is a mismatch between the source task (ImageNet classification) and the target task (dSprite shape regression).

To compare the transferability estimation methods, we fit a linear regression to the points in each plot and compute its RMSE to these points, where we obtain: $6.12\times 10^{-3}$ (LogME), $6.16\times 10^{-3}$ (TransRate), $6.10\times 10^{-3}$ (LinMSE0), and $\textbf{5.46}\times 10^{-3}$ (LinMSE1). These results show that LinMSE0 and LinMSE1 are better than LogME and TransRate.

7 Conclusion

We formulated transferability estimation for regression tasks and proposed the Linear MSE and Label MSE estimators, two simple but effective approaches for this problem. We proved novel theoretical results for these estimators, showing their relationship with the actual task transferability. Our extensive experiments demonstrated that the proposed approaches are superior to recent, relevant SotA methods in terms of efficiency and effectiveness. Our proposed ideas can also be extended to mixed cases where one of the tasks is a classification problem.

Acknowledgements.

LSTH was supported by the Canada Research Chairs program, the NSERC Discovery Grant RGPIN-2018-05447, and the NSERC Discovery Launch Supplement DGECR-2018-00181. VD was supported by the University of Delaware Research Foundation (UDRF) Strategic Initiatives Grant, and the National Science Foundation Grant DMS-1951474.

References

Achille et al. [2019] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
Agrawal et al. [2014] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision, 2014.
Arora et al. [2018] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
Azizzadenesheli et al. [2019] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2019.
Bao et al. [2019] Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information-theoretic approach to transferability in task transfer learning. In IEEE International Conference on Image Processing, 2019.
Ben-David and Schuller [2003] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. Learning theory and kernel machines, 2003.
Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.
Blitzer et al. [2007] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems, 2007.
Bolya et al. [2021] Daniel Bolya, Rohit Mittapalli, and Judy Hoffman. Scalable diverse model selection for accessible transfer learning. In Advances in Neural Information Processing Systems, 2021.
Bu et al. [2021] Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, and Zhaoxiang Zhang. GAIA: A transfer learning system of object detection that fits your needs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Cai et al. [2020] Enyu Cai, Sriram Baireddy, Changye Yang, Melba Crawford, and Edward J Delp. Deep transfer learning for plant center localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
Chatfield et al. [2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
Chuang et al. [2020] Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning, 2020.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
Deng and Zheng [2021] Weijian Deng and Liang Zheng. Are labels always necessary for classifier accuracy evaluation? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Deshpande et al. [2021] Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, and Pietro Perona. A linearized framework and a new benchmark for model selection for fine-tuning. arXiv:2102.00084, 2021.
Dhillon et al. [2020] Guneet S. Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Representations, 2020.
Doersch and Zisserman [2019] Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: Motion to the rescue. In Advances in Neural Information Processing Systems, 2019.
Donahue et al. [2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, 2014.
Dwivedi and Roig [2019] Kshitij Dwivedi and Gemma Roig. Representation similarity analysis for efficient task taxonomy & transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
Dwivedi et al. [2020] Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, and Gemma Roig. Duality diagram similarity: A generic framework for initialization selection in task transfer learning. In European Conference on Computer Vision, 2020.
Fard et al. [2021] Ali Pourramezan Fard, Hojjat Abdollahi, and Mohammad Mahoor. ASMNet: A lightweight deep neural network for face alignment and pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.
Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
Golowich et al. [2018] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Annual Conference on Learning Theory, 2018.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: Data mining, inference, and prediction, volume 2. Springer, 2009.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
Huang et al. [2021] Jiaji Huang, Qiang Qiu, and Kenneth Church. Exploiting a zoo of checkpoints for unseen tasks. In Advances in Neural Information Processing Systems, 2021.
Huang et al. [2022] Long-Kai Huang, Junzhou Huang, Yu Rong, Qiang Yang, and Ying Wei. Frustratingly easy transferability estimation. In International Conference on Machine Learning, 2022.
Li et al. [2021] Yandong Li, Xuhui Jia, Ruoxin Sang, Yukun Zhu, Bradley Green, Liqiang Wang, and Boqing Gong. Ranking neural checkpoints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Mansour et al. [2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Annual Conference on Learning Theory, 2009.
Matthey et al. [2017] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dSprites: Disentanglement testing Sprites dataset, 2017. https://github.com/deepmind/dsprites-dataset/.
Nguyen et al. [2022] Cuong N Nguyen, Lam Si Tung Ho, Vu Dinh, Tal Hassner, and Cuong V Nguyen. Generalization bounds for deep transfer learning using majority predictor accuracy. In International Symposium on Information Theory and Its Applications, 2022.
Nguyen et al. [2020] Cuong V Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. LEEP: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, 2020.
Oquab et al. [2014] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019.
Poster et al. [2021] Domenick D Poster, Shuowen Hu, Nathan J Short, Benjamin S Riggan, and Nasser M Nasrabadi. Visible-to-thermal transfer learning for facial landmark detection. IEEE Access, 2021.
Radhakrishnan et al. [2022] Adityanarayanan Radhakrishnan, Max Ruiz Luyten, Neha Prasad, and Caroline Uhler. Transfer learning with kernel methods. arXiv:2211.00227, 2022.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, 2021.
Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems, 2019.
Schwarz et al. [2015] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In IEEE International Conference on Robotics and Automation, 2015.
Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
Sharif Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2014.
Smale and Zhou [2007] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153–172, 2007.
Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, 2018.
Tan et al. [2021] Yang Tan, Yang Li, and Shao-Lun Huang. OTCE: A transferability metric for cross-domain cross-task representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Tong et al. [2021] Xinyi Tong, Xiangxiang Xu, Shao-Lun Huang, and Lizhong Zheng. A mathematical framework for quantifying transferability in multi-source transfer learning. In Advances in Neural Information Processing Systems, 2021.
Tran et al. [2019] Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transferability and hardness of supervised classification tasks. In IEEE/CVF International Conference on Computer Vision, 2019.
Tripuraneni et al. [2020] Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, 2020.
Wah et al. [2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report, 2011. https://authors.library.caltech.edu/27452/.
Wang et al. [2019] Boyu Wang, Jorge Mendez, Mingbo Cai, and Eric Eaton. Transfer learning via minimizing the performance gap between domains. In Advances in Neural Information Processing Systems, 2019.
Whatmough et al. [2019] Paul N Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, and Matthew Mattina. FixyNN: Efficient hardware for mobile computer vision via transfer learning. In Conference on Systems and Machine Learning, 2019.
Yao et al. [2021] Yuan Yao, Abhiraj Abhiraj Mohan, Eliza Bliss-Moreau, Kristine Coleman, Sienna M Freeman, Christopher J Machado, Jessica Raper, Jan Zimmermann, Benjamin Y Hayden, and Hyun Soo Park. OpenMonkeyChallenge: Dataset and Benchmark Challenges for Pose Tracking of Non-human Primates. bioRxiv, 2021. http://openmonkeychallenge.com/.
You et al. [2021] Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. LogME: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, 2021.
Zamir et al. [2018] Amir R. Zamir, Alexander Sax, William B. Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.

Simple Transferability Estimation for Regression Tasks
(Supplementary Material)

The contents of this supplementary include:

1.

Appendix A.1: Proof of Lemma 5.1 in the main paper.
2.

Appendix A.2: Proof of Theorem 5.2 in the main paper.
3.

Appendix A.3: Proof of Lemma 5.3 in the main paper.
4.

Appendix A.4: Proof of Theorem 5.4 in the main paper.
5.

Appendix B.1: More details for the experiment settings in Sections 6.1–6.6 of the main paper.
6.

Appendix B.2: More details for the experiment setting in Section 6.7 of the main paper.
7.

Appendix C.1: An additional experiment to show the usefulness of our theoretical bounds.
8.

Appendix C.2: Additional experiment results for Section 6.1 of the main paper.
9.

Appendix C.3: Additional experiment results for Section 6.2 of the main paper.

Appendix A Mathematical proofs

A.1 Proof of Lemma 5.1

Denote $\displaystyle A^{*},b^{*}=\operatorname*{argmin}_{A,b}\left\{\frac{1}{n_{t}}% \sum_{i=1}^{n_{t}}{\|y^{t}_{i}-Az_{i}-b\|^{2}}+\lambda\|A\|_{F}^{2}\right\}.$

For all $k$ , we have:

$\displaystyle\sqrt{\mathcal{L}(w^{},k^{};\mathcal{D}_{t})}$	$\displaystyle\leq\sqrt{\mathcal{L}(w^{*},k;\mathcal{D}_{t})}$	(definition of $k^{*}$ )
	$\displaystyle=\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\\|y^{t}_{i}-k(w^{*}(x^{t}% _{i}))\\|^{2}\right]^{1/2}$	(definition of $\mathcal{L}$ )
	$\displaystyle\leq\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\\|y^{t}_{i}-A^{}z_{i}% -b^{}\\|^{2}\right]^{1/2}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\\|A^{}z_{i}+% b^{}-k(w^{*}(x^{t}_{i}))\\|^{2}\right]^{1/2}$	(triangle inequality)
	$\displaystyle\leq\sqrt{-\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},% \mathcal{D}_{t})}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\\|A^{}z_{i}+b^{}-k(% w^{*}(x^{t}_{i}))\\|^{2}\right]^{1/2}$
	$\displaystyle=\sqrt{-\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},% \mathcal{D}_{t})}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\\|A^{}h^{}(w^{}(x^% {t}_{i}))+b^{}-k(w^{*}(x^{t}_{i}))\\|^{2}\right]^{1/2}.$	(definition of $z_{i}$ )

By choosing $k(\cdot)=A^{*}h^{*}(\cdot)+b^{*}$ , the second term in the above inequality becomes 0. This implies $\sqrt{\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})}\leq\sqrt{-\mathcal{T}^{\mathrm% {lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})}$ and thus the lemma.

A.2 Proof of Theorem 5.2

First, we need to define the notion of expected (true) risk. Given any model $(w,k)$ for the target task, the expected risk of $(w,k)$ is defined as:

\mathcal{R}(w,k):=\mathbb{E}_{(x^{t},y^{t})\sim\mathbb{P}_{t}}\left\{\|y^{t}-k% (w(x^{t}))\|^{2}\right\}.

(4)

Note that $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})=-\mathcal{R}(w^{*},k^{*})$ . We prove the uniform bound in Lemma A.1 below that can help us prove Theorem 5.2.

Lemma A.1.

For any $\delta>0$ , with probability at least ${1-\delta}$ , for all ReLU feed-forward neural network $(w,k)$ of the target task, we have:

|\mathcal{R}(w,k)-\mathcal{L}(w,k;\mathcal{D}_{t})|\leavevmode\nobreak\ \leq% \leavevmode\nobreak\ C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}.

Proof.

We recall the definition of Rademacher complexity. Given a real-valued function class $\mathcal{G}$ and a set of data points $\mathcal{D}=\{u_{i}\}_{i=1}^{n}$ , the (empirical) Rademacher complexity $\widehat{R}_{\mathcal{D}}(\mathcal{G})$ is defined as:

\widehat{R}_{\mathcal{D}}(\mathcal{G})=\mathbb{E}_{\epsilon}\left[\sup_{g\in% \mathcal{G}}\frac{1}{n}\sum_{i=1}^{n}{\epsilon_{i}g(u_{i})}\right],

where $\epsilon=(\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n})$ is a vector uniformly distributed in $\{-1,+1\}^{n}$ .

In our setting, the hypothesis space $\Phi$ is the class of $L$ -layer ReLU feed-forward neural networks whose number of hidden nodes and parameters in each layer are bounded from above by $H$ and $M\geq 1$ respectively. For all $(w,k)\in\Phi$ and $x$ such that $\|x\|_{\infty}\leq 1$ , we have:

\|k(w(x))\|_{\infty}\leq dM^{L+1}H^{L}.

Define $f_{w,k}(x,y)=y-k(w(x))$ and note that $f_{w,k}(x,y)\in\mathbb{R}^{d_{t}}$ . For any $j=1,2,\ldots,d_{t}$ , let $[\cdot]_{j}$ be the projection map to the $j$ -th coordinate. We consider the following real-valued function classes:

	$\displaystyle\mathcal{F}$	$\displaystyle=\{\\|f_{w,k}\\|^{2}:(w,k)\in\Phi\},$
	$\displaystyle\mathcal{F}_{j}$	$\displaystyle=\{[f_{w,k}]_{j}:(w,k)\in\Phi\},$
	$\displaystyle\Phi_{j}$	$\displaystyle=\{[k(w(\cdot)]_{j}:(w,k)\in\Phi\},$

where each element of $\mathcal{F}$ oder $\mathcal{F}_{j}$ is a function with variables $(x,y)$ , and each element of $\Phi_{j}$ is a function with variable $x$ . Let $\mathcal{D}^{x}_{t}=\{x^{t}_{i}\}_{i=1}^{n_{t}}$ be the set of target inputs. By Theorem 2 of Golowich et al. [2018], for all $j=1,2,\ldots,d_{t}$ , we have:

\widehat{R}_{\mathcal{D}^{x}_{t}}(\Phi_{j})\leq 2d_{t}M^{L+1}H^{L}\sqrt{\frac{% L+1+\ln d}{n_{t}}}.

We note that for any $i=1,2,\ldots,n_{t}$ , the function $r_{i}(a)=(a-y_{i}^{t})^{2}$ mapping from ${a\in[-dM^{L+1}H^{L},dM^{L+1}H^{L}]}$ to $\mathbb{R}$ is Lipschitz with constant $4dM^{L+1}H^{L}$ . Thus, applying the Contraction Lemma (Lemma 26.9 in Shalev-Shwartz and Ben-David [2014]), we obtain:

\widehat{R}_{\mathcal{D}_{t}}(\mathcal{F}_{j})\leq 4dM^{L+1}H^{L}\widehat{R}_{% \mathcal{D}^{x}_{t}}(\Phi_{j})\leq 8dd_{t}M^{2L+2}H^{2L}\sqrt{\frac{L+1+\ln d}% {n_{t}}}.

Therefore,

\widehat{R}_{\mathcal{D}_{t}}(\mathcal{F})\leq\sum_{j=1}^{d_{t}}\widehat{R}_{% \mathcal{D}_{t}}(\mathcal{F}_{j})\leq 8dd_{t}^{2}M^{2L+2}H^{2L}\sqrt{\frac{L+1% +\ln d}{n_{t}}}.

Using this inequality, the result of Lemma A.1 follows from Theorem 26.5 in Shalev-Shwartz and Ben-David [2014]. ∎

To prove Theorem 5.2, we apply Lemma 5.1 in the main paper and Lemma A.1 above for the transferred target model $(w^{*},k^{*})$ . Thus, for any $\lambda\geq 0$ and $\delta>0$ , with probability at least $1-\delta$ , we have:

	$\displaystyle\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_% {t})$	$\displaystyle\leq-\mathcal{L}(w^{},k^{};\mathcal{D}_{t})$
		$\displaystyle\leq-\mathcal{R}(w^{},k^{})+C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}$
		$\displaystyle=\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})+C(d,d_{t},M,H,L,% \delta)/\sqrt{n_{t}}.$

Therefore, Theorem 5.2 holds.

A.3 Proof of Lemma 5.3

Note that $\displaystyle A^{*}_{\lambda},b^{*}_{\lambda}=\operatorname*{argmin}_{A,b}% \left\{\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}+\lambda\|A\|_{F% }^{2}\right\}.$

For all $k$ , we have:

$\displaystyle\sqrt{\mathcal{L}(w^{},k^{};\mathcal{D}_{t})}$	$\displaystyle\leq\sqrt{\mathcal{L}(w^{*},k;\mathcal{D}_{t})}$	(definition of $k^{*}$ )
	$\displaystyle=\left[\frac{1}{n}\sum_{i=1}^{n}\\|y^{t}_{i}-k(w^{*}(x_{i}))\\|^{2}% \right]^{1/2}$	(definition of $\mathcal{L}$ )
	$\displaystyle\leq\left[\frac{1}{n}\sum_{i=1}^{n}\\|y^{t}_{i}-A^{}_{\lambda}y^{% s}_{i}-b^{}_{\lambda}\\|^{2}\right]^{1/2}+\left[\frac{1}{n}\sum_{i=1}^{n}\\|A^{% }_{\lambda}y^{s}_{i}+b^{}_{\lambda}-k(w^{*}(x_{i}))\\|^{2}\right]^{1/2}$	(triangle inequality)
	$\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\left[\frac{1}{n}\sum_{i=1}^{n}\\|A^{}_{% \lambda}y^{s}_{i}+b^{}_{\lambda}-k(w^{*}(x_{i}))\\|^{2}\right]^{1/2}.$	(definition of $\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}$ )

Picking $k(\cdot)=A^{*}_{\lambda}h^{*}(\cdot)+b^{*}_{\lambda}$ , this inequality becomes:

	$\displaystyle\sqrt{\mathcal{L}(w^{},k^{};\mathcal{D}_{t})}$	$\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\left[\frac{1}{n}\sum_{i=1}^{n}\\|A^{}_{% \lambda}[y^{s}_{i}-h^{}(w^{*}(x_{i}))]\\|^{2}\right]^{1/2}$
		$\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\\|A^{}_{\lambda}\\|_{F}\left[\frac{1}{n}\sum% _{i=1}^{n}\\|y^{s}_{i}-h^{}(w^{*}(x_{i}))\\|^{2}\right]^{1/2}$
		$\displaystyle=\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D% }_{s},\mathcal{D}_{t})}+\\|A^{}_{\lambda}\\|_{F}\sqrt{\mathcal{L}(w^{},h^{*};% \mathcal{D}_{s})}.$

Note that if $a\leq b+c$ , then $a^{2}\leq 2b^{2}+2c^{2}$ . Applying this fact to the above inequaility, we have:

\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})\leq-2\widehat{\mathcal{T}}^{\mathrm{% lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})+2\|A^{*}_{\lambda}\|^{2}_{F}% \mathcal{L}(w^{*},h^{*};\mathcal{D}_{s}).

Thus, Lemma 5.3 holds.

A.4 Proof of Theorem 5.4

For any $\lambda\geq 0$ and $\delta>0$ , applying Lemma A.1 for $(w^{*},k^{*})$ and Lemma 5.3, with probability at least $1-\delta$ :

	$\displaystyle\mathcal{R}(w^{},k^{})$	$\displaystyle\leq\mathcal{L}(w^{},k^{};\mathcal{D}_{t})+C(d,d_{t},M,H,L,% \delta)/\sqrt{n}$
		$\displaystyle\leq-2\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_% {s},\mathcal{D}_{t})+2\\|A^{}_{\lambda}\\|^{2}_{F}\leavevmode\nobreak\ \mathcal% {L}(w^{},h^{*};\mathcal{D}_{s})+C(d,d_{t},M,H,L,\delta)/\sqrt{n}.$

Since $\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})=-\mathcal{R}(w^{*},k^{*})$ , Theorem 5.4 holds.

Appendix B More details for experiment settings

B.1 More details for Sections 6.1–6.6

For these experiments, we train our source models from scratch using the MSE loss with the AdamW optimizer [Loshchilov and Hutter, 2019], which we run for 40 epochs with batch size of 64 and the cosine learning rate scheduler. To obtain good source models, we resize all input images to 256 $\times$ 256 and apply basic image augmentations without horizontal flipping (i.e., affine transformation, Gaussian blur, and color jitter). We also scale all labels into $[0,1]$ using the width and height of the input images.

For the transfer learning setting with head re-training, we freeze the trained feature extractor and re-train the regression head on the target dataset using the same setting above, except that we run 15 epochs on the CUB-200-2011 dataset and 30 epochs on the OpenMonkey dataset. For half fine-tuning, we unfreeze the last convolution layer and the head classifier since the number of trainable parameters is around half of the total number of parameters. For full fine-tuning, we unfreeze the whole network. In these two fine-tuning settings, we fine-tune for 15 epochs on both datasets. We use PyTorch [Paszke et al., 2019] for implementation.

B.2 More details for Section 6.7

For this experiment, we use the following 8 ImageNet pre-trained models as the source models: ResNet50, ResNet101, ResNet152 [He et al., 2016], DenseNet121, DenseNet169, DenseNet201 [Huang et al., 2017], GoogleNet [Szegedy et al., 2015], and Inceptionv3 [Szegedy et al., 2016]. These models are taken from the PyTorch Model Zoo.

We use the dSprites dataset [Matthey et al., 2017] for the target task. This dataset contains 737,280 images with 4 outputs for regression: x and y positions, scale, and orientation. The train-test split is similar to the settings in You et al. [2021]: 60% for training, 20% for validation, and 20% for testing. The transferred MSE is computed on the test set. We train our models with 10 epochs using the AdamW optimizer. The initial learning rate is $10^{-3}$ , which is divided by 10 every 3 epochs.

Appendix C Additional experiment results

C.1 Usefulness of theoretical bounds

Although the theoretical bounds in Section 5 show the relationships between the transferability of the optimal transferred model and our transferability estimators, these bounds could be loose in practice unless the number of samples is large. This is in fact a limitation of this type of generalization bounds. To show the usefulness of our bounds in practice, we conduct an experiment to investigate the generalization gap using the head re-training setting in Section 6.1.

The generalization gap is defined as the difference between our transferability score and the negative MSE (the transferability) of the transferred model. According to our theorems, this generalization gap is bounded above by the complexity term. We will compare the generalization gap with the absolute value of our transferability score and also inspect whether it has any significant correlation with the actual transferred MSE.

From this experiment, the ratios between the absolute value of transferability score and the generalization gap for our transferability estimators are: 1.6 (LinMSE0), 2.0 (LinMSE1), 2.3 (LabMSE0), and 2.3 (LabMSE1). These results show that the transferability scores dominate the generalization gap in practice. More importantly, there is no significant correlation between the generalization gap and the actual transferred MSE. These findings indicate that the complexity term in our bounds may have little effects for transferability estimation, as opposed to the transferability score term that has a strong effect (shown by the high correlations in our main experiments).

Table C.1: Kendall’s-

\tau

correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Our estimators improve up to 28.4% in comparison with SotA (LogME) while being 13% better on average.

Transfer setting	Label-based method				Feature-based method
Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	0.728	0.028	0.935*	0.924	0.906	0.104	0.896	0.922*
Half fine-tuning	0.525	0.392	0.644	0.646*	0.651	0.291	0.667*	0.646
Full fine-tuning	0.497	0.289	0.606*	0.594	0.611	0.328	0.616*	0.594

Table C.2: Spearman correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Our estimators improve up to 19.9% in comparison with SotA (LogME) while being 9.7% better on average.

Transfer setting	Label-based method				Feature-based method
Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	0.857	0.102	0.994*	0.991	0.988	0.215	0.984	0.990*
Half fine-tuning	0.726	0.409	0.857	0.858*	0.857	0.437	0.865*	0.858
Full fine-tuning	0.689	0.433	0.826*	0.823	0.827*	0.474	0.827*	0.823

Table C.3: Correlation coefficients when transferring between 10d-output tasks from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. All correlations are statistically significant with

p<0.001

. Our estimators with both

\lambda

values are better than SotA (LogME).

Transfer setting	Label-based method				Feature-based method
Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	0.970	0.719	0.991*	0.989	0.968	0.656	0.990	0.995*
Half fine-tuning	0.944	0.742	0.963*	0.943	0.954	0.684	0.980*	0.958
Full fine-tuning	0.878	0.736	0.892*	0.863	0.892	0.669	0.916*	0.881

C.2 Additional results for Section 6.1

Detailed correlation plots for Table 1. In Figures C.3, C.3, and C.3, we show the detailed correlation plots and $p$ -values for our experiment results reported in Table 1 of the main paper. From these plots, all correlations are statistically significant with $p<0.001$ , except for TransRate and LabTransRate with head re-training.

Additional results with non-linear correlation metrics. In Tables C.1 and C.2, we report the Kendall’s- $\tau$ and Spearman correlation coefficients to complement the results in Table 1 of the main paper. These coefficients, as described in Bolya et al. [2021], are used to assess the ranking associations or the monotonic relationships between the transferability measures and the model performance. Based on the findings presented in these tables, our proposed scores are generally on par with or outperform the current state-of-the-art (SotA) approach, LogME [You et al., 2021], with an average correlation improvement of 9.7% and 13% for Spearman and Kendall’s- $\tau$ coefficients, respectively. This serves as a strong evidence illustrating the effectiveness of our proposed measures, not only in the linear relationship assessment, but also in the non-linear one.

Additional result with high-dimensional labels. Using the setting in Section 6.1, we also conducted an additional experiment where both source and target tasks have 10-dimensional labels. In particular, we train a source model to predict five OpenMonkey keypoints: right eye, left eye, nose, head, and neck simultaneously (i.e., this source model returns a 10-dimensional output). The source model is then transferred to a target task that predicts a combination of five CUB-200-2011 keypoints. We consider each combination of 5 keypoints among 10 CUB-200-2011 keypoints as a target task, resulting in 252 target tasks that all have 10-dimensional labels.

We also run 3 transfer learning algorithms: head re-training, half fine-tuning, and full fine-tune, using the same training settings as in Section 6.1. For TransRate and LabTransRate, we use 2 bins per dimension instead of 5 bins to reduce the computational costs. The results for this experiment are reported in Table C.3. From these results, our approaches are better than the baselines for both $\lambda$ values.

C.3 Additional results for Section 6.2

Table C.4: Correlation coefficients when transferring from 2d-output tasks to 10d-output tasks on CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Except for TransRate with half and full fine-tuning, all correlations are statistically significant with

p<0.001

. Our estimators are better than SotA (LogME) in most cases.

Transfer setting	Label-based method				Feature-based method
Transfer setting	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	0.602	0.632	0.868*	0.816	0.885	0.549	0.901	0.973*
Half fine-tuning	0.491	0.645	0.771	0.881*	0.804	0.072	0.913*	0.818
Full fine-tuning	0.397	0.632	0.727	0.888*	0.756	0.050	0.884*	0.833

Detailed correlation plots for Table 2. In Figures C.6- C.9, we show the detailed correlation plots and $p$ -values for our experiment results reported in Table 2 of the main paper. From these plots, all correlations are statistically significant with $p<0.001$ , except for TransRate and LabTransRate as well as the full fine-tuning setting on the CUB-200-2011 dataset.

Additional result for each individual source task. We report in Tables C.5 and C.6 more comprehensive results for all source tasks on CUB-200-2011 and OpenMonkey respectively. Each row of the tables corresponds to one source task and shows the correlation coefficients when transferring to all other tasks in the respective dataset. From the tables, our transferability estimators are consistently better than LogME, LabLogME, TransRate, and LabTransRate for most source tasks on both datasets. These results confirm the effectiveness of our proposed methods.

Additional result with high-dimensional labels. In this additional experiment, we further show the effectiveness of our proposed methods when the target tasks have higher dimensional labels. In particular, we transfer from 4 source tasks on CUB-200-2011 (back, beak, belly, and breast) to all the combinations of 5 attributes among the remaining tasks (except for right eye, right leg, and right wing, which may not always be available in the data). In total, we have 224 source-target pairs, where the source tasks have 2-dimensional labels and the target tasks have 10-dimensional labels. We use the same training settings as in Section 6.2 of the main paper, except that we also use 2 bins per dimension when calculating TransRate and LabTransRate to reduce computational costs. Table C.4 reports the results for this experiment. These results clearly show that our methods, LinMSE0 and LinMSE1, are better than the LogME and TransRate baselines in most cases.

Table C.5: Correlation coefficients for all source tasks on CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.

Transfer setting	Source task	Label-based method				Feature-based method
Transfer setting	Source task	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	Zurück	0.743	0.116	0.956	0.966*	0.920	0.273	0.931	0.964*
	Beak	0.863	0.229	0.922*	0.915	0.878	0.158	0.906	0.945*
	Belly	0.892	0.097	0.970	0.982*	0.933	0.188	0.932	0.982*
	Breast	0.915	0.120	0.935	0.945*	0.903	0.279	0.922	0.961*
	Crown	0.917	0.041	0.962	0.966*	0.913	0.251	0.945	0.979*
	Forehead	0.888	0.076	0.941*	0.939	0.885	0.221	0.924	0.966*
	Left eye	0.035	0.076	0.913	0.964*	0.924	0.289	0.945	0.969*
	Left leg	0.261	0.221	0.935	0.975*	0.935	0.223	0.953	0.975*
	Left wing	0.260	0.170	0.964	0.994*	0.980	0.173	0.994*	0.994*
	Nape	0.889	0.085	0.922	0.942*	0.900	0.300	0.929	0.953*
	Right eye	0.625	0.242	0.904	0.974*	0.921	0.244	0.948	0.975*
	Right leg	0.508	0.047	0.958	0.989*	0.942	0.217	0.954	0.990*
	Right wing	0.521	0.167	0.907	0.979*	0.935	0.270	0.946	0.980*
	Tail	0.591	0.392	0.900	0.927*	0.872	0.544	0.880	0.890*
	Throat	0.896	0.124	0.938	0.941*	0.890	0.291	0.924	0.956*
Half fine-tuning	Zurück	0.714	0.076	0.791	0.814*	0.835	0.168	0.911*	0.873
	Beak	0.663	0.160	0.831*	0.772	0.765	0.076	0.883	0.899*
	Belly	0.528	0.233	0.655	0.752*	0.758	0.309	0.849*	0.764
	Breast	0.730	0.100	0.802*	0.779	0.762	0.152	0.867*	0.850
	Crown	0.644	0.068	0.752	0.776*	0.714	0.165	0.832*	0.816
	Forehead	0.654	0.032	0.804*	0.786	0.727	0.120	0.859	0.873*
	Left eye	0.420	0.046	0.913*	0.853	0.812	0.227	0.892*	0.865
	Left leg	0.121	0.095	0.721	0.819*	0.845	0.150	0.893*	0.832
	Left wing	0.352	0.150	0.949*	0.918	0.859	0.189	0.919*	0.918
	Nape	0.660	0.055	0.705	0.770*	0.751	0.181	0.863*	0.802
	Right eye	0.561	0.221	0.911*	0.873	0.786	0.180	0.871	0.890*
	Right leg	0.268	0.125	0.690	0.804*	0.810	0.069	0.861*	0.820
	Right wing	0.407	0.133	0.495	0.613*	0.516	0.338	0.521	0.617*
	Tail	0.801	0.117	0.930*	0.812	0.848	0.285	0.924	0.968*
	Throat	0.767	0.013	0.870*	0.810	0.811	0.253	0.900*	0.873
Full fine-tuning	Zurück	0.710	0.085	0.785	0.808*	0.829	0.178	0.906*	0.868
	Beak	0.659	0.161	0.826*	0.780	0.758	0.073	0.877	0.899*
	Belly	0.645	0.273	0.782	0.847*	0.862	0.365	0.926*	0.856
	Breast	0.740	0.104	0.811*	0.791	0.768	0.152	0.871*	0.859
	Crown	0.647	0.073	0.756	0.784*	0.717	0.157	0.834*	0.821
	Forehead	0.648	0.037	0.799*	0.783	0.723	0.111	0.855	0.869*
	Left eye	0.224	0.456*	0.297	0.347	0.333*	0.246	0.282	0.326
	Left leg	0.057	0.067	0.659	0.769*	0.796	0.146	0.850*	0.783
	Left wing	0.342	0.159	0.954*	0.915	0.860	0.195	0.920*	0.914
	Nape	0.667	0.041	0.713	0.779*	0.752	0.177	0.864*	0.810
	Right eye	0.549	0.213	0.915*	0.876	0.794	0.199	0.877	0.893*
	Right leg	0.237	0.377	0.673	0.692*	0.755	0.431	0.766*	0.693
	Right wing	0.254*	0.046	0.237	0.223	0.225	0.093	0.227*	0.220
	Tail	0.803	0.122	0.930*	0.818	0.846	0.288	0.923	0.969*
	Throat	0.665	0.027	0.801*	0.779	0.744	0.256	0.850*	0.834

Table C.6: Correlation coefficients for all source tasks on OpenMonkey. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.

Transfer setting	Source task	Label-based method				Feature-based method
Transfer setting	Source task	LabLogME	LabTransRate	LabMSE0	LabMSE1	LogME	TransRate	LinMSE0	LinMSE1
Head re-training	Right eye	0.894	0.859	0.986*	0.835	0.918	0.846	0.978	0.986*
	Left eye	0.895	0.854	0.987*	0.838	0.868	0.858	0.981	0.987*
	Nose	0.908	0.849	0.988*	0.849	0.818	0.837	0.978	0.989*
	Head	0.941	0.881	0.992*	0.821	0.897	0.884	0.983*	0.978
	Neck	0.972	0.862	0.998*	0.887	0.932	0.839	0.982	0.987*
	Right shoulder	0.977	0.837	0.994*	0.891	0.842	0.811	0.982*	0.980
	Right elbow	0.963	0.529	0.994*	0.940	0.469	0.564	0.969	0.990*
	Right wrist	0.970	0.753	0.993*	0.939	0.615	0.446	0.963	0.990*
	Left shoulder	0.972	0.800	0.997*	0.915	0.823	0.808	0.988*	0.988*
	Left elbow	0.960	0.546	0.994*	0.948	0.711	0.572	0.969	0.989*
	Left wrist	0.975	0.597	0.993*	0.951	0.964	0.544	0.963	0.993*
	Hip	0.922	0.540	0.989*	0.325	0.874	0.557	0.800	0.991*
	Right knee	0.925	0.080	0.975*	0.850	0.766	0.331	0.945	0.993*
	Right ankle	0.931	0.411	0.989*	0.770	0.737	0.371	0.930	0.997*
	Left knee	0.923	0.160	0.978*	0.848	0.692	0.209	0.936	0.994*
	Left ankle	0.916	0.416	0.986*	0.775	0.852	0.329	0.925	0.998*
	Tail	0.936	0.712	0.993*	0.312	0.821	0.662	0.897	0.990*
Half fine-tuning	Right eye	0.795	0.734	0.906*	0.883	0.835	0.709	0.963*	0.923
	Left eye	0.797	0.731	0.905*	0.879	0.771	0.719	0.960*	0.918
	Nose	0.829	0.736	0.914*	0.872	0.649	0.721	0.968*	0.916
	Head	0.835	0.759	0.921*	0.882	0.804	0.751	0.964*	0.928
	Neck	0.902	0.793	0.929*	0.871	0.745	0.765	0.969*	0.915
	Right shoulder	0.887	0.725	0.924*	0.890	0.751	0.758	0.972*	0.924
	Right elbow	0.764	0.250	0.806	0.914*	0.048	0.602	0.931*	0.821
	Right wrist	0.806	0.501	0.823	0.903*	0.172	0.643	0.929*	0.819
	Left shoulder	0.893	0.718	0.927*	0.899	0.702	0.774	0.972*	0.930
	Left elbow	0.782	0.369	0.824	0.919*	0.366	0.594	0.946*	0.839
	Left wrist	0.822	0.523	0.828	0.902*	0.765	0.663	0.932*	0.824
	Hip	0.030	0.487	0.233	0.910*	0.006	0.359	0.800*	0.305
	Right knee	0.481	0.429	0.598	0.906*	0.186	0.067	0.831*	0.687
	Right ankle	0.357	0.275	0.534	0.910*	0.286	0.226	0.806*	0.632
	Left knee	0.467	0.355	0.601	0.899*	0.172	0.215	0.855*	0.692
	Left ankle	0.331	0.242	0.530	0.904*	0.197	0.303	0.822*	0.632
	Tail	0.231	0.196	0.434	0.829*	0.160	0.121	0.729*	0.494
Full fine-tuning	Right eye	0.796	0.711	0.905*	0.894	0.821	0.694	0.959*	0.927
	Left eye	0.790	0.734	0.904*	0.882	0.763	0.714	0.957*	0.921
	Nose	0.810	0.731	0.912*	0.892	0.642	0.709	0.960*	0.932
	Head	0.801	0.737	0.900*	0.892	0.772	0.718	0.947*	0.920
	Neck	0.893	0.782	0.930*	0.886	0.755	0.743	0.962*	0.926
	Right shoulder	0.896	0.722	0.936*	0.908	0.759	0.750	0.975*	0.940
	Right elbow	0.689	0.168	0.736	0.878*	0.047	0.562	0.888*	0.761
	Right wrist	0.796	0.505	0.805	0.876*	0.199	0.644	0.910*	0.803
	Left shoulder	0.872	0.690	0.901*	0.882	0.670	0.762	0.955*	0.903
	Left elbow	0.726	0.282	0.774	0.904*	0.326	0.538	0.914*	0.797
	Left wrist	0.787	0.488	0.787	0.868*	0.725	0.672	0.903*	0.785
	Hip	0.016	0.518	0.173	0.894*	0.038	0.382	0.757*	0.238
	Right knee	0.391	0.518	0.516	0.891*	0.096	0.141	0.763*	0.614
	Right ankle	0.246	0.396	0.437	0.889*	0.185	0.340	0.726*	0.546
	Left knee	0.381	0.448	0.521	0.891*	0.149	0.303	0.789*	0.618
	Left ankle	0.244	0.297	0.444	0.871*	0.098	0.357	0.751*	0.551
	Tail	0.105	0.299	0.309	0.824*	0.047	0.212	0.628*	0.372

	$\displaystyle\sqrt{\mathcal{L}(w^{},k^{};\mathcal{D}_{t})}$	$\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\left[\frac{1}{n}\sum_{i=1}^{n}\\|A^{}_{% \lambda}[y^{s}_{i}-h^{}(w^{*}(x_{i}))]\\|^{2}\right]^{1/2}$
		$\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\\|A^{}_{\lambda}\\|_{F}\left[\frac{1}{n}\sum% _{i=1}^{n}\\|y^{s}_{i}-h^{}(w^{*}(x_{i}))\\|^{2}\right]^{1/2}$
		$\displaystyle=\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D% }_{s},\mathcal{D}_{t})}+\\|A^{}_{\lambda}\\|_{F}\sqrt{\mathcal{L}(w^{},h^{*};% \mathcal{D}_{s})}.$