License: CC BY 4.0
arXiv:2312.00656v2 [cs.LG] 04 Dec 2023

Simple Transferability Estimation for Regression Tasks

Cuong N. Nguyen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT   Phong Tran2,323{}^{2,3}start_FLOATSUPERSCRIPT 2 , 3 end_FLOATSUPERSCRIPT   Lam Si Tung Ho44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT   Vu Dinh55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Anh T. Tran22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT   Tal Hassner66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT   Cuong V. Nguyen Florida International University, USA   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTVinAI Research, Vietnam   33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTMBZUAI, UAE 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTDalhousie University, Canada   55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTUniversity of Delaware, USA   66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPTMeta AI, USA
Abstract

We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.

1 Introduction

Transferability estimation [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020] aims to develop computationally efficient metrics to predict the effectiveness of transferring a deep learning model from a source to a target task. This problem has recently gained attention as a means for model and task selection [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Bolya et al., 2021, You et al., 2021] that can potentially improve the performance and reduce the cost of transfer learning, especially for expensive deep learning models. In recent years, new transferability estimators were also developed and used in applications such as checkpoint ranking [Huang et al., 2021, Li et al., 2021] and few-shot learning [Tong et al., 2021].

Nearly all existing methods consider only the transferability between classification tasks [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, Huang et al., 2022], with very few designed for regression [You et al., 2021, Huang et al., 2022], despite the importance of regression problems in a wide range of applications such as landmark detection [Fard et al., 2021, Poster et al., 2021], object detection and localization [Cai et al., 2020, Bu et al., 2021], pose estimation [Schwarz et al., 2015, Doersch and Zisserman, 2019], or image generation [Ramesh et al., 2021, Razavi et al., 2019]. Moreover, those few methods are often a byproduct of a classification transferability estimator and were never tested against regression transferability estimation baselines.

In this paper, we explicitly consider transferability estimation for regression tasks and formulate a novel definition for this problem. Our formulation is based on the practical usage of transferability estimation: to compare the actual transferability between different tasks [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, You et al., 2021]. We then propose two simple, efficient, and theoretically grounded approaches for this problem that estimate transferability using the negative regularized mean squared error (MSE) of a linear regression model computed from the source and target training sets. The first approach, Linear MSE, uses the linear regression model between features extracted from the source model (a model trained on the source task) and true labels of the target training set. The second approach, Label MSE, estimates transferability by regressing between the dummy labels, obtained from the source model, and true labels of the target data. In special cases where the source and target data share the inputs, the Label MSE estimators can be computed even more efficiently from the true labels without a source model.

In addition to their simplicity, we show our transferability estimators to have theoretical properties relating them to the actual transferability of the transferred target model. In particular, we prove that the transferability of the target model obtained from transfer learning is lower bounded by the Label MSE minus a complexity term, which depends on the target dataset size and the model architecture. Similar theoretical results can also be proven for the case where the source and target tasks share the inputs.

We conduct extensive experiments on two real-world keypoint detection datasets, CUB-200-2011 [Wah et al., 2011] and OpenMonkey [Yao et al., 2021], as well as the dSprites shape regression dataset [Matthey et al., 2017] to show the advantages of our approaches. The results clearly demonstrate that despite their simplicity, our approaches outperform recently published, state-of-the-art (SotA) regression transferability estimators, such as LogME [You et al., 2021] and TransRate [Huang et al., 2022], in both effectiveness and efficiency. In particular, our approaches can improve SotA results from 12% to 36% on average, while being at least 27% faster.

Summary of contributions. (1) We formulate a new definition for the transferability estimation problem that can be used for comparing the actual transferability (§3). (2) We propose Linear MSE and Label MSE, two simple yet effective transferability estimators for regression tasks (§4). (3) We prove novel theoretical results for these estimators to connect them with the actual task transferability (§5). (4) We rigorously test our approaches in various settings and challenging benchmarks, showing their advantages compared to SotA regression transferability methods (§6).111Implementations of our methods are available at: https://github.com/CuongNN218/regression_transferability.

2 Related work

Our paper is one of the recent attempts to develop efficient and effective transferability estimators for deep transfer learning [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, You et al., 2021, Huang et al., 2022, Nguyen et al., 2022], which is closely related to the generalization estimation problem [Chuang et al., 2020, Deng and Zheng, 2021]. Most of the existing work for transferability estimation focuses on classification [Bao et al., 2019, Tran et al., 2019, Nguyen et al., 2020, Deshpande et al., 2021, Li et al., 2021, Tan et al., 2021, Nguyen et al., 2022], while we are only aware of two methods developed for regression [You et al., 2021, Huang et al., 2022].

One regression transferability method, called LogME [You et al., 2021], takes a Bayesian approach and uses the maximum log evidence of the target data as the transferability estimator. While this method can be sped up using matrix decomposition, its scalability is still limited since the required memory is large. In contrast, our proposed approaches are simpler, faster, and more effective. We also provide novel theoretical properties for our methods that were not available for LogME. Another approach for transferability estimation between regression tasks, called TransRate [Huang et al., 2022], is to divide the real-valued outputs into different bins and apply a classification transferability estimator. In our experiments, we will show that this approach is less accurate than both LogME and our approaches.

Transferability can also be inferred from a task taxonomy [Zamir et al., 2018, Dwivedi and Roig, 2019, Dwivedi et al., 2020] or a task space representation [Achille et al., 2019], which embeds tasks as vectors on a vector space. A popular task taxonomy, Taskonomy [Zamir et al., 2018], exploits the underlying structure of visual tasks by computing a task affinity matrix that can be used for estimating transferability. Constructing the Taskonomy requires training a small classification head, which resembles the training of the regularized linear regression models in our approaches. However, they investigate the global taxonomy of classification tasks, while our paper studies regression tasks with a focus on estimating their transferability efficiently.

Our paper is also related to transfer learning with kernel methods [Radhakrishnan et al., 2022] and with deep models [Tan et al., 2018], which has been successful in real-world regression problems such as object detection and localization [Cai et al., 2020, Bu et al., 2021], landmark detection [Fard et al., 2021, Poster et al., 2021], or pose estimation [Schwarz et al., 2015, Doersch and Zisserman, 2019]. Several previous works have investigated theoretical bounds for transfer learning [Ben-David and Schuller, 2003, Blitzer et al., 2007, Mansour et al., 2009, Azizzadenesheli et al., 2019, Wang et al., 2019, Tripuraneni et al., 2020]; however, these bounds are hard to compute in practice and thus unsuitable for transferability estimation. Some previous transferability estimators have theoretical bounds on the empirical loss of the transferred model [Tran et al., 2019, Nguyen et al., 2020], but these bounds were for classification and did not relate directly to transferability. Our bounds, on the other hand, focus on regression and connect our approaches directly to the notion of transferability.

3 Transferability between regression tasks

In this section, we describe the transfer learning setting that will be used in our subsequent analysis. We then propose a definition of transferability for regression tasks and a new formulation for the transferability estimation problem.

3.1 Transfer learning for regression

Consider a source training set 𝒟s={(xis,yis)}i=1nssubscript𝒟𝑠superscriptsubscriptsubscriptsuperscript𝑥𝑠𝑖subscriptsuperscript𝑦𝑠𝑖𝑖1subscript𝑛𝑠\mathcal{D}_{s}=\{(x^{s}_{i},y^{s}_{i})\}_{i=1}^{n_{s}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a target training set 𝒟t={(xit,yit)}i=1ntsubscript𝒟𝑡superscriptsubscriptsubscriptsuperscript𝑥𝑡𝑖subscriptsuperscript𝑦𝑡𝑖𝑖1subscript𝑛𝑡\mathcal{D}_{t}=\{(x^{t}_{i},y^{t}_{i})\}_{i=1}^{n_{t}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT consisting of nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT examples respectively, where xis,xitdsubscriptsuperscript𝑥𝑠𝑖subscriptsuperscript𝑥𝑡𝑖superscript𝑑x^{s}_{i},x^{t}_{i}\in\mathbb{R}^{d}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are d𝑑ditalic_d-dimensional input vectors, yisdssubscriptsuperscript𝑦𝑠𝑖superscriptsubscript𝑑𝑠y^{s}_{i}\in\mathbb{R}^{d_{s}}italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT-dimensional source label vector, and yitdtsubscriptsuperscript𝑦𝑡𝑖superscriptsubscript𝑑𝑡y^{t}_{i}\in\mathbb{R}^{d_{t}}italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-dimensional target label vector. Here we allow multi-output regression tasks (with ds,dt1subscript𝑑𝑠subscript𝑑𝑡1d_{s},d_{t}\geq 1italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 1) where the source and target labels may have different dimensions (dsdtsubscript𝑑𝑠subscript𝑑𝑡d_{s}\neq d_{t}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≠ italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). In the simplest case, the source and target tasks are both single-output regression tasks where ds=dt=1subscript𝑑𝑠subscript𝑑𝑡1d_{s}=d_{t}=1italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1.

In this paper, we will refer to a model (such as w𝑤witalic_w, w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, hhitalic_h, h*superscripth^{*}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, k𝑘kitalic_k, or k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT) and its parameters interchangeably. Using the source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we train a deep learning model (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) consisting of an optimal feature extractor w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and an optimal regression head h*superscripth^{*}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that minimizes the empirical MSE loss:222Here we assume (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is a global minimum of Eq. (1). However, practical optimization algorithms often only return a local minimum for this problem. The same is also true for Eq. (3).

w*,h*=argminw,h(w,h;𝒟s),superscript𝑤superscriptsubscriptargmin𝑤𝑤subscript𝒟𝑠\textstyle w^{*},h^{*}=\operatorname*{argmin}_{w,h}\mathcal{L}(w,h;\mathcal{D}% _{s}),italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_w , italic_h end_POSTSUBSCRIPT caligraphic_L ( italic_w , italic_h ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (1)

where w:ddr:𝑤superscript𝑑superscriptsubscript𝑑𝑟w:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{r}}italic_w : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a feature extractor network that transforms a d𝑑ditalic_d-dimensional input vector into a drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT-dimensional feature vector, h:drds:superscriptsubscript𝑑𝑟superscriptsubscript𝑑𝑠h:\mathbb{R}^{d_{r}}\rightarrow\mathbb{R}^{d_{s}}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a source regression head network that transforms a drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT-dimensional feature vector into a dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT-dimensional output vector, and (w,h;𝒟s)𝑤subscript𝒟𝑠\mathcal{L}(w,h;\mathcal{D}_{s})caligraphic_L ( italic_w , italic_h ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the empirical MSE loss of the whole model (w,h)𝑤(w,h)( italic_w , italic_h ) on the dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

(w,h;𝒟s)=1nsi=1nsyish(w(xis))2,𝑤subscript𝒟𝑠1subscript𝑛𝑠superscriptsubscript𝑖1subscript𝑛𝑠superscriptnormsubscriptsuperscript𝑦𝑠𝑖𝑤subscriptsuperscript𝑥𝑠𝑖2\mathcal{L}(w,h;\mathcal{D}_{s})=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}\|y^{s}_{i}-% h(w(x^{s}_{i}))\|^{2},caligraphic_L ( italic_w , italic_h ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h ( italic_w ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

with \|\cdot\|∥ ⋅ ∥ being the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. In practice, we usually consider a source model (e.g., a ResNet [He et al., 2016]) as a whole and use its first l𝑙litalic_l layers from the input (for some chosen number l𝑙litalic_l) as the feature extractor w𝑤witalic_w. The regression head hhitalic_h is the remaining part of the model from the l𝑙litalic_l-th layer to the output layer, and the prediction for any input x𝑥xitalic_x is h(w(x))𝑤𝑥h(w(x))italic_h ( italic_w ( italic_x ) ).

After training the optimal source model (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), we perform transfer learning to the target task by freezing the optimal feature extractor w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and re-training a new regression head k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT using the target dataset 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, also by minimizing the empirical MSE loss:

k*superscript𝑘\displaystyle k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =argmink(w*,k;𝒟t)absentsubscriptargmin𝑘superscript𝑤𝑘subscript𝒟𝑡\displaystyle=\textstyle\operatorname*{argmin}_{k}\mathcal{L}(w^{*},k;\mathcal% {D}_{t})= roman_argmin start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=argmink{1nti=1ntyitk(w*(xit))2},absentsubscriptargmin𝑘1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝑘superscript𝑤subscriptsuperscript𝑥𝑡𝑖2\displaystyle={\textstyle\operatorname*{argmin}_{k}}\Big{\{}\frac{1}{n_{t}}% \sum_{i=1}^{n_{t}}\|y^{t}_{i}-k(w^{*}(x^{t}_{i}))\|^{2}\Big{\}},= roman_argmin start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , (3)

where k:drdt:𝑘superscriptsubscript𝑑𝑟superscriptsubscript𝑑𝑡k:\mathbb{R}^{d_{r}}\rightarrow\mathbb{R}^{d_{t}}italic_k : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a target regression head network that may have a different architecture than that of hhitalic_h. In general, the regression heads hhitalic_h and k𝑘kitalic_k may contain multiple layers and are not necessarily linear.

This transfer learning algorithm, usually called head re-training, has been widely used for deep learning models [Donahue et al., 2014, Oquab et al., 2014, Sharif Razavian et al., 2014, Whatmough et al., 2019] and will be used for our theoretical analysis. In practice and in our experiments, we also consider another transfer learning algorithm, widely known as fine-tuning, where we fine-tune the trained feature extractor w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT on the target set, and then train a new target regression head k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with this fine-tuned feature extractor [Agrawal et al., 2014, Girshick et al., 2014, Chatfield et al., 2014, Dhillon et al., 2020].

3.2 Transferability estimation

As our first contribution, we propose a definition of transferability for regression tasks and a new formulation for the transferability estimation problem. For this purpose, we make the standard assumption that the target data 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are drawn iid from the true but unknown distribution t:=(Xt,Yt)assignsubscript𝑡superscript𝑋𝑡superscript𝑌𝑡\mathbb{P}_{t}:=\mathbb{P}(X^{t},Y^{t})blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := blackboard_P ( italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ); that is, (xit,yit)iidtsuperscriptsimilar-toiidsubscriptsuperscript𝑥𝑡𝑖subscriptsuperscript𝑦𝑡𝑖subscript𝑡(x^{t}_{i},y^{t}_{i})\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}\mathbb{P}_{t}( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_iid end_ARG end_RELOP blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We do not make any assumption on the distribution of the source data 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, but we assume a source model (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is pre-trained on 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and then transferred to a target model (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) using the procedure in Section 3.1.

We now define the transferability between the source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the target task represented by tsubscript𝑡\mathbb{P}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In our Definition 3.1 below, the transferability is the expected negative 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss of the target model (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) on a random example drawn from tsubscript𝑡\mathbb{P}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. From this definition, the lower the loss of (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), the higher the transferability.

Definition 3.1.

The transferability between a source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and a target task tsubscript𝑡\mathbb{P}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as: Tr(𝒟s,t):=𝔼(xt,yt)t{ytk*(w*(xt))2}assignnormal-Trsubscript𝒟𝑠subscript𝑡subscript𝔼similar-tosuperscript𝑥𝑡superscript𝑦𝑡subscript𝑡superscriptnormsuperscript𝑦𝑡superscript𝑘superscript𝑤superscript𝑥𝑡2\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t}):=\mathbb{E}_{(x^{t},y^{t})\sim% \mathbb{P}_{t}}\left\{-\|y^{t}-k^{*}(w^{*}(x^{t}))\|^{2}\right\}roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∼ blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { - ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }.

In the above definition, transferability is also equivalent to the negative expected (true) risk of (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). Next, we formulate the transferability estimation problem. Previous work [Tran et al., 2019, Huang et al., 2022] defined this problem as estimating Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from the training sets (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), i.e., to derive a real-valued metric 𝒯(𝒟s,𝒟t)𝒯subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\in\mathbb{R}caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R such that 𝒯(𝒟s,𝒟t)Tr(𝒟s,t)𝒯subscript𝒟𝑠subscript𝒟𝑡Trsubscript𝒟𝑠subscript𝑡{\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\approx\mathrm{Tr}(\mathcal{D}_{s% },\mathbb{P}_{t})}caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, in most applications of transferability estimation such as task selection [Tran et al., 2019, Huang et al., 2022, You et al., 2021] or model ranking [Huang et al., 2021, Li et al., 2021], an accurate approximation of Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is usually not required since 𝒯(𝒟s,𝒟t)𝒯subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is only used for comparing tasks or models. Thus, we propose below an alternative definition for this problem that better aligns with its practical usage.

Definition 3.2.

Transferability estimation aims to find a computationally efficient real-valued metric 𝒯(𝒟s,𝒟t)𝒯subscript𝒟normal-ssubscript𝒟normal-t{\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\in\mathbb{R}}caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R for any pair of training datasets (𝒟s,𝒟t)subscript𝒟normal-ssubscript𝒟normal-t(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) such that: 𝒯(𝒟s,𝒟t)𝒯(𝒟s,𝒟t)𝒯subscript𝒟normal-ssubscript𝒟normal-t𝒯subscriptsuperscript𝒟normal-′normal-ssubscriptsuperscript𝒟normal-′normal-t\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})\leq\mathcal{T}(\mathcal{D}^{% \prime}_{s},\mathcal{D}^{\prime}_{t})caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ caligraphic_T ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) if and only if Tr(𝒟s,t)Tr(𝒟s,t)normal-Trsubscript𝒟normal-ssubscriptnormal-tnormal-Trsubscriptsuperscript𝒟normal-′normal-ssubscriptsuperscriptnormal-′normal-t\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\leq\mathrm{Tr}(\mathcal{D}^{\prime% }_{s},\mathbb{P}^{\prime}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ roman_Tr ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where tsubscriptnormal-t\mathbb{P}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and tsubscriptsuperscriptnormal-′normal-t\mathbb{P}^{\prime}_{t}blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the tasks corresponding with the datasets 𝒟tsubscript𝒟normal-t\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒟tsubscriptsuperscript𝒟normal-′normal-t\mathcal{D}^{\prime}_{t}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT respectively.

In our new definition, a transferability estimator 𝒯(𝒟s,𝒟t)𝒯subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a function of (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that can be used for comparing or ranking transferability. It does not need to be an approximation of Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This is a generalization of previous definitions [Nguyen et al., 2020, Huang et al., 2022] and can be used for source task selection (when t=tsubscript𝑡subscriptsuperscript𝑡\mathbb{P}_{t}=\mathbb{P}^{\prime}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒟t=𝒟tsubscript𝒟𝑡subscriptsuperscript𝒟𝑡{\mathcal{D}_{t}=\mathcal{D}^{\prime}_{t}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) as well as target task selection (when 𝒟s=𝒟ssubscript𝒟𝑠subscriptsuperscript𝒟𝑠\mathcal{D}_{s}=\mathcal{D}^{\prime}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT). It is consistent with the usage of transferability estimators and the way they are evaluated in the literature by correlation analysis [Tran et al., 2019, Nguyen et al., 2020, You et al., 2021, Huang et al., 2022].

4 Simple transferability estimators for regression

In theory, we can use (w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})- caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the negative MSE of the transferred target model (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), as a transferability estimator, since it is an empirical estimation of Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using the dataset 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. However, this method requires us to run the actual transfer learning process, which could be expensive if the network architecture of the target regression heads (e.g., k𝑘kitalic_k and k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT) is deep and complex. This violates a crucial requirement for a transferability estimator in Definition 3.2: the estimator must be computationally efficient since it will be computed several times for task comparison. In this section, we propose two simple regression transferability estimators to address this problem.

4.1 Linear MSE estimator

To reduce the cost of computing (w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), a simple idea is to approximate it with an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized linear regression (Ridge regression) head. This leads to our first simple transferability estimator, Linear MSE, which is defined as the negative regularized MSE of this Ridge regression head. In this definition, F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius norm.

Definition 4.1.

The Linear MSE transferability estimator with a regularization parameter λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is: 𝒯λlin(𝒟s,𝒟t):=minA,b{1nti=1ntyitAw*(xit)b2+λAF2}assignsubscriptsuperscript𝒯normal-lin𝜆subscript𝒟𝑠subscript𝒟𝑡subscript𝐴𝑏1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴superscript𝑤subscriptsuperscript𝑥𝑡𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}):=-\min_{% A,b}\big{\{}\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}{\|y^{t}_{i}-Aw^{*}(x^{t}_{i})-b% \|^{2}}+\lambda\|A\|_{F}^{2}\big{\}}caligraphic_T start_POSTSUPERSCRIPT roman_lin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := - roman_min start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, where Adr×dt𝐴superscriptsubscript𝑑𝑟subscript𝑑𝑡A\in\mathbb{R}^{d_{r}\times d_{t}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a dr×dtsubscript𝑑𝑟subscript𝑑𝑡d_{r}\times d_{t}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT real-valued matrix and bdt𝑏superscriptsubscript𝑑𝑡b\in\mathbb{R}^{d_{t}}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-dimensional real-valued vector.

Here we add a regularizer to avoid overfitting when the target dataset 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is small. Previous work such as LogME [You et al., 2021] proposed to prevent overfitting by taking a Bayesian approach, which is more complicated and expensive. We will show empirically in our experiments (Section 6.3) that our simple regularization approach can tackle the issue more effectively and efficiently.

Given a pre-trained feature extractor w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and a target set 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can compute 𝒯λlin(𝒟s,𝒟t)subscriptsuperscript𝒯lin𝜆subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) efficiently using the closed form solution for Ridge regression or using second-order optimization [Bishop, 2006]. If the target regression head k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a linear regression model, 𝒯0lin(𝒟s,𝒟t)subscriptsuperscript𝒯lin0subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}^{\mathrm{lin}}_{0}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with λ=0𝜆0\lambda=0italic_λ = 0 is the negative MSE of the transferred target model (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT has more than one layer with a non-linear activation, 𝒯λlin(𝒟s,𝒟t)subscriptsuperscript𝒯lin𝜆subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}^{\mathrm{lin}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be regarded as using a regularized linear model to approximate this non-linear head.

4.2 Label MSE estimator

Although the Linear MSE transferability score above can be computed efficiently, this computation may still be relatively expensive if the feature vectors w*(xit)superscript𝑤subscriptsuperscript𝑥𝑡𝑖w^{*}(x^{t}_{i})italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are high-dimensional. To further reduce the costs, we propose another transferability estimator, Label MSE, which replaces w*(xit)superscript𝑤subscriptsuperscript𝑥𝑡𝑖w^{*}(x^{t}_{i})italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) by the “dummy” source label zi=h*(w*(xit))subscript𝑧𝑖superscriptsuperscript𝑤subscriptsuperscript𝑥𝑡𝑖z_{i}=h^{*}(w^{*}(x^{t}_{i}))italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). Using dummy labels from the pre-trained source model (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is a technique previously used to compute the LEEP transferability score for classification [Nguyen et al., 2020]. We define our Label MSE estimator below.

Definition 4.2.

The Label MSE transferability estimator with a regularization parameter λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is: 𝒯λlab(𝒟s,𝒟t):=minA,b{1nti=1ntyitAzib2+λAF2}assignsubscriptsuperscript𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡subscript𝐴𝑏1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴subscript𝑧𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}):=-\min_{% A,b}\big{\{}\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|y^{t}_{i}-Az_{i}-b\|^{2}+% \lambda\|A\|_{F}^{2}\big{\}}caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := - roman_min start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, where Ads×dt𝐴superscriptsubscript𝑑𝑠subscript𝑑𝑡{A\in\mathbb{R}^{d_{s}\times d_{t}}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a ds×dtsubscript𝑑𝑠subscript𝑑𝑡d_{s}\times d_{t}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT real-valued matrix, bdt𝑏superscriptsubscript𝑑𝑡b\in\mathbb{R}^{d_{t}}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-dimensional real-valued vector, and zi=h*(w*(xit))subscript𝑧𝑖superscriptsuperscript𝑤subscriptsuperscript𝑥𝑡𝑖z_{i}=h^{*}(w^{*}(x^{t}_{i}))italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

In practice, since the size of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is usually much smaller than that of w*(xit)superscript𝑤subscriptsuperscript𝑥𝑡𝑖w^{*}(x^{t}_{i})italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (i.e., dsdrmuch-less-thansubscript𝑑𝑠subscript𝑑𝑟d_{s}\ll d_{r}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≪ italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT), computing the Label MSE is usually faster than computing the Linear MSE.

\bullet Special case with shared inputs. When the source and target datasets have the same inputs, i.e., 𝒟s={(xi,yis)}i=1nsubscript𝒟𝑠superscriptsubscriptsubscript𝑥𝑖subscriptsuperscript𝑦𝑠𝑖𝑖1𝑛{\mathcal{D}_{s}=\{(x_{i},y^{s}_{i})\}_{i=1}^{n}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝒟t={(xi,yit)}i=1nsubscript𝒟𝑡superscriptsubscriptsubscript𝑥𝑖subscriptsuperscript𝑦𝑡𝑖𝑖1𝑛\mathcal{D}_{t}=\{(x_{i},y^{t}_{i})\}_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we can compute the Label MSE even faster using only the true labels. Particularly, we can consider the following version of the Label MSE.

Definition 4.3.

The Shared Inputs Label MSE transferability estimator with a regularization parameter λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is: 𝒯^λlab(𝒟s,𝒟t):=minA,b{1ni=1nyitAyisb2+λAF2}assignsubscriptsuperscriptnormal-^𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡subscript𝐴𝑏1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴subscriptsuperscript𝑦𝑠𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}% ):=-\min_{A,b}\Big{\{}{\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}% }+\lambda\|A\|_{F}^{2}\Big{\}}over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := - roman_min start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, where Ads×dt𝐴superscriptsubscript𝑑𝑠subscript𝑑𝑡A\in\mathbb{R}^{d_{s}\times d_{t}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bdt𝑏superscriptsubscript𝑑𝑡b\in\mathbb{R}^{d_{t}}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

In this definition, the Shared Inputs Label MSE is computed by training a Ridge regression model directly from the true label pairs (yis,yit)subscriptsuperscript𝑦𝑠𝑖subscriptsuperscript𝑦𝑡𝑖(y^{s}_{i},y^{t}_{i})( italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which is less expensive than the original Label MSE since we do not need to train the source model (w*,h*)superscript𝑤superscript(w^{*},h^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) or compute the dummy labels.

Intuitively, our estimators use a weaker version of the actual target model that helps trade off the estimators’ accuracy for computational speed. Our estimators can also be viewed as instances of the kernel Ridge regression approach [Smale and Zhou, 2007, Hastie et al., 2009]. While the Linear MSE can be interpreted as a linear approximation to (w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})- caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), properties of the Label MSE and Shared Inputs Label MSE are not well understood. In the next section, we shall prove novel theoretical properties for these estimators.

5 Theoretical properties

We now prove some theoretical properties for the Label MSE with ReLU feed-forward neural networks. These properties are in the form of generalization bounds relating 𝒯λlab(𝒟s,𝒟t)subscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with the transferability Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Throughout this section, we assume the space of all target regression heads k𝑘kitalic_k, which may have more than one layer, is a superset of all the linear regression models. This assumption is generally true for ReLU networks [Arora et al., 2018].

First, we show in Lemma 5.1 below a relationship between the negative MSE loss (w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})- caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and the Label MSE. This lemma states that the negative MSE loss (w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})- caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) upper bounds the Label MSE. The proof for this lemma is in the Appendix A.1.

Lemma 5.1.

For any λ0𝜆0\lambda\geq 0italic_λ ≥ 0, we have: 𝒯λlab(𝒟s,𝒟t)(w*,k*;𝒟t)subscriptsuperscript𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscript𝑤superscript𝑘subscript𝒟𝑡\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})\leq-% \mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Using this lemma, we can prove our main theoretical result in Theorem 5.2 below. In this theorem, L𝐿Litalic_L is the number of layers of the ReLU feed-forward neural network (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), and we assume the number of hidden nodes and parameters in each layer are upper bounded by H𝐻Hitalic_H and M1𝑀1M\geq 1italic_M ≥ 1 respectively. Without loss of generality, we also assume all input and output data are upper bounded by 1111 in subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm. This assumption can easily be satisfied by a pre-processing step that scales them to [0,1]01[0,1][ 0 , 1 ] in subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm.

Theorem 5.2.

For any source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, λ0𝜆0\lambda\geq 0italic_λ ≥ 0 and δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ over the randomness of 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have: Tr(𝒟s,t)𝒯λlab(𝒟s,𝒟t)C(d,dt,M,H,L,δ)/ntnormal-Trsubscript𝒟𝑠subscript𝑡subscriptsuperscript𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿subscript𝑛𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\geq\mathcal{T}^{\mathrm{lab}}_{% \lambda}(\mathcal{D}_{s},\mathcal{D}_{t})-C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, where C(d,dt,M,H,L,δ)=16M2L+2H2L[dt2dL+1+lnd+dtd22ln(4/δ)]𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿16superscript𝑀2𝐿2superscript𝐻2𝐿delimited-[]superscriptsubscript𝑑𝑡2𝑑𝐿1𝑑subscript𝑑𝑡superscript𝑑224𝛿C(d,d_{t},M,H,L,\delta)=16M^{2L+2}H^{2L}[d_{t}^{2}d\sqrt{L+1+\ln d}+d_{t}d^{2}% \sqrt{2\ln(4/\delta)}]italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) = 16 italic_M start_POSTSUPERSCRIPT 2 italic_L + 2 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT [ italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d square-root start_ARG italic_L + 1 + roman_ln italic_d end_ARG + italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 roman_ln ( 4 / italic_δ ) end_ARG ].

The proof for this theorem is in the Appendix A.2 The theorem shows that the transferability Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is lower bounded by the Label MSE 𝒯λlab(𝒟s,𝒟t)subscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) minus a complexity term C(d,dt,M,H,L,δ)/nt𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿subscript𝑛𝑡C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG that depends on the target dataset (specifically, the input and output dimensions, as well as the dataset size) and the architecture of the target network. When this complexity term is small (e.g., when ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is large enough), the bound in Theorem 5.2 will be tighter. In this case, a higher Label MSE score will likely lead to better transferability.

\bullet Shared inputs case. We can also derive similar bounds for the Shared Inputs Label MSE 𝒯^λlab(𝒟s,𝒟t)subscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denote Aλ*,bλ*:=argminA,b{1niyitAyisb2+λAF2}assignsubscriptsuperscript𝐴𝜆subscriptsuperscript𝑏𝜆subscriptargmin𝐴𝑏1𝑛subscript𝑖superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴subscriptsuperscript𝑦𝑠𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2{A^{*}_{\lambda},b^{*}_{\lambda}:=\operatorname*{argmin}_{A,b}\big{\{}\frac{1}% {n}\sum_{i}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}+\lambda\|A\|_{F}^{2}\big{\}}}italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT := roman_argmin start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. We first show the following lemma relating 𝒯^λlab(𝒟s,𝒟t)subscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the losses of the source and target models.

Lemma 5.3.

For any λ0𝜆0\lambda\geq 0italic_λ ≥ 0, we have: 𝒯^λlab(𝒟s,𝒟t)(w*,k*;𝒟t)/2+Aλ*F2(w*,h*;𝒟s).subscriptsuperscriptnormal-^𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscript𝑤superscript𝑘subscript𝒟𝑡2superscriptsubscriptnormsubscriptsuperscript𝐴𝜆𝐹2superscript𝑤superscriptsubscript𝒟𝑠\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t}% )\leq-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})/2+\|A^{*}_{\lambda}\|_{F}^{2}% \mathcal{L}(w^{*},h^{*};\mathcal{D}_{s}).over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / 2 + ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Using this lemma, we can prove the following theorem for this shared inputs setting. The proofs for these results are in the Appendix A.3.

Theorem 5.4.

For any source dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, λ0𝜆0\lambda\geq 0italic_λ ≥ 0 and δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ over the randomness of 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have: Tr(𝒟s,t)2𝒯^λlab(𝒟s,𝒟t)2Aλ*F2(w*,h*;𝒟s)C(d,dt,M,H,L,δ)/nnormal-Trsubscript𝒟𝑠subscript𝑡2subscriptsuperscriptnormal-^𝒯normal-lab𝜆subscript𝒟𝑠subscript𝒟𝑡2superscriptsubscriptnormsubscriptsuperscript𝐴𝜆𝐹2superscript𝑤superscriptsubscript𝒟𝑠𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿𝑛\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})\geq 2\widehat{\mathcal{T}}^{% \mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})-2\|A^{*}_{\lambda}\|_% {F}^{2}\mathcal{L}(w^{*},h^{*};\mathcal{D}_{s})-C(d,d_{t},M,H,L,\delta)/\sqrt{n}roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 2 over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 2 ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n end_ARG.

From the theorem, 𝒯^λlab(𝒟s,𝒟t)subscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can indirectly tell us information about the transferability Tr(𝒟s,t)Trsubscript𝒟𝑠subscript𝑡\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) without actually training w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, h*superscripth^{*}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. This bound becomes tighter when n𝑛nitalic_n is large or (w*,h*;𝒟s)superscript𝑤superscriptsubscript𝒟𝑠\mathcal{L}(w^{*},h^{*};\mathcal{D}_{s})caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is small (e.g., when the source model is expressive enough to fit the source data). An experiment to investigate the usefulness of our theoretical bounds in this section is available in the Appendix A.4.

6 Experiments

In this section, we conduct experiments to evaluate our approaches on the keypoint (or landmark) regression tasks using the following two large-scale public datasets:

\bullet CUB-200-2011 [Wah et al., 2011]. This dataset contains 11,788 bird images with 15 labeled keypoints indicating 15 different parts of a bird body. We use 9,788 images for training and 2,000 images for testing. Since the annotations for occluded keypoints are highly inaccurate, we remove all occluded keypoints during the training for both source and target tasks.

\bullet OpenMonkey [Yao et al., 2021]. This is a benchmark for the non-human pose tracking problem. It offers over 100,000 monkey images in natural contexts, annotated with 17 body landmarks. We use the original train-test split, which contains 66,917 training images and 22,306 testing images.

In our experiments, we use ResNet34 [He et al., 2016] as the backbone since it provides good performance as a source model. Following previous work [Tran et al., 2019, Nguyen et al., 2020, Huang et al., 2022, Nguyen et al., 2022], we investigate how well our transferability estimators correlate (using Pearson correlation) with the negative test MSE of the target model obtained from actual transfer learning. This correlation analysis is a good method to measure how well transferability estimators satisfy our Definition 3.2. In the Table C.1 and C.2 in the Appendix C.2, we provide additional results for other non-linear correlation measures, including Kendall’s τ𝜏\tauitalic_τ and Spearman correlations. The conclusions in our paper remain the same when comparing these correlations.

We consider three standard transfer learning algorithms: (1) head re-training [Donahue et al., 2014, Sharif Razavian et al., 2014]: We fix all layers of the source model up until the penultimate layer and re-train the last fully-connected (FC) layer using the target training set; (2) half fine-tuning [Donahue et al., 2014, Sharif Razavian et al., 2014]: We fine-tune the last convolutional block and all the FC layers of the source model, while keeping all other layers fixed; and (3) full fine-tuning [Agrawal et al., 2014, Girshick et al., 2014]: We fine-tune the whole source model using the target training set. Among these settings, head re-training resembles the transfer scenario in Section 3.1, while half and full fine-tuning are more commonly used in practice. For half fine-tuning, around half of the parameters in the network will be fine-tuned (similar-to\sim13M parameters). More details of our experiment settings are in the Appendix B.1.

We compare our transferability estimators, Linear MSE and Label MSE, with two recent SotA baselines for regression: LogME [You et al., 2021] and TransRate [Huang et al., 2022]. For our methods, we consider λ=0𝜆0\lambda=0italic_λ = 0 (named LinMSE0 and LabMSE0) for the estimators without regularization, and λ=1𝜆1\lambda=1italic_λ = 1 (named LinMSE1 and LabMSE1) for the estimators with the default λ𝜆\lambdaitalic_λ value. The effects of λ𝜆\lambdaitalic_λ on our algorithms are investigated in Section 6.6.

For the baselines, besides the usual versions (LogME and TransRate) that are computed from the extracted features and the target labels, we also consider the versions where they are computed from the dummy labels and the target labels (named LabLogME and LabTransRate). As in previous work [Huang et al., 2022], we divide the target label values into equal-sized bins (five bins in our case) to compute TransRate and LabTransRate.

6.1 General transfer between two different domains

Table 1: Correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Detailed correlation plots are in the Appendix C.2. Our estimators improve up to 25.9% in comparison with SotA (LogME) while being 12.9% better on average.
Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training 0.824 0.165 0.991 0.995* 0.969 0.121 0.982 0.995*
Half fine-tuning 0.706 0.392 0.881 0.885* 0.870 0.304 0.866 0.885*
Full fine-tuning 0.691 0.410   0.870* 0.869 0.861 0.311 0.855 0.869*

This experiment considers the general case where source models are trained on one dataset (OpenMonkey) and then transferred to another (CUB-200-2011). Specifically, we train a source model for each of the 17 keypoints of the OpenMonkey dataset and transfer them to each of the 15 keypoints of the CUB-200-2011 dataset, resulting in a total of 255 final models. Since each keypoint consists of x and y positions, all source and target tasks in this experiment have two dimensional labels. The actual MSEs of these models are computed on the respective test sets and then used to calculate the Pearson correlation coefficients with the transferability estimators. In this experiment, LabMSE0, LabMSE1, LabLogME, and LabTransRate are computed from the dummy source labels and the actual target labels.

Results for this experiment are in Table 1. In this setting, TransRate and LabTransRate perform poorly, while our methods are equal or better than LogME and LabLogME in most cases, especially when using λ=1𝜆1\lambda=1italic_λ = 1 (LinMSE1) or dummy labels (LabMSE0 and LabMSE1). The results show our approaches improve up to 25.9% in comparison with SotA (LogME) while being 12.9% better on average.

It is interesting to observe that LabMSE0 and LabMSE1 provide competitive or even better correlations than LinMSE0 and LinMSE1 in this experiment. This shows that the dummy labels (i.e., body parts of monkeys) can provide as much information about the target labels (i.e., body parts of birds) as the extracted features.

In the Appendix C.2, we also report additional results where both source and target tasks have 10-dimensional labels (i.e., each task predicts 5 keypoints simultaneously). We also achieve better correlations than the baselines in this case.

6.2 Transfer with shared-inputs tasks

Table 2: Correlation coefficients when transferring between tasks with shared inputs. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Detailed correlation plots are in the Appendix C.3. Our estimators improve up to 113% in comparison with SotA (LogME) while being 36.6% better on average.
Dataset Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
CUB-200-2011 Head re-training 0.547 0.019 0.916 0.946* 0.890 0.029 0.921   0.960*
Half fine-tuning 0.401 0.006 0.536 0.565* 0.560 0.064   0.628* 0.619
Full fine-tuning   0.128* 0.041 0.056 0.057 0.100   0.109* 0.097 0.082
Open Monkey Head re-training 0.890 0.666   0.973* 0.773 0.695 0.711 0.946   0.975*
Half fine-tuning 0.615 0.340 0.754 0.890* 0.446 0.488   0.899* 0.801
Full fine-tuning 0.569 0.269 0.705 0.882* 0.403 0.439   0.859* 0.761

In this experiment, we consider the setting where the source and target tasks have the same inputs (the special setting in Section 4.2). Since images in our datasets contain multiple labels (15 keypoints for CUB-200-2011 and 17 keypoints for OpenMonkey), we can use any two different keypoints on the same dataset as source and target tasks. In total, we construct 210 source-target pairs for CUB-200-2011 and 272 pairs for OpenMonkey that all have the same source and target inputs but different labels. The labels for all tasks are also two dimensional real values.

We repeat the experiment in Section 6.1 with these source-target pairs for CUB-200-2011 and OpenMonkey separately. The main difference in this experiment is that we use the true source labels (instead of dummy labels) when computing LabLogME, LabTransRate, LabMSE0, and LabMSE1. Under this setting, the LabMSE estimators here are the Shared Inputs Label MSE estimators in Definition 4.3. These estimators can be computed without any source models, and thus incurring very low computational costs in this setting.

Results for these experiments are in Table 2. In the results, both versions of TransRate perform poorly on CUB-200-2011, while TransRate is slightly better than LogME on OpenMonkey. In most settings, LabMSE0 and LabMSE1 both outperform LabLogME and LabTransRate, while LinMSE0 and LinMSE1 both outperform LogME and TransRate. In the setting where we transfer by full fine-tuning on the CUB-200-2011 dataset, all methods perform poorly. From these results, our approaches improve up to 113% in comparison with SotA (LogME) while being 36.6% better on average.

We also report in the Appendix C.3 additional results for each individual source task. The results show that our methods are consistently better than LogME, LabLogME, TransRate, and LabTransRate for most source tasks on both datasets. Furthermore, our methods are also better than these baselines when transferring to higher dimensional target tasks (tasks that predict 5 keypoints simultaneously and have 10-dimensional labels). These additional results further confirm the effectiveness of our approaches.

6.3 Evaluations on small target sets

Figure 1: Correlation coefficients with small target training sets on CUB-200-2011 (left) and OpenMonkey (right). LinMSE1 and LogME are designed to avoid overfitting, but LinMSE1 is better than LogME in both datasets.
Figure 2: Average running time (in milliseconds) for the experiments in Sections 6.2 (left) and 6.3 (right).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1       
Figure 1: Correlation coefficients with small target training sets on CUB-200-2011 (left) and OpenMonkey (right). LinMSE1 and LogME are designed to avoid overfitting, but LinMSE1 is better than LogME in both datasets.
Figure 2: Average running time (in milliseconds) for the experiments in Sections 6.2 (left) and 6.3 (right).
Figure 3: Test MSEs vs. transferability scores when transferring from pre-trained classification models to a target regression task. The x𝑥xitalic_x-axis represents the transferability scores. A linear regression model (dashed line) is fitted to the points in each plot. Our methods give better fits than the baselines.

In many real-world transfer learning scenarios, the target set is usually small. This experiment will evaluate the effectiveness of the feature-based transferability estimators (LogME, TransRate, LinMSE0, and LinMSE1) in this small data regime where the number of samples is smaller than the feature dimension. For this experiment, we fix a source task (Belly for CUB-200-2011 and Right eye for OpenMonkey) and transfer to all other tasks in the corresponding dataset using head re-training. These source tasks are chosen since they have fewer missing labels and thus can be used to train reasonably good source models for transfer learning. For each target task, instead of using the full data, we randomly select a small subset of 100 to 400 images to perform transfer learning and to compute the transferability scores. The actual MSEs of the transferred models are still computed using the full target test sets.

Figure 3 compares the correlations of the 4 methods on different target set sizes between 100 and 400. The results are averaged over 10 runs with 10 different random seeds. From the figure, LogME and LinMSE1 are better than TransRate and LinMSE0. This is expected since LogME and LinMSE1 are designed to avoid overfitting on small data. Both LogME and LinMSE1 are also more stable, but LinMSE1 is slightly better than LogME on all dataset sizes.

6.4 Efficiency of our estimators

One of the main strengths of our methods is their efficiency due to the simplicity of training the Ridge regression head. In this experiment, we first use the settings in Section 6.2 to compare the running time of our methods with that of the baselines on the CUB-200-2011 dataset. Figure 3 (left) reports the results (averaged over 5 runs with different random seeds) for this experiment. From these results, our methods, LabMSE0, LabMSE1, LinMSE0, and LinMSE1, are all faster than the corresponding label-based or feature-based baselines. The figure also shows that LabMSE1 and LinMSE1 achieve the best running time among the label-based and feature-based methods respectively.

In Figure 3 (right), we also compare the average running time of the 4 transferability estimators using the CUB-200-2011 experiment in Section 6.3. This figure clearly shows that our methods, LinMSE0 and LinMSE1, are more computationally efficient than LogME and TransRate. Both results in Figure 3 show that LinMSE1 and LabMSE1 are significantly faster than other corresponding feature-based and label-based methods. In these experiments, LinMSE1 and LabMSE1 converge faster than LinMSE0 and LabMSE0 respectively, and thus are more efficient.

6.5 Source task selection

Source task selection is important for applying transfer learning since the right source task can improve transfer learning performance [Nguyen et al., 2020]. In this experiment, we examine the application of our transferability estimation methods for selecting source tasks on the CUB-200-2011 dataset. We use the head re-training setting similar to Section 6.2, but fix one of the tasks as the target and choose the best source task from the rest of the task pool. We repeat this process for all 15 target tasks and measure the top-k𝑘kitalic_k matching rate of each transferability estimator.

The top-k𝑘kitalic_k matching rate is defined as mmatch/mtargetsubscript𝑚matchsubscript𝑚targetm_{\text{match}}/m_{\text{target}}italic_m start_POSTSUBSCRIPT match end_POSTSUBSCRIPT / italic_m start_POSTSUBSCRIPT target end_POSTSUBSCRIPT, where mtargetsubscript𝑚targetm_{\text{target}}italic_m start_POSTSUBSCRIPT target end_POSTSUBSCRIPT is the total number of target tasks (15 in our case), and mmatchsubscript𝑚matchm_{\text{match}}italic_m start_POSTSUBSCRIPT match end_POSTSUBSCRIPT is the number of times the selected source task gives a target model within the best k𝑘kitalic_k models. Here the best k𝑘kitalic_k models are determined by the actual test MSE on the target task.

Results for this experiment are in Table 3. From the results, our methods are better than the baselines in terms of top-3333 and top-5555 matching rates. When comparing top-1111 matching rates, our methods are competitive with LogME and LabLogME for the feature-based and label-based approaches respectively. This experiment shows that our transferability estimators are useful for source task selection.

Table 3: Top-k𝑘kitalic_k matching rates for source task selection on CUB-200-2011. Bold numbers indicate best results in each column. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.
k𝑘kitalic_k Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
1 6/15* 4/15 6/15* 2/15 11/15* 2/15 9/15 10/15
3 9/15 9/15 10/15* 9/15 12/15 6/15 12/15 13/15*
5 10/15 12/15 14/15* 14/15* 12/15 6/15 12/15 13/15*

6.6 Effects of λ𝜆\lambdaitalic_λ

Table 4: Correlation coefficients for different values of λ𝜆\lambdaitalic_λ on CUB-200-2011. Bold numbers indicate best results in each column. Results of the baselines are given in the last 2 rows for comparison. When there are meaningful correlations (head re-training and half fine-tuning), our methods are better than the corresponding baselines for all λ𝜆\lambdaitalic_λ values.
λ𝜆\lambdaitalic_λ Head re-training Half fine-tuning Full fine-tuning
LabMSE LinMSE LabMSE LinMSE LabMSE LinMSE
0 0.916 0.921 0.536 0.628 0.056 0.097
0.001 0.921 0.933 0.562 0.645 0.051 0.091
0.01 0.922 0.943 0.560 0.643 0.048 0.089
0.1 0.935 0.954 0.552 0.639 0.043 0.089
0.5 0.945 0.960 0.562 0.629 0.053 0.085
1 0.946 0.960 0.565 0.619 0.057 0.082
2 0.945 0.958 0.567 0.607 0.059 0.077
5 0.945 0.954 0.568 0.594 0.061 0.072
10 0.945 0.951 0.568 0.586 0.061 0.069
15 0.945 0.950 0.568 0.582 0.061 0.067
20 0.945 0.949 0.568 0.580 0.061 0.066
(Lab)LogME 0.547 0.889 0.400 0.560 0.120 0.099
(Lab)TransRate 0.008 0.029 0.006 0.006 0.001 0.100

In this experiment, we investigate the effects of λ𝜆\lambdaitalic_λ on our proposed transferability estimators. We use the setting in Section 6.2 with the CUB-200-2011 dataset and vary the value of λ𝜆\lambdaitalic_λ in [0, 20] for both LabMSE and LinMSE. Table 4 reports the results for all three transfer learning settings.

For head re-training, we observe that the best correlations are achieved at λ=1𝜆1\lambda=1italic_λ = 1 for both LabMSE and LinMSE. For half fine-tuning, λ5𝜆5\lambda\geq 5italic_λ ≥ 5 gives the best result for LabMSE, while λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001 gives the best result for LinMSE. For full fine-tuning, we do not observe significant correlations for both transferability estimators.

Notably, from the results in Table 4 for the head re-training and half fine-tuning settings (where we have significant correlations for at least one transferability estimator), LabMSE with any tested λ𝜆\lambdaitalic_λ value in [0, 20] is better than LabLogME and LabTransRate, while LinMSE with any tested λ𝜆\lambdaitalic_λ value in this range is better than LogME and TransRate. These results show that our methods are better than the baselines for a wide range of λ𝜆\lambdaitalic_λ values.

6.7 Beyond regression

Although our paper mainly focuses on regression tasks, the main idea of using the negative regularized MSE of a Ridge regression model for transferability estimation goes beyond regression. In principle, this idea can be applied for transferring between classification tasks (in this case, we should train a linear classifier and use its regularized log-likelihood as the transferability estimator) or between a classification and a regression task.

In this section, we demonstrate that our idea can be applied for transferability estimation between a classification and a regression task. Particularly, we use 8 source models pre-trained on ImageNet [Deng et al., 2009] and transfer to a target regression task on the dSprite dataset [Matthey et al., 2017] using full fine-tuning. This setting is similar to You et al. [2021] where the target is a regression task with 4-dimensional labels: x and y positions, scale, and orientation. We compute the transferability scores from the extracted features and the labels of the target training set. More details about this experiment are in the Appendix B.2.

From the results in Figure 3, the trends for LogME, LinMSE0, and LinMSE1 are correct (i.e., transferability scores have negative correlations with actual MSEs), while that of TransRate is incorrect. Note that there is a discrepancy between the ranges of the transferability and the transferred MSE because of two reasons: (1) The transferability estimators are computed from the target training set, while the transferred MSEs are computed from the target test set, and (2) there is a mismatch between the source task (ImageNet classification) and the target task (dSprite shape regression).

To compare the transferability estimation methods, we fit a linear regression to the points in each plot and compute its RMSE to these points, where we obtain: 6.12×1036.12superscript1036.12\times 10^{-3}6.12 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (LogME), 6.16×1036.16superscript1036.16\times 10^{-3}6.16 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (TransRate), 6.10×1036.10superscript1036.10\times 10^{-3}6.10 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (LinMSE0), and 5.46×1035.46superscript103\textbf{5.46}\times 10^{-3}5.46 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (LinMSE1). These results show that LinMSE0 and LinMSE1 are better than LogME and TransRate.

7 Conclusion

We formulated transferability estimation for regression tasks and proposed the Linear MSE and Label MSE estimators, two simple but effective approaches for this problem. We proved novel theoretical results for these estimators, showing their relationship with the actual task transferability. Our extensive experiments demonstrated that the proposed approaches are superior to recent, relevant SotA methods in terms of efficiency and effectiveness. Our proposed ideas can also be extended to mixed cases where one of the tasks is a classification problem.

Acknowledgements.
LSTH was supported by the Canada Research Chairs program, the NSERC Discovery Grant RGPIN-2018-05447, and the NSERC Discovery Launch Supplement DGECR-2018-00181. VD was supported by the University of Delaware Research Foundation (UDRF) Strategic Initiatives Grant, and the National Science Foundation Grant DMS-1951474.

References

  • Achille et al. [2019] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Agrawal et al. [2014] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision, 2014.
  • Arora et al. [2018] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
  • Azizzadenesheli et al. [2019] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2019.
  • Bao et al. [2019] Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information-theoretic approach to transferability in task transfer learning. In IEEE International Conference on Image Processing, 2019.
  • Ben-David and Schuller [2003] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. Learning theory and kernel machines, 2003.
  • Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.
  • Blitzer et al. [2007] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems, 2007.
  • Bolya et al. [2021] Daniel Bolya, Rohit Mittapalli, and Judy Hoffman. Scalable diverse model selection for accessible transfer learning. In Advances in Neural Information Processing Systems, 2021.
  • Bu et al. [2021] Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, and Zhaoxiang Zhang. GAIA: A transfer learning system of object detection that fits your needs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Cai et al. [2020] Enyu Cai, Sriram Baireddy, Changye Yang, Melba Crawford, and Edward J Delp. Deep transfer learning for plant center localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
  • Chatfield et al. [2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
  • Chuang et al. [2020] Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning, 2020.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
  • Deng and Zheng [2021] Weijian Deng and Liang Zheng. Are labels always necessary for classifier accuracy evaluation? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Deshpande et al. [2021] Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, and Pietro Perona. A linearized framework and a new benchmark for model selection for fine-tuning. arXiv:2102.00084, 2021.
  • Dhillon et al. [2020] Guneet S. Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Representations, 2020.
  • Doersch and Zisserman [2019] Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: Motion to the rescue. In Advances in Neural Information Processing Systems, 2019.
  • Donahue et al. [2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, 2014.
  • Dwivedi and Roig [2019] Kshitij Dwivedi and Gemma Roig. Representation similarity analysis for efficient task taxonomy & transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Dwivedi et al. [2020] Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, and Gemma Roig. Duality diagram similarity: A generic framework for initialization selection in task transfer learning. In European Conference on Computer Vision, 2020.
  • Fard et al. [2021] Ali Pourramezan Fard, Hojjat Abdollahi, and Mohammad Mahoor. ASMNet: A lightweight deep neural network for face alignment and pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
  • Golowich et al. [2018] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Annual Conference on Learning Theory, 2018.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: Data mining, inference, and prediction, volume 2. Springer, 2009.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Huang et al. [2021] Jiaji Huang, Qiang Qiu, and Kenneth Church. Exploiting a zoo of checkpoints for unseen tasks. In Advances in Neural Information Processing Systems, 2021.
  • Huang et al. [2022] Long-Kai Huang, Junzhou Huang, Yu Rong, Qiang Yang, and Ying Wei. Frustratingly easy transferability estimation. In International Conference on Machine Learning, 2022.
  • Li et al. [2021] Yandong Li, Xuhui Jia, Ruoxin Sang, Yukun Zhu, Bradley Green, Liqiang Wang, and Boqing Gong. Ranking neural checkpoints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Mansour et al. [2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Annual Conference on Learning Theory, 2009.
  • Matthey et al. [2017] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dSprites: Disentanglement testing Sprites dataset, 2017. https://github.com/deepmind/dsprites-dataset/.
  • Nguyen et al. [2022] Cuong N Nguyen, Lam Si Tung Ho, Vu Dinh, Tal Hassner, and Cuong V Nguyen. Generalization bounds for deep transfer learning using majority predictor accuracy. In International Symposium on Information Theory and Its Applications, 2022.
  • Nguyen et al. [2020] Cuong V Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. LEEP: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, 2020.
  • Oquab et al. [2014] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019.
  • Poster et al. [2021] Domenick D Poster, Shuowen Hu, Nathan J Short, Benjamin S Riggan, and Nasser M Nasrabadi. Visible-to-thermal transfer learning for facial landmark detection. IEEE Access, 2021.
  • Radhakrishnan et al. [2022] Adityanarayanan Radhakrishnan, Max Ruiz Luyten, Neha Prasad, and Caroline Uhler. Transfer learning with kernel methods. arXiv:2211.00227, 2022.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, 2021.
  • Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems, 2019.
  • Schwarz et al. [2015] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In IEEE International Conference on Robotics and Automation, 2015.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Sharif Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2014.
  • Smale and Zhou [2007] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153–172, 2007.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  • Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, 2018.
  • Tan et al. [2021] Yang Tan, Yang Li, and Shao-Lun Huang. OTCE: A transferability metric for cross-domain cross-task representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Tong et al. [2021] Xinyi Tong, Xiangxiang Xu, Shao-Lun Huang, and Lizhong Zheng. A mathematical framework for quantifying transferability in multi-source transfer learning. In Advances in Neural Information Processing Systems, 2021.
  • Tran et al. [2019] Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transferability and hardness of supervised classification tasks. In IEEE/CVF International Conference on Computer Vision, 2019.
  • Tripuraneni et al. [2020] Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, 2020.
  • Wah et al. [2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report, 2011. https://authors.library.caltech.edu/27452/.
  • Wang et al. [2019] Boyu Wang, Jorge Mendez, Mingbo Cai, and Eric Eaton. Transfer learning via minimizing the performance gap between domains. In Advances in Neural Information Processing Systems, 2019.
  • Whatmough et al. [2019] Paul N Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, and Matthew Mattina. FixyNN: Efficient hardware for mobile computer vision via transfer learning. In Conference on Systems and Machine Learning, 2019.
  • Yao et al. [2021] Yuan Yao, Abhiraj Abhiraj Mohan, Eliza Bliss-Moreau, Kristine Coleman, Sienna M Freeman, Christopher J Machado, Jessica Raper, Jan Zimmermann, Benjamin Y Hayden, and Hyun Soo Park. OpenMonkeyChallenge: Dataset and Benchmark Challenges for Pose Tracking of Non-human Primates. bioRxiv, 2021. http://openmonkeychallenge.com/.
  • You et al. [2021] Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. LogME: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, 2021.
  • Zamir et al. [2018] Amir R. Zamir, Alexander Sax, William B. Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.

Simple Transferability Estimation for Regression Tasks
(Supplementary Material)

The contents of this supplementary include:

  1. 1.

    Appendix A.1: Proof of Lemma 5.1 in the main paper.

  2. 2.

    Appendix A.2: Proof of Theorem 5.2 in the main paper.

  3. 3.

    Appendix A.3: Proof of Lemma 5.3 in the main paper.

  4. 4.

    Appendix A.4: Proof of Theorem 5.4 in the main paper.

  5. 5.

    Appendix B.1: More details for the experiment settings in Sections 6.16.6 of the main paper.

  6. 6.

    Appendix B.2: More details for the experiment setting in Section 6.7 of the main paper.

  7. 7.

    Appendix C.1: An additional experiment to show the usefulness of our theoretical bounds.

  8. 8.

    Appendix C.2: Additional experiment results for Section 6.1 of the main paper.

  9. 9.

    Appendix C.3: Additional experiment results for Section 6.2 of the main paper.

Appendix A Mathematical proofs

A.1 Proof of Lemma 5.1

Denote A*,b*=argminA,b{1nti=1ntyitAzib2+λAF2}.superscript𝐴superscript𝑏subscriptargmin𝐴𝑏1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴subscript𝑧𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2\displaystyle A^{*},b^{*}=\operatorname*{argmin}_{A,b}\left\{\frac{1}{n_{t}}% \sum_{i=1}^{n_{t}}{\|y^{t}_{i}-Az_{i}-b\|^{2}}+\lambda\|A\|_{F}^{2}\right\}.italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } .

For all k𝑘kitalic_k, we have:

(w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡\displaystyle\sqrt{\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})}square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG (w*,k;𝒟t)absentsuperscript𝑤𝑘subscript𝒟𝑡\displaystyle\leq\sqrt{\mathcal{L}(w^{*},k;\mathcal{D}_{t})}≤ square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG (definition of k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT)
=[1nti=1ntyitk(w*(xit))2]1/2absentsuperscriptdelimited-[]1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝑘superscript𝑤subscriptsuperscript𝑥𝑡𝑖212\displaystyle=\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|y^{t}_{i}-k(w^{*}(x^{t}% _{i}))\|^{2}\right]^{1/2}= [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT (definition of \mathcal{L}caligraphic_L)
[1nti=1ntyitA*zib*2]1/2+[1nti=1ntA*zi+b*k(w*(xit))2]1/2absentsuperscriptdelimited-[]1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsubscriptsuperscript𝑦𝑡𝑖superscript𝐴subscript𝑧𝑖superscript𝑏212superscriptdelimited-[]1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsuperscript𝐴subscript𝑧𝑖superscript𝑏𝑘superscript𝑤subscriptsuperscript𝑥𝑡𝑖212\displaystyle\leq\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|y^{t}_{i}-A^{*}z_{i}% -b^{*}\|^{2}\right]^{1/2}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|A^{*}z_{i}+% b^{*}-k(w^{*}(x^{t}_{i}))\|^{2}\right]^{1/2}≤ [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT (triangle inequality)
𝒯λlab(𝒟s,𝒟t)+[1nti=1ntA*zi+b*k(w*(xit))2]1/2absentsubscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscriptdelimited-[]1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsuperscript𝐴subscript𝑧𝑖superscript𝑏𝑘superscript𝑤subscriptsuperscript𝑥𝑡𝑖212\displaystyle\leq\sqrt{-\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},% \mathcal{D}_{t})}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|A^{*}z_{i}+b^{*}-k(% w^{*}(x^{t}_{i}))\|^{2}\right]^{1/2}≤ square-root start_ARG - caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
=𝒯λlab(𝒟s,𝒟t)+[1nti=1ntA*h*(w*(xit))+b*k(w*(xit))2]1/2.absentsubscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscriptdelimited-[]1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡superscriptnormsuperscript𝐴superscriptsuperscript𝑤subscriptsuperscript𝑥𝑡𝑖superscript𝑏𝑘superscript𝑤subscriptsuperscript𝑥𝑡𝑖212\displaystyle=\sqrt{-\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},% \mathcal{D}_{t})}+\left[\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\|A^{*}h^{*}(w^{*}(x^% {t}_{i}))+b^{*}-k(w^{*}(x^{t}_{i}))\|^{2}\right]^{1/2}.= square-root start_ARG - caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . (definition of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)

By choosing k()=A*h*()+b*𝑘superscript𝐴superscriptsuperscript𝑏k(\cdot)=A^{*}h^{*}(\cdot)+b^{*}italic_k ( ⋅ ) = italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, the second term in the above inequality becomes 0. This implies (w*,k*;𝒟t)𝒯λlab(𝒟s,𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡subscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\sqrt{\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})}\leq\sqrt{-\mathcal{T}^{\mathrm% {lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})}square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ≤ square-root start_ARG - caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG and thus the lemma.

A.2 Proof of Theorem 5.2

First, we need to define the notion of expected (true) risk. Given any model (w,k)𝑤𝑘(w,k)( italic_w , italic_k ) for the target task, the expected risk of (w,k)𝑤𝑘(w,k)( italic_w , italic_k ) is defined as:

(w,k):=𝔼(xt,yt)t{ytk(w(xt))2}.assign𝑤𝑘subscript𝔼similar-tosuperscript𝑥𝑡superscript𝑦𝑡subscript𝑡superscriptnormsuperscript𝑦𝑡𝑘𝑤superscript𝑥𝑡2\mathcal{R}(w,k):=\mathbb{E}_{(x^{t},y^{t})\sim\mathbb{P}_{t}}\left\{\|y^{t}-k% (w(x^{t}))\|^{2}\right\}.caligraphic_R ( italic_w , italic_k ) := blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∼ blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_k ( italic_w ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (4)

Note that Tr(𝒟s,t)=(w*,k*)Trsubscript𝒟𝑠subscript𝑡superscript𝑤superscript𝑘\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})=-\mathcal{R}(w^{*},k^{*})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - caligraphic_R ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). We prove the uniform bound in Lemma A.1 below that can help us prove Theorem 5.2.

Lemma A.1.

For any δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δ1𝛿{1-\delta}1 - italic_δ, for all ReLU feed-forward neural network (w,k)𝑤𝑘(w,k)( italic_w , italic_k ) of the target task, we have:

|(w,k)(w,k;𝒟t)|C(d,dt,M,H,L,δ)/nt.𝑤𝑘𝑤𝑘subscript𝒟𝑡𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿subscript𝑛𝑡|\mathcal{R}(w,k)-\mathcal{L}(w,k;\mathcal{D}_{t})|\leavevmode\nobreak\ \leq% \leavevmode\nobreak\ C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}.| caligraphic_R ( italic_w , italic_k ) - caligraphic_L ( italic_w , italic_k ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .
Proof.

We recall the definition of Rademacher complexity. Given a real-valued function class 𝒢𝒢\mathcal{G}caligraphic_G and a set of data points 𝒟={ui}i=1n𝒟superscriptsubscriptsubscript𝑢𝑖𝑖1𝑛\mathcal{D}=\{u_{i}\}_{i=1}^{n}caligraphic_D = { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the (empirical) Rademacher complexity R^𝒟(𝒢)subscript^𝑅𝒟𝒢\widehat{R}_{\mathcal{D}}(\mathcal{G})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( caligraphic_G ) is defined as:

R^𝒟(𝒢)=𝔼ϵ[supg𝒢1ni=1nϵig(ui)],subscript^𝑅𝒟𝒢subscript𝔼italic-ϵdelimited-[]subscriptsupremum𝑔𝒢1𝑛superscriptsubscript𝑖1𝑛subscriptitalic-ϵ𝑖𝑔subscript𝑢𝑖\widehat{R}_{\mathcal{D}}(\mathcal{G})=\mathbb{E}_{\epsilon}\left[\sup_{g\in% \mathcal{G}}\frac{1}{n}\sum_{i=1}^{n}{\epsilon_{i}g(u_{i})}\right],over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( caligraphic_G ) = blackboard_E start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ,

where ϵ=(ϵ1,ϵ2,,ϵn)italic-ϵsubscriptitalic-ϵ1subscriptitalic-ϵ2subscriptitalic-ϵ𝑛\epsilon=(\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n})italic_ϵ = ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is a vector uniformly distributed in {1,+1}nsuperscript11𝑛\{-1,+1\}^{n}{ - 1 , + 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

In our setting, the hypothesis space ΦΦ\Phiroman_Φ is the class of L𝐿Litalic_L-layer ReLU feed-forward neural networks whose number of hidden nodes and parameters in each layer are bounded from above by H𝐻Hitalic_H and M1𝑀1M\geq 1italic_M ≥ 1 respectively. For all (w,k)Φ𝑤𝑘Φ(w,k)\in\Phi( italic_w , italic_k ) ∈ roman_Φ and x𝑥xitalic_x such that x1subscriptnorm𝑥1\|x\|_{\infty}\leq 1∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1, we have:

k(w(x))dML+1HL.subscriptnorm𝑘𝑤𝑥𝑑superscript𝑀𝐿1superscript𝐻𝐿\|k(w(x))\|_{\infty}\leq dM^{L+1}H^{L}.∥ italic_k ( italic_w ( italic_x ) ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_d italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT .

Define fw,k(x,y)=yk(w(x))subscript𝑓𝑤𝑘𝑥𝑦𝑦𝑘𝑤𝑥f_{w,k}(x,y)=y-k(w(x))italic_f start_POSTSUBSCRIPT italic_w , italic_k end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_y - italic_k ( italic_w ( italic_x ) ) and note that fw,k(x,y)dtsubscript𝑓𝑤𝑘𝑥𝑦superscriptsubscript𝑑𝑡f_{w,k}(x,y)\in\mathbb{R}^{d_{t}}italic_f start_POSTSUBSCRIPT italic_w , italic_k end_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. For any j=1,2,,dt𝑗12subscript𝑑𝑡j=1,2,\ldots,d_{t}italic_j = 1 , 2 , … , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, let []jsubscriptdelimited-[]𝑗[\cdot]_{j}[ ⋅ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the projection map to the j𝑗jitalic_j-th coordinate. We consider the following real-valued function classes:

\displaystyle\mathcal{F}caligraphic_F ={fw,k2:(w,k)Φ},\displaystyle=\{\|f_{w,k}\|^{2}:(w,k)\in\Phi\},= { ∥ italic_f start_POSTSUBSCRIPT italic_w , italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : ( italic_w , italic_k ) ∈ roman_Φ } ,
jsubscript𝑗\displaystyle\mathcal{F}_{j}caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ={[fw,k]j:(w,k)Φ},absentconditional-setsubscriptdelimited-[]subscript𝑓𝑤𝑘𝑗𝑤𝑘Φ\displaystyle=\{[f_{w,k}]_{j}:(w,k)\in\Phi\},= { [ italic_f start_POSTSUBSCRIPT italic_w , italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : ( italic_w , italic_k ) ∈ roman_Φ } ,
ΦjsubscriptΦ𝑗\displaystyle\Phi_{j}roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ={[k(w()]j:(w,k)Φ},\displaystyle=\{[k(w(\cdot)]_{j}:(w,k)\in\Phi\},= { [ italic_k ( italic_w ( ⋅ ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : ( italic_w , italic_k ) ∈ roman_Φ } ,

where each element of \mathcal{F}caligraphic_F oder jsubscript𝑗\mathcal{F}_{j}caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a function with variables (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), and each element of ΦjsubscriptΦ𝑗\Phi_{j}roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a function with variable x𝑥xitalic_x. Let 𝒟tx={xit}i=1ntsubscriptsuperscript𝒟𝑥𝑡superscriptsubscriptsubscriptsuperscript𝑥𝑡𝑖𝑖1subscript𝑛𝑡\mathcal{D}^{x}_{t}=\{x^{t}_{i}\}_{i=1}^{n_{t}}caligraphic_D start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the set of target inputs. By Theorem 2 of Golowich et al. [2018], for all j=1,2,,dt𝑗12subscript𝑑𝑡j=1,2,\ldots,d_{t}italic_j = 1 , 2 , … , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have:

R^𝒟tx(Φj)2dtML+1HLL+1+lndnt.subscript^𝑅subscriptsuperscript𝒟𝑥𝑡subscriptΦ𝑗2subscript𝑑𝑡superscript𝑀𝐿1superscript𝐻𝐿𝐿1𝑑subscript𝑛𝑡\widehat{R}_{\mathcal{D}^{x}_{t}}(\Phi_{j})\leq 2d_{t}M^{L+1}H^{L}\sqrt{\frac{% L+1+\ln d}{n_{t}}}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ 2 italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_L + 1 + roman_ln italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

We note that for any i=1,2,,nt𝑖12subscript𝑛𝑡i=1,2,\ldots,n_{t}italic_i = 1 , 2 , … , italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the function ri(a)=(ayit)2subscript𝑟𝑖𝑎superscript𝑎superscriptsubscript𝑦𝑖𝑡2r_{i}(a)=(a-y_{i}^{t})^{2}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) = ( italic_a - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT mapping from a[dML+1HL,dML+1HL]𝑎𝑑superscript𝑀𝐿1superscript𝐻𝐿𝑑superscript𝑀𝐿1superscript𝐻𝐿{a\in[-dM^{L+1}H^{L},dM^{L+1}H^{L}]}italic_a ∈ [ - italic_d italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_d italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ] to \mathbb{R}blackboard_R is Lipschitz with constant 4dML+1HL4𝑑superscript𝑀𝐿1superscript𝐻𝐿4dM^{L+1}H^{L}4 italic_d italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Thus, applying the Contraction Lemma (Lemma 26.9 in Shalev-Shwartz and Ben-David [2014]), we obtain:

R^𝒟t(j)4dML+1HLR^𝒟tx(Φj)8ddtM2L+2H2LL+1+lndnt.subscript^𝑅subscript𝒟𝑡subscript𝑗4𝑑superscript𝑀𝐿1superscript𝐻𝐿subscript^𝑅subscriptsuperscript𝒟𝑥𝑡subscriptΦ𝑗8𝑑subscript𝑑𝑡superscript𝑀2𝐿2superscript𝐻2𝐿𝐿1𝑑subscript𝑛𝑡\widehat{R}_{\mathcal{D}_{t}}(\mathcal{F}_{j})\leq 4dM^{L+1}H^{L}\widehat{R}_{% \mathcal{D}^{x}_{t}}(\Phi_{j})\leq 8dd_{t}M^{2L+2}H^{2L}\sqrt{\frac{L+1+\ln d}% {n_{t}}}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ 4 italic_d italic_M start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ 8 italic_d italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 2 italic_L + 2 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_L + 1 + roman_ln italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

Therefore,

R^𝒟t()j=1dtR^𝒟t(j)8ddt2M2L+2H2LL+1+lndnt.subscript^𝑅subscript𝒟𝑡superscriptsubscript𝑗1subscript𝑑𝑡subscript^𝑅subscript𝒟𝑡subscript𝑗8𝑑superscriptsubscript𝑑𝑡2superscript𝑀2𝐿2superscript𝐻2𝐿𝐿1𝑑subscript𝑛𝑡\widehat{R}_{\mathcal{D}_{t}}(\mathcal{F})\leq\sum_{j=1}^{d_{t}}\widehat{R}_{% \mathcal{D}_{t}}(\mathcal{F}_{j})\leq 8dd_{t}^{2}M^{2L+2}H^{2L}\sqrt{\frac{L+1% +\ln d}{n_{t}}}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_F ) ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ 8 italic_d italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 italic_L + 2 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_L + 1 + roman_ln italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

Using this inequality, the result of Lemma A.1 follows from Theorem 26.5 in Shalev-Shwartz and Ben-David [2014]. ∎

To prove Theorem 5.2, we apply Lemma 5.1 in the main paper and Lemma A.1 above for the transferred target model (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). Thus, for any λ0𝜆0\lambda\geq 0italic_λ ≥ 0 and δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have:

𝒯λlab(𝒟s,𝒟t)subscriptsuperscript𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡\displaystyle\mathcal{T}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_% {t})caligraphic_T start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (w*,k*;𝒟t)absentsuperscript𝑤superscript𝑘subscript𝒟𝑡\displaystyle\leq-\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})≤ - caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
(w*,k*)+C(d,dt,M,H,L,δ)/ntabsentsuperscript𝑤superscript𝑘𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿subscript𝑛𝑡\displaystyle\leq-\mathcal{R}(w^{*},k^{*})+C(d,d_{t},M,H,L,\delta)/\sqrt{n_{t}}≤ - caligraphic_R ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG
=Tr(𝒟s,t)+C(d,dt,M,H,L,δ)/nt.absentTrsubscript𝒟𝑠subscript𝑡𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿subscript𝑛𝑡\displaystyle=\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})+C(d,d_{t},M,H,L,% \delta)/\sqrt{n_{t}}.= roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .

Therefore, Theorem 5.2 holds.

A.3 Proof of Lemma 5.3

Note that Aλ*,bλ*=argminA,b{1ni=1nyitAyisb2+λAF2}.subscriptsuperscript𝐴𝜆subscriptsuperscript𝑏𝜆subscriptargmin𝐴𝑏1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝐴subscriptsuperscript𝑦𝑠𝑖𝑏2𝜆superscriptsubscriptnorm𝐴𝐹2\displaystyle A^{*}_{\lambda},b^{*}_{\lambda}=\operatorname*{argmin}_{A,b}% \left\{\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-Ay^{s}_{i}-b\|^{2}+\lambda\|A\|_{F% }^{2}\right\}.italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_A , italic_b end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } .

For all k𝑘kitalic_k, we have:

(w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡\displaystyle\sqrt{\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})}square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG (w*,k;𝒟t)absentsuperscript𝑤𝑘subscript𝒟𝑡\displaystyle\leq\sqrt{\mathcal{L}(w^{*},k;\mathcal{D}_{t})}≤ square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG (definition of k*superscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT)
=[1ni=1nyitk(w*(xi))2]1/2absentsuperscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝑦𝑡𝑖𝑘superscript𝑤subscript𝑥𝑖212\displaystyle=\left[\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-k(w^{*}(x_{i}))\|^{2}% \right]^{1/2}= [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT (definition of \mathcal{L}caligraphic_L)
[1ni=1nyitAλ*yisbλ*2]1/2+[1ni=1nAλ*yis+bλ*k(w*(xi))2]1/2absentsuperscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝑦𝑡𝑖subscriptsuperscript𝐴𝜆subscriptsuperscript𝑦𝑠𝑖subscriptsuperscript𝑏𝜆212superscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝐴𝜆subscriptsuperscript𝑦𝑠𝑖subscriptsuperscript𝑏𝜆𝑘superscript𝑤subscript𝑥𝑖212\displaystyle\leq\left[\frac{1}{n}\sum_{i=1}^{n}\|y^{t}_{i}-A^{*}_{\lambda}y^{% s}_{i}-b^{*}_{\lambda}\|^{2}\right]^{1/2}+\left[\frac{1}{n}\sum_{i=1}^{n}\|A^{% *}_{\lambda}y^{s}_{i}+b^{*}_{\lambda}-k(w^{*}(x_{i}))\|^{2}\right]^{1/2}≤ [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT (triangle inequality)
𝒯^λlab(𝒟s,𝒟t)+[1ni=1nAλ*yis+bλ*k(w*(xi))2]1/2.absentsubscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝐴𝜆subscriptsuperscript𝑦𝑠𝑖subscriptsuperscript𝑏𝜆𝑘superscript𝑤subscript𝑥𝑖212\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\left[\frac{1}{n}\sum_{i=1}^{n}\|A^{*}_{% \lambda}y^{s}_{i}+b^{*}_{\lambda}-k(w^{*}(x_{i}))\|^{2}\right]^{1/2}.≤ square-root start_ARG - over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_k ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . (definition of 𝒯^λlabsubscriptsuperscript^𝒯lab𝜆\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT)

Picking k()=Aλ*h*()+bλ*𝑘subscriptsuperscript𝐴𝜆superscriptsubscriptsuperscript𝑏𝜆k(\cdot)=A^{*}_{\lambda}h^{*}(\cdot)+b^{*}_{\lambda}italic_k ( ⋅ ) = italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) + italic_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT, this inequality becomes:

(w*,k*;𝒟t)superscript𝑤superscript𝑘subscript𝒟𝑡\displaystyle\sqrt{\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})}square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG 𝒯^λlab(𝒟s,𝒟t)+[1ni=1nAλ*[yish*(w*(xi))]2]1/2absentsubscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡superscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝐴𝜆delimited-[]subscriptsuperscript𝑦𝑠𝑖superscriptsuperscript𝑤subscript𝑥𝑖212\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\left[\frac{1}{n}\sum_{i=1}^{n}\|A^{*}_{% \lambda}[y^{s}_{i}-h^{*}(w^{*}(x_{i}))]\|^{2}\right]^{1/2}≤ square-root start_ARG - over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT [ italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
𝒯^λlab(𝒟s,𝒟t)+Aλ*F[1ni=1nyish*(w*(xi))2]1/2absentsubscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡subscriptnormsubscriptsuperscript𝐴𝜆𝐹superscriptdelimited-[]1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscriptsuperscript𝑦𝑠𝑖superscriptsuperscript𝑤subscript𝑥𝑖212\displaystyle\leq\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(% \mathcal{D}_{s},\mathcal{D}_{t})}+\|A^{*}_{\lambda}\|_{F}\left[\frac{1}{n}\sum% _{i=1}^{n}\|y^{s}_{i}-h^{*}(w^{*}(x_{i}))\|^{2}\right]^{1/2}≤ square-root start_ARG - over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
=𝒯^λlab(𝒟s,𝒟t)+Aλ*F(w*,h*;𝒟s).absentsubscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡subscriptnormsubscriptsuperscript𝐴𝜆𝐹superscript𝑤superscriptsubscript𝒟𝑠\displaystyle=\sqrt{-\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D% }_{s},\mathcal{D}_{t})}+\|A^{*}_{\lambda}\|_{F}\sqrt{\mathcal{L}(w^{*},h^{*};% \mathcal{D}_{s})}.= square-root start_ARG - over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT square-root start_ARG caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG .

Note that if ab+c𝑎𝑏𝑐a\leq b+citalic_a ≤ italic_b + italic_c, then a22b2+2c2superscript𝑎22superscript𝑏22superscript𝑐2a^{2}\leq 2b^{2}+2c^{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Applying this fact to the above inequaility, we have:

(w*,k*;𝒟t)2𝒯^λlab(𝒟s,𝒟t)+2Aλ*F2(w*,h*;𝒟s).superscript𝑤superscript𝑘subscript𝒟𝑡2subscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡2subscriptsuperscriptnormsubscriptsuperscript𝐴𝜆2𝐹superscript𝑤superscriptsubscript𝒟𝑠\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})\leq-2\widehat{\mathcal{T}}^{\mathrm{% lab}}_{\lambda}(\mathcal{D}_{s},\mathcal{D}_{t})+2\|A^{*}_{\lambda}\|^{2}_{F}% \mathcal{L}(w^{*},h^{*};\mathcal{D}_{s}).caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - 2 over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Thus, Lemma 5.3 holds.

A.4 Proof of Theorem 5.4

For any λ0𝜆0\lambda\geq 0italic_λ ≥ 0 and δ>0𝛿0\delta>0italic_δ > 0, applying Lemma A.1 for (w*,k*)superscript𝑤superscript𝑘(w^{*},k^{*})( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and Lemma 5.3, with probability at least 1δ1𝛿1-\delta1 - italic_δ:

(w*,k*)superscript𝑤superscript𝑘\displaystyle\mathcal{R}(w^{*},k^{*})caligraphic_R ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (w*,k*;𝒟t)+C(d,dt,M,H,L,δ)/nabsentsuperscript𝑤superscript𝑘subscript𝒟𝑡𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿𝑛\displaystyle\leq\mathcal{L}(w^{*},k^{*};\mathcal{D}_{t})+C(d,d_{t},M,H,L,% \delta)/\sqrt{n}≤ caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n end_ARG
2𝒯^λlab(𝒟s,𝒟t)+2Aλ*F2(w*,h*;𝒟s)+C(d,dt,M,H,L,δ)/n.absent2subscriptsuperscript^𝒯lab𝜆subscript𝒟𝑠subscript𝒟𝑡2subscriptsuperscriptnormsubscriptsuperscript𝐴𝜆2𝐹superscript𝑤superscriptsubscript𝒟𝑠𝐶𝑑subscript𝑑𝑡𝑀𝐻𝐿𝛿𝑛\displaystyle\leq-2\widehat{\mathcal{T}}^{\mathrm{lab}}_{\lambda}(\mathcal{D}_% {s},\mathcal{D}_{t})+2\|A^{*}_{\lambda}\|^{2}_{F}\leavevmode\nobreak\ \mathcal% {L}(w^{*},h^{*};\mathcal{D}_{s})+C(d,d_{t},M,H,L,\delta)/\sqrt{n}.≤ - 2 over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT roman_lab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 ∥ italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT caligraphic_L ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_C ( italic_d , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M , italic_H , italic_L , italic_δ ) / square-root start_ARG italic_n end_ARG .

Since Tr(𝒟s,t)=(w*,k*)Trsubscript𝒟𝑠subscript𝑡superscript𝑤superscript𝑘\mathrm{Tr}(\mathcal{D}_{s},\mathbb{P}_{t})=-\mathcal{R}(w^{*},k^{*})roman_Tr ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - caligraphic_R ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), Theorem 5.4 holds.

Appendix B More details for experiment settings

B.1 More details for Sections 6.16.6

For these experiments, we train our source models from scratch using the MSE loss with the AdamW optimizer [Loshchilov and Hutter, 2019], which we run for 40 epochs with batch size of 64 and the cosine learning rate scheduler. To obtain good source models, we resize all input images to 256×\times×256 and apply basic image augmentations without horizontal flipping (i.e., affine transformation, Gaussian blur, and color jitter). We also scale all labels into [0,1]01[0,1][ 0 , 1 ] using the width and height of the input images.

For the transfer learning setting with head re-training, we freeze the trained feature extractor and re-train the regression head on the target dataset using the same setting above, except that we run 15 epochs on the CUB-200-2011 dataset and 30 epochs on the OpenMonkey dataset. For half fine-tuning, we unfreeze the last convolution layer and the head classifier since the number of trainable parameters is around half of the total number of parameters. For full fine-tuning, we unfreeze the whole network. In these two fine-tuning settings, we fine-tune for 15 epochs on both datasets. We use PyTorch [Paszke et al., 2019] for implementation.

B.2 More details for Section 6.7

For this experiment, we use the following 8 ImageNet pre-trained models as the source models: ResNet50, ResNet101, ResNet152 [He et al., 2016], DenseNet121, DenseNet169, DenseNet201 [Huang et al., 2017], GoogleNet [Szegedy et al., 2015], and Inceptionv3 [Szegedy et al., 2016]. These models are taken from the PyTorch Model Zoo.

We use the dSprites dataset [Matthey et al., 2017] for the target task. This dataset contains 737,280 images with 4 outputs for regression: x and y positions, scale, and orientation. The train-test split is similar to the settings in You et al. [2021]: 60% for training, 20% for validation, and 20% for testing. The transferred MSE is computed on the test set. We train our models with 10 epochs using the AdamW optimizer. The initial learning rate is 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, which is divided by 10 every 3 epochs.

Appendix C Additional experiment results

C.1 Usefulness of theoretical bounds

Although the theoretical bounds in Section 5 show the relationships between the transferability of the optimal transferred model and our transferability estimators, these bounds could be loose in practice unless the number of samples is large. This is in fact a limitation of this type of generalization bounds. To show the usefulness of our bounds in practice, we conduct an experiment to investigate the generalization gap using the head re-training setting in Section 6.1.

The generalization gap is defined as the difference between our transferability score and the negative MSE (the transferability) of the transferred model. According to our theorems, this generalization gap is bounded above by the complexity term. We will compare the generalization gap with the absolute value of our transferability score and also inspect whether it has any significant correlation with the actual transferred MSE.

From this experiment, the ratios between the absolute value of transferability score and the generalization gap for our transferability estimators are: 1.6 (LinMSE0), 2.0 (LinMSE1), 2.3 (LabMSE0), and 2.3 (LabMSE1). These results show that the transferability scores dominate the generalization gap in practice. More importantly, there is no significant correlation between the generalization gap and the actual transferred MSE. These findings indicate that the complexity term in our bounds may have little effects for transferability estimation, as opposed to the transferability score term that has a strong effect (shown by the high correlations in our main experiments).

Table C.1: Kendall’s-τ𝜏\tauitalic_τ correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Our estimators improve up to 28.4% in comparison with SotA (LogME) while being 13% better on average.
Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training 0.728 0.028   0.935* 0.924 0.906 0.104 0.896 0.922*
Half fine-tuning 0.525 0.392 0.644   0.646* 0.651 0.291   0.667* 0.646
Full fine-tuning 0.497 0.289   0.606* 0.594 0.611 0.328   0.616* 0.594
Table C.2: Spearman correlation coefficients when transferring from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Our estimators improve up to 19.9% in comparison with SotA (LogME) while being 9.7% better on average.
Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training 0.857 0.102   0.994* 0.991 0.988 0.215 0.984   0.990*
Half fine-tuning 0.726 0.409 0.857   0.858* 0.857 0.437   0.865* 0.858
Full fine-tuning 0.689 0.433   0.826* 0.823   0.827* 0.474   0.827* 0.823
Table C.3: Correlation coefficients when transferring between 10d-output tasks from OpenMonkey to CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. All correlations are statistically significant with p<0.001𝑝0.001p<0.001italic_p < 0.001. Our estimators with both λ𝜆\lambdaitalic_λ values are better than SotA (LogME).
Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training 0.970 0.719 0.991* 0.989 0.968 0.656 0.990   0.995*
Half fine-tuning 0.944 0.742 0.963* 0.943 0.954 0.684   0.980* 0.958
Full fine-tuning 0.878 0.736 0.892* 0.863 0.892 0.669   0.916* 0.881

C.2 Additional results for Section 6.1

Detailed correlation plots for Table 1. In Figures C.3C.3, and C.3, we show the detailed correlation plots and p𝑝pitalic_p-values for our experiment results reported in Table 1 of the main paper. From these plots, all correlations are statistically significant with p<0.001𝑝0.001p<0.001italic_p < 0.001, except for TransRate and LabTransRate with head re-training.

Additional results with non-linear correlation metrics. In Tables C.1 and C.2, we report the Kendall’s-τ𝜏\tauitalic_τ and Spearman correlation coefficients to complement the results in Table 1 of the main paper. These coefficients, as described in Bolya et al. [2021], are used to assess the ranking associations or the monotonic relationships between the transferability measures and the model performance. Based on the findings presented in these tables, our proposed scores are generally on par with or outperform the current state-of-the-art (SotA) approach, LogME [You et al., 2021], with an average correlation improvement of 9.7% and 13% for Spearman and Kendall’s-τ𝜏\tauitalic_τ coefficients, respectively. This serves as a strong evidence illustrating the effectiveness of our proposed measures, not only in the linear relationship assessment, but also in the non-linear one.

Additional result with high-dimensional labels. Using the setting in Section 6.1, we also conducted an additional experiment where both source and target tasks have 10-dimensional labels. In particular, we train a source model to predict five OpenMonkey keypoints: right eye, left eye, nose, head, and neck simultaneously (i.e., this source model returns a 10-dimensional output). The source model is then transferred to a target task that predicts a combination of five CUB-200-2011 keypoints. We consider each combination of 5 keypoints among 10 CUB-200-2011 keypoints as a target task, resulting in 252 target tasks that all have 10-dimensional labels.

We also run 3 transfer learning algorithms: head re-training, half fine-tuning, and full fine-tune, using the same training settings as in Section 6.1. For TransRate and LabTransRate, we use 2 bins per dimension instead of 5 bins to reduce the computational costs. The results for this experiment are reported in Table C.3. From these results, our approaches are better than the baselines for both λ𝜆\lambdaitalic_λ values.

C.3 Additional results for Section 6.2

Table C.4: Correlation coefficients when transferring from 2d-output tasks to 10d-output tasks on CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods. Except for TransRate with half and full fine-tuning, all correlations are statistically significant with p<0.001𝑝0.001p<0.001italic_p < 0.001. Our estimators are better than SotA (LogME) in most cases.
Transfer setting Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training 0.602 0.632   0.868* 0.816 0.885 0.549 0.901   0.973*
Half fine-tuning 0.491 0.645 0.771   0.881* 0.804 0.072   0.913* 0.818
Full fine-tuning 0.397 0.632 0.727   0.888* 0.756 0.050   0.884* 0.833

Detailed correlation plots for Table 2. In Figures C.6C.9, we show the detailed correlation plots and p𝑝pitalic_p-values for our experiment results reported in Table 2 of the main paper. From these plots, all correlations are statistically significant with p<0.001𝑝0.001p<0.001italic_p < 0.001, except for TransRate and LabTransRate as well as the full fine-tuning setting on the CUB-200-2011 dataset.

Additional result for each individual source task. We report in Tables C.5 and C.6 more comprehensive results for all source tasks on CUB-200-2011 and OpenMonkey respectively. Each row of the tables corresponds to one source task and shows the correlation coefficients when transferring to all other tasks in the respective dataset. From the tables, our transferability estimators are consistently better than LogME, LabLogME, TransRate, and LabTransRate for most source tasks on both datasets. These results confirm the effectiveness of our proposed methods.

Additional result with high-dimensional labels. In this additional experiment, we further show the effectiveness of our proposed methods when the target tasks have higher dimensional labels. In particular, we transfer from 4 source tasks on CUB-200-2011 (back, beak, belly, and breast) to all the combinations of 5 attributes among the remaining tasks (except for right eye, right leg, and right wing, which may not always be available in the data). In total, we have 224 source-target pairs, where the source tasks have 2-dimensional labels and the target tasks have 10-dimensional labels. We use the same training settings as in Section 6.2 of the main paper, except that we also use 2 bins per dimension when calculating TransRate and LabTransRate to reduce computational costs. Table C.4 reports the results for this experiment. These results clearly show that our methods, LinMSE0 and LinMSE1, are better than the LogME and TransRate baselines in most cases.

Figure C.1: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training from OpenMonkey to CUB-200-2011.
Figure C.2: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning from OpenMonkey to CUB-200-2011.
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Figure C.1: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training from OpenMonkey to CUB-200-2011.
Figure C.2: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning from OpenMonkey to CUB-200-2011.
Figure C.3: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with full fine-tuning from OpenMonkey to CUB-200-2011.
Figure C.4: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training between any two different keypoints (with shared inputs) on CUB-200-2011.
Figure C.5: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning between any two different keypoints (with shared inputs) on CUB-200-2011.
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Figure C.4: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training between any two different keypoints (with shared inputs) on CUB-200-2011.
Figure C.5: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning between any two different keypoints (with shared inputs) on CUB-200-2011.
Figure C.6: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with full fine-tuning between any two different keypoints (with shared inputs) on CUB-200-2011.
Figure C.7: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training between any two different keypoints (with shared inputs) on OpenMonkey.
Figure C.8: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning between any two different keypoints (with shared inputs) on OpenMonkey.
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Refer to caption
(a) LogME
Refer to caption
(b) TransRate
Refer to caption
(c) LinMSE0
Refer to caption
(d) LinMSE1
Refer to caption
(e) LabLogME
Refer to caption
(f) LabTransRate
Refer to caption
(g) LabMSE0
Refer to caption
(h) LabMSE1
Figure C.7: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with head re-training between any two different keypoints (with shared inputs) on OpenMonkey.
Figure C.8: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with half fine-tuning between any two different keypoints (with shared inputs) on OpenMonkey.
Figure C.9: Correlation coefficients and p𝑝pitalic_p-values between transferability estimators and negative test MSEs when transferring with full fine-tuning between any two different keypoints (with shared inputs) on OpenMonkey.
Table C.5: Correlation coefficients for all source tasks on CUB-200-2011. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.
Transfer setting Source task Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training Zurück 0.743 0.116 0.956 0.966* 0.920 0.273 0.931 0.964*
Beak 0.863 0.229 0.922* 0.915 0.878 0.158 0.906 0.945*
Belly 0.892 0.097 0.970 0.982* 0.933 0.188 0.932 0.982*
Breast 0.915 0.120 0.935 0.945* 0.903 0.279 0.922 0.961*
Crown 0.917 0.041 0.962 0.966* 0.913 0.251 0.945 0.979*
Forehead 0.888 0.076 0.941* 0.939 0.885 0.221 0.924 0.966*
Left eye 0.035 0.076 0.913 0.964* 0.924 0.289 0.945 0.969*
Left leg 0.261 0.221 0.935 0.975* 0.935 0.223 0.953 0.975*
Left wing 0.260 0.170 0.964 0.994* 0.980 0.173 0.994* 0.994*
Nape 0.889 0.085 0.922 0.942* 0.900 0.300 0.929 0.953*
Right eye 0.625 0.242 0.904 0.974* 0.921 0.244 0.948 0.975*
Right leg 0.508 0.047 0.958 0.989* 0.942 0.217 0.954 0.990*
Right wing 0.521 0.167 0.907 0.979* 0.935 0.270 0.946 0.980*
Tail 0.591 0.392 0.900 0.927* 0.872 0.544 0.880 0.890*
Throat 0.896 0.124 0.938 0.941* 0.890 0.291 0.924 0.956*
Half fine-tuning Zurück 0.714 0.076 0.791 0.814* 0.835 0.168 0.911* 0.873
Beak 0.663 0.160 0.831* 0.772 0.765 0.076 0.883 0.899*
Belly 0.528 0.233 0.655 0.752* 0.758 0.309 0.849* 0.764
Breast 0.730 0.100 0.802* 0.779 0.762 0.152 0.867* 0.850
Crown 0.644 0.068 0.752 0.776* 0.714 0.165 0.832* 0.816
Forehead 0.654 0.032 0.804* 0.786 0.727 0.120 0.859 0.873*
Left eye 0.420 0.046 0.913* 0.853 0.812 0.227 0.892* 0.865
Left leg 0.121 0.095 0.721 0.819* 0.845 0.150 0.893* 0.832
Left wing 0.352 0.150 0.949* 0.918 0.859 0.189 0.919* 0.918
Nape 0.660 0.055 0.705 0.770* 0.751 0.181 0.863* 0.802
Right eye 0.561 0.221 0.911* 0.873 0.786 0.180 0.871 0.890*
Right leg 0.268 0.125 0.690 0.804* 0.810 0.069 0.861* 0.820
Right wing 0.407 0.133 0.495 0.613* 0.516 0.338 0.521 0.617*
Tail 0.801 0.117 0.930* 0.812 0.848 0.285 0.924 0.968*
Throat 0.767 0.013 0.870* 0.810 0.811 0.253 0.900* 0.873
Full fine-tuning Zurück 0.710 0.085 0.785 0.808* 0.829 0.178 0.906* 0.868
Beak 0.659 0.161 0.826* 0.780 0.758 0.073 0.877 0.899*
Belly 0.645 0.273 0.782 0.847* 0.862 0.365 0.926* 0.856
Breast 0.740 0.104 0.811* 0.791 0.768 0.152 0.871* 0.859
Crown 0.647 0.073 0.756 0.784* 0.717 0.157 0.834* 0.821
Forehead 0.648 0.037 0.799* 0.783 0.723 0.111 0.855 0.869*
Left eye 0.224 0.456* 0.297 0.347 0.333* 0.246 0.282 0.326
Left leg 0.057 0.067 0.659 0.769* 0.796 0.146 0.850* 0.783
Left wing 0.342 0.159 0.954* 0.915 0.860 0.195 0.920* 0.914
Nape 0.667 0.041 0.713 0.779* 0.752 0.177 0.864* 0.810
Right eye 0.549 0.213 0.915* 0.876 0.794 0.199 0.877 0.893*
Right leg 0.237 0.377 0.673 0.692* 0.755 0.431 0.766* 0.693
Right wing 0.254* 0.046 0.237 0.223 0.225 0.093 0.227* 0.220
Tail 0.803 0.122 0.930* 0.818 0.846 0.288 0.923 0.969*
Throat 0.665 0.027 0.801* 0.779 0.744 0.256 0.850* 0.834
Table C.6: Correlation coefficients for all source tasks on OpenMonkey. Bold numbers indicate best results in each row. Asterisks (*) indicate best results among the corresponding label-based or feature-based methods.
Transfer setting Source task Label-based method Feature-based method
LabLogME LabTransRate LabMSE0 LabMSE1 LogME TransRate LinMSE0 LinMSE1
Head re-training Right eye 0.894 0.859 0.986* 0.835 0.918 0.846 0.978 0.986*
Left eye 0.895 0.854 0.987* 0.838 0.868 0.858 0.981 0.987*
Nose 0.908 0.849 0.988* 0.849 0.818 0.837 0.978 0.989*
Head 0.941 0.881 0.992* 0.821 0.897 0.884 0.983* 0.978
Neck 0.972 0.862 0.998* 0.887 0.932 0.839 0.982 0.987*
Right shoulder 0.977 0.837 0.994* 0.891 0.842 0.811 0.982* 0.980
Right elbow 0.963 0.529 0.994* 0.940 0.469 0.564 0.969 0.990*
Right wrist 0.970 0.753 0.993* 0.939 0.615 0.446 0.963 0.990*
Left shoulder 0.972 0.800 0.997* 0.915 0.823 0.808 0.988* 0.988*
Left elbow 0.960 0.546 0.994* 0.948 0.711 0.572 0.969 0.989*
Left wrist 0.975 0.597 0.993* 0.951 0.964 0.544 0.963 0.993*
Hip 0.922 0.540 0.989* 0.325 0.874 0.557 0.800 0.991*
Right knee 0.925 0.080 0.975* 0.850 0.766 0.331 0.945 0.993*
Right ankle 0.931 0.411 0.989* 0.770 0.737 0.371 0.930 0.997*
Left knee 0.923 0.160 0.978* 0.848 0.692 0.209 0.936 0.994*
Left ankle 0.916 0.416 0.986* 0.775 0.852 0.329 0.925 0.998*
Tail 0.936 0.712 0.993* 0.312 0.821 0.662 0.897 0.990*
Half fine-tuning Right eye 0.795 0.734 0.906* 0.883 0.835 0.709 0.963* 0.923
Left eye 0.797 0.731 0.905* 0.879 0.771 0.719 0.960* 0.918
Nose 0.829 0.736 0.914* 0.872 0.649 0.721 0.968* 0.916
Head 0.835 0.759 0.921* 0.882 0.804 0.751 0.964* 0.928
Neck 0.902 0.793 0.929* 0.871 0.745 0.765 0.969* 0.915
Right shoulder 0.887 0.725 0.924* 0.890 0.751 0.758 0.972* 0.924
Right elbow 0.764 0.250 0.806 0.914* 0.048 0.602 0.931* 0.821
Right wrist 0.806 0.501 0.823 0.903* 0.172 0.643 0.929* 0.819
Left shoulder 0.893 0.718 0.927* 0.899 0.702 0.774 0.972* 0.930
Left elbow 0.782 0.369 0.824 0.919* 0.366 0.594 0.946* 0.839
Left wrist 0.822 0.523 0.828 0.902* 0.765 0.663 0.932* 0.824
Hip 0.030 0.487 0.233 0.910* 0.006 0.359 0.800* 0.305
Right knee 0.481 0.429 0.598 0.906* 0.186 0.067 0.831* 0.687
Right ankle 0.357 0.275 0.534 0.910* 0.286 0.226 0.806* 0.632
Left knee 0.467 0.355 0.601 0.899* 0.172 0.215 0.855* 0.692
Left ankle 0.331 0.242 0.530 0.904* 0.197 0.303 0.822* 0.632
Tail 0.231 0.196 0.434 0.829* 0.160 0.121 0.729* 0.494
Full fine-tuning Right eye 0.796 0.711 0.905* 0.894 0.821 0.694 0.959* 0.927
Left eye 0.790 0.734 0.904* 0.882 0.763 0.714 0.957* 0.921
Nose 0.810 0.731 0.912* 0.892 0.642 0.709 0.960* 0.932
Head 0.801 0.737 0.900* 0.892 0.772 0.718 0.947* 0.920
Neck 0.893 0.782 0.930* 0.886 0.755 0.743 0.962* 0.926
Right shoulder 0.896 0.722 0.936* 0.908 0.759 0.750 0.975* 0.940
Right elbow 0.689 0.168 0.736 0.878* 0.047 0.562 0.888* 0.761
Right wrist 0.796 0.505 0.805 0.876* 0.199 0.644 0.910* 0.803
Left shoulder 0.872 0.690 0.901* 0.882 0.670 0.762 0.955* 0.903
Left elbow 0.726 0.282 0.774 0.904* 0.326 0.538 0.914* 0.797
Left wrist 0.787 0.488 0.787 0.868* 0.725 0.672 0.903* 0.785
Hip 0.016 0.518 0.173 0.894* 0.038 0.382 0.757* 0.238
Right knee 0.391 0.518 0.516 0.891* 0.096 0.141 0.763* 0.614
Right ankle 0.246 0.396 0.437 0.889* 0.185 0.340 0.726* 0.546
Left knee 0.381 0.448 0.521 0.891* 0.149 0.303 0.789* 0.618
Left ankle 0.244 0.297 0.444 0.871* 0.098 0.357 0.751* 0.551
Tail 0.105 0.299 0.309 0.824* 0.047 0.212 0.628* 0.372
RkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]">